Quick PDF sorting and searching: SWISH++

Problem: using SWISH++ it is possible to search and sort PDF-files automatically
Solution: tools like pdftotext, find, scripts on Bash or Perl are required to perform quick and fast search within PDF and indexing PDF documents.

The common way is to use Beagle or some other searching stuff, but I show how SWISH++ can do the same, but much, much more fast and resource-efficient.

Introduction: How indexing within PDF

Perl-lovers likes to say that “there is more than one way to do it”. So, that`s my way to do it. Briefly, solve consists from these steps:

use find to search all pdf documents and converting them to text with pdftotext tool
indexing this text files with index++ and getting index file
experimentally choosing level of relevant
searching in index file with keywords using search++
found files moving into required directory

Searching in PDF-documents and getting text from them
Simply asking find tool to search all *.pdf files and for everyone executing pdftotext in quiet mode. This can be achieved by command:

find -name ‘*.pdf’ -exec pdftotext -nopgbrk -q {} ;

It is possible only for English, and other languages are not supported..

Making index file

Here it is even more simply: just ask index++ to index all of our textual files from current directory to the deep:

index++ -e “text:*.txt” .

Dot at the end is required!

What is SWISH++

There are a few mentions about SWISH++ in the Net – only homepage of project and article about application this system to real search engine. Some guys tells that SWISH++ is fastest search engine ever.
Description of this excellent search system can be found in debian package – Simple Document Indexing System for Humans: C++ version. Especially it is suitable for fast and efficient search engine.
Here are some advantages of SWISH++

Lightning-fast indexing
Indexes META elements, ALT, and other attributes
Selectively not index text within HTML or XHTML elements
Intelligently index mail and news files
Index Unix manual page files
Apply filters to files on-the-fly prior to indexing
Index non-text files such as Microsoft Office documents (antiword required)
Modular indexing architecture
Index new files incrementally
Index remote web sites
Handles large collections of files
Lightning-fast searching
Optional word stemming (suffix stripping)
Ability to run as a search server
Easy-to-parse results format

SWISH++ consist of two tools: index++ ? search++. First tool indexing files, and second one searches within index. It`s like your personal Google, but small, fast and console. 🙂

Install SWISH++ in Debian

Use the following command to install swish++ in debian

#aptitude swish++

Indexing files

index++ make index file, which contains indexed text documents, made by pdftotext (oh, yea, UNIX-way!). It supports such formats as text, HTML, XML, LaTeX, mail – all that can be converted to text with may be little bit of tag-reached. On my desktop machine indexing is very fast: Intel ?4 630 3GHz with 2GB RAM indexing 270 in 5 seconds.

With level verbosity of 3, one can get more information about indexing process:

index++ -v3 -e “text:*.txt” .

Dot at the end is important, manual page can say more. Output will be like this:

watters_etal_paleobio_2001.txt (2704 words)
WaveMetriconChip64.txt (1351 words)
wshedtopoalgoJMIV.txt (4042 words)
Ye.IJDAR.1.txt (4470 words)
YucelITIP01.txt (1678 words)

./edg:
morphology.txt (753 words)
LuengoEtAl_IbPRIA05.txt (1227 words)
Cuisenaire2005_1250.txt (1162 words)
icpr2004_nucleus.txt (1234 words)
OrtizEtAl_SPIE01.txt (1463 words)
Angulo_VIIP04.txt (1658 words)
682.txt (1901 words)
comorph.txt (1948 words)

index++: ranking index…
index++: writing index…

index++: done:
00:05 (min:sec) elapsed time
548 files, 271 indexed
2465116 words, 1046139 indexed, 56281 unique

The result will be swish++.index file were are all information about indexed files.
Great: this huge collection of articles indexed so fast! Now we are ready to search something in it.

Searching files

Let`s find something in our collection of files with keywords. It is possible by asking search++ to find in database swish++.index. For example, I can search papers about morphology analysis of images but without mentions about medicine:

$ search++ morphology and erosion and dilation not medicine

And here are results (output is reduced):

# results: 125
99 ./Krylov2.txt 3771 Krylov2.txt
49 ./13300407.txt 3103 13300407.txt
46 ./morph1.slides.printing.6.txt 4369 morph1.slides.printing.6.txt
37 ./lecture_morphology_sara.txt 6746 lecture_morphology_sara.txt
30 ./SIGGRAPH2002_Sketch-Mitchell.txt 5308 SIGGRAPH2002_Sketch-Mitchell.txt
26 ./MorphologicalImageProcessing.txt 7642 MorphologicalImageProcessing.txt
25 ./phdsymp2002_ledda.txt 8298 phdsymp2002_ledda.txt
23 ./lab2_manual.txt 9313 lab2_manual.txt
23 ./Project 1.txt 9946 Project 1.txt
22 ./morphology.txt 11212 morphology.txt
22 ./edg/morphology.txt 11212 morphology.txt
22 ./slides-6-geometry.txt 11717 slides-6-geometry.txt
22 ./V1BFOGG8.txt 10797 V1BFOGG8.txt
18 ./71650638.txt 13978 71650638.txt

First column is relevantness, second – relative file placement, third – file size, and fourth – name. Simple and clean. So it`s very simple to search article if you remember something about it (author name, keywords, or even phare from it).

What we get

I have vast collection of science articles in English, and it`s very hard to remember exact name and content each of paper. Using this approach, I had sorted more than 2400 papers in about 2 hours. Task for SWISH++ was more difficult because of homogeneity of paper`s content. Precision was estimated as approximately 60-70%. Of course, sorted papers had been viewed by me, so it was semi-automatic-alike mode 😉

Links:

I can`t say all about this shiny search system in one post, but I tried to show how quickly and easily I working with loads of PDFed articles in my Debian box.

For further information, you may be interest in sourceforge page of project. Here are many articles aobut search engines, and, particularly, about SWISH++, and documentation about SWISH-e is here. I hope that with this post, there will be one more article about this very useful system – SWISH++.

7 thoughts on “Quick PDF sorting and searching: SWISH++”

Mark_in_Hollywood on January 4, 2008 at 8:36 pm said:

mark@Lexington-19:~/PDF$ index++ -e “text:*.txt” .
index++: error: ““text”: no such indexing module

How do I fix this?
Marco Fang on March 24, 2008 at 7:12 pm said:

Virens,
This solution is simply TOO GREAT!! I have Gigs of pdf files and I had no way to quickly find some simple text in them for many years! Thanks Google let me find this page and solved my big problem, and big thanks to you who wrote this great article!
virens on April 2, 2008 at 3:43 pm said:

2Marco Fang Says:
“Virens, This solution is simply TOO GREAT!!”
Thanks for such words 🙂

I’m using SWISH++ for the same purposes, i.e., for searching in my scientific articles for words combination or literature citations.

“Thanks Google let me find this page and solved my big problem, and big thanks to you who wrote this great article!”

🙂 I will try to translate to English my articles but currently I’m busily working under my Ph.D. thesis.
Sivakumar B on October 17, 2008 at 11:41 am said:

Is it possible to sort the contents of a PDF file. I am not referring to the sorting of file names with PDF extension. I am referring to SORTING OF CONTENTS within a PDF file. Is it possible to print the sorted contents only e.g., if the pages in a PDF file contain identical figure and/or letter, or name or address, is it possible group such pages and print them at a single stretch?
If any one has any idea? Please reply to me at [email protected]
yeah on July 12, 2009 at 3:15 pm said:

This solution seems great ! In all the unix way 🙂 . Just a question anything for other language than english ? (personnally I search a way to use pdftotext or other Linux / unix tools for french and chinese .

thankx
Paul J. Lucas on December 11, 2009 at 1:55 am said:

You can skip the “find” pre-conversion of PDF files to plain text. SWISH++ has the ability to filter files on-the-fly. Consult the FILTERS section in the swish++.conf(4) man page.
indexed by google on January 3, 2010 at 5:21 am said:

HmI’ve considered the evidently simple ways the big G works. The truth of the thing is that even though Google “indexes” your page abundant times, it still takes a tonne of work on your part to get your website to become interesting to Google. This lends to my knowledge of search engines.

7 thoughts on “Quick PDF sorting and searching: SWISH++”

Leave a comment