Nov 152007
 

Sponsored Link

Problem: using SWISH++ it is possible to search and sort PDF-files automatically
Solution: tools like pdftotext, find, scripts on Bash or Perl are required to perform quick and fast search within PDF and indexing PDF documents.

The common way is to use Beagle or some other searching stuff, but I show how SWISH++ can do the same, but much, much more fast and resource-efficient.

Introduction: How indexing within PDF

Perl-lovers likes to say that "there is more than one way to do it". So, that`s my way to do it. Briefly, solve consists from these steps:

  • use find to search all pdf documents and converting them to text with pdftotext tool
  • indexing this text files with index++ and getting index file
  • experimentally choosing level of relevant
  • searching in index file with keywords using search++
  • found files moving into required directory

Searching in PDF-documents and getting text from them
Simply asking find tool to search all *.pdf files and for everyone executing pdftotext in quiet mode. This can be achieved by command:

find -name ‘*.pdf' -exec pdftotext -nopgbrk -q {} ;

It is possible only for English, and other languages are not supported..

Making index file

Here it is even more simply: just ask index++ to index all of our textual files from current directory to the deep:

index++ -e "text:*.txt" .

Dot at the end is required!

What is SWISH++

There are a few mentions about SWISH++ in the Net -- only homepage of project and article about application this system to real search engine. Some guys tells that SWISH++ is fastest search engine ever.
Description of this excellent search system can be found in debian package -- Simple Document Indexing System for Humans: C++ version. Especially it is suitable for fast and efficient search engine.
Here are some advantages of SWISH++

  • Lightning-fast indexing
  • Indexes META elements, ALT, and other attributes
  • Selectively not index text within HTML or XHTML elements
  • Intelligently index mail and news files
  • Index Unix manual page files
  • Apply filters to files on-the-fly prior to indexing
  • Index non-text files such as Microsoft Office documents (antiword required)
  • Modular indexing architecture
  • Index new files incrementally
  • Index remote web sites
  • Handles large collections of files
  • Lightning-fast searching
  • Optional word stemming (suffix stripping)
  • Ability to run as a search server
  • Easy-to-parse results format

SWISH++ consist of two tools: index++ ? search++. First tool indexing files, and second one searches within index. It`s like your personal Google, but small, fast and console. :-)

Install SWISH++ in Debian

Use the following command to install swish++ in debian

#aptitude swish++

Indexing files

index++ make index file, which contains indexed text documents, made by pdftotext (oh, yea, UNIX-way!). It supports such formats as text, HTML, XML, LaTeX, mail -- all that can be converted to text with may be little bit of tag-reached. On my desktop machine indexing is very fast: Intel ?4 630 3GHz with 2GB RAM indexing 270 in 5 seconds.

With level verbosity of 3, one can get more information about indexing process:

index++ -v3 -e "text:*.txt" .

Dot at the end is important, manual page can say more. Output will be like this:

watters_etal_paleobio_2001.txt (2704 words)
WaveMetriconChip64.txt (1351 words)
wshedtopoalgoJMIV.txt (4042 words)
Ye.IJDAR.1.txt (4470 words)
YucelITIP01.txt (1678 words)

./edg:
morphology.txt (753 words)
LuengoEtAl_IbPRIA05.txt (1227 words)
Cuisenaire2005_1250.txt (1162 words)
icpr2004_nucleus.txt (1234 words)
OrtizEtAl_SPIE01.txt (1463 words)
Angulo_VIIP04.txt (1658 words)
682.txt (1901 words)
comorph.txt (1948 words)

index++: ranking index...
index++: writing index...

index++: done:
00:05 (min:sec) elapsed time
548 files, 271 indexed
2465116 words, 1046139 indexed, 56281 unique

The result will be swish++.index file were are all information about indexed files.
Great: this huge collection of articles indexed so fast! Now we are ready to search something in it.

Searching files

Let`s find something in our collection of files with keywords. It is possible by asking search++ to find in database swish++.index. For example, I can search papers about morphology analysis of images but without mentions about medicine:

$ search++ morphology and erosion and dilation not medicine

And here are results (output is reduced):

# results: 125
99 ./Krylov2.txt 3771 Krylov2.txt
49 ./13300407.txt 3103 13300407.txt
46 ./morph1.slides.printing.6.txt 4369 morph1.slides.printing.6.txt
37 ./lecture_morphology_sara.txt 6746 lecture_morphology_sara.txt
30 ./SIGGRAPH2002_Sketch-Mitchell.txt 5308 SIGGRAPH2002_Sketch-Mitchell.txt
26 ./MorphologicalImageProcessing.txt 7642 MorphologicalImageProcessing.txt
25 ./phdsymp2002_ledda.txt 8298 phdsymp2002_ledda.txt
23 ./lab2_manual.txt 9313 lab2_manual.txt
23 ./Project 1.txt 9946 Project 1.txt
22 ./morphology.txt 11212 morphology.txt
22 ./edg/morphology.txt 11212 morphology.txt
22 ./slides-6-geometry.txt 11717 slides-6-geometry.txt
22 ./V1BFOGG8.txt 10797 V1BFOGG8.txt
18 ./71650638.txt 13978 71650638.txt

First column is relevantness, second -- relative file placement, third -- file size, and fourth -- name. Simple and clean. So it`s very simple to search article if you remember something about it (author name, keywords, or even phare from it).

What we get

I have vast collection of science articles in English, and it`s very hard to remember exact name and content each of paper. Using this approach, I had sorted more than 2400 papers in about 2 hours. Task for SWISH++ was more difficult because of homogeneity of paper`s content. Precision was estimated as approximately 60-70%. Of course, sorted papers had been viewed by me, so it was semi-automatic-alike mode ;-)

Links:

I can`t say all about this shiny search system in one post, but I tried to show how quickly and easily I working with loads of PDFed articles in my Debian box.

For further information, you may be interest in sourceforge page of project. Here are many articles aobut search engines, and, particularly, about SWISH++, and documentation about SWISH-e is here. I hope that with this post, there will be one more article about this very useful system -- SWISH++.

Sponsored Link

 Posted by at 10:16 am
  • Mark_in_Hollywood

    mark@Lexington-19:~/PDF$ index++ -e “text:*.txt” .
    index++: error: ““text”: no such indexing module

    How do I fix this?

  • http://www.iloho.com Marco Fang

    Virens,
    This solution is simply TOO GREAT!! I have Gigs of pdf files and I had no way to quickly find some simple text in them for many years! Thanks Google let me find this page and solved my big problem, and big thanks to you who wrote this great article!

  • http://debianletters.blogspot.com/ virens

    2Marco Fang Says:
    “Virens, This solution is simply TOO GREAT!!”
    Thanks for such words :)

    I’m using SWISH++ for the same purposes, i.e., for searching in my scientific articles for words combination or literature citations.

    “Thanks Google let me find this page and solved my big problem, and big thanks to you who wrote this great article!”
    :-) I will try to translate to English my articles but currently I’m busily working under my Ph.D. thesis.

  • http://debianadmin.com Sivakumar B

    Is it possible to sort the contents of a PDF file. I am not referring to the sorting of file names with PDF extension. I am referring to SORTING OF CONTENTS within a PDF file. Is it possible to print the sorted contents only e.g., if the pages in a PDF file contain identical figure and/or letter, or name or address, is it possible group such pages and print them at a single stretch?
    If any one has any idea? Please reply to me at [email protected]

  • yeah

    This solution seems great ! In all the unix way :) . Just a question anything for other language than english ? (personnally I search a way to use pdftotext or other Linux / unix tools for french and chinese .

    thankx

  • Paul J. Lucas

    You can skip the “find” pre-conversion of PDF files to plain text. SWISH++ has the ability to filter files on-the-fly. Consult the FILTERS section in the swish++.conf(4) man page.

  • http://identi.ca/jfreemon indexed by google

    HmI’ve considered the evidently simple ways the big G works. The truth of the thing is that even though Google “indexes” your page abundant times, it still takes a tonne of work on your part to get your website to become interesting to Google. This lends to my knowledge of search engines.