Debian Admin - Your way to Debian World

November 15, 2007

Quick PDF sorting and searching: SWISH++

by @ 10:16 am. Filed under General, Database

If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

Problem: using SWISH++ it is possible to search and sort PDF-files automatically
Solution: tools like pdftotext, find, scripts on Bash or Perl are required to perform quick and fast search within PDF and indexing PDF documents.

The common way is to use Beagle or some other searching stuff, but I show how SWISH++ can do the same, but much, much more fast and resource-efficient.

Introduction: How indexing within PDF

Perl-lovers likes to say that “there is more than one way to do it”. So, that`s my way to do it. Briefly, solve consists from these steps:

  • use find to search all pdf documents and converting them to text with pdftotext tool
  • indexing this text files with index++ and getting index file
  • experimentally choosing level of relevant
  • searching in index file with keywords using search++
  • found files moving into required directory

Searching in PDF-documents and getting text from them
Simply asking find tool to search all *.pdf files and for everyone executing pdftotext in quiet mode. This can be achieved by command:

find -name ‘*.pdf’ -exec pdftotext -nopgbrk -q {} \;

It is possible only for English, and other languages are not supported..

Making index file

Here it is even more simply: just ask index++ to index all of our textual files from current directory to the deep:

index++ -e “text:*.txt” .

Dot at the end is required!

What is SWISH++

There are a few mentions about SWISH++ in the Net - only homepage of project and article about application this system to real search engine. Some guys tells that SWISH++ is fastest search engine ever.
Description of this excellent search system can be found in debian package - Simple Document Indexing System for Humans: C++ version. Especially it is suitable for fast and efficient search engine.
Here are some advantages of SWISH++

  • Lightning-fast indexing
  • Indexes META elements, ALT, and other attributes
  • Selectively not index text within HTML or XHTML elements
  • Intelligently index mail and news files
  • Index Unix manual page files
  • Apply filters to files on-the-fly prior to indexing
  • Index non-text files such as Microsoft Office documents (antiword required)
  • Modular indexing architecture
  • Index new files incrementally
  • Index remote web sites
  • Handles large collections of files
  • Lightning-fast searching
  • Optional word stemming (suffix stripping)
  • Ability to run as a search server
  • Easy-to-parse results format

SWISH++ consist of two tools: index++ и search++. First tool indexing files, and second one searches within index. It`s like your personal Google, but small, fast and console. :-)

Install SWISH++ in Debian

Use the following command to install swish++ in debian

#aptitude swish++

Indexing files

index++ make index file, which contains indexed text documents, made by pdftotext (oh, yea, UNIX-way!). It supports such formats as text, HTML, XML, LaTeX, mail - all that can be converted to text with may be little bit of tag-reached. On my desktop machine indexing is very fast: Intel Р4 630 3GHz with 2GB RAM indexing 270 in 5 seconds.

With level verbosity of 3, one can get more information about indexing process:

index++ -v3 -e “text:*.txt” .

Dot at the end is important, manual page can say more. Output will be like this:

watters_etal_paleobio_2001.txt (2704 words)
WaveMetriconChip64.txt (1351 words)
wshedtopoalgoJMIV.txt (4042 words)
Ye.IJDAR.1.txt (4470 words)
YucelITIP01.txt (1678 words)

./edg:
morphology.txt (753 words)
LuengoEtAl_IbPRIA05.txt (1227 words)
Cuisenaire2005_1250.txt (1162 words)
icpr2004_nucleus.txt (1234 words)
OrtizEtAl_SPIE01.txt (1463 words)
Angulo_VIIP04.txt (1658 words)
682.txt (1901 words)
comorph.txt (1948 words)

index++: ranking index…
index++: writing index…

index++: done:
00:05 (min:sec) elapsed time
548 files, 271 indexed
2465116 words, 1046139 indexed, 56281 unique

The result will be swish++.index file were are all information about indexed files.
Great: this huge collection of articles indexed so fast! Now we are ready to search something in it.

Searching files

Let`s find something in our collection of files with keywords. It is possible by asking search++ to find in database swish++.index. For example, I can search papers about morphology analysis of images but without mentions about medicine:

$ search++ morphology and erosion and dilation not medicine

And here are results (output is reduced):

# results: 125
99 ./Krylov2.txt 3771 Krylov2.txt
49 ./13300407.txt 3103 13300407.txt
46 ./morph1.slides.printing.6.txt 4369 morph1.slides.printing.6.txt
37 ./lecture_morphology_sara.txt 6746 lecture_morphology_sara.txt
30 ./SIGGRAPH2002_Sketch-Mitchell.txt 5308 SIGGRAPH2002_Sketch-Mitchell.txt
26 ./MorphologicalImageProcessing.txt 7642 MorphologicalImageProcessing.txt
25 ./phdsymp2002_ledda.txt 8298 phdsymp2002_ledda.txt
23 ./lab2_manual.txt 9313 lab2_manual.txt
23 ./Project 1.txt 9946 Project 1.txt
22 ./morphology.txt 11212 morphology.txt
22 ./edg/morphology.txt 11212 morphology.txt
22 ./slides-6-geometry.txt 11717 slides-6-geometry.txt
22 ./V1BFOGG8.txt 10797 V1BFOGG8.txt
18 ./71650638.txt 13978 71650638.txt

First column is relevantness, second - relative file placement, third - file size, and fourth - name. Simple and clean. So it`s very simple to search article if you remember something about it (author name, keywords, or even phare from it).

What we get

I have vast collection of science articles in English, and it`s very hard to remember exact name and content each of paper. Using this approach, I had sorted more than 2400 papers in about 2 hours. Task for SWISH++ was more difficult because of homogeneity of paper`s content. Precision was estimated as approximately 60-70%. Of course, sorted papers had been viewed by me, so it was semi-automatic-alike mode ;-)

Links:

I can`t say all about this shiny search system in one post, but I tried to show how quickly and easily I working with loads of PDFed articles in my Debian box.

For further information, you may be interest in sourceforge page of project. Here are many articles aobut search engines, and, particularly, about SWISH++, and documentation about SWISH-e is here. I hope that with this post, there will be one more article about this very useful system - SWISH++.

Tags: , , , ,

You may also be interested in...

3 Responses to “Quick PDF sorting and searching: SWISH++”

  1. Mark_in_Hollywood Says:

    mark@Lexington-19:~/PDF$ index++ -e “text:*.txt” .
    index++: error: ““text”: no such indexing module

    How do I fix this?

  2. Marco Fang Says:

    Virens,
    This solution is simply TOO GREAT!! I have Gigs of pdf files and I had no way to quickly find some simple text in them for many years! Thanks Google let me find this page and solved my big problem, and big thanks to you who wrote this great article!

  3. virens Says:

    2Marco Fang Says:
    “Virens, This solution is simply TOO GREAT!!”
    Thanks for such words :)

    I’m using SWISH++ for the same purposes, i.e., for searching in my scientific articles for words combination or literature citations.

    “Thanks Google let me find this page and solved my big problem, and big thanks to you who wrote this great article!”

    :-) I will try to translate to English my articles but currently I’m busily working under my Ph.D. thesis.

Leave a Reply

Subscribe RSS Feed

subscribe to the Debian Admin RSS feed

Internal links:

Sponsors:



Categories:

Support Debian Admin

Amount $:
Website(Optional):

Sponsors:

Archives:

WidgetBucks - Trend Watch - WidgetBucks.com

Related Links:


WidgetBucks - Trend Watch - WidgetBucks.com

Favourite Sites:

Wordpress Collection
Windows Reference
Ubuntu Geek
DebianHelp
All About Debian Tutorials
Power Electrical
Check Your IP Here
Debian,Ubuntu News
DebCentral
Tuxmachines
Capnkirby
Libervis
Nuxifield
Linux Horizon
Linux Appfinder
Debuntu
GNU/Linux For Everyone
Free Penguin
DebianAdmin is not related to the Debian Project.
This site is copyright © 2006,2007 Debian Admin
All Trademarks are the property of their respective owners.
The contents of this website may not be mirrored or archived without the express written permission of DebianAdmin Site Owner.

DISCLAIMER: All the information, troubleshooting methods, utilities offered in this website is provided AS-IS, without any warranties. Though I strive for perfection, and always test the validity and effectiveness of the troubleshooting content in various systems, I assume no responsibility for your use of these Fixes, Utilities and other troubleshooting advice. The author will not be liable for any special, incidental, consequential or indirect damages due to loss of data or any other reason. All use is completely at your own risk. Changes to the existing content and new additions are made to this website periodically, without notification.
Rodney's Kontera DynamiContext Plugin plugged in.