Full text search for Python (or with a clean C API)?
by Uche Ogbuji
One of the tasks we shall add before the 1.0 release of 4Suite is full-text search for documents in the repository. We actually had this support before (in 4Suite 0.11.1), using Swish++. It was implemented with quite a kludge: we'd use os.system calls to invoke the indexer on files which we temporarily copied to disk. Even this hack was undone by the confusion over the forks and future of the various Swish code-bases. It looks as if things have finally settled down under the Swish-E umbrella, but now we're looking at all options.
Our preference is for a search engine with a clean C API to which one can pass text and get indexes back in a nice data structure. Another preference is for XML indexing features. Since we'd ask full-text search users to install that engine separately, a nice, clean install would be nice. And if it already came with a Python API, it would be save a good deal of work.
Bill Ellridge suggested mnoGoSearch (which has a horrible name from a PHB's point of view), and he even tried his hand at a Python port of the Perl/C module for it. But I have not been encouraged that I am not even able to get mnoGoSearch working from an end-user POV. I tried to set it up as the search engine for the 4Suite mailing list, and no matter how much I tinker with the config file, the Indexer dies with an error.
So we're still looking. The Open Source Search Engines page is a great resource, but its summaries don't really give me the sort of in-depth information one needs to evaluate a search engine for such an intimate use.
This is a wheel I'd hate to reinvent, so I'd be grateful for any suggestions.
Do you have a favorite full-text search engine you would recommend for Python users? Or do you know of one with a well-designed C API?
You could check out zope/Python solutions
There was recently a thread on zope-dev about which
full text search solution was going to be used for Zope going forward. The thread starts here:
I believe there is one developed/maintained by Python Labs "ZCTextIndex"
then there is another that does sophisticated stemming: Andreas Jung's TextIndexNG
It might not be terribly difficult to adapt either one for use with the 4suite repository...
Python wrapper for swish-e
Have you noted that firstname.lastname@example.org posted to the swish-e mailing list (email@example.com) last week that he is "in the process of writing a python wrapper for swish-e"?
Try LuPy or pyndex from Divmod.org
You may consider these. Pyndex is pure python.