In Search of Perfect Search, Redux

by Scot Hacker


A while ago, I posted here that I was in search of the perfect search engine. Today, I have a functioning search engine in place at the jschool and all is well.




My initial plan had been to use mnogo, but the $800 fee for implementing mnogo on Windows (free on other platforms), nixed those plans (I'd love to run the site off an X-Serve, but it's not in the cards for now). In the end, I turned to the open source search engine Swish-e. One of the things that drew me to Swish-e was the wide availability of interfaces for it - Swish-e is bundled with a perl CGI, but there are also interfaces written in PHP, C++, Java, and other languages.




Getting the indexer to run the way we wanted it to proved to be fairly easy. I was able to tell it to exclude certain directories, skip common stopwords or words that appear on more than n % of pages, etc, and to schedule nightly index creation.




I had initally intended to write my own PHP interface to Swish-e indexes. I quickly learned that, while simple search interfaces are simple to write, advanced interfaces, which include booleans, phrase highlighting, intelligent ranking and sorting, limiting, stopwords, and stemming, are a whole 'nuther matter. Simple scripts quickly grow green mould. I went back to the bundled perl interface, which works great out of the box. Cutomizing its behavior and appearance wasn't 100% trivial, but neither was it un-doable (perl is not my first language).




My only complaints with Swish-e as it stands today are:




1) While it will index and search meta tags such as keywords and descriptions, there is currently no easy way to get these to rank higher in results (to crank up the document "weight") without modifying the source code of the indexer itself. This feature is slated to be tweak-able in the conf file in a future version.




2) There is no built-in logging facility. If you want to see what people are searching on, you'll need to modify the scripts either to stuff vars into your apache logs or to generate an external log, which you'll then need to parse with other tools.




Other than that, I'm totally happy with it. I'm auto-indexing 3,000 pages every night, and built a PHP tool to initiate a manual index when needed (to make a new document appear in search results immediately).




My boss even found what could be considered a bug in swish.cgi -- his comments on some anomalous sorting behavior may result in future improvements -- which would constitute his first official contribution to an open source development project.




In any case, the lesson here is that simple search is easy, while really good search is correspondingly hard. Hats off to everyone who has put their energy and time into making Swish-e what it is today so that people like me can come along and have a high-quality web application to plug in quickly and freely. Your efforts do not go unnoticed (and that goes for open source developers everywhere).