Heard any good strategies for super-fast catalog-database search results?

by Derek Sivers

Related link: http://www.cdbaby.com/

Heard any good strategies for super-fast catalog-database search results?

We've got a database of 70,000 albums with 5 varchar fields that people may search:

- artist name
- album title
- description
- mis-spellings of artist name
- similar/related artists

The problem is that most people just like to use the one generic SEARCH _______ [search] box on the front page, and expect to find exactly what they were looking for.

I used to return the results using "LIKE" like this:
SELECT * FROM albums WHERE artistname='$word' -- (exact match always gets first priority)
SELECT * FROM albums WHERE albumname='$word'
SELECT * FROM albums WHERE (artistname LIKE '%$word%' OR albumname LIKE '%$word%' OR style LIKE '%$word%' OR mispeling LIKE '%$word%');

Everyone was happy with the results they got, except the search would take 5 to 40 seconds! (Yes all those fields named above are indexed.)

So I switched to MySQL's FULLTEXT search:
SELECT * FROM albums WHERE artistname='$word'
SELECT * FROM albums WHERE albumname='$word'
SELECT * FROM albums WHERE MATCH (albumname, soundlike, mispeling, style) AGAINST ('$word')

MUCH faster! Way way faster! Customers stopped complaining. Only problem is... artists started complaining. If the musician's name is Bob "Red" Thomas and someone searches for Bob Thomas, he'll be in the FULLTEXT search results... under about 1000 others with any "Bob" and "Thomas" in their name.

But it was "good enough" so I left it that way, and apologized to the folks affected by the fulltext search results.

But now that I'm thinking of switching to PostgreSQL for various reasons, the issue of search-results-speed has come up again.

We already have a dual-Xeon fast-RAID-SCSI box dedicated to be the master, and a slave dual-Xeon fast-RAID-SCSI box dedicated to be used for nothing but read-only search-results, and STILL it's not fast enough.

(I should mention I've tried lots of other ways in the past, too, such as having PHP serialize the results of every search, saved in a filename that's the MD5 hash of the SQL query itself, so that if someone made that same query later that day it would unserialize the previous search-results, which were flushed each night at midnight. That helped a bit for common searches.)

So here I am asking the O'Reilly crowd - how do the big online catalog sites search many fields in their massive databases in millseconds? Do you just keep throwing more hardware at it, or am I going about things the wrong way entirely?

Heard any good strategies for super-fast catalog-database search results?


2004-10-02 16:37:38
Most Bypass the RDBMS all together
Drop the catalog to a file(s) and then let a search engine
deal with it. Some RDBMS vendors sell search engine extensions and some have search engine like functionality built in but none are really as good as specialized engines like Verity ($$$) or
engine libraries (Lucene) and even WAIS can still be usefull for
these types of searches.
2004-10-02 16:45:38
Can build word index
You can build a word index, this takes work to maintain, and is "expensive" in computer time to build, but less expensive to run.

What you would do, is find every word, or phrase that you want to match and create an entry for it in a table. If you want good normalization you can then create another table which links the album to the word/phrase or you can just build a comma delimited list of albums with the word. The advantage is that you no longer have 5 fields to search on. The disadvantage is you can't give weightings to the different fields; but it dosen't seem like you where doing that anyway.

Then a simple index on that field, and you should be fine.

2004-10-02 22:50:36
Super Fast Searching
Approach A
dump each album out to a html file, and set up robots.txt so that indexing engines could spider it.

Next, add a google search button to search inside your site. This way, you

a) reduce your server requirements, and
b) improve artist visibility on all search engines

Approach B
Study the query plan and try to understand how your RDBMS is actually executing the search, and you'd be able to better appreciate how to optimize your queries.

I'd say Approach B loses to Approach A

Approach C
Tune your database's cache settings.

2004-10-02 22:52:05
In memory DB?
Do you have the option of using an in memory database? This might greatly speed your queries.

Some of the comments on the MySQL FUlltext Search help might give a clue. http://dev.mysql.com/doc/mysql/en/Fulltext_Search.html

Also when are you guys (cdbaby.com) getting the new Devil Makes Three album???

2004-10-02 23:43:31
Look in the PostgreSQL contrib directory for tsearch2. It's a full text indexing system for PostgreSQL.


2004-10-03 00:12:09
inverted index...
Move to an inverted index if the content supports it. If you're talking about your CDBABY inventory, I'd avoid stemming, as it was cause you trouble with the odd band names, but it's a fast and effective way of getting reasonable search results quickly. Lucene is the library of choice for java related development. There's other open source for C/C++ code if you can use GPL bits - the book Managing Gigabytes references some open source code there (MG2). Or there's pyLucene

When you get to scaling, you can also investigate replication and redundancy to achieve that. Good ole network queueing theory there (the through hardware at it solution).


2004-10-03 04:17:54
text indexing with postgresql

I am looking for something similar (full text indexing) in migrating a database to postgresql.

I thought I might try this out and see how well it works...

Full Text Indexing with PostgreSQL

2004-10-03 04:34:43
text indexing with postgresql

This looks better though - tsearch2

2004-10-03 10:05:29
use a search engine
I switched Webjay over to using Google for search. My thought was that search is hard enough that you should avoid doing it yourself. This worked out surprizingly well, though there are some glitches. See http://webjay.org/query to take it for a test drive.
2004-10-03 18:25:44
Full text...
You could use a generic full-text engine, but have it index a weighted document. In this case, you might repeat the artist's name eight times, then the album name a couple times, then list the individual songs. Then artist names are going to be implicitly weighted much higher than other parts of the document.

By using a generic full text search, you'll get features like checking for word proximity and other things that database full text searches might not have.

(BTW, this was suggested to me by Simon Willison, who used it successfully for searching an event database)

2004-10-03 22:00:39
Don't dump to files
Instead of dumping to files, parse the file path.

http://example.com/artists/23523/ can be the artist with id 23523, for instance. However, as this looks like a normal file to google, and not the result of a GET query, it gets indexed the same as all the normal files. You get the benefits of google's search without giving up the benefits of a live database (for instance, for people who don't mind waiting you can provide more complex searches than google can do, tailored to your data structures, and you can update the info more easily).

That said, I'd also suggest a technology like Lucene if you want more direct control over your searches. Its not hard to set up, its fast, and its powerful.

2004-10-04 02:28:06
I'm using the MySQL freetext search for a similar app
I have the same issues (currently using MySQL freetext search).
Also I have a similar application (e-commerce stuff written with LAMP) where a better search brings more sales and therefore is something I can though resources/money at to get right.
I did alot of research on the web for better solutions. Verity type soutions don't seem to fit well because they are are about indexing alot of words in docs of diffferent formats and I want to index just a few words per item and not delve into documents.
I don't have a good solution for plurals, and information which comes from different sources (eg. some is much more verbose than others) and yet needs to have the same weighting. Also I have some stuff coming from web services (eg. Amazon) so it is difficult to merge this with the (freetext) search from my database (it doesn't seem to make sense (from a point of view of speed) to insert this so that I can then query it all together.

I'm going to see what the mysql freetext search developer has to say at mysqlcon Europe (Nov 2004)

2004-10-04 02:28:41
inverted index...
writing your own weighted reverse index is trivial and unlike any native RDBMS fulltext search works accross tables. Most website searches are rubbish, because people just do what you have done -- use SQL 'like' or the native RDBMS 'fulltext' search (god forbid they use the MS SQL Server fulltext search it is dreadful, even JoelWhoLovesMicrosoftSoftware thinks its awful).

You might want to look at Plucene if you are using Perl or even if you are not I should point you at my article on perl.com about Adding Search Functionality to Perl Applications.

Unfortanately it is hard to find on the oreilly websites because they use atomz for searching at it really isn't much use. Shame as they could build something decent using Perl very easily. compare this search on google to perl.com

2004-10-04 04:07:59
I'm using the MySQL freetext search for a similar app
write a reverse index, you can easily add weightings, etc.

Its trivial even in languages that aren't perl ;)

2004-10-04 08:12:39
Devil Makes Three
Longjohns, Boots, and a Belt - just arrived by mail and should hit the site in the next couple days. :-)
Joseph Scott
2004-10-04 14:22:29
Something doesn't seem right here
What would be really helpful is if you could export your schema and data for people to try out. Short of that I've got a test customers table in a postgresql database that should be close enough to give some ideas. Just for kicks I searched for something nice and generic (FIRST). The query returns 238 rows. Translating your query to my test customers table looks like this:

SELECT * FROM customers WHERE first_name = 'FIRST'
SELECT * FROM customers WHERE last_name = 'FIRST'
SELECT * FROM customers WHERE first_name LIKE '%FIRST%'
OR last_name LIKE '%FIRST%'
OR address_1 LIKE '%FIRST%'
OR address_2 LIKE '%FIRST%'

There are more than 90,000 rows in this table, so it should come reasonably close to the type of query that you are doing. After running the above query a few times the average time was about 2.5 seconds. Not great, but better than the 5 to 40 seconds you are reporting. One big difference here is load of course, but hey, I had to start some where.

Your query seems like it might be causing the system to do more work than it needs to, so I rewrote my query to look like this:

FROM customers
WHERE first_name = 'FIRST'
OR last_name = 'FIRST'
OR first_name LIKE '%FIRST%'
OR last_name LIKE '%FIRST%'
OR address_1 LIKE '%FIRST%'
OR address_2 LIKE '%FIRST%'

This query after being run a few times has an average of about 2 seconds. Again, load is likely not the same as your system, but the new query ran half a second faster on my system.

From the hardware that you describe you should be able to get things even faster than that. My gut instinct is that you can probably get the kind of times your looking for with postgresql (or mysql for that matter), without having to do some alternative hash index. This is given with the normal grain of salt since I can't replicate your load, etc since I don't know what it is.

On a different note, you search is a little too plain. I tried searching for two words in quotes on cdbaby and it seemed to completely ignore the quotes. If I search for my name for instance, I got the same results for Joseph Scott as I did for "Joseph Scott". That isn't very helpful.

Just for kicks I drop the = clause to make the query look like:

FROM customers
WHERE first_name LIKE '%FIRST%'
OR last_name LIKE '%FIRST%'
OR address_1 LIKE '%FIRST%'
OR address_2 LIKE '%FIRST%'

The time dropped down to 1.5 seconds. I know it know that changes your results, but it would be start. Perhaps you should run two different queries, first one with exact matches and the second with partial matches.

2004-10-05 05:17:58
choices of index...
I have to say first that everyone else is right: you want something more like lucene for your basic searches.

However, you mentioned "all of these columns are indexed" and I thought it was worth chipping in on another topic... indexing every column or combinations does **not** always make things faster. If you take a look at the query execution path you may find its picking the 'wrong' path, simply because you've done that. I'm more used to this in oracle, but you should be able to provide index hints in the queries, check the manual.

We hit a similar problem for an 'advanced search' page that needed to be done in sql. There were a dozen or so fields on the form, and the user could fill whichever they wanted. Naively building the query resulted in every query being badly tuned. The solution was to choose a different, tuned+hinted, query builder depending on the fields that the user had filled, defaulting to the naive one when we didn't match a specialized query.

2004-10-06 19:56:03
Thanks everyone for your AWESOME comments, below. Great stuff, and I'm going to give most of it a try.

I'll make a new post in the future with results.

2004-11-06 19:50:25
great article!
But now I use Myql 4.0 ,I don't know those acticle will help me .
-- posted by soudown