CD Baby rewrite : thought process of subsets/limiters solution

by Derek Sivers

A big change I'm adding in the CD Baby rewrite is the ability for the store to be browsing/searching only a subset of its items. This is useful for genre-specific stores, say jazz.cdbaby.com, where browsing and searching the catalog would only show you jazz albums.

I planning, I was calling these "LIMITERS", because in SQL terms, I imagined it would work like this: Say I'm searching and browsing the store, looking at top-sellers or new-arrivals:

SELECT * FROM items ORDER BY sold DESC LIMIT 20
SELECT * FROM items ORDER BY date_added DESC LIMIT 20

... if you add a LIMITER of showing only jazz, the queries become:

SELECT * FROM items WHERE style='jazz' ORDER BY sold DESC LIMIT 20
SELECT * FROM items WHERE style='jazz' ORDER BY date_added DESC LIMIT 20

... if you add another LIMITER of showing only artists from Sweden, the queries become:

SELECT * FROM items WHERE style='jazz' AND location='SE' ORDER BY sold DESC LIMIT 20
SELECT * FROM items WHERE style='jazz' AND location='SE' ORDER BY date_added DESC LIMIT 20

... and so on.

At first I figured some smart object would parse a config of limiters and add it to all SQL queries. But I realized that would lead to some awfully slow queries, since some of the other queries in browsing and searching the store can get pretty complex on their own, and adding in the extra limitations would be even worse.

Jeremy suggested memcached. Brilliant! We pass it all the complex queries we want, and just tell memcached to remember the results, so our customers browsing the site see it fast. No extra work needed. Speed problem solved.

BUT... since one of my planned LIMITERs would be a join on another table, we foresaw that trying to get a class to just add a join to all of our queries is asking for trouble. The whole "complicated queries on the fly" idea had to be nixed.

So then I thought that I'd export a bunch of nightly cache tables of all of our items with these various limiters. Say, "jazz.db" would be a subset of our catalog, already defined as only jazz, where we'd have all the fields needed for browsing (id, artist_name, item_name, price, description, etc) - and search only that cache-table when browsing the jazz store.

We were about to do this when we looked at the stuff for browsing, realized it was all just duplicated cache-info, and all that we really needed to know is what item ID#s are included in this genre-specific store we're browsing! That's it! Everything else can be reliably joined in. No need to export entire new databases or tables. Just give a list of item IDs. Let's call it catalogs and it would look like this:

CREATE TABLE catalogs (
id serial primary key,
name text,
description text
-- etc...
);
CREATE TABLE catalog_items (
catalog_id int not null REFERENCES catalogs(id),
item_id int not null REFERENCES items(id),
PRIMARY KEY (catalog_id, item_id)
)

Looking at top-sellers or new-arrivals becomes:

SELECT items.* FROM items
LEFT JOIN catalog_items ON items.id=catalog_items.id
WHERE catalog_items.catalog_id=5
ORDER BY sold DESC LIMIT 20


SELECT items.* FROM items
LEFT JOIN catalog_items ON items.id=catalog_items.id
WHERE catalog_items.catalog_id=5
ORDER BY date_added DESC LIMIT 20

Perfect! Now we can populate item IDs into catalog_items however we like, by any crazy logic or no logic at all, and the store can easily browse/search it as a subset.

Last step : how to know set up our system so that it sometimes joins against this catalog_items table when needed, (jazz.cdbaby.com), but doesn't when not (www.cdbaby.com)?

Easy! We'll ALWAYS join against it. If you're browsing www.cdbaby.com with no limiters then that's catalog #1, a catalog of all available items that we'll populate into this join-table every night like we do the others. Catalog #2 and up will be the subsets.

This even solves a problem I wasn't looking to solve: the trouble of always having to pass "WHERE active=TRUE" in every single search, so that I'm not showing items until they're approved to go on the site. It ALSO solves a problem of not including abstract items' variations, which is a topic of a future post.

2 Comments

aristotle
2005-01-23 13:56:25
There's more to SQL than what you learned with MySQL...
Isn't this kind of problem exactly what VIEWs were invented to solve?
dereksivers
2005-01-24 21:03:19
There's more to SQL than what you learned with MySQL...
I felt views would be too "expensive" since sometimes the queries it takes to create these subset-stores will be complex. Though I gave the example of "a jazz store", above, some will be as complex as "Midwest Bluegrass" which will have to only include products from one of 5 states in the USA with their musical style set to one of two styles.


Putting that in a view would make it run that query every single time on every single call, with extra qualifiers on top of it.


Yeah I coulda dumped all that on memcached, but then it also means that every time I wanted to create a new subset store, I'd have to go create a new view on the database.


I'm very happy with the joining-table solution described above.