the joy of a self-joining SQL query

by Derek Sivers

At CD Baby, our "songs" database table has a list of every song on every album. (Albums are identifed by their sku we call "albumcode")

As we digitize each CD into the FLAC format, we add an "f" into the field called "encoded".

The problem:
I want to find all albums where we have SOME of the FLAC files now, but not all. (The problem albums - I want to re-do these.)

The answer:
Tell MySQL, "show me all albums where encoded has f AND encoded does NOT have f".


SELECT DISTINCT(s1.albumcode)
FROM songs s1
LEFT JOIN songs s2 ON s1.albumcode=s2.albumcode
WHERE s1.encoded LIKE '%f%'
AND s2.encoded NOT LIKE '%f%'


I haven't had a query make me smile in a while. :-)

Other examples of this? Or was there a better way to solve this problem?


4 Comments

Jonathan Gennick
2004-05-27 05:46:00
Aggregation presents a solution
I can think of two alternative solutions involving aggregation.


First, I created the following data to test with:



SQL> SELECT * FROM songs;


ALBUMCODE SONG ENCODED
--------------- -------------------- -------
Album 1 Song 1 a
Album 1 Song 2 a
Album 1 Song 2 a
Album 2 Song 1 b
Album 2 Song 2 bf
Album 2 Song 2 b
Album 3 Song 1 cf
Album 3 Song 2 cf
Album 3 Song 2 cf


Then I wrote the following query, which uses Oracle Database 10g's new, regular expression functionality to look for a trailing 'f' in the encoded column:


SQL> SELECT albumcode
FROM songs
GROUP BY albumcode
HAVING COUNT(REGEXP_SUBSTR(encoded,'.f'))
NOT IN (COUNT(*),0);


ALBUMCODE
---------------
Album 2


Not everyone has a regular expression substring function at their disposal, so I also took a shot at creating an aggregate query using standard SQL, and came up with the following:


SQL> SELECT albumcode
FROM songs
GROUP BY albumcode
HAVING SUM(
CASE SUBSTR(encoded,-1,1)
WHEN 'f' THEN 1
ELSE 0
END) NOT IN (COUNT(*),0);


ALBUMCODE
---------------
Album 2


This query uses CASE and SUBSTR in place of REGEXP_SUBSTR in the previous query.


I can't speak for the performance of any of these queries under MySQL. However, I did test them in my Oracle database. The self-join solution required 14 consistent block gets (i.e. 14 logical reads). The two aggregate solutions each required half that: only 7 consistent block gets.
All three solutions show a sort step in their execution plan. The aggregate functions require a sort for the aggregation. The self-join requires a sort because of the DISTINCT. Given the difference in I/O, the aggregation solutions appear to be better choices, at least in Oracle they do. MySQL results could be different.

Jonathan Gennick
2004-05-27 05:49:27
Aggregation presents a solution
Derek, this is really an interesting problem that you posed. I wonder whether there might also be a UNION (or rather an EXCEPT) based solution. I'll give that some thought.
Jonathan Gennick
2004-05-27 06:22:15
No UNION solution that I can think of
Ok. I could not come up with a UNION solution that even begins to make sense. I did however, put together a query using analytic functions that could form the basis of a solution, but the logical I/O is no better than my previous, and probably easier to understand, GROUP BY queries:



SQL> SELECT DISTINCT albumcode
FROM (
SELECT albumcode,
COUNT(REGEXP_SUBSTR(encoded,'.f')) OVER
(PARTITION BY albumcode) fcount,
COUNT(*) OVER
(PARTITION BY albumcode) allcount
FROM songs
)
WHERE fcount NOT IN(allcount, 0);


ALBUMCODE
---------------
Album 2


Oh, I also discovered an error of no consequence in my data. I had my songs named: 'Song 1', 'Song 2', 'Song 2'. That last should have been 'Song 3'. Fortunately, this error isn't material to my solutions.


Well, now that I've had my fun with SQL for the day, I suppose I ought to get back to work...

jcworsley
2004-11-23 10:08:22
Hmm.
Hey Derek, just browsing your blog here and came across this and it got my SQL curiousity going. Here's yet another alternative to Gennick's preceding ones (hi Jonathan!), specific to a couple tricks I've used in PostgreSQL.


I'm not totally familiar with your entire schema here, obviously, but I am guessing that "encoded" is of type text or varchar and contains letters other than "f" for flac to indicate other types of encoding (perhaps "m" for mp3, etc).


So one simple solution here is to aggregate by albumcode, "sum" up the encoded letters for that group, count the number of "f" letters found, and compare it to the number of songs on the album (count(*) for that albumcode). Some gory details:


First, I keep a custom sum() aggregate function of type text in most of my databases which is easily created thusly:


CREATE AGGREGATE sum (
BASETYPE = text,
SFUNC = textcat,
STYPE = text,
INITCOND = ''
);


This function just concatenates text values for an aggregated group, and I find it periodically quite useful. Sometimes I'll call it "sumtext()" on client's databases if I want to be quite explicit that I am not intending to alias the functionality of the built-in sum() function (and in case future PostgreSQL releases choose to add their own functionality for sum(text)). To keep it simple I'll call it "sum()" for here.


Now we just needs a means of counting up the found "f"s in the summed output. If there are only a small number of possible letters in this field, you could do a translate() on those letters you do not want (e.g., translate(sum(encoded), 'msl', ''), if you had "m" "s" and "l" encodings), but since that's opaque to me I'll just slap a simple caching pl/perl regex into it like so:


CREATE FUNCTION count_letter(text, char)
RETURNS integer
AS '
$string = shift();
$character = shift();
(@matches) = $string =~ m/$character/g;
return scalar(@matches);'
LANGUAGE 'plperl'
IMMUTABLE;


Now, the actual solution, which looks like:


SELECT albumcode
FROM songs
GROUP BY albumcode
HAVING count_letter(sum(encoded), 'f') != count(*);


This should give you each albumcode where the number of associated song records does not match the number of "f" characters found in the concatenated and counted "encoded" field. The "DISTINCT" is omitted since the GROUP BY would make it redundant.


The only other possible consideration would be if the encoded field could contain "NULL" values in which case you'd need to throw a coalesce() in there.


/geekramble.


Best,
John.