Debunking GROUP BY Myths

by Roland Bouman

There is a popular myth about the SQL GROUP BY clause. The myth holds that 'standard SQL' requires columns referenced in the SELECT list of a query to also appear in the GROUP BY clause, unless these columns appear exclusively in an aggregated expression. MySQL is often accused of violating this standard.

In this article I will attempt to debunk this myth, and to provide a more balanced view regarding MySQL's treatment of GROUP BY at the same time.

8 Comments

Aristotle Pagaltzis
2007-05-24 18:13:57

I have to say after 4800 words, I am a little disappointed: I expected something a little more interesting. I came out with more understanding, so it was worth the read; but there isn’t much of practical use here. At most, it might help optimising queries on MySQL.


Why do I say that?


Consider what distinguishes a column that is functionally dependent on a grouping column: in any rows where the grouping column’s values are identical, so are the functionally dependent column’s values. This is the definition of a functionally dependent column, after all. Now, since the value of a grouping column is by definition identical for all rows in a group, then likewise the value of a column that is functionally dependent on any grouping column will also be identical for all rows in a group. So far this is just what you established in the article.


However, this leads me down a route you didn’t mention at all: it means that any aggregate function whose result can be computed from columns of any single row will yield the same value as any other such function if its expression is based only on functionally dependent columns. In MySQL, that includes only MIN() and MAX(), but in other databases it might also include functions like MEDIAN() or COMMONEST(). All of these functions compute the given expression once for each row, then pick one of the computed values. If that expression only uses values from functionally dependent columns, then it will evaluate to the same result on every row of the group, and therefore regardless of which row the aggregate function picks, it will yield the same result. So it doesn’t matter whether you say MIN(func_dep_column) or MAX(func_dep_column) or (per SQL’92 or ’2003) func_dep_column in the SELECT clause – the result will be identical.


In other words, rather than (needlessly) adding the functionally dependent column to the GROUP BY clause when you enable ONLY_FULL_GROUP_BY, you can use any one of these aggregate functions to legitimise the column in question. You do not need to maintain two concurrent mentions of the column (one in the SELECT clause and one in the GROUP BY clause). This is probably slower than disabling ONLY_FULL_GROUP_BY and using no aggregate function at all, but should be faster than adding the column to the GROUP BY clause. It is most likely more portable.

Roland Bouman
2007-05-25 00:26:20
Hi Aristotle,


thanks for you comment.


"I have to say after 4800 words, I am a little disappointed: I expected something a little more interesting. I came out with more understanding, so it was worth the read; but there isn't much of practical use here. At most, it might help optimising queries on MySQL."


Well, it's hard to please everybody. Next time, you might want to read the first and the last section of a long entry - I usually have a proper "abstract"/"conculsion" structure which should be enough to decide if you find the article interesting.


"However, this leads me down a route you didn't mention at all: it means that any aggregate function whose result can be computed from columns of any single row will yield the same value as any other such function if its expression is based only on functionally dependent columns."


You are absolutely right. Believe it or not, but I realised I did not include a section to describe that trick. I will edit the entry shorly to include a section to describe this.

Aristotle Pagaltzis
2007-05-25 02:29:37

Ah, as I said: I learned something. It’s not that I could have skipped the middle parts – I had to read it most of it to follow. I didn’t already know this. I was just hoping there would be a cleverer conclusion to the question.

3bit
2007-05-30 03:40:24
From my point of view MySQL violates the Standard. Demonstrated by your Example Query:


mysql> SELECT species
-> , MIN(birth)
-> , MAX(birth)
-> , birth
-> FROM menagerie.pet
-> GROUP BY species;


Because birth is functionally dependent on the the key (in this case maybe name) but it is not functionally dependent on the species column. (So it is not clear which birth date is to be selected for a species.) Thus prohibiting the use of it in the SELECT list without using it in the GROUP BY clause.

Roland Bouman
2007-05-30 05:17:51
Hi 3bit,


thanks for your comment ;)


"From my point of view MySQL violates the Standard."


Yes - I said so in the conclusion:


"In MySQL, one can write GROUP BY queries that reference non-aggregated columns in the SELECT list that are not included in the GROUP BY clause, even if these columns are not functionally dependent upon the GROUP BY clause. This behaviour conforms to none of the SQL standard's versions."


But also:


"It is possible to avoid this behaviour by including ONLY_FULL_GROUP_BY in the sql_mode server setting, but it might make more sense to take advantage of the ability to write only partial GROUP BY clauses."


Meaning that if you do include ONLY_FULL_GROUP_BY in the sql mode, you cannot use partial GROUP BY clauses even if they are perfectly safe because of functional dependency.


The Myth being debunked here is not that MySQL does not support the standard - the Myths being debunked here are:


1) The standard requires only full group by clauses, and
2) MySQL is not capable of rejecting partial group by clauses.

Dominique
2007-09-20 16:20:43
I was trying to do a Group by on a "partial" content of a column.
A\B\C\D
A\B\C
A\B\


I want a Group by on A\B
Thanks

Roland Bouman
2007-09-21 00:41:19
Hi Dominique,


you mean something like this:


GROUP BY LEFT(col,3)


(makes groups of those rows that have the first three characters of the col column in common)

Adrian
2007-12-09 08:47:38
Excellent explanation !


I was looking for some info regarding group by clause
and it couldn't be explained in a better manner !


Well done and thanks for great info !