Data Warehousing at MySQLCon: A tutorial and a session you shouldn't miss

by John Adams

Related link:

At last summmer's OSCON, I attended John Paul Ashenfelter's Data Warehousing with MySQL and OSS Tools tutorial and Roger Magoulas' Building the Open Warehouse session. Go on, click on those links, and--surprise! They don't go to the OSCON 2004 archives, do they? And those sessions have new names--weird!

Newer versions of both those presentations are coming up at the MySQL Users Conference. That's where those links go--and if you have the chance, perhaps you should go there, too.

Here are some of the highlights from my copious notes on both these presentations. I expect this week's versions will be similar but different. I'm guessing Ashenfelter will focus less on the various open source tools available to surround the MySQL database and more on specifics of using MySQL in data warehousing--and I'd attend a second time to get that. Magoulas was only six months into O'Reilly's open warehouse, so I know he'll have new things to report.

John Paul Ashenfelter’s tutorial, “Building a Data Warehouse with MySQL and OSS Tools”, laid out a straightforward implementation of data marts with MySQL as the RDBMS and other open source tools providing various services around it. (If Ashenfelter discussed his choice of operating system, it’s not in my notes, nor is it in the tutorial handouts.)

In the process, he made a convincing case that a dubious piece of Ralph Kimball’s advice--”Plan a data warehouse, but build a data mart”--can be a good plan for a mid-sized business. In a mid-sized business, the first data mart may be all the data warehouse that company needs (at least at that time), and can provide a sufficiently large chunk of the eventual warehouse that the warehouse can actually be built without completely reworking the data mart.

(It’s bad advice for very large enterprises, but it’s no longer current advice. Most large companies have now discovered the consequences of this path and coped in one of two ways: Integrated their systems painfully into a single warehouse, or glued their systems together precariously into a somewhat consistent set of marts.)

The first half of Ashenfelter’s tutorial was a primer on data warehousing for open source programmers--a good summary, for this audience, particularly in detailing ways in which decision support systems are different from OLTP systems. (Not enough emphasis can ever be placed on this difference. These two major uses of the RDBMS are as different as night and day. Training is always application-specific, and thus tends to cover either transactional processing or analytic processing. Education tends toward the transactional model--it’s more interesting as a CS topic.)

The second half began with a small data mart building case, then went into some detail on open source tools available for data warehousing.

Roger Magoulis’ talk, “Building the Open Warehouse”, went through the implementation of an open source data warehouse, built at O’Reilly from third-party book sales information.

This was an inspiring presentation. At the time (and since), I've been about ready to leave data warehousing. Hearing from Magoulas about the speed and agility with which he and the rest of the O'Reilly people put together their warehouse was a wonderful tonic.

Like Ashenfelter's tutorial, Magoulas' presentation focused on whst data warehousing could accomplish in a medium-sized business. Accordingly, certain items in Magoulas' presentation which wouldn't fly in a large enterprise made perfect sense in his context. (I had to work at making this mental adjustment--my last gig was working with a 10 TB warehouse, which struck me as a little small, comparatively speaking.)

Here's an example. Magoulas said that the data warehouse is a perfect open source project, because the warehouse isn't required to be up 24 x 7, and because you could restore it by simply reloading it. My first reaction to this was, "Where did you come from and who let you up there?" My second reaction was, "You came from an underserved market in data warehousing, and in that context, you are absolutely right." (Yes, I know there are open source applications which do run 24 x 7--I think Magoulas' point was aimed at deflecting a management objection to a specific open source implementation rather than advocating open source generally, and that's a tactically wise move in many cases.)

I'll write more about Magoulas's presentation later, but since Ashenfelter's tutorial starts in--dang! Where did the time go?--four hours, I'll post this now, so you can get what value there is out of my recommendation.

If you attend one of these presentations, tell me: Was it good for you, too?