The (data) medium is the message

by Simon St. Laurent

Marshall McLuhan published Understanding Media in 1964, back when the foundations of most of our current information processing systems were still developing. While McLuhan's discussion of media per se may not seem relevant to the dull work of information management (as opposed to the glossy hype of Wired, which adopted him as patron saint), some of McLuhan's fundamental insights apply as strongly to relational databases and XML as they do to television or the Web.

In Understanding Media, McLuhan first explored a general theory of media and then explored speech, writing, print, the telegraph, the photograph, the phonograph, and much more, including things like roads, housing, and automation. Perhaps the best summary of what he was getting at comes from the Introduction to the Second Edition:

"Environments are not passive wrappings but active processes."

In computing, we generally regard environments as active processes we use to get things done, but we don't often look at the impact that the environment has on the way we think. People recognize, for instance, that there are real transition costs in moving from Perl to Java to Lisp, but people who remain immersed in one environment don't always recognize how the lessons they learn in that environment may be completely inappropriate in other environments. The paths, the priorities, and the best practices emerge from a combination of design and discovery that are particular to each language.

In my experience, these differences are even less recognized on the data side of computing, which most developers seem to regard as passive storage. Typically, convenience rules the choices here, with developers either working from legacy data or building systems using tools they already understand or have paid for. "It's all just data" is a fairly common expression, and a lot of developers see the code they write rather than the form of the data as the important part of the puzzle.

Developers who look at information this way make the same mistake made by people who think newspapers and television both deliver news, so what does it matter? To some degree, you can get the same information from different media sources, but no one expects television to be a reading of newspaper stories or the newspaper to be a transcript of the nightly news on TV. Both are containers for information, but the shape of the container inevitably affects the way the information is both produced and consumed. Sophisticated consumers and media moguls both typically understand this. The consumers try to get information from multiple sources and compare different media, while moguls have spent the last decade building business empires that span different media to reach different customers with similar (advertising and more) messages.

While the developer's view of information politics is usually more local and better understood than mass media politics and the FCC, the differences between media persist. Relational databases are all about linked tables and structured atomic data and the possibilities that opens. Object stores and serializations are generally about flexible hierarchies, with relatively direct linkages to to particular processing expectations. XML is about labeled hierarchically-structured containers, with a general separation between content and processing expectations. (I'm using XML here generically for both XML documents and collections of XML documents.) RDF is about directed graphs, keeping away specific processing expectations regarding their content but with a well-defined general model for manipulating the graphs. Plain text, of course, offers a sequence which may or may not contain identifiable repetitive structures.

Perhaps the most important thing to recognize about all of these forms is that they are different. There are, of course, cases where relationally-modeled information can be represented as objects, XML or RDF, and there are cases where object stores or RDF triple stores use relational databases as back-ends, but these all involve tinkering and compromises. There is no general way for an XML document to serve as an efficient foundation for relational queries, nor is RDF much good at modeling XML's mixed content. While it may be convenient in some cases to serialize objects to XML, it requires lots of metadata if the object needs to be reconstituted in the same form, and the XML produced by serializations often looks alien to people who actually care to work with XML itself.

At the same time, these different approaches do particular tasks very well. The relational model allows the efficient processing of vast quantities of information stored in unordered rows in related tables. Object stores let developers put objects somewhere without having to spend time creating pathways between their existing model and a different model. XML comes from the document world, and most of its functionality is aimed at creating labeled structured content that both humans and computers can process. RDF is about assertions and how they combine to make statements, and while humans frequently have a hard time making sense of URI chains, some programmers find they solve classification and other problems easily.

XML currently carries the unfortunate burden of being the medium the other forms think they understand. Object serializations in particular produce an enormous amount of lousy markup. Technically, it's XML, but its creators plainly cared about their program and not much about XML, or how anyone else might want to process the XML they create. Relational database folks have faced the same problem for years, as developers find all kinds of strange structures in databases that reflect the needs of a particular program rather than a vaguely sane normalization of data according to relational best practices. (At least relational databases and XML share a notion of named containers for data, though how they work with those containers is very different!) RDF creates similar problems for XML, as lately there's been a flurry of proposals for 'fixing' XML with RDF tools and structures. RDF's own XML syntax isn't widely beloved either, but as long as you never look at the XML...

I don't think I invented it, but I've long described a difference between 'square' and 'groovy' data. 'Square' data is easily atomized and tabulated, a perfect fit for relational models. 'Groovy' data is information that doesn't neatly fit in a box, typically information created by and for human users directly. XML fits that kind of data very well, with its tolerance for arbitrarily recursive structures, reuse of content through inclusion, and structures flexible enough to mix raw text nodes with containers. RDF feels like 'puzzle' data to me, interlocking pieces which form larger pictures when assembled. Object stores are kind of a combination of all three of these, with the strong demands for structure common to relational work, the hierarchy of XML (though with multiple and different kinds of hierarchy), and a massive dose of RDF's interconnectedness. I don't quite know what to call that, as it both subsets and supersets the other categories.

I think it's time for developers to take a closer look at how they're storing data, and what that means for the data and for other developers. We seem to have moved into an age where modeling information too tightly against a particular set of processing expectations incurs significant costs, and it's time to start thinking about what media fits our information best rather than what we want to do with the information today.

Diversity has its costs, but recognizing the positive contributions these different media can make should lower costs over the long term. Use relational databases where they're appropriate, XML where it's appropriate, RDF where it's appropriate, and strictly object approaches where they are appropriate. There's no need to put everything in a single model, despite the claims of both one-model purists and vendors trying to solve everyone's problems. It may take some learning and some looking around at different models, but the upfront costs should avoid painful legacy disasters later.

Why can't data all be the same?


2003-05-08 09:16:10
Newspapers and TVs are containers?
I always considered these to be interfaces. We can push different kinds of information through those interfaces.

In much the same way, I think that the data formats described evolved to support interfaces with varying capacities and protocols, and varying assumptions regarding the capabilities of the receiving processors that interpret the information.