Stop making those gigabyte XML files, already

by Uche Ogbuji

Related link:

I've been hollering this for years now (softly counseling in the case of my clients), and I'm glad to hear others giving the same advice. As no less a sage than Mike Kay says:

"I wonder whether [creating huge XML files] is a wise way of using XML. Even with XML databases, most databases are optimized to handle large numbers of small/medium documents rather than a single gigantic one. I don't think that using an XML document as a replacement for a database is a particularly good idea. It's not the job it was designed for."

Yes folks. XML is not designed to be a monolithic database instance implementation. If you're dealing with gigabyte XML files, I can almost guarantee your design is broken somwehere. Between modern file systems and modern archive formats and tools, there is no reason not to decompose XML into reasonable chunks.

Update: for a bonus, see Kay's argument against some overcooked RDBMS dogma. I strongly agree with him here, as well, even though I'd guess Fabian Pascal and gang are still looking for scalps of such heretics.


2004-07-29 12:43:07
The problem is that linking isn't intuitive... anybody used to the HTML model of linking that was so simple. There may be now a standard means of referencing another document (xlink), but that standard isn't easily understood, and has the added hassle of making it so that people writing otherwise simple XML have to deal with the complexity of other namespaces/schemas. In other words, their own business knowledge (what does MY xml file represent) is all mixed up with the technical knowledge needed to process a linked document correctly.

This lack of an accepted (by the masses, not necessarilly by a committee) standard for linking is one of the reasons I feel the Semantic Object Web isn't. As in, its not a web because there's no direct way for me in RDF to specifically reference another RDF document to keep the web metaphore going that we're so used to in html. Yes, I can include URLs as URIs, but there's no guarentee that any particular URL is or isn't an RDF file, or even is a file at all.

Yes, that's also true with html's tags, but the Semantic Object Web should have defined its own semantics for dealing with that, rather than falling back on the specifics of the web server and a little blind luck to finding more RDF documents.

FOAF in particular is a MAJOR failing because of this. I shouldn't have to define my own URIs and metadata for my friends -- I should provide their URIs and those URIs also be full URLs to their own FOAF files that can be used to follow a FOAF chain around the world. Right now, it doesn't do that, so IMHO its a technical failure.

The Semantic Object Web can't be "surfed". thus, to some, the only seeming way to get all your data processable is to stick it all into a single file. otherwise, you're basically on your own for defining the mechanisms by which multiple XML files become a single knowledge base -- the standards haven't helped us here at all.