Sane XML

by Simon St. Laurent

Related link:

Sane XML

Is XML driving you insane? It shouldn't, and it doesn't have to. Sanity is within reach, if you're willing to discard a lot of junk and take a look at some tools that fit XML neatly.

Back in 1996, the XML effort began, focused on creating a subset of SGML that would be easier to work with, and finally allowing it to reach a much wider group of developers and users.

Today, in 2004, XML is far more tangled than SGML ever dreamed of being. Even if you include other SGML-related ISO projects, like HyTime, in the mix, XML has far outstripped its supposedly complex parent. In practice, most users focus only on a tiny subset of the capabilities that standards bodies and vendors have provided, but choosing a subset that doesn't inflict major pain over the course of a project is still difficult.

Over three years ago, a group of developers in which I participated proposed "Common XML", a subset of XML 1.0 and Namespaces in XML. We thought it trimmed the fat pretty reasonably and enhanced the interoperability that had been compromised by several design decisions in XML 1.0 itself. In practice, I think we got things mostly right, as developers who work with XML tend to stick to the parts whose use we encouraged, and seem to have gotten the message that some of the pieces we described as extensions may or may not work as expected across applications. (I don't credit Common XML with making any changes; it just codified practices people have largely found on their own.)

Today, the XML landscape is far more complex, with specifications good and bad littering the computing world. One of the most bloated, W3C XML Schema, has dominated the tools world despite interoperability and complexity issues. Thanks in large part to early support from vendors, this collection of issues masquerading as a schema language continues to dominate the XML world - and in my opinion, makes the cost of using XML much higher for both vocabulary creators and consumers of those vocabularies. W3C XML Schema is only of a number of complicating specifications from the W3C, and the W3C is starting to find itself troubled by the additions it made to SGML.

Developers don't need these headaches, though they may feel trapped by currently available tools, and many of them haven't heard that there are in fact alternatives to W3C XML Schema. Using XML shouldn't be a mind-binding experience, and it's possible to discard most of W3C XML Schema and still get work done - even get more and better work done.

The key to this sanity is a strict focus on XML and XML documents. Stop pretending that these things have object hierarchies, and stop hacking around the conflicts between object hierarchies and document realities with broken tools like substitution groups and keys. Focus on the documents themselves and the structures you'd like to have in those documents, and there's a chance you'll produce documents that are a pleasure, rather than a burden, to work with. You can build schemas using this understanding of documents with RELAX NG, a schema language that describes document structures, not type structures abstracted on top of document structures.

I gave a presentation last week on how to use RELAX NG to create schemas which work with W3C XML Schema tools - you don't have to give up compatibility with current tools to escape the complexity. There's plenty of information at to get you started, as well as a new O'Reilly book that's also available online.

Take a look at RELAX NG, and start using it where you can. Start by writing new schemas in RELAX NG, and convert them to W3C XML Schema later if you need to. Ask other developers for schemas in RELAX NG format. Even the W3C, purveyor of W3C XML Schema, has found RELAX NG to be useful.

XML was never meant to be complicated. You shouldn't have to buy a continuous stream of books, even O'Reilly books, to get your work done using XML. (Given the state of the XML book market, it seems clear that the treadmill has exhausted people.) While you'll undoubtedly still find data modeling a challenge, RELAX NG will let you focus on your information structures rather than on the intricacies of a bloated schema specification half-hidden by tools.

What else about XML makes you crazy?


2004-01-22 13:36:04
Perhaps the cart is before the horse
I believe a lot of what's happening is that people are focusing on the XML representation before they understand what it is they're representing. The simplicity of the abstraction is lost.

There are some domains where XML is THE THING (or very close too it) and others where it is just a best approximation, given the limitations of the syntax for expressing THE THING. The essence gets lost, even when all the information is conveyed.

Maybe I'm old-fashioned, but I remember when we would first take pains to understand the abstract user model BEFORE we attempted to represent it physically in database or serialized file format. Nowadays it seems like people confuse the two; they think the format is the data, and forget that any physical representation introduces notational artifacts that have no bearing on the user's domain experience.

This was the problem with DOM: user's data as a document object model. The user doesn't think in terms of document objects models (unless he's an editor, perhaps). The information they deal with everyday can only be approximated in whatever physical representation you choose. And a good UI makes the pure representational artifacts disappear behind the video screen. (This is why many mainframe interfaces are such dogs: they're too close the database tables and rows that underly them).

XML is just another way to represent information, but unfortunately many still think it is THE way, and one that somehow has implicit importance to the guy must maintain such information. Phooey!

2004-01-22 13:48:29
How many carts?
I think we may be talking across each other here, since - apart from your model-first preference - I don't think we're very far apart.

I think the problem in W3C XML Schema is more ferocious than whether an abstract model matters more or less than the syntax.

In W3C XML Schema, you effectively define two different things simultaneously: an abstract model and representations of that model. The two pieces tangle with each other, making it very difficult to work with one or the other but not both.

In RELAX NG, you're defining a representation - one that also has a clean underlying model. Removing the extra abstraction of types may seem like a drawback to model-first developers, but in many ways it's a gift. It means, for instance, that you can develop the model in tools which are optimized for model development - think UML - and then convert them into RELAX NG without fear of model mismatch. Eric has written about this in the RELAX NG book, for instance.

The RELAX NG approach lets you hitch whatever horse you'd like to the cart; W3C XML Schema requires you to hitch your horse to multiple interlocking carts. I think you'll find RELAX NG is actually easier if you want to define the model and the representation separately, though I'm not sure you were arguing that anyway.

2004-01-22 14:41:52
How many carts?
Okay, I'm spinning off your "XML is not objects statement", not the "RELAX NG is a better XML metalanguage" one. The former is my hobbyhorse (and cart?) of the moment.

XML Schema is, frankly, a failed effort-- one kept alive by extraordinary means. I'm betting we'll all be discussing XML Schema with the same wistfullness as ALGOL 68 in ten or fifteen years. All memories soften with age.

XML Schema exacerbates the situation I'm harping on precisely because it does provide datatype info. It's like a little siren song that reinforces the notion that XML structures are just flattened objects (with hordes of little getter/setters implied).

Model-to-model bindings such as UML to RELAX NG are okay as far as they go, but as you eventually wind up with other interacting representations, no binding tool at present is powerful enough to handle that seemlessly. I don't think it will ever happen; laziness is a virtue, but it's real work to maintain.

It's not like I'm preaching to the preacher here, or anything....