Best practices for XML vocabulary creation

by Simon St. Laurent

I've had a small flurry of odd questions about how to do a good job of creating XML vocabularies. The answers are different for every vocabulary, but here's what I've learned in the past few years.

Documentation, documentation, documentation!

Whatever you do, make it easy for users to figure out how they're supposed to use your vocabulary. Start writing documentation when you start developing, and use the documentation to keep yourself clear about what you're doing. Prose descriptions accompanied by a rich set of examples provide users with far more help than a bare schema. "Tag abuse" and the many weird application issues it creates can be greatly reduced with good documentation. Make your documentation easy to find - RDDL's approach of putting both human and machine-readable documentation at the namespace URI is a good one.

Listen to the problem you're trying to solve.

This one comes in a lot of forms. Sometimes it's requirements gathering, other times it's notes on a napkin about what a process needs to operate. However you do it, make sure you keep thinking about the problem throughout your development process and afterward. Be sure to document what the problem is and how you think you've addressed it. Contemplate the possibility that your solution might create worse problems.

Step away from the program.

It's tempting to write XML vocabularies that directly reflect the data structures you've written into a program. There are several large problems with this. First, binding data to program structures that tightly often makes it more difficult to make things work when either the data format or the program structure change. Second, you may well find you've limited how well your vocabulary plays with programs that run in different environments or with different assumptions. Perhaps worst of all, however, the structures you've used in your program may be far from the ideal way to present the same data in XML. (Take a look at WordprocessingML or Apple's plists if you don't believe that object serialization can produce excessively complicated markup.)

Plan for change.

Versioning is tough everywhere, but XML offers some opportunities to make things easier on yourself. You can leave spaces for further expansion, creating vocabularies whose processors can tolerate the possibility that they don't understand every single part of a document and work with what they do know. You can use schema languages like RELAX NG and Schematron, which are more tolerant of changing, modularized, and interlinked structures than DTDs or W3C XML Schema. And make sure that places where change is foreseen are well-marked in the excellent documentation you produce.

Remember that you're creating documents, not just schemas.

It's tempting to look at XML vocabularies as things like C structs, where you define the structure and then a program comes along and fills them. Unlike structs, however, XML documents are open to anyone who happens to encounter them, not just your programs. People and programs write XML in all kinds of ways, many of them chaotic. Make sure you've considered what the document structures look like, not just what the schema structures look like.

Take advantage of hierarchy, except when you can't.

XML is all about trees. Elements contain attributes, other elements, and text. Containment is everywhere in XML. If two or more things go together, put them in a common parent element. While it's certainly possible to go crazy and add too many levels to an XML document, there's a balance between possible and necessary that isn't too hard to achieve in practice. There are also times you need to cut across hierarchies - cross-references and keys are historic examples - and you should take advantage of XML's facilities for doing that. If you don't find hierarchy a good fit overall for your document structure, however, you should probably contemplate leaving XML behind and using a different approach entirely.

Learn XPath.

If it's easy to get to your information using XPath 1.0, you've probably created an XML document structure that other people can process easily. If you find yourself needing to use a lot of named axes to get around, seeking an element's parent's sibling's parent to interpret that element's meaning, you probably need to take a close look at your vocabulary and contemplate restructuring.

That'll do for the basics. There's lots more, of course, but given that foundation, there's hope of people getting it right.

Any more best practices?


2005-03-01 20:28:49
Here's another:
Avoid attributes in all but a few very specific cases. Avoid mixed content if you are creating a data oriented (as opposed to document oriented) vocabulary.

Both of these will make your job in accessing, building and transforming documents much easier. You'll have fewer special cases to deal with. You'll also have an easier time when time comes to expand the vocabulary.

2005-03-02 04:51:40
Sorry, but those aren't
best practices in my view.

You're foreclosing on a lot of flexibility, probably because you're already thinking about the programs you'll use to process the XML rather than the data you'll be transferring.

If attributes fit your work, use them. If mixed content makes something easier, use it. Get the data right before you contemplate processing.

2005-03-02 04:52:39
Comment above ("Sorry...") is supposed to be a reply to this comment.