Usage Schemas to tame ODF and OpenXML down-conversions

by Rick Jelliffe

Kitchen-sink standards are developed by committees and have to cope with a wide variety of different applications. If someone's software does something, there has to be some element or attribute or value stuck in. Sometimes the backdoor of properties (open ended value lists) is used, so that the schema can be simplified at the expense of enumerating possible values. But schemas like DOCBOOK, TEI, ODF and OpenXML are classic kitchen sinks.

There is an objective way to detect them: check their Structured Document Complexity Metric and if it is over 300, you probably have a kitchen sink. I gave some metrics earlier in Comparing Office Document Formats.

Now the trouble with kitchen-sink schemas is that any particular set of documents will only use a subset of the total possible features. So writing a complete converter that accepts any possible input from a kitchen-sink schema and outputing them to some more targetted document type is a completely wasteful process. YAGNI. But, and here's the rub, every so often, someone will in fact use one of these strange often, someone will in fact use one of the elements you didn't expect.

One way to cope with this is the usage schema. This is a schema derived from sampling representative documents. When new documents come in, you first validate them against the usage schema, and if there is a problem, escalate it to the roject management to schema, and if there is a problem, escalate it to the roject management to discuss how to handle it. It is a sign that the data is not what they expected.

There are some tools to generate XSD usage schemas, but you can also generate them using Schematron. The tool I use first generates all three-level Xpaths found in the document, then makes a Schematron schema that reports if any node was found that was not caught by these XPaths. Very straightforward, but effective.

Another use for usage schemas is for software development. If the customer has provided a sample of the output format, then make a usage schema for that and check that the output from your converter validates. Escalate any differences to project management, This gives a way of proving that your program meets their specs, and also of showing where their specs (e.g. the sample output) was inadequate.

4 Comments

orcmid
2007-03-08 20:06:17
All right!!


Thanks for this. The people who think they can government-wide fiat a format and have interoperability solved need to understand about profiling and agreement on profiles used as part of important practices.


I also like the kitchen-sink notion. It helps people visualize the problem and you also give a lead to technical approaches for confining a format to address a particular usage scenario.


Give us more! More!!

M. David Peterson
2007-03-08 23:20:39
>> Give us more! More!!


times that by 2, please :D

Michael
2007-03-10 12:50:40
"Kitchen-sink" seems a bit perjorative. In document-centered XML, it is not so much that different applications support different features. Instead, it is that the documents themselves are complex, and the applications are simply evolving to meet real user needs for both the common and the rare features.


You can see this in the evolution of the MusicXML format. Even MusicXML 1.0 had a complexity metric well over 300. While it was useful, it was also incomplete for musicians to use for effective document interchange. MusicXML 1.1 goes far beyond that and 2.0 will go even farther, as the format evolves to meet distribution as well as interchange needs. It's more work for the software developer (hence the usefulness of usage schemas) but in the service of making things transparent to the document creators and users. The whole point of increasing the DTD complexity is to enable the more complete interchange of digital sheet music documents that our customers are demanding.


To me, 300 seems more like a cutoff between "toy applications" and "real applications" for DTDs and schemas, rather than a "kitchen-sink" metric (only half :-)).

Rick Jelliffe
2007-03-12 16:44:26
Michael: You right that some things are just complicated. But especially where vendors with different specialist features all need to adopt the standard format as their native format, and so need it to support their unique features as well as the general ones.


The classic case here is the CALS exchange table model, which was at the roots of OASIS Open. The US military CALS project made a very large table model that kitchen-sinked all the features of every table that they could fine. (Kitchen table?) None of the software products could support it. So the software vendors got together to make a profile to allow exchange, and agreed that the CALS features that none or few support would be dropped, and that the features that most supported would be adopted by all.


This is a really good model, because it shows that 1) kitchen sink schemas where every possibility is catered allow modeling but prevent interoperability, 2) profiles just based on what everyone supports are good for interoperability but mean that everyone gets the features of the worst products, 3) exchange profiles require users to tone down what they want and vendors to cooperate to add shared features.