Converting XML Schemas to Schematron (#1)

by Rick Jelliffe

In my blog Converting Content Models to Schematron I outlined some code ideas. Recently we (Topologi) have been working on an actually implementation for a client: a series of XSLT 2 scripts that we want to release as open source in a few months time.

Why would you want to convert XSD to Schematron?

The prime reason is to get better diagnostics: grammar-based diagnostics basically don't work, the last two decades of SGML/XML DTD/XSD experiences makes plain. People find them difficult to interpret and they give the response in terms of the grammar not the information domain. And error messages are reported in terms of where the error was detected, not where the error was. For example, given a content model (a, (b, c)?, c, d ) and a document <a/><c/><c/><d/> you will get an error "Expected a d" at the location of the second c element; however the problem really is that the b is missing.

Schematron converted from a grammar still does not have much info to go on. Of course, the Schematron scripts should be easier to customize for tailored assertions and diaganostics. But also the phase mechanism is very useful: we can implement multiple different ways of checking the grammar and let the user decide on which one provides the best information.

A secondary reason is that Schematron only needs an XSLT implementation. There is still quite a suspicion that XML Schema implemantations are partial or broken. Japan Industrial Standards' comment on Open XML were that they could not in fact even get the Schemas to run under Xerces and another major implementation. XSLT is much more common. However, we have decided to use XSLT2, and SAXON in particular, because it offers us some short cuts.

One shortcut that is quite fun is this possibility (I am not sure whether we will implement this method this round, it is outside our initial brief): by converting the children element names of an element into a string, such as "H1 p div div div table ht p" for example, and the converting a grammar such as ( (H1 | H2 | H3 | P | div | table )* into a regular expression equivalent, we can actually use the built-in regex recogniser of the XPath2 functions to validate the document. Just using a vanilla CSLT2. And this even copes with the minOccurs/maxOccurs cardinality contstraints, too.

This is rather exciting as these things go because it means that we can have a fallback validator that completely covers all the constraints of a grammar system, without leaving Schematron or the world of assertions. The downside? If implemented in a simple way, you only get the same kinds of diagnostics as a conventionally implemented XSD system will give you. But the advantage of having a complete Plan B means that we can concentrate on useful messages for the Plan A.

I'll blog on how we implemented it over the next few weeks. Basically, we have a two-stage architecture: the first stage (3 XSLTs) takes all the XSD schema files and does a big series of macro processes on them, to make a single document that contains all the top-level schemas for each namespace, with all references resolved by substitution (except for simple types which we keep). This single big file gets rid off almost all the complications of XSD, which in terms makes it much simpler to then generate the Schematron assertions.

We have so far made the preprocessor, implemented simple type checking (including derivation by restriction) and the basic exception content models (empty, ALL, mixed content), with content models under way at the moment. I think the pre-processor stage might be useful for other projects involving XML Schemas.

Actually, the difficulty has been in an unexpected direction. XML Schemas is so unpleasant to work with, that one programmer asked to be take off the project because it was simply too much to cope with, and another has left the company (to take up an overseas appointment) but not before also getting frustrated, boggled and bogged down by XSD! Things like complex type with simple content derived by extension from a simple type with simple content etc become a maze or ratnest. (Hopefully we have that under control and we'll be able to attend to our backlog of other work ASAP: we have been pretty poor.)

It is interesting that in all the last almost eight years of Schematron, I don't recall anyone complaining it was too difficult. Instead, I regularly get surprised to hear of quite important projects where it has been quietly used without fuss or drama, and just chugs away doing its thing, with everyone involved feeling (and being) in control. This week for example I heard about UK taxation office's use of Schematron for checking incoming documents being lodged. I think some of the reason for the success might be that because Schematron is small, it can be kept under control and understood, and that because there is zero support from the large software players, it is never used as part of an attempt to up-sell big hardware or message busses or protocols or enterprise systems etc.: it gets used for POX (Plain Old XML) sites.

8 Comments

J. Prevost
2007-09-24 23:18:57
I’m a big fan of Relax NG, myself.


I certainly agree that XML Schemas is almost the worst content model specification system available. Almost because I’m not sure whether it’s better or worse than DTD—I’d almost rather use DTDs if they were namespace-aware.


(As far as smallness of implementation: There’s a Relax NG-based XML editing mode for emacs called nxml-mode. It has *built into it* a Relax NG validator, which means it validates everything in real time as you edit it in emacs, without calling out to any external programs.)


Anyway, I support anything that encourages people to consider options other than XML Schemas. Since XMLS has the "stamp of approval", people far too often use it to specify their content models, even though there are other choices (with standards of their own) that are better for their specific task. (Which is to say: Practically any task at all.)

Rick Jelliffe
2007-09-25 00:14:13
J.: RELAX NG is really good. Because it has a slightly more powerful class of grammar than XML Schemas, some schemas using RELAX NG will be more difficult to translate to Schematron.
bryan
2007-09-25 01:39:06
Hmm, I had to do a similar thing for UBL. Although I just went from an XML representation of all elements, but not very much different from the normalized schema you are discussing. My target was schematron 1.5 though, I found that basically the schematrons I could generate were too verbose to work in most xsl-t processors, I could probably have done some optimizations but I think for Order I would still have been looking at 1+ MBs. How is the verbosity of the Schematrons? any optimization strategies? Have you noticed any consistent relationships between size of input/type of schema structure with size of output?
Rick Jelliffe
2007-09-25 06:15:53
Bryan: We are not really at the stage of optimization yet. But certainly it is a characteristic of Schematron and similar languages that you can have zillions of diagnostics. I think Schematron is however the only general rule-based language (I am not up-to-date at the moment, happy to be corrected) which takes this issue seriously: hence the provision of from almost the beginning, and the more recent additions in ISO Schematron of @role and @flag.


allows the user (interface) to select up front which patterns they are interested in. @role and @flag provides richer information for the user (interface) to select and sort results, for example when outputting to the ISO SVRL format.


For example, a phase to just select validation of a particular namespace. Or to just select validation of simple typed data not complex types. Or we follow a user model to validate typos (name not found in schema), context (parent/name not found in schema), required elements, and further.


At one stage I was thinking about whether we needed to provide a mechanism for chaining phases: so that if there are no "typo" assertion failures we then validate in "context" mode, etc. However, actually this gives us no capability that cannot be done merely by selecting and sorting the SVRL output (helped by @flag and @role perhaps.)


Because the SVRL does provide path information to the instance, the diagnostics can be sorted by Xpath, and then potentially sorted within these so that the user is presented with the most specific diagnostics first. But there is definitely quite a lot of exciting experimentation possible here.


One thing we have done is to ask our prime user whether they are interested in "what comes next?" diagnostics or in "what is wrong?" diagnostics. This influences the kinds of assertions and phases we make. There are in fact many different ways of implementing tests for content models, each of which fits into a different scenario.


Ultimately, you need a good range of different assertions neatly arranged into different phases, so that users can drill into the information they are interested in. For example, in the blog I give an example of (a, (b, c)?, c, d ) but it is certainly possible to write assertions that are just as misleading as the ones that a grammar will give: just a simple




An element c that is preceded by an element a should be followed by an element d.


...


where as the following would be more useful



A b should be followed by two c's.
...


The biggest optimization we have been doing bits of is simply to generate use abstract rules for representing simple type hierarchies. And there is quite a lot of scope for coalescing rules that have the same assertions (by adding unions in their context: easy.)

amike
2007-09-25 13:34:23
Sorry, but I still prefer Examplotron (XSLT) used with XSD :
2 reasons :
- readable by a non xpath expert who can validate the diagnostics
- close to set theory : if A is part of B, not-A is easy to get with Examplotron.



Rick Jelliffe
2007-09-25 15:44:24
Amike: Yes, Examplotron is a great idea. There could certainly be an Examplotron to Schematron converter (i.e. using Schematron as an implementation API like with the XSD converter discussed here.)


The downside is that, probably even more than other schema languages, Examplotron does not support traceability back to requirements.

Radkrishna
2007-09-30 00:15:07
Hi,
I wanna to know to convert special charcter value for xml generation


Rgds,
RK

Rick Jelliffe
2007-10-11 18:40:44
(#2) of this series available at
http://www.oreillynet.com/xml/blog/2007/10/converting_xml_schemas_to_sche_1.html