Validating concurrent and interrupted elements in XML with Schematron

by Rick Jelliffe

Data structures people like to think of an XML document as superficially a rooted tree of the type called an Attribute Value Tree (AVT) and, when you add IDREFs, a kind of ordered, directed graph. This puts the emphasis on the element structure. But of course an XML document is more than that: it is also a tree of entities, a tree of notations, a tree of character set encodings, and so on. A relational person might see tables of atomic values split up and regrouped according to keys. An SGML or markup person would see it in terms of linear text which has had various range annotations to provide metadata; the element ranges being synchronous (i.e. no overlap) also means that they can be viewed as a tree, however there is no reason why a subrange actually relates in any semantic way to a containing range: the element is a property of the text not the other way around.

There are particular, admittedly niche, areas where the synchronous restriction galls. So there have been various systems for concurrent markup proposed. Many of these go outside the meager resources that XML allows back towards the parsing power of SGML, and some even extend SGML. I was looking over Michael McQueen's Rabbit/Duck grammars which deals with validating concurrent structures: I wondered about how Schematron could be used.

Lets take the most common case of overlapping markup: bold and italic because it is easy to visualize: we want:


where a brave but naive soul would mark this up as
THE <i>GRAPHS <b>OF </i>WRATH</b>

but the XML markup has to be
THE <i>GRAPHS <b>OF </b></i><b>WRATH</b>

Lets make a constraint that there can only be one "phrase" of bold in our text. An odd constraints, but it relates to grouping arbitrary elements together. First, lets use markup to indicate connection, with an IDREFattribute called join.

THE <i>GRAPHS <b id="b1">OF </b></i><b join="b1">WRATH</b>

Note that we now have represented the concurrent structures, but @join does not require that the sections be contiguous: interrupted or dispersed sections are possible too! Now for the Schematron schema

<rule context=" $context ">
<assert test="count(b[not(@join)]=1) and count(b[@join][@join!=../b[@id]/@id)=0 )">
There should be exactly one phrase of bold. This should be marked up with one or more
b elements, but one of those b elements has an id and the other have join attributes.

The same approach can be extended for different occurrence and position constraints.