PVL: a minimal schema/stripping language for XML

by Rick Jelliffe

Doing some training last week, it became again clear that some XML users are in a bind currently with insignificant whitespace. They may have documents with indented XML but they don't want to have to validate or transform to strip out the whitespace in element content.

I have been also thinking about a similar problem from a different angle: how to make an ultra-simple, efficient validator that does not use a grammar, just the XML processor's stack, for basic parent/(child|attribute|pi|data|element) validation. This is part of an ongoing interest in implementing XSD as Schematron assertions: one problem with which is that Schematron (using XPath 2 for example) requires random access pretty much with a built tree. But for any kind of high transaction rate work, you really want to be able to fail early if there are foreign elements, billion laughs attacks, etc. In my company's Interceptor product, we have Schematron processing as a third stage after basic size and evil-string-detection, then WF/validation checking. But schema validation does blow out the timing more than desirable.

So, for your delight, I present the Path Validation Language, a thought experiment in minimal schema languages, rather like UNIX access control lists ACL. You more or less make an ACL entry for each information in each significant context. I've tried to make it that XSD, DTD, and RELAX NG schemas (and perhaps even the streamable parts of some Schematron schemas) could be readily simlpified into a PVL schema. Obviously it could also be extended
to allow simple datatyping too or attribute defaulting, but if you go too far you may as well have the real thing. (Though I increasingly think that grammars get in the way of schemas: better to have path-based datatype attribution for example.)

Actually the syntax is unimportant (could be a PI, could be a config file, could be in XML syntax, could be just an internal datastructure from compiling a schema). The more interesting thing is the question of whether we actually need something much simpler than DTDs (which can either be written or generated from schema languages) for situations where XSD (or even RELAX NG!) is too complex or inefficient.


5 Comments

M. David Peterson
2006-04-11 06:12:55
Hey Rick,


Nice! Have you rocked this up into any sort of proof of concept code, or are you purely in theory mode at the moment?


2006-04-11 23:08:34
No code for now for something tightly coupled to a parser. I guess the most useful code would be something insertable into a SAX stream.


But the "Usage Schema" in some Topologi products extracts parent/child pairs from a document or DTD and makes a Schematron schema from them: this is great for determining that a new document is marked up in the same kind of way that old documents were for example. So I am sure that even tiny twigs like this catch a good number of errors: name and namespace typos, enforcing element/mixed/empty/data content in elements, enforcing containment and basic structure. All good.


Interesting to consider that XML was developed by separating validity into two layers: WF and DTD valid. Perhaps schema/DTD valid needs to be forked into two layers too, because of the need for insignificant whitespace node stripping.


Another approach would be for XML to simply allow xml:space="strip", but that is probably too radical.

M. David Peterson
2006-04-12 03:55:25
re: xml:space="strip"


I think the part that would make it too radical is the length of time it would take from now until any such point as a new version of the XML spec is RTM'd, if ever. Obviously XSLT already provides this functionality, but for folks would would rather bathe in a tub of Tar-and-Feathers than work with XSLT, thats obviously not an option, although a simple template that did nothing but a deep-copy of the associated document with and could definitely act in the capacity of a pre-processor in this regard, ensuring there will be no need to start warming up the Tar bath just yet ;)


This method would dump any associated PI's, but if there do happen to be any, these can always be merged back into whichever object model document they happen to be using after the fact. Not sure if the PI's would even be much of a concern, but if they were, then this option should help ensure this could be a simple hack that wouldn't require a whole lot of overhead.


Your PVL 'spec' would obviously at a few simple pieces to this to allow for a bit more granularity, so I think its fantastic in this regard. Definitely worth looking deeper into some proof-of-concept code.

M. David Peterson
2006-04-12 03:59:31
it seems it didn't like my markup in the comment... lets try that again.


<xsl:strip-space elements="*" />


and


<xsl:output method="xml" indent="no" />


should be inserted to read:


"... deep-copy of the associated document with <xsl:strip-space elements="*" /> and <xsl:output method="xml" indent="no" />could definitely act in the capacity of a pre-processor in this regard, ensuring there will be no need to start warming up the Tar bath just yet ;)"



Innovimax
2006-04-16 04:01:45
Hi,


Great idea and great work


I've made some comments on my blog about your proposal : http://innovimax.fr/blog/index.php/2006/04/16/4-pvl-a-path-validation-language


I'm openning a sourceforge project to bring this idea to life


MoZ