Extreme Markup Languages, day 1

by Uche Ogbuji

I'll use this entry as an anchor for my observations on the first day of Extreme Markup Languages (See also: Looking forward to Extreme Markup Languages). I'll update it with a note each time a new talk begins, but I'll add my comments on the talk in the comments section. I also added a numbering scheme for the talks, to correlate to comments.

If you happen to be reading this in an aggregator, much of the meat is in the comments, so you might want to click through.

D1.1. B. Tommie Usdin, one of the organizers of opens up the conference with "Riding the wave, riding for a fall, or just along for the ride?"

D1.2. "Easy RDF for real-life system modeling", Thomas B. Passin

D1.3. "Writing an XSLT optimizer in XSLT", Michael Kay

D1.4. "From Word to XML to mobile devices" , David Lee

D1.5. "MYCAREVENT: OWL and the automotive repair information supply chain", Martin Bryan & Jay Cousins (Martin presented alone)

D1.6 "Advanced approaches to XML document validation", Petr Nalevka, Jirka Kosek (Petr presented alone)


2007-08-07 07:32:12
One of the things Tommie said that comes up a lot is the question "who cares about XML". She echoed Henry Thomson's remark that most XML is machine generated, and moved on to a discussion of a point I've often pondered. As Liam Quinn reports from his surveys of XML users, each group (e.g. Web services folks, RSS folks, DBMS folks) say "99% of XML is for [insert name of group here]", with the implication that any features not important to that group be discarded. It also means that since each of these groups tend to produce XML through automata, each of them dismisses it as a technology with minor value.

It could add up to an existential crisis for those who specialize in such technology, except that I think it rather energizes us (my background is old school enterprise architecture, but these days I try to make my work as XML-plumbing-intensive as possible). XML generated by machines with only the most cursory craftmanship is what has built the technology to its present ubiquity, no question. But the resulting ubiquity means that it offers a great deal more value than many of its independent constituents admit. This value comes from the bridge it creates across, e.g. Web services worlds, RSS worlds and DBMS worlds.

And that's where those of us with craftsmanship come in. We turn that ocean of XML with poor value into the substrate for integration solutions with very high, strategic value. A dumb machine will never be able to do that. The state of this craft is the primary concern of this conference, as I understand it (this is my first time here).

2007-08-07 08:54:03
D1.3. Good quote: "It's so obvious I'm surprised it hasn't been patented" - Mike Kay
2007-08-07 09:19:17
D1.3. He points out that optimization is really just a transformation, and that XSLT is designed for such transformation, so XSLT is a great language for expressing such optimization.

In an example he shows how he decomposes XPath into an XML representation, and then uses the match for an xsl:template to express the criteria for a rewrite, with, of course, the rewritten logic as the template body.

Some rewites can be complex, and require analysis, e.g. understanding context dependencies of expressions, so you for example know when you can safely move an expression, or static type inference (for XSLT 2.0).

An example is is-in-doc-order property, which can be used to eliminate unnecessary sorts. It's attached to constructs such as union ("|"). Can be lazily computed.

He is hesitant to go too far down this path on Saxon because it's such a well-deployed product in mission-critical systems. He is hoping to find a grad student who could work on this in a lower-risk environment. As he puts it, this is "high risk, but potentially high reward". I do wonder whether he couldn't just have an experimental branch of Saxon, but I do hope he finds the help to get it done one way or another.

Overall: Wow. What a dynamite talk! Goes up with James Clark's introduction of nxml a few years ago as a talk that gives you rich insight into the mind of a great developer.

2007-08-07 09:32:44
D1.3. Question time.

One good question was whether the output of this would be a library of public domain XSLT templates that could be reused in other processors. Mike Kay said that some rewrites are at a level that manifests at the standard XSLT level, and some use representation of internal constructs, and so would not be so easily reused. He also made the point that some rewrites make more sense for certain implementations.

Steve Newcomb wondered whether Evolutionary Algortithms might be useful here. Mike Kay thought that maybe the first step is RDBMS optimization theory, but agreed it would be fun to fire all sorts of trickery at the problem.

John Cowan plugged the Q functional language based on rewrites along this line: http://q-lang.sourceforge.net/

2007-08-07 09:33:11
D1.2. Unfortunately, Firefox crashed on me during this session. I didn't lose any text in the text area (thanks to FF crash recovery), but I did lose time and attention while getting my session set up again. The basic idea is that Tom has a very friendly syntax for RDF. I'll try to post a bit more on that as I catch up.
2007-08-07 11:42:40
D1.4. Background is the age-old problem that users (clinicians, in this case) have been using Office suites for ages, and that's what they know, but the developers want to create useful stuff with rich markup. How to bridge Office docs to rich markup?

They tried exporting to RTF, HTML, Word/XML, and using VBScript, settling on Word/XML export in the end, then conversion to their custom schema. Their entry documents are entirely Word tables: they found these hardest for the user to screw up. Suggests very sparing use of Word Macros--only for extraction, early validation and auto-correction.

Felt pressure to create the workflow as a monolithic program, but went with a pipeline approach, and found that very suitable.

Things he could have used XSLT, but prefers XQuery.

Asked about using a pipeline language, he said he preferred to write his own, as a custom Java program (not sure how this differentiates from a monolithic program).

John Cowan mentioned that in his experience at Reuters Health they were able to do it all using plain text. "Inferring markup is much easier than cleaning up markup".

2007-08-07 12:41:18
D1.5. Data modeling using OWL and other such is a mainstay of my day-to-day job, so looked forward to this one.

MYCAREEVENT is a service portal created by a group of auto manufacturers, repair/maintenance companies, IT companies, etc. on information relating to vehicle repair and maintenance.

One reason they use a full ontology (OWL DL, to be precise) for this is that there is different terminology across various domains, e.g. car owners and drivers, mechanics, manufacturers, etc. Very important to model and map across such concepts.

They started from a formal UML model of the top-levle process involved in fault diagnosis. From this they created XML schemas, then a shared terminology, and then finally the ontology.

2007-08-07 14:18:42
D1.6. Showed example of how Relaxed, their validator software, can catch errors the W3C validator cannot, such as form within form or "%" in border width in style attributes.

Put forward NVDL as the solution for compound document validation, e.g. XHTML+RDF+SVG+MathML. NVDL of course allows you to combine existing schemata from the component languages. The JNDVL implementation has an impressive list of built-in component schema languages.

They have a Relaxed validation service at http://relaxed.vse.cz/relaxed/ which is built on JNVDL. Includes a simple way to build NVDL examples from a form (called namespace restaurant)

See also:

* http://www.xmlhack.com/read.php?item=2123
* http://xmlhack.com/read.php?item=2120
* http://www.oasis-open.org/archives/ubl/200602/msg00117.html

2007-08-09 06:29:31
D1.2. Tom Passin said it's OK for me to link to his support files site for his talk, in which you can find examples of the RDF notation, and more.


David Lee
2007-08-10 17:14:44
Reguarding D1.4 "not sure how this differentiates from a monolithic program"

The difference is that with a pipeline, reguardless of it is written with "in house" code or some yet to be standardized pipeline language, is that there are well established and documented entry and exit points or "nodes" in the process.

This differes significantly from a "monolithic" program where there is only one input and one output all in a black box code. In the case of this pipeline, reguardless of what language the pipeline OR the transformation language is written in or who wrote it, each step or node can be used as an alternate input or output point in theory.