Extreme Markup, Day 1

by Simon St. Laurent

Related link: http://extrememarkup.com/

It's early August, so it's time to think about XML in depth.

B. Tommie Usdin opened the Extreme Markup Languages Conference "in praise of the edge case" by looking at past attendee compliments and complaints, and then exploring the comments reviewers had given papers this year. Her point: this is a conference where ideas aren't required to be immediately practical, but rather one where difficult questions can surface in the hope of their finding answers eventually.

Usdin talked about the challenges corporate IT managers face in quarterly goals and business structures that require clear budgets. For better or worse, focusing strictly on questions with clear answers that have immediate returns can deaden conversations about potentially exciting but more difficult projects. Things that "can't be done" become doable over time, and things that once moved from edge case to core - like mixed content in documents - can move back out to edge case status depending on how technology is used.

I come to Extreme for an annual brain-stretching, something that can keep me interested even while I perform much more mundane tasks. Ideas I hear at Extreme sometimes take years to percolate, but when they reach maturity they prove extremely useful. It's good to have a conference that takes a long-term view of what's practical rather than jumping on what's hot this week.

Elliotte Rusty Harold, profilic XML author and keeper of the Cafe Con Leche XML news site, gave a talk on his Randomizer project.

While lots of people find raw XML documents obscure enough already, there are a lot of reasons why people can't share their documents. Copyrights, security, and simple embarrassment can all get in the way. To make it easier for developers to exchange information, Harold has developed a tool that obscures XML document content and element and attribute names while preserving the structure of the markup.

This should reduce the "I'd tell you but I'd have to kill you" problem in sharing XML documents for debugging purposes, but what most interesting about the talk was the way the project set off a number of conversations with the audience, questioning how precisely the Randomizer worked and asking for a variety of additional features. Different levels of structure and content randomization came up a few times, and there were a number of questions about Harold's approach to ensuring that documents couldn't be converted back to their originals.

Hopefully Randomizer will make a contribution to improving XML software by making it much safer to share use cases and create test suites.


Next up was Angelo Di Iorio of the University of Bologna, taking a crack at getting underneath the complexity of XML Schema (and even DTDs) by using a much smaller set of patterns to define document structures, and by creating schemas by processing annotated model documents, somewhat like Examplotron does. In a bit of irony, the processing uses XML Schema and even extensions to XML Schema, called SchemaPath, which deals with co-occurrence constraints (if X is here, Y must be here) and other kinds of conditional expression.

There were some good ideas along the way. I don't frequently hear about doing more with less (or at least I haven't at XML conferences for a few years). In a section on "syntactical minimality and semantic expressivity," DiIorio looked at ways that fewer patterns can be used to do a wide variety of different things. A context-focused approach, rather than one where every element and attribute is defined in something of a vacuum, is interesting, especially the possibility of saying that "this extra possibility is [not] allowed in this context."


Anne Cregan opened the afternoon with a talk on reconciling OWL and Topic Maps. OWL, the Web Ontology Language, ties into RDF and the Semantic Web, while Topic Maps have emerged from the worlds of indexes, cross-references, tables of contents, and all kinds of related structures to become a general metadata framework.

The RDF/OWL/Semantic Web community and the Topic Maps community have taken very different approaches to similar problems, and sometimes compete. At an earlier Extreme (which I missed), there was a proposed battle between the two sides, though in the end they seem to have decided that cooperation was a better idea. Cregan suggested that the Topic Maps approach emphasizes humans finding information, which the RDF approach is more about computers managing information.

Cregan explored the parallels and disjunctions of Topic Maps and the RDF/OWL specifications, and decided to see if the Topic Maps Data Model could be reconciled with OWL Description Logic, and concluded that their fundamental similarities as entity-relationship models made it possible. Working in OWL-DL could even be used as a means of enforcing TMDM's expectations.

As with most difficult projects, there are some caveats. Cregan hasn't attempted to reverse the process, using Topic Maps to create OWL-DL. There are still some issues about type instances and supertypes that need to be hammered out (currently needing extra code beyond OWL-DL processing, or ugly workarounds).


Lars Marius Garshol of Ontopia continued the RDF/Topic Maps discussion, also seeking to reconcile the two approaches. While conversions are possible, there hasn't yet been a seamless approach. Garshol proposed a unifying model, Q. Topic Maps and RDF could both be converted into this model, and converted back out, making them interoperable. Garshol is also hoping to simplify the Topic Maps data model.

Unlike Cregan's approach, which used RDF and OWL directly, Garshol preferred to abstract a layer beyond those to more directly accomodate Topic Maps' greater complexity. Neither RDF itself or object models seemed like the right answer, so Garshol turned to quads, adding an extra piece to RDF triples. This lets him add identity to RDF triples, and simplifies the modeling.

While Garshol saw this approach as an improvement, there's still room for additional development, depending on how much bloat matters, how much you value supporting parts of the model which aren't normally used, issues of duplicate nodes, problems of round-tripping, your expectations about working with scope, and some odd issues around language tags and URI usage in RDF. In the end, as a surprise, Garshol leapt to quints instead of quads to solve context problems.


Kristoffer Rose of IBM spoke next, showing how DFDL, (pronounced "daffodil," the Data Format Description Language), might help XML processors get into a wide variety of file formats, including binary data. DFDL itself is built using annotations to XML Schema, indicating data format information and supporting structure directives. You can, for instance, tell a schema what the separator and what the terminator is for a given piece, as well as type information about its content. It's flexible enough to support things like control characters XML normally prohibits.

Wisely (in my opinion, anyway), they seem to have taken a conservative approach to rules for how these annotations flow through the schema, limiting the effect of the annotations to the element directly annotated. It does support some more ambitious features like scoped default values.

DFDL processing lets developers treat non-XML documents as if they were in XML, returning XML models as if the document had come from an XML parser. It's an interesting approach, one that has lots of promise for the many forms of data which meet its requirements. I do wonder how widely it will be adopted, but maybe the Global Grid Forum can make it work well enough to be attractive. Draft specifications have been published, but there is still ongoing discussion and only a few implementations have appeared so far.

Jon Bosak, the "father of XML" who got the project rolling, gave the final talk of the afternoon. Jon, a last minute substitute, talking about Universal Business Language. UBL's focus is on business documents, taking existing practices that run on paper (and faxes) and providing a framework to use them electronically. That means reconciling that approach with another existing practice, EDI. As Jon put it:

"the key question is how to make this cheap enough for smaller companies.... How do you get cheap software? Through standardarization. And that means trading off all those proprietary features."

One especially interesting piece is that they want to stay close to paper documents in part so that courts and other non-technical interpreters can look at transactions and understand what happened, or was supposed to happen.

Jon described UBL as "a radically un-new idea. We're trying to connect with existing systems," and pointed out that "disruptive" isn't exactly a popular word among businesses trying to get their work done. UBL is also meant to be stable, providing a foundation like HTML has done. Much to my happiness, they're also creating a "small business subset," a profile that lets people work with a smaller set of pieces rather than demanding a huge initial implementation. Public review on that should begin soon.

What would you like to see in the markup conversation?