Extreme Markup, Day 2

by Simon St. Laurent

Related link: http://extrememarkup.com/

After an evening enjoying Montreal - I recommend the Lac Saint-Jean dessert crêpe at Chez Suzette - it was back to the mental stretching at Extreme Markup.

(I'm going to update this post over the course of the day.)

Ken Holman started the morning off by talking about how a project he'd presented here two years ago had seemed great, but then he found limitations. His original approach, LiterateXSLT™, was based on creating 'literate results' - annotated XSL Formatting Object Documents with XPath information about where to find source material to fill them. Then Holman combined the document structure and the annotations to generate a stylesheet.

That worked well, unless you needed to reuse the stylesheet with other data. He shifted the XPath information to separate files, making it possible to reuse the same target layout with multiple source vocabulary. In doing this he was able to "drastically reduce the number of annotations," making the base document much easier to work with. Instead of expecting all of the information in one document and producing monolithic XSLT, Holman's new approach - ResultXSLT™ - synthesizes XSLT which calls imported templates. Those imports carry the detailed information about how the vocabulary relates to the expectations of the stylesheet. (Ken's also done some tricky work using namespaces as a signal for certain kinds of processing rather than as a vocabulary identifier, an approach that deserves a lot more look.)

One extra degree of separation can frequently simplify a larger set of problems. While extra separation does mean extra work, the extra flexibility becomes more useful as the size of the problems grows. (Those ™ marks don't indicate Ken's interest in owning the technology - he's happy to see the technique used by other people.)

(I've written here about Ken's XSLT training in the past.)


Matthijs Breebart was next, presenting on an issue that has become more and more common as XML has reached into more and more places. Sometimes those places are readily accessible, but other times those places are accessible only under certain circumstances: on a particular network, or with a particular license. Breebart's case involved annotating laws with commentary, both public and commercial, coming in a variety of different formats.

The commentary was organized by vendor, not by content. Looking for information about a particular subject required going back and forth between a number of different sites, often to find relatively little new. While vendors use URLs to create permanent links, they all use different systems. Breebart wasn't thrilled by the prospect of manually reconciling these, and processing URLs to try to sort out the vendor-specific approach wasn't fun either. The next step was asking vendors to use a standard form - getting everyone in one room.

Modeling data was one project, and then they had to figure out identifiers. Using random strings and a registry had disadvantages. They wound up combining some meaningful information for internal parts of regulations and a meaningless identifier (and number) for regulations themselves. They still have 10,000 identifiers, but having the internal portion of the identifier follow the structure of the document spared them many many more identifiers.

For sharing the information, they used RELAX NG to create a schema, and then generated XML Schema as appropriate. Once they had this set up, they could share the identifier list among all of the parties. They could also create a direct transformation of the XML describing the identifier to a URI, giving them a much more compact approach.

It's a lot of work to reconcile something which seems simple on the surface, but knowing what to call something makes it vastly easier to reference it without paying a visit. Once the common format is established, it should be easier to ensure that new software supports it.


After the coffee break, Ann Wrightson asked a basic question: why is some XML so difficult for humans to read? Wrightson's question strikes at a key concern of mine, the interest I've always had in XML as a meeting place between human and computer understandings.

Wrightson said that "An awful lot of it has to do with how computers communicate with humans," and then focused on "situation semantics," looking at signs and basic items of information Wrightson called "infons". She looked at a variety of ways these work and these break down, using a conversation about rugby for illustration.

Wrightson then carried this over to XML, examining how people can try to fill in the blanks when obscure markup names are used, and the immediate limits people hit when identifiers don't conform to understood (natural language) expectations. She also examined the value of context, even partial context, for making sense of those identifiers. Abbreviations, numbers, using modeling roles for names, and opaque identifiers, all popular in computing for a lot of reasons, are frequently not helpful.

Wrightson then asked a key question: "Is human readability of XML just 'semantic sugar'?" The answer seems to depend on how much you value keeping humans close to their data. If you're excited about packing as much information into a document as possible knowing that processors on the other end will devote major resources to presenting it, then maybe it is just semantic sugar. On the other hand, if there is more than one display possibility, and especially if humans will have to interact with it in any of those possibilities...

Wrightson concluded with a lovely bit of Klingon (a translation of Shakespeare sonnet) marked up with Elvish, giving us all an extra opportunity to contemplate how syntax sugar tastes.


Next up will be Walter Perry, about whom Elliotte Rusty Harold said this morning:

a talk from Walter Perry, one of the most inconoclastic thinkers in the XML space. He's so diametrically opposed to the conventional wisdom that most people can't even hear what he's saying. It's like trying to explain atheism to an eight grade class in a Texas Christian school.

(I'm no doubt oversimplifying Walter's positions here or getting them wrong, but here we go.)

Perry began with a quote from Peter Murray-Rust about working in fields we don't understand (even things so reportedly simple as chemical bonds), and leaving space for other people to work. He then contrasted schemas and indices, and set up a some assumptions about search, suggesting that search is about finding semantic value in a particular context.

Internal contexts - like those described by schemas - are the traditional focus of a lot of XML work. External contexts - whether hyperlinks, indexing systems, or the processes which create and consume documents - seem more interesting to Perry. (This seems to me to be where the fracture line between his views and those of the traditionalists opens, and why the conversation is so difficult.)

External processes may be interested in the internal structure of documents, but they're (at least potentially) less concerned about the internal structure or type of the document is than they are about how they can use that document and its content. The semantics created by these processes are more interesting to Perry than the lexical details.

To Perry, it's a problem that schemas currently focus on one kind of consumption and validation, rather than an "Open-World Internetworked View" of "What Processes produce and publish (and at what URIs) documents with an external context that we understand and therefore might use." Partial processing is a possibility here, as is processing document structure in ways that vary dramatically from their creators' (or specifier's) intentions. It's the combination of the semantic expectations the reader brings with the content of the document that produces meaning, not just the internal definition.

When I first got into XML, I heard lots of stories, good and bad, about SGML and XML consultants who would show up, create a vocabulary, and head home. Processing and vocabulary evolution were implementation details, not part of the core of XML work. Perry seems to reject that approach, insisting that the XML work is going on all of the time, not just during one phase of vocabulary or document creation.

(And I have to love any talk with a slide contrasting Finnegan's Wake with a Burger King pickup receipt.)

For the afternoon, the schedule split into two tracks. One is squarely focused on XQuery, while the other is about information integration issues. After seven years of hearing about XQuery without it yet reaching maturity, I've decided to tune out XQuery until it's, well, ready. So on with information integration.

(Elliotte Rusty Harold is covering the XQuery material if you're interested, and also has more on the morning presentations.)

Lee Iverson kicked off the afternooon talking not about XML, but "what XML is for, so that we can do XML-like things with anything." He asked whether our current software models - front-end/business rules/database, and model/view/controller - might be preventing a number of useful possibilities. He abstracted them to context/knowledge/data approaches, and examined the ways these pieces are layered.

Iverson confessed to not being an XML person, preferring his data models separated from the syntax, referring to HyTime as an approach that allowed software to treat diverse data sources as having a common structure. Iverson suggested "working with as much as we can manage," showing a diagram of a generic data model using typed nodes to create a simple and (perhaps) universal data model.

It's intriguing in some ways, and if you're looking for universal data models that can operate over a variety of data types (I'm very clearly not), this is definitely a good place to look.


Wendell Piez followed, talking about a question that frequently dogs XML projects and applications, the idea that format and content are best kept separated. Piez started with an example of structure that had come up earlier, the sonnet, on a slide titled "but this poem doesn't validate!"

Piez moved on to books, using the Table of Contents for Marshall McLuhan's Understanding Media, and looking at how the relationship between title and subtitle can vary and sometimes vanish. Looking more generally, Piez cited tables as a common case where semantics lurk in the content but vary from table to table. Scalable Vector Graphics (SVG) offers even larger questions about semantics lurking in the markup.

Where was Wendell headed with this? Web Graphics Layout Language (WGLL, pronounced WGLL), the vocabulary he created so that he could generate SVG (with XSLT and Cocoon) without getting stuck in the mire of writing SVG directly. It's a minimal format, inserted into SVG to simplify creation of graphics, layouts, and some simple animations to enhance the interface.

Relating the content back to a more generalized model may be slightly easier than working from SVG, but Piez suggested that it's an acceptable cost for a lot of cases. It strikes a balance between classic descriptive markup without formatting and moving directly into formatting.

As Piez writes in the paper, it's time to demystify our perspective on what descriptive markup can do for documents:

Introducing a layered system for the production of digital media provides many advantages for scaling, application design and management, and long-term maintenance. But it doesn't actually take us closer to the "truth" of a text.


For the last sessions, I thought I'd try Michel Biezunski's presentation on Talking About Talking About Topic Maps. While earlier sessions have gone meta on meta on Topic Maps, this is the first one to do that at the "what are we trying to accomplish" level rather than in a strictly formal sense.

Biezunski opened with his concerns about "ontologies," concerned that the word is too constraining, too focused on reaching a single agreement about categorization. He'd prefer to take a more pragmatic approach. He also looked at clarifying the distinction between "semantic interoperability" and "semantic integration", with the former creating harmonized processes and systems while integration is more about aggregating data from a variety of different perspectives.

Biezunski sees connections between the Semantic Web and Artificial Intelligence communities, and sees the Semantic Web as a synthetic processing of data. In his own work, Biezunski keeps finding the boundaries between automatable work and projects requiring human involvement to be fluid, changing depending on the situation, "very delicate".

After describing a number of terms he uses to reach his discussion of perspective (which he acknowledges as biased and not universal) Biezunski showed a variety of different and complex views of Lower Manhattan. He concluded with a question that seems ridiculous outside of a computing context:

Should there be one common perspective to describe Lower Manhattan?

Many systems seem intent on creating single massively-aggregated perspectives, and Biezunski challenges that, answering with another question: "what for?" While top-down approaches with everything in place work for islands of information, and provide interoperability, but bottom-up aggregation of perspectives allows more careful focus on relevant material. Getting back to Topic Maps, Biezunski sees Topic Maps as amenable to multiple views.


Steven Newcomb closed out the day with discussion of Subject Map Disclosures, asking "how do know we what subject it is we want to talk about?" and formalizing the perspectives Biezunski described as Subject Map Patterns. Subject maps themselves are sets of unique subject proxies containing potentially multiple perspectives on multiple subjects. Disclosing Subject Maps means providing definitions of property classes.

It sounds like a promising way to present multiple perspectives on information, but the combination of Sudafed kicking in and the remnants of sneezing are making it hard for me to follow how this works out.

Patrick Durusau gets in a great quote I can't let go by. Riffing on The Princess Bride, Durusau said "Everything has perpective. Anyone who tells you otherwise is selling you an undisclosed perspective."

(I probably should have noted this earlier, but like Roger Sperberg, I'm here as press.)

Namespace triggers? URI generation? Situational context?