Extreme Markup, Day 3

by Simon St. Laurent

Related link: http://extrememarkup.com/



Even though I can't stop sneezing, I still need to hear the morning sessions about overlap at Extreme - it's the subject that strikes me as producing the most creative thought in the whole markup area. (Elliotte Rusty Harold gives some excellent background on the subject.)



Like yesterday, I'll be updating this article throughout the day.






Syd Bauman opened the morning with discussion of the ways that the Brown Women Writers Project has been dealing with overlapping structures in documents. Bauman acknowledged that:



XML is probably not the best way to model humanities texts, but it's what we're using today.



The overlap problem is an old one, and the Text-Encoding Initiative (TEI) has been trying to deal with it for a long while. Bauman cited Hx and Steve DeRose's CLIX work as prior work before setting off on a brief discussion of YAMFORX (yet another method for
overlap representation in XML), complete with images of, er, yams with fork legs standing in various places.



Empty elements marking start and end positions are a classic approach to indicating starts and ends of structures which can't be nested cleanly. To indicate which start and end elements go together, identifiers are stored in sID and eID elements.



DTD-based validation can't check that these elements are used properly (and I don't think XML Schema could either), but Bauman showed RELAX NG for doing part of the rule set, and Schematron can validate all of it.



After sorting through the schema side, Bauman turned to practical use of this technique in TEI, looking at how to integrate this with the existing vocabulary. Processing involves two levels - one for the regular markup, and then extra work to support the YAMFORX pieces, possibly converting the regular XML to YAMFORX and converting the YAMFORX to regular XML.



Bauman concluded with a list of what still needs to be done to make this practical, but it's a promising start.



[paper]





Next up was Paul Caton, also from Brown, discussing how LMNL could address similar problems. While LMNL has been a frequent subject at Extreme, it hasn't seen much implementation, but Caton's work in developing Limner suggests to him that LMNL has promise.



Caton finds LMNL important "simply because it's here now... because it helps us appreciate the context in which it exists," showing us a lot about traditional markup. Caton noted that everything in markup cycles around, with ideas constantly being reinvented, the leading edge becoming the trailing edge becoming the leading edge.



Caton showed a book he'd used for his master's thesis fifteen years ago, marked up by hand with multi-color highlighting and inserted paper notes. Caton wants to be able to do such annotation collaboratively, with tools for separating layers of annotations.



After a brief description of LMNL's layers, owner layers, ranges, overlays, and interaction with text, Caton looked at how he could apply these tools to his own application, a web-based tool for creating a variety of different kinds of markup. The left hand shows the text, while the right hand allows the creation of ranges through a form that takes range and overlay information as well as text at the start and end of the range. The information is then stored in MySQL, and the application can then provide a LMNL representation of the text, both as regular text with highlight and a more abstract view showing the layers graphically. (Caton is working on developing additional graphic representations using multiple planes.)



Caton also brought up attributed range algebra, work by Gavin Thomas Nicol that builds on his core range algebra work. There's no layer model, just ranges and sequences. This may avoid some data model issues in LMNL. (I'm a fan of LMNL but haven't found the data model compelling. One of these days I'd like to get back to working with LMNL, as it seems to make possible a lot of projects I've wanted to do for years.)



Caton also showed a frightening diagram of the many conversations in this space - perhaps a good sign of the intensity of the conversation, but also a clear sign that there is much left to be done in this area.






After the complexities of overlap, we moved to the challenges of difference calculation. Erich Schubert presented an effort to calculate differences among XML documents in a way that ensures the results are interpretable by humans.



After noting that the most frequently used output of GNU diff is a verbose form readable by people, Schubert looked at the reasons text difference algorithms don't quite work with marked-up documents, and then examined why many tree-based approaches work better on XML but aren't very usable by humans trying to sort out what's changed.



Schubert's preferred approach builds on query-by-example, which can support looser matching and handle questions like finding content that has moved within a document. Comparing nodes as graphs also supports a variety of structural possibilities, and Schubert explored a number of different ways to see nodeset correspondence. The paper also describes a number of places this work could go, from ways to optimize performance through integration with other types of data and databases.



The software implementing this is available as open source (along with the slides), if you want to explore in greater depth. It looks like a few people will be doing just that, as someone shouted "we want this!" during the applause.



[paper]






Sudafed has stopped my sneezing, but it's strange typing when I can't feel the tips of my fingers. If this entry lurches off into Hunter S. Thompson territory, let me know.






All this extreme structure is cool, but sometimes people just want to go extremely fast. Steve DeRose presented analysis of a lot of different XML operations, looking at the constant battle to make these things run efficiently and quickly. DeRose focused on tree management - DOM, storage management, and location identifiers after the document has been parsed.



DeRose started by explaining the processing costs of various kinds of processing, and how algorithm design affects these costs.



As a convenient first target, DeRose picked on the SGML & operator in DTDs, which produced factorial expansions of its contents during naive processing. A 23-item list in a document he was processing produced 10 to the 24th possibilities.



Next, he moved on to XPath and DOM operations, examining the different axes and their possibilities. XPath tends to return lists of nodes, while DOM typically returns single nodes. DOM also doesn't have a native notion of preceding or following nodes. These differences make XPath implementations do extra work when XPath is built on top of DOM. DeRose suggested that XPath processing can be optimized by storing extra information about axes with nodes, reducing the number of nodes that need to be traversed for a given operation.



As XML documents grow larger, the number of nodes grows, and frequently the depth of the node tree may also grow. DeRose found 'typical' documents to go eight levels, while military manuals using CALS tables went 13. Similarly, most nodes have a relatively small number of child elements, but projects like dictionaries may have thousands and thousands of siblings at a single level.



DeRose concluded by looking at a few different ways to store XML data. Raw XML source is usually a forward-only approach, unless the program happens to save data along the way, which works less well for larger documents. That doesn't have to be a full DOM tree - it could, for example, be an index to where elements start and stop within the document. (Unicode normalization and entities can make that extra interesting.)



Relational databases can collect information along the way, remembering parts of the document or all of the information - but "relational databases are not very efficient in their use of space," creating new problems. DeRose did suggest some possible information especially worth keeping available, but it's still a hike. Relational databases also impose some new costs, because XML is ordered and relational database tables, by definition, are not.



DeRose's answer built on earlier work by Dongwook Shin on child sequences. They work really quickly, but keeping track of nodes' positions in multiple levels, but are also somewhat brittle, requiring renumbering when changes are made to the document.



In questions, Daniela Florescu suggested that database vendors have already solved a lot of these problems, and the markup community needs to catch up to their work.



[paper]






In the afternoon, Mirco Hilbert and Andreas Witt presented their take on the overlap issues, returning to one of the oldest pieces in the conversation, SGML's CONCUR feature, which they have implemented in their Multi-Layered XML (MuLaX).



CONCUR itself allows documents to be marked up using multiple DTDs. When processed, an application only sees one of the structures at a time. MuLaX goes beyond DTDs, supporting XML Schema and RELAX NG, though it uses a similar approach of reporting a single structure as an annotation layer and a similar syntax. While you could perhaps do something similar with namespaces, there are times when the same namespace might reasonably be used in multiple annotation layers, and namespaced XML still doesn't support overlap.



One of the more interesting angles on MuLaX was discussion of multiple processing processing models, highlighting how the same data could take different different routes to arrive at similar results. Also of interest, in their broader work they used multiple approaches to overlap, not just MuLaX.



[paper]






The next presentation took a fresh look at XML parsing and processing. Virtually every parser to date has reported the information stored in XML documents in a single pass from beginning to end, whether generating SAX events or DOM trees. Antonio Sierra's Free Cursor Mobility (FCM) breaks that pattern, offering more flexibility, as well as an opportunity to avoid the need to store a complete object version of the document in memory.



Sierra's approach was designed for smaller platforms without major resources, which drove the decision to move a cursor within the document rather than duplicating the information in memory. After exploring the pros and cons of DOM, Push (SAX), and Pull parsing, Sierra showed how his FCM processor behaves.



FCM builds on the pull parser approach, using an iterator-based API that can move forward and backward through the document. The API looks a bit like regular Java iterators, with hasNext() and a hasPreviousBrother() methods, as well as ways to move the cursor in the document - to the parent element, for instance, of the previous sibling. It also allows programs to skip around, not parsing the contents of elements the program finds less interesting.





The last two sessions of the day focused on W3C work, both in progress and to come.



Felix Sasaki opened with a talk on "Schema Languages and Internationalization Issues", combining two traditionally thorny issues. While some of the features they wanted to see supported - room for language identifiers, directionality indicators and Ruby markup - seem like things that can be integrated into vocabularies easily, but much of this seems like material that goes beyond schemas. On the other hand, he's also looking at issues that could cause problems in processing chains including schemas.



Mostly it looks like he's trying to explain how namespaces, pattern-based descriptions, and modularization may make it easier to implement an Internationalization Tag Set (ITS). It looks like they're hoping for namespace sectioning, using something like Namespace Routing Language, or possibly using things like schema annotations or architectural forms.



In questions, Eric van der Vlist suggested that processing instructions might be an alternate path for some of these projects, avoiding the complications of schemas. He pushed strongly for not breaking existing schemas.



[paper]






For the last session, Liam Quin, W3C Activity Lead, is asking the audience "what are we not doing that we should be doing, and what are we doing that we should not be doing?" He's opened with a brief introduction of what's up at the W3C, including possible profiling of schemas, XLink 1.1, XSLT 2.0, XML Query, and XSL-FO 1.1.



Tommie Usdin expressed concerns about partial interoperability

I, of course, proposed that the W3C had done well when they were simplifying SGML into XML, DSSSL into XSL, and (well, sort of) HyTime into XLink. The W3C has since morphed into design-by-committee, and so I suggested getting that XSLT 2/XQuery/XPath 2 mess out the door and shutting down for a few years. The world could catch up to what's been done, and when we go to subset again (or heck, create new features) we'll have experience.



Ann Wrightson proposed that the W3C focus on making sure that schema reduction work, and that implementations actually interoperate.



John Cowan proposed shutting down the XML Core Working Group. (Of course, he noted that he was asking just after he'd finished doing what he'd wanted to accomplish there, XML 1.1!)



Scott Tsao of Boeing expressed his concerns that the W3C is moving quickly, but education of people like the vendors he has to work with is far behind the W3C's work. He said "ultimately, my message is to slow down developing new standards." He also related issues Boeing had had using XML Schema - "as it turned out, several XML editors that we are evaluating - none of them are able to support the test cases I put together quickly from Part 0." Tsao seems to be pushing test-driven development at the W3C, which sounds intriguing.



Quin seemed hopeful about the growth of test cases at the W3C.



An audience member whose name I didn't catch asked Quin about RELAX NG and W3C XML Schema. Quin felt it was "a strength of XML that we have multiple ways to do things," and the audience seemed to agree that consolidating XML Schema and RELAX NG probably wasn't a good idea.



Igor Ikonnikov suggested that the W3C slow down with incremental changes but prepare in the long term for some larger-scale qualitative changes, noting Daniela Florescu's comments from yesterday as one possibility. He also suggested a certification program, which got support from another audience member. Quin noted some difficulties of member-funded organizations doing certification.



C. Michael Sperberg-McQueen wanted to ensure that the schema user experience workshop was better represented - not so much a profile, but rather patterns of usage. He said "there was no support for anything like a subset." Ann Wrightson responded that she doesn't currently have confidence in interoperability, a "substantial area... in which people like me can operate with confidence."



Jon Bosak said "certification is a wonderful thing and would benefit us all," but said there were problems around liability. Ken Holman also expressed concerns that products got tuned for test suites and not for the real world.



Emilia Georgieva asked for something like XSLT "Enterprise Edition" - documenting XSLT use in large situations. Her work at EBay involves 4 million lines of XSLT code, a maintenance nightmare. She asked for formal documentation on best practices, especially for issues like naming things, and better ways to integrate them with tools. She suggested that companies she has worked with are shying away from XSLT because of these issues.



Quin said that the W3C was especially interested in how XSLT 2.0 would play in such environments, and suggested she send in a formal comment.



Steve Newcomb made some general comments as a "message to Tim [Berners-Lee]". Acknowledging that he was disgusted with standards bodies, but still sees the positive, hoping for some progress toward doing the right thing. Newcomb suggested that "unfortunately or fortunately for the W3C, because of Tim's role as absolute monarch... we have a situation where the people perish or live depending on Tim's vision. I would like to see Tim articulate a way forward for the institutional character of the W3C which would make it more than just a vendor consortium, motivated by higher goals than merely avoidance of situations and techologies that would be disruptive to their members' business models." Newcomb felt that a message from this 'monarch' could have a "salutary effect" on the field, which is currently a "garden of weeds".



Where's your overlap? What's your diff?