Extreme Markup Languages, day 3

by Uche Ogbuji

I'll use this entry as an anchor for my observations on the third day of Extreme Markup Languages. I'll update it with a note each time a new talk begins, but I'll add my comments on the talk in the comments section. There is a numbering scheme for the talks, to correlate to comments.

If you happen to be reading this in an aggregator, much of the meat is in the comments, so you might want to click through.

D3.1. "Principles, patterns, and procedures of XML schema design: Reporting from the XBlog project", Anne Brüggemann-Klein, Thomas Schöpf, Karlheinz Toni

D3.2. " Enhancing AIML Bots using semantic web technologies", Eric Freese

D3.3. "Converting into pattern-based schemas: A formal approach", Antonina Dattolo, Angelo Di Iorio, Silvia Duca, Antonio Angelo Feliziani, Fabio Vitali

D3.6. "Relational database preservation through XML modeling", José Carlos Ramalho, Miguel Ferreira, Luís Francisco da Cunha Cardoso de Faria, Rui Castro

D3.7. " Mind the Gap: Seeking holes in the markup-related standards suite", Chris Lilley, James David Mason, Mary McRae


2007-08-09 06:46:48
D3.1. They're creating a new Weblog platform, for research purposes, and not because the world needs one (I can't judge, being in the process of reinventing that wheel myself).

Started with a UML model of a Weblog system. Said there was some negative feedback from the paper reviewers about modeling such a thing in OO. Blimey! I thought I was down on OO but even I'm not as fundamentalist as that. Model in whatever drapes your brain all comfy, OO, ER or somat. They did use classic DTD design principles to translate that OO model into the XML manifestations, and the audience seemed to appreciate this.

Of course, once they started with OO, they decided that WXS is the most natural schema language to use. I think the problem here is not that they started with OO, but that they couldn't separate thinking about source code from thinking about content. Almost any other schema language would have been better for the sorts of rich, mixed content Weblogs tend to be built on.

They went with a very mechanical process of mapping abstract classes to abstract WXS types, building on a very rigid design pattern they call "double extension". This doesn't feel to me a likely way to achieve the sort of flexibility required for a typical Weblog system.

I did stand up and have my two cents on that point. I might not have been as concerned except that they say they are looking to use this system as a platform for teaching students about design, and it's hardly a pedagogical approach I could favor.

2007-08-09 07:41:15
D3.2. Eric remarked that he'd been explaining to his mother what he does, and she never got it until he told her he's building the start trek computer. She finally got it, but now that means he actually spends some of his spare time working on the star trek computer. Hey, I'll try that for my next elevator pitch. Err, maybe. Depends on the elevator and the occupant.

Pithy quote: "One of the reason the Semantic Web hasn't taken off is that the porn industry hasn't yet taken it up".

Anyway he's working on an extensible Eliza, as one way of putting it. A set of chat bots.

The substrate knowledge of the bots are RDF triples. It builds on standard RDF vocabs, such as using FOAF to answer questions about known individuals. It can use MusicBrainz to carry on discussion or provide information about digital music, etc.

Much of the talk was off-the-cuff because of technical problems with Eric's laptop and the projector (and he was using Windows!).

2007-08-09 08:54:00
D3.3. XML pattern sleuthing, in effect. Started with an XML snippet with purposefully unhelpful generic identifiers, and listing some of the clues that can be used to deduce useful facts form the document. For example one might guess something about elements that contain mixed content, or element that in every case contain only numbers.

Offered a pretty good list of "dimensions" of a document:

* content
* structure
* presentation
* behavior
* metadata

I like this, and might just have to steal it. I thought for a moment that behavior and presentation could be collapsed, but I can accept that presentation is important enough to stand alone.

Another useful insight presented was 3 main roles in XML doc creation:

* Schema designer
* Content author
* Document editor

I'm not sure I agree with how they categorized the latter 2. They had the author as only concerned with pure text, while the editor is responsible for markup. In my experience, the author is likely to make the first pass at both text and markup, and the editor will correct text, refine markup, and most importantly approve for publication. With that clarification I've found this set of actors fundamental in every CMS deployment on which I've worked.

Much of the talk is a discussion of unnecessary strictness in schemata, advocating minimal strictness. This is a design issue that comes up often, and I appreciate the analysis of what sorts of schema constructs add what level of strictness, even when I don't always agree that laissez-faire is always best (it all depends, I'd say). I hope this part of their paper is not impenetrable because I'd love to check it out.

Again I think there is good analysis in this presentation, but some of the mapping of this to real workflow is a bit misplaced. I think that some of the effort in relaxing schemata should occur not as a loosening of schema, but rather as a tidying step that occurs either as an automated process prior to the role of the editor, or as a set of tools available to the editor himself. Strict validation should occur after this point, but I think one advantage of strict validation is that it keeps the downstream processing manageable.

So we can afford authors some flexibility without necessarily surrendering to chaos in the document corpus. To give a similar example, we don't use validation on random tag soup we get on the Web. We tidy it (using Cowan's TagSoup tool in my case) and *then* we can rely on consistency. No one thinks it's realistic to tame all the producers of tag soup, but that doesn't mean we surrender to forcing all our systems to deal with the mess.

2007-08-09 12:50:04
D3.6. RODA is a huge Portuguese digital archival project (36TB at present). An important aspect of building trust in their system is in documenting all actions taken on digital objects. The trust requirements can rise as high as their being used as evidence in law.

Their system implements ISO OAIS ( http://en.wikipedia.org/wiki/OAIS ).
They have some conventions such as normalizing all images to TIFF, all documents to PDF, and all database information to XML. For the latter case they developed DBML, an XML format for storing an RDBMS, including the DB metadata (its structure), with defined SQL transforms to and from DBML. One sample practicality is that BLOBs are extracted and stored as separate files. They also use the Fedora repository for digital objects ( http://fedora.info/ ).

Their portal: http://roda.iantt.pt/

The talk resulted in a torrent of response from the attendees, mostly every manner of finger-wagging. Frankly this surprised me more than anything else in the talk. I didn't get the impression that RODA is doing anything other than making a game attempt at digital archiving, and I think much about their approach seems as reasonable enough, considering the magnitude of the ambition. And what's wrong with such an ambitions undertaking? We're sure to learn useful lessons from it, and I don't see how any likely loss is significantly more than making no attempt. I guess that's where federation is a virtue. Let the Portuguese to it their way, and the Spanish theirs, and the Canadians theirs, so on, and we'll all learn what's useful and what's potty.

2007-08-09 14:35:19
D3.7. So this one is "what else do we want XML standards committees to work on?" Hmm. I'm tempted to just thing "nothing, ya'll, we have enough of a leaky barge laden with bullion and scrap. We might want to tidy that up a bit, first? Yeah, I know. Whom am I kidding?

Mike Kay asks when we'll clean up all the mess at the core of XML. Lilley defers to Liam. Liam says W3C hasn't the resources to do it. What about XML 1.1? Anyone here using that? No hands. John Cowan says "not even me".

Syd Bauman would like to see the XML community agree on a way for a document to give a hint to the processor of what schema to use.

Ann Wrightson wants various standards orgs to stop overlapping and conflicting. Mike Kay says "competition is good. Without it you get stagnation". Ann clarifies that she is really more concerned about non-fundamental standards. Lilley mentiones W3C memoranda of understanding between e.g. W3C and OMG or OASIS.

John Cowan praises Unicode/ISO collaboration. Lilley says W3C has a mechanism for rescinding a rec, and he might like HTML 3.2 to enjoy such fate. McRae says OASIS does not yet have a deprecation mechanism, but some on staff are pushing for that.

[I missed who] mentioned that DSSSL is outdated and could be replaced by XSL*. Mason says some people are still using DSSSL, that about once a month or so there is a flurry of activity on the Mulberry DSSSL list, and that there are constituents, particularly in Japan where there is still desire to maintain it. Lilley asked whether the existence of both interferes with anyone's use of XSL-FO. No one said so.

Fabio Vitali mentions that the major standards orgs are slow to tackle the big areas, and for example, poor standardization in Web 2.0, blogs, wikis, collab, IM, etc. Lilley mentions W3C Web API work, Jabber at IETF, etc.

Henry Thompson mentioned the likes of ISO standardization of ping and HTML and such were a political response to German law (you couldn't make a mandate that didn't have ISO's stamp). But it hakes more than wishful thinking to marshal the sorts of resource-intensive activity such standardization requires. Mason points out how ISO HTML and W3C's HTML diverged through separated process of the two orgs.

Martin Bryan pointed out that we have to be careful not to standardize stuff too early, while it should still be in the hands of the industry. Also pointed out possible conflict between DSDL and W3C's work on streaming transforms. Wrightson proposed a joint workshop.

Mason communicated some suggestiong from folks he asked: ZIP (used by both ODF and OO XML), graphical language for representing XML validation models, general validation reporting language and standardizing SAX.

Steve DeRose asked for standard ways of communicating entity boundaries. Also: "annotations please....Like Annotea, but something that works. Sharing annotations locally. So I can go to any Web site and stick a note on it, [and to some extent share those]." Lilley thinks it sounds simple but is really a hotch-potch of requirements. DeRose counters that the set of problems is well known and it would still be wirthwhile to try. Another asked whether that would have been easily added to XLink. Response "let's not put more gasoline on that fire".

That rekindled the whole competing needs discussion, to which Henry Thompson contributed his maxim: "Most standard orgs want to meet the 80/20 point. Problem is one person's 80 is another person's 20"

John Tucker pled against the copyright/cost of ISO standards. Mason hoped W3C, OASIS etc. could serve as lights for ISO, but pointed out that this matter is tied up in the fact that high tech is just a small piece of ISO, and that some ISO national bodies are essentially publishing houses. SC34 makes a regular practice of saying, when it collabs with other standard orgs, that the online ISO version must be as free as the original body's version. Debbie Lapeyre pointed out that Mason's tireless effort kept available a lot of what's built this whole community, especially in the early days.

Wrightson mentioned confusion between W3C and OASIS re: Web services. She wants the orgs to get together and give a reasonable account of what's going on in the large.

Final note. This session especially made me think: Rick Jelliffe, we missed ya here.