From duct tape to chewing gum and baling wire

by Simon St. Laurent

Related link: http://www.w3.org/MarkUp/2004/02/xhtml-rdf.html



There's long been a disconnect between the regular (X)HTML Web and the RDF Semantic Web, one frequently bridged with duct-tape like solutions. Unfortunately, rather than improve on that duct-tape with a clean joint, a recent proposal suggests replacing the duct tape with a bit of bubblegum and wrapping the whole thing in baling wire.



In the early days of HTML, the META tag seemed like a pretty cool thing, a way to put information into the HEAD of an HTML document that applications could interpret - or not interpret - as they liked. It was easily extensible in those cheerful days before namespaces, even if much of what it was used for was things like architecturally suspect blending of HTTP headers with HTML documents.



As people have come up with new things to do with the Web, META (now typically lower-cased as meta) has continued to be popular. Looking at website source code, you may find things like:



<meta name="ICBM" content="42.46558,-76.41397" />


for use with GeoURL, or:



<meta name="dc.creator" content="Simon St.Laurent" />


The former relies, in classical HTML style, on the expectation that it's unlikely someone else will create a property named "ICBM" and use it for something other than geographic coordinates. The latter reflects a more paranoid time, where Dublin Core metadata is typically prefixed with dc or DC to avoid
name collisions.



This prefixing is accomplished here with a duct-tape solution, something convenient that fixes the problem roughly for a large class of processors and conveniently doesn't require much hunting around in the document. You can implement tools which hunt for this duct tape without even using XML tools - regular expressions will do just fine.



Unfortunately, this convention for Dublin Core doesn't meet the expectations of URI and QName-obsessed RDF triple processing. As Birbeck points out, the triples don't work because there isn't quite the set of information RDF expects.



Birbeck's solution is to replace the duct tape holding together the prefix and the name with a colon - the bubblegum - and then wrapping the whole thing in baling wire, since you now need to find your namespace declarations in the surrounding context.



Why do I describe this as mere baling wire and not a more robust form of adhesive? It's because the solution proposed here relies on a use of QNames that requires lots of extra work on the part of the implementer. The information needed to process the meta tag successfully is now a namespace declaration likely elsewhere in the document, but that namespace information won't flow naturally to your processor, even if the XHTML is well-formed XML. This meta element, for instance:




<meta name="dc:creator" content="Simon St.Laurent" />


will still be reported to an application by a garden-variety XML processor as a meta element containing an attribute named "name" whose value is "dc:creator" and an attribute named "content" whose value is "Simon St.Laurent".



There is no magic processing of the name attribute into "a QName whose URI is "http://purl.org/dc/elements/1.1/" and whose local name is "name", which is the information you need to actually make the triple. XML processors do this work for element and attribute names, but not for content.



Even if a schema identifies the name attribute of being of type xs:QName, only schema-compliant processors producing a post-schema validation infoset (PSVI) will provide that, and there aren't a whole lot of those in the world. The PSVI and similar approaches aren't particularly renowned for their efficiency in any case, and it's especially hard to justify using that heavy a style of processing on what was until recently pretty ordinary and easy HTML, even if it was cleaned up to XHTML.



If a schema isn't available, the lucky implementer gets to keep track of which namespace prefixes are in scope at a given point in the document and break down the QName manually. If developers are using an environment that didn't value prefixes enough to keep them around, they're out of luck. If the user didn't bother to declare the namespace, the document is still well-formed XML, but the developer can either guess what it was supposed to be - the equivalent of the earlier duct-tape solution with the dot - or just give up.



There is a simple solution that cleans up the joint and builds a stronger structure on which RDF and other developers can build, however. It does mean discarding the HTML META element's extensibility through a name attribute, and turning to the very technology that makes this particular use of META so painful: namespaces.



Instead of fiddling with this:



<meta name="dc:creator" content="Simon St.Laurent" />


use this:



<meta dc:creator="Simon St.Laurent" />


This will be consistently reported as a meta element with an attribute whose URI is "http://purl.org/dc/elements/1.1/", whose local name is "creator", and whose value is "Simon St.Laurent". All the information needed to create the triple is available, without the need to use tools any more complicated than a namespace-aware XML processor, which is most of them.



You can even go from there to:



<meta><dc:creator>Simon St.Laurent</dc:creator>


Once you start treating meta as a first-class container, you can make much more sophisticated statements, using all of RDF/XML if you really want. It becomes a clean joint between HTML and RDF using core XML mechanisms, capable of supporting far more weight.



The mechanisms for including metadata in the body that are described in the rest of the Note are yet another tangled mess of chewing gum and baling wire that needs repair, but as this blog entry is already far too long, I'll leave that as an exercise for the reader.



Note: If you're curious about the gum and baling wire metaphor, see
this explanation.




Ever notice that using markup syntax as designed is a lot easier than extending syntax in ways that aren't magically supported?


7 Comments

mdubinko
2004-03-10 11:00:18
Validation is the issue
I agree that URI-madness is sweeping the XML world, and not in a good way.


My understanding is that the major obstacle needed by the XHTML community is clean validation. Having arbitrarily namespaced elements/attributes mucks things up in a world where DTD (alas) can't be escaped.


This is good rant-fodder though. :-) What's the deal with *full* URIs getting used everywhere? Short reverse-DNS prefixes, with judiciously chosen pre-defined segments has seemed to work great for ages, eg. "java.lang.String".


So, since Dublin Core is quite widespread, why not just accept "dc" as a prefix and have:


<meta name="dc.name" ...
or
<meta name="org.geourl.ICBM" ...


-m

simonstl
2004-03-10 16:25:51
DTDs can cope with
<!ELEMENT meta ANY>


That does leave the attribute option high and dry, as there isn't an equivalent ATTLIST declaration.


Still, I'll take:


<meta><dc:name>whoever</dc:creator></meta>


over QNames in content any day, even if the W3C can't break itself of the DTD habit while breaking a variety of other things in XHTML 2.0.

simonstl
2004-03-10 16:26:53
</dc:name> , not </dc:creator>
Oops.
dmh
2004-03-12 04:40:54
My own suggestion...

I think something like the following would be good enough, and is as clean as I think it can be without creating problems with DTDs:



<head>
<title>Document Title</title>
<link rel="schema:DC" href="http://purl.org/dc/elements/1.1/"/>
<link rel="schema:Con" href="urn:x-dmh:contact"/>
<meta property="DC:creator">
<meta property="Con:fullName">Jane Doe</meta>
</meta>
<meta property="DC:language" scheme="RFC1766">en-GB</meta>
<meta property="DC:rights">Copyright 2004 Example Corp.</meta>
</head>
simonstl
2004-03-12 05:44:38
Good enough?
Dumping effective namespace declarations into the link element creates even more chaos, as we now have things that look and act like QNames in content, but aren't, while sticking to a painfully broken use of attributes to identify name/value pairs.


(Maybe DC.creator instead of DC:creator would ease the confusion, but that's merely trading in one flavor of chewing gum for another.)


Maybe it's time for XHTML 2.0 to stop letting the limitations of DTDs be limitations for XHTML, and spend a little of the energy they've spent on DTDs looking at the core of their markup instead.


Both RELAX NG and W3C XML Schema offer better options for dealing with these kinds of validation issues. META-hacks have always been pretty weak, and trying to use them to support the Semantic Web would be hysterically funny if it wasn't so sad.

su
2004-03-12 05:52:08
My own suggestion...
But aren't you misusing the href attribute here?
dmh
2004-03-12 06:23:44
Good enough?
Well I certainly don't deny it's a bit of a hack, caused as you point out by the limitations of DTDs. I suspect this DTD/namespace/metadata issue is why it's been so long since the last XHTML 2.0 Working Draft.