XML 1.0 (draft fifth edition) builds a foundation then doesn't use it

by Rick Jelliffe

The comments period for the XML 1.0 fifth edition revision finished last Friday 16th May. I didn't make a submission, in part because I felt I have had a good run in the past and my concerns are pretty well known and unchanged.

In XML 1.0, we went strongly against accepted wisdom which held 1) that the future was Unicode so you didn't need to support existing encodings, 2) that the present was beautifully layered so one standard shouldn't try to overcome the deficiencies in others, and 3) that we should all live in a Standards Fantasyland (on the map near Boogie Wonderland) where even if the world had gone one way that didn't agree with what the existing standards said, we should follow the standard. A complete triumph of engineering (systematizing what works) over schematising (insisting on the right way to do things).

So for 1) the XML encoding header allows multiple encodings. Now, ten years later, we are finally reaching the stage where UTF-8 for web pages has exceeded ASCII and 8879/Windows encoded pages (Unicode wrangler Mark Davis, now with Google but for a long time with IBM, recently released some figures on this), so it may indeed be coming closer to the time when XML can be simplified so as to only support UTF-* encodings: I doubt it will have any demand because it is handy, free (everyone has large transcoder libraries) and doesn't get in anyone's way.

For 2) the example is that XML adopted what we now call IRIs for System identifiers in entities: it took IETF almost a decade to catch up and formalize this, surely a record for any standard. "Internet time" are you kidding? XML deliberately didn't use the official URL syntax, but opted for the approach that it was better to have the software shield the user from the details of delimiting. I think there are very few advocates of XML simplification who would be prepared to go using vanilla URL syntax. But now 10 years later, entities are fast disappearing (mind you, just this week I had a seminar where there were surprisingly many questions on trying to use entities schemas) and the IRI spec is out. Namespaces and XLink should be using IRIs now, but there is an underlying problem that character-by-character comparison of IRIs is not robust unless they are canonicalized.

For 3) the example was again the XML header specifying the encoding header, despite the information supposedly being available in the HTTP MIME headers. But the standards got it wrong: the person who creates a file is not the person who sets the HTTP MIME header, in effect. Now 10 years later the relative reduction in the number of encodings in widespread use does make encoding sniffing a much more workable approach, but still too fallible and time-wasting for mission critical data.

In XML 1.1, engineering won again. The decision was made to open up the naming rules from XML 1.0 to remove a dependency on versions of Unicode. However, because this meant in turn that XML 1.1 processors would not as reliably detect encoding errors (when you see "encoding error" think "database corruption" or "spurious data" or "spurious rejected documents") the treatment of the C1 range of control characters (0x80-FF in IS8859-* encodings) was clarified to be non-well-formed (with special treatment for IBM's NEL character). Control characters have no place in markup, as confirmed by Unicode Technical Reports and as emphasized recently by the OOXML BRM which required MS to change a couple of places where some control characters could be entered even though harmlessly delimited. I was startled during the OOXML debates how strongly this was held to be a vital, core part of the XML story from all sides.

XML 1.1 was an enormous flopperoony, for the unsurprising reason that if you put version="1.1" then an XML 1.0 processor would spit the dummy. Some people have tried to claim that it failed because previously well-formed 1.0 documents that had C1 controls in them became non-WF. I have never seen such a document in the last decade, nor have I ever had any credible reports of one, and I can see no cases where putting C1 control characters in a document would be legitimate practice, so I think it is just bluffing: there has always been a wing of users of XML whose life would be easier if they could embed raw binary into XML and they deserve no sympathy or help.

So along comes XML 1.0 (fifth edition) as a draft. It has only a couple of changes of significance. The first is that it finally puts in place a rudimentary versioning system: E10 allows an XML 1.0 processor to parse an XML 1.x document on the understanding that it only reports things in terms of XML 1.0 rules and capabilities.

The second change then makes a mockery of the first. It introduces the lax naming rules from XML 1.1. Now such a change is not required for any reason, because XML 1.1 exists and could be used. So rather than go into a well-managed regime where documents are well-labelled, and XML minor versions chug along, XML 1.0 draft fifth edition just allows a new XML 1.0 parser to accept documents that all the other old XML 1.0 parsers will reject: and remember this is not because of previous bad practice being more consistently exposed, but because some innocent person has created a document with the new name characters and the XML 1.0 processors deployed in the last decade reject it.

Basically, the W3C XML WG is saying that if you get a document that breaks in this way, it is the receiver's problem. The sender can say "But it is well-formed against the latest version of XML 1.0" and the XML WG washes their hands. It is the triumph of bad engineering practice, of doing what can be guaranteed to fail, of putting the responsibility on the wrong person. It will cause problems first for the nominal beneficiaries of these extra name characters (since they will be unreliable) and second for people using non-UTF-8 encodings who won't get as many WF errors. So who will benefit: the makers of standards who will have less housekeeping. They are not an unworthy set of stakeholders.

The W3C XML WG needs to revise the goals of XML (in s 1.1) to accomodate these changes. In particular

6. XML documents should be human-legible and reasonably clear.

no longer holds. The new rules allow a blank check, so you could have a document entirely made with element and attribute names from code points which have never even been allocated a character by Unicode. With the fifth edition, the goal becomes

6. XML documents may be human-legible and reasonably clear.

And the goal 5. needs changing

5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero

because in effect support for these new naming characters becomes an optional feature: does your XML 1.0 parser support editions 1-4 or edition 5?

I didn't write a comment to the W3C XML WG because nothing has changed over the last 10 years that makes the decisions in XML 1.0 and in XML 1.1 inappropriate. I don't have any new information that changes anything, and the XML WG certainly has produced none. All that is needed is for the fifth edition to fix up the minor versioning issue, and then we could all transition to 1.1 on an as-needs basis. This minor-versioning fix is already at least five years overdue: fixing it opens the door for XML 1.1 to have a snowflake's hope and will allow a better transition to XML 1.2 potentially including some other overdue changes (building in xml:id, namespaces, etc.)

To summarize: XML 1.0 (fifth edition) is bad from a standardization and engineering viewpoint, betrays the goals of XML 1.0 which have served well for the last decade, and may hurt the end-users it is intended to support. It sets up a workable versioning mechanism then fails to use it for a significant change. It provides a good foundation for workable minor versioning, then ignores the foundation and builds on sand with its allowing of incompatible names.

I may be wrong, but it looks like a hack to me. However, fortunately it barely impacts anyone in the West, including me nowadays, so who cares? Interoperability, schminteroparibility! Unambiguous labelling of data formats, gedoudahere!

I am not trying to suggest the W3C XML WG is doing this because they prefer to sit by some giddy swimming pool in their floral-printed bathing costumes sipping umbrella-ed beverages, that they clear their desk by making incompatibility problems someone else's problem, or any laziness! But I think they at least owe it to explain why they are doing a substantive minor version change as an edition change, failing to use the edition mechanism they are setting up at the same time which would allow people who needed this feature to access an already-existing minor version!


John Cowan
2008-05-22 15:15:21
(Disclaimer: I speak for myself, not the XML Core WG; nevertheless, I had a lot to do with XML 1.1 and not a little with XML 1.0 5e.)

The fact that (as everyone knows) there were no documents with C1 controls in them is irrelevant to the XML 1.1 flop. What mattered very much is that it became a political stick to beat XML 1.1 with; it helped people who didn't want it anyhow to make it irrelevant. Including that feature hurt XML 1.1's chances of success.

And it's disingenuous to say that XML 1.1 "exists and is available". For people who need and want native element and attribute names, XML 1.1 effectively does not exist and is not available, because there is essentially no support for it.

Goals are goals, not requirements. It's already possible to use names that nobody can read because they use barely-distinguishable or ultra-obscure Chinese characters. Likewise XML 1.0 had plenty of optional features, and in one sense every time XML 1.0 changes in any way an option exists: parsers can be fixed or not fixed. (Talk to Elliotte Rusty Harold about this sometime.)

Is it a hack? Yes. I tried once, the right way; now I'm trying again, the wrong way.

Rick Jelliffe
2008-05-22 20:08:25
John: Cart before the horse. When there is inadequate provision of the layering infrastructure to positively support plurality (i.e. the old inadequate versioning) then it imposes artificial decisions.

Xerces2 supports XML 1.1, and consequently Java apps. I don't think that is "essentially no support." (Nor disingenuous!)

As for right/wrong: why not try it the systematic way? Each layer builds on the last and allows plurality at the next level. That is the only successful architecture for these standards and for evolution: TCP/*, MIME content types, etc etc. The layering/selection capability goes first.

John Cowan
2008-05-23 07:09:42
We *did* try it the systematic way. Now we try it the brutally pragmatic way. And the 800-pound gorilla is nodding and smiling this time.