Has XML on the Web really failed?

by Robin Berjon

Related link: http://www.xml.com/pub/a/2004/07/21/dive.html




Mark Pilgrim wrote an interesting piece on XML.com about how he considers that XML on the Web Has Failed. That's a bold claim to make though, and while his arguments ferret out some real problems, they also carry serious flaws.




Starting with the flaws, he considers "strike 3" to be that XML parsers are all buggy, broken, and liberal in what they accept rather than draconian. For this he refers to the XML specification, notably parts surrounding a bit that defines how encodings can be inferred when there is a high-level protocol carrying XML:




When multiple sources of information are available, their relative priority and the preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML.



Here, there are two things to stress. The first is that it says "should", not must. The second is that this is part of Appendix F.2, which is non-normative anyway. Therefore, inferring rom that part of the specification, as Mark does, that XML parsers that don't support RFC3023 are broken is wrong.




His "strike 2" is that too many Web servers are misconfigured. Well, that's sadly true, but it's not really a failure of XML. In fact, it's a big enough problem that some Web clients (most infamously IE) have included content "sniffing" code to try to deal with the issue, in turn making it far easier to never notice that servers are misconfigured in the first place. Yes it's a problem, no it's not something that XML can do much about. At best, it can try to derive practices that work around the problem.




"Strike 1" is one are where he makes a good argument: RFC3023 is quite likely broken. Indeed, XML is very good at labelling the encoding that it's using, the rules are precise and deterministic. Why oh why then would someone else, at another layer, want to add their own rules? File format and transport-level metadata should be as separate as they can possibly be, for there is little value and high cost in having incestuous relationships between the two.




So how to move forward? Well, I don't claim to have the perfect answer, but it would appear to me that forbidding the usage of charset parameters on all XML Media Types is a good step forward. Processors will then be able to rely simply on the information contained in the XML documents and we'll avoid clashes that buy us nothing and decrease interoperability. Content producers will then either produce well-formed XML or not, and it'll be easy to fix. As a side note, I feel that Mark vastly overstates the value in using whichever encoding one wants. I've been sticking to UTF-8 and UTF-16 only for several years, and have been very happy using those for many languages. A second step could possibly be to stop kidding ourselves that XML is "just text" and drop the text/xml MIME type altogether, using only the application/* and image/* ones. That'll teach those gung-ho transcoding proxies to behave.




Update! Right in the nick of time, it would seem that
Murata/Kohn/Lilley have produced an Internet Draft obsoleting 3023 :) The list of changes is as yet uncomplete, but one should already help: text/xml and text/xml-external-parsed-entity are deprecated. I highly recommend reading the text in section 3 that justifies this if you are interested in this topic.




Is XML on the Web really broken? Or has Mark produced a stimulating, but somewhat over-enthusiastic piece?