Does XML Suck? Revisited

by Jeff Bone

XML-bashing seems to have become a semi-popular passtime of late... of the many critiques I've come across, this presentation is one of the best. Here are a few hopefully reasonable comments addressing the whole anti-XML sentiment that's floating around and this critique in particular.

Up front... I'm not particularly an XML advocate; I've been involved in none of the XML specifications or working groups. However, while I share some of the same feelings that somehow XML is rather "grotty," I've developed what I hope is a reasonable position on the matter.

Some of Aaron's arguments are pretty good, but some rest on a few assumptions and philosophical positions that are, IMHO, erroneous.

What Technology Should (and Shouldn't) Try to Achieve

All instances of technology have one meta-purpose: to accomplish or achieve some function, feature, or design requirement. That's it. It's not the job of a technology to be beautiful, aesthetically pleasing, etc. In fact, there's *no such thing* as beautiful or aesthetically pleasing technology. We technologists are prone to thinking that technology can have these qualities because we "feel" that some technologies do; however, this is a trap --- one that we technologists often fall into.

A technology should accomplish what it is designed to accomplish in a reasonably efficient manner, with minimal cognitive overhead. Let's break that down: technology should be designed to accomplish some purpose. It should not attempt to accomplish things for which it is not explicitly and specifically designed. (A desire to make our artifacts very general is another pervasive trap that us techies fall into.) It should fulfill its design in a reasonably efficient manner; this means that it shouldn't be obviously and grossly inefficient, i.e., some other design should not be able to fulfill the specific requirements in a significantly more efficient manner. Efficiency is a loosey-goosey term, but it can mean computational efficiency, storage or bandwidth efficiency, ease-of-use, or any combination of the above and other types of efficiency. Efficiency is likely to be domain-specific, so the statement needs be interpreted in the context of the particular application. Minimal cognitive overhead means that the simplest design that achieves the design requirements reasonably efficiently wins.

All other considerations about a technology are illusionary, insignificant, unimportant. A technology that does what it is supposed to in an efficient manner with no more than a minimum amount of complexity is "good enough" --- and there's no such thing as "better than good enough." The mistaken belief that there is such a thing is what leads to a chronic and puzzling thing in the marketplace: technologies that are deemed "best of breed" (i.e., exceed their design requirements on somebody's subjective quality assessment) *ALWAYS* underperform "inferior" but adequate technologies in the market. (NeXT, Beta, Objective C, the Mac, Be, Newton, etc. etc. etc.)

Aaron falls into this trap when he describes XML as "technologically terrible." He's expressing an aesthetic opinion dressed up as a technological argument. The question we should ask when evaluating XML (or any technology) is: "can something else do as good or better a job on the relevant / important dimensions?" The answer for XML *might* be "yes", but in absence of any compelling evidence to the contrary it's probably "no."

"Does XML Suck?" Revisited

Aaron lists "verbosity" as one of the problems with XML. He's not alone in that complaint. However, this criticism is off base for several reasons. First, it's important to distinguish between XML-the-syntax and XML-the-datastructure. Complaints about syntax are, generally, pretty silly. Two otherwise equivalent syntaxes for something should be considered the same; and a range of techniques exist for reducing the verbosity of XML (including judicious use of namespaces and schemas, as well as things like binary and indexed representations.) Complaining about XML's verbosity is a generalization from some certain bad examples of XML. And: some researchers at IBM Almaden about two years ago [lost the ref, anybody] showed that a reasonably efficient XML representation of something was IIRC same-order the size of the minimal representation / encoding of something that carried the same amount of structural and semantic information. That is, with appropriately efficient use of namespaces, etc. XML will be less than 10x the minimal compressed size of the same info given any non-lossy compression scheme. In my experience, differences of that order can largely be ignored in almost all systems. (It's the O(n^2), O(n!) etc. stuff that we've got to worry about.)

Aaron also emphasizes that XML isn't the most human-readable representation. But that assumes that XML is intended to be read / written by humans. It's tempting to make that assumption, but IMHO it's incorrect: we make that assumption because HTML is often read / written by humans. However, just as *more* HTML is created / processed by software than by people, so too (and even moreso) is more XML created / processed by software than people. The nice thing about both is that they *can* --- when necessary --- be processed by humans. XML represents a nice tradeoff between human readability and efficient machine representation.

Aaron's arguments about complexity sound reasonable on the surface, however... Complexity is a tough thing to pin down. In computer science we have good (at least adequate) tools for analyzing and understanding computational complexity e.g. time-space tradeoffs, algorithmic complexity, etc. Information theory gives us some tools for dealing with information complexity... But we have very poor or no tools for analyzing and quantitatively addressing problems of dynamic complexity of component interactions in systems, representational complexity of data structures, expressivity of languages, etc. I've spent over a year trying to create a theory of the former (compositional complexity in software architectures) and let me tell you, complexity is a complex notion. ;-)

Representational complexity and expressivity is an even less studied and less understood area, and while Aaron may in fact be right his argument isn't well supported. And there are hints from information theory --- such as the size order of XML vs. theoretic optima --- that indicate that it's wrong.

If Aaron can state exactly what he means by "complexity" and quantify / generalize his argument, it would be significant not only as XML criticism but as an important result in computer science.

The "acronym proliferation" problem is very real, but it's a function of where this technology is in its lifecycle and the amount of attention it has received. It's not surprising that there's a "fan-out" of overlapping applications / standards / etc. related to XML --- it's relatively early, very general, and lots of people are trying to do stuff. That leads to quite a bit of noise and frustration but --- inevitably --- there will be a "fan-in" to a few general, standard tools for various things. Aaron even recognizes that this is happening: "Even here, the situation is improving."

The bottom line is this: it *might* be possible to design a similar representational mechanism that accomplishes all the things that XML accomplishes --- i.e., multidimensional reference structures with arbitrary attribution and strong typing... But *today* there are no existence proofs of such alternatives and, indeed, I believe that if there were they would strongly resemble XML except in the trivial details. In the absence of proposals for such alternatives, it would seem that criticizing XML is a rather empty exercise.

And Aaron recognizes the most important argument *for* XML --- its socioeconomic benefits. "Everybody's doing it" is a very *good* argument for any technology; systems that can communicate through such a mechanism grow in value with the square of the number of components, per Metcalfe's law. Anyone using other idiosyncratic technologies to accomplish some or all of the same things actually inhibit the overall growth of value of the system.

What do you think? Does XML suck? Is it horribly inefficient? Are there better alternatives that accomplish the same thing?


2002-08-06 13:03:34
10x size is an issue for humans
You talk about 10x size not being an issue, since it's not an order-level change (O(n)->O(n^2)), etc. This is a fine and dandy argument for things that machines have to deal with, but it's not a fine thing for things humans have to grok. In fact, I wouldn't be surprised if human-complexity of something like XML is on the order of O(s^2), where s is the size.

In other words, linear increases in complexity for a machine are likely not linear increases in complexity for humans; it's probably polynomial for humans.

Furthermore, XML as a datastructure is pretty laughable too. Starting with something like SML (a minimal XML-like language), it's nice. But pull in things like processing instructions, mixed content, etc., and you've ruined it.

Also, XML Schema shows largely the epitome much of the wrong mentality behind XML. E.g., it can't even represent itself; the lack of formality and cleanness is quite poor.

2002-08-06 13:21:12

I wrote an article on my personal web site about this some time ago. It was edited to a degree, and published in XML-Journal. I was surprised, as the title of it was "10 Years of Markup Madness" (was shortened by XML-J to "Markup is Madness" -- fair enough).

Technology should be simple. We have all seen what happens when it's not. XML seems to be a way of encoding data in marked-up text format to support interoperability. Too bad the DTD syntax is so bizarre. Too bad XML is inherently harder to parse than, say, JavaScript (NewtonScript) frame notation.

Being an unemployed software developer sucks. Dealing with the recruiters here in Los Angeles is a joke. Most of them ask "Do you know XML?" Sure, I've used it a bit. Do I "believe?" No. It's just another way of marking up your data, that's it. It's complex, error-prone, too verbose, and simply putting data in XML format doesn't guarantee any other application can make sense of it unless they know how to read and interpret your own format, so while it may be a more clear way of representing data than say, a proprietary binary format, I don't really see the win here.

The neat thing about simple and arguably "best of breed" technologies like BeOS (all hail) is that performance is generally much better. Today's operating systems, as "more complete" as they are, still fail to achieve the level of performance and ease-of-development that Be did. This is not a win-win for developers, or ultimately consumers.

Luckily, though, the industry doesn't listen to me. =)


2002-08-06 13:23:16
10x size is an issue for humans
Sorry, I should've been more clear. The 10x (approx) inflation is the inflation due solely to the syntactic overhead of XML. Actual inflation of any given representation of the same information is likely to be much, much less than this. Also, this is inflation compared to a *highly compressed* form of the same information. I don't know about you, but I can't read zip files myself... ;-) Any similar, human-readable form of the same information is likely to be roughly the same size, assuming no loss of structure, content, or metainformation.
2002-08-06 13:30:21
I agree with Steve that technology should be simple --- as simple as it can be for its task and no simpler. None of the arguments that I've come across against XML convince me that there's a particularly well-formed notion of "simple" as regards representational formats, or that XML has unnecessary complexity for its purpose.

Other proposals - Steve's favorite is Newtonscript Frames - convey as much information as XML conveys. In order to decide the issue, we'd have to have another format that was equally rich in its ability to attribute anything to any degree of depth. Frames don't do this --- they lose metadata, typing, etc. relative to XML --- so comparing XML to them is an apples-to-oranges exercise.

I'm open to the idea that it could be done in some other, better way... but there's no existence proof to justify that argument, and without such a proof it seems like a rather pointless thing to argue about.

2002-08-07 10:29:08
depends on the user
as for the 'beauty' of xml ... i suppose the original writer meant to say it's valid to declare a piece of technology 'beautiful' if it does what it says, if it does it correctly, fully, in appropriate time --- and if it's the most simple solution to a problem, but not simpler! (cf. a. einstein)

in light of that ... let any well-researched, efficient algorithm be implemented by a poorly-skilled programmer, and the results won't be as efficient. same applies to xml, xml schemas, and so on. processing instructions and cross-referenced notations can spoil any xml representation, sure. but inappropriate usage of tools can ruin any craft.

also, one need not forget that a huge pile of the xml specs deals with accurate definitions as how to encode your xml documents (iso/utf/etc.) or how to handle white-space (normalization etc). questions unanswered by previous 'formats'.

regarding file sizes: c'mon, do you still code in assembler and count cycles? big iron is cheap enough.


2002-08-07 12:03:50
The representation of data is the question
I agree with 'ftobin' in that "XML as a data structure is pretty laughable". When a person analyses XML to its lowest element, one finds that XML is essentially about how to structure data. The advent of XML being used as a datatype system (DTD, Xschema, Relax-NG) and thereby the proliferation of XML Databases, compromises what XML was intended to do. Remember, XML as derived from SGML came out of the publishing space. Now, XML is entering the Database systems space, something it was not intended for. This ‘adoption’ now means that XML and XML Databases in particular must now deal with such things as data independence, data integrity, data manipulation…etc. Clearly, this is not what XML was for, but that is where it is heading. The laughable part is XML is heading back to the past where such things as hierarchic databases existed (ex: IMS database from IBM). Hierarchic structures are based on graph theory, which was abandoned due to its complexity and gave way to relational theory. So basically, XML is just repeating the very same mistakes of time past. XML is no more than a technological fashion trend, all glitter no substance.
2002-08-07 14:46:50
The representation of data is the question
While I don't necessarily agree that there's a good distinction between XML being a format and XML being a database, I do think you're right about one thing: graphs are expensive, and we ditched hierarchical DBs for many applications years ago due to their complexity on several dimensions (no pun intended.) RBDMS ruled the day for years, however... Nobody seems to have made much of it, but over the last few years the hierarchical model has crept back into the fore --- it's the data model for the biggest DB on the planet --- "the Web!"

I've run into the problem of mapping between hierarchical and relational models several times throughout my career, first in building tools to do large-scale data migration between e.g. ISAM and other hierarchical DBs and RDBMS (Evolutionary Technologies, Extract), in doing news aggregation at and classification at Clickfeed, and more recently (currently, at Deepfile!) in gathering and using filesystem metadata and statistics for various purposes. So the problem is near and dear to my heart...

The problem is this: if the information you're interested in is inherently graph-based --- if the reference structures and their topology are meaningful --- then you can't simply make e.g. computational complexity disappear by using a relational model and normalizing. Many of the things you might want to do with that data are inherently traverse-and-compare type problems, and RDBMS don't give you any help with this by themselves. These problems aren't automatically made less complex by a set-theoretic reformulation, and finding solutions that aren't exponential, factorial, or worse is HARD.

Some of these may be reformulated as linear algebraic problems with polynomial-time solutions, but many, many, many of them ultimately reduce to the Knapsack Problem.

Having said all that, here's the summary: XML isn't really repeating the mistakes of the past. Rather, it's another attempt to address a class of data (graph structured data) that is unavoidable in some domains. [-jb]

2002-08-09 10:40:07
Aesthetics are an essential part of science and technology
The argument that technology should only do it's job, while prevalent, is absolute nonsense. Perhaps it's because of insecurity on the part of technologists, but they continually refuse to admit the importance of beauty in our work. Science and technology have no better mechanism for sorting out good theories and designs from bad ones than our own sense of beauty and aesthetics, and following that path will lead us to software that's more secure, reliable, faster and easier to work with. Ignoring that, like the people behind XML have done, is fraught with great peril.

For a book that discusses these subjects in detail, I recommend Machine Beauty by David Gelernter.

2002-08-09 14:48:32
Aesthetics are an essential part of science and technology
Perhaps there's a better way to put my point... of course some technology is beautiful, but I maintain that what we perceive as "beautiful" re: technology is exactly and only that which achieves functionality and minimal complexity. :-)

OTOH, the idea of any practical application of aesthetics to technology is absolute nonsense. Down that path lies art school mumbo-jumbo. If indeed aesthetics *is* important to technology creation, then e.g. the creators of XML could justify certain syntactic decisions on the basis of e.g. a deep and abiding love of the triangular form of angle brackets. And there wouldn't be any way to argue with that because, hey, technological choices are equivalent (under this hypothesis) to aesthetic choices.

And that's nonsense.