The Design Goals of XML

by Rick Jelliffe

One of the really neat things about the XML specification is not just that it makes its design goals explicit (I gave a twist to this idea in the Schematron standard by mentioning various non-goals too) but that the goals were really well chosen.

A decade ago, Tim Bray wrote up his Annotated version of the XML spec, which includes some hypertext comments to the Goals section.

Recently, I have heard several times people quoting the XML goals to support various opinions on what makes a good or bad markup language (schema). In particular, goal #10 Terseness is of minimal importance gets used to claim that abbreviated element names go against the spirit of XML (a blithe spirit indeed). (See here for example.)

But if we look at the XML Spec, we see that these are not general goals for XML documents to follow, but goals for the committee designing XML the technology: they are explicitly design goals. Tim's comments are useful here, on goal #10 he writes
The historical reason for this goal is that the complexity and difficulty of SGML was greatly increased by its use of minimization, i.e. the omission of pieces of markup, in the interest of terseness. In the case of XML, whenever there was a conflict between conciseness and clarity, clarity won.


I have always attributed the goals to Jon Bosak. Tim mentions Jon's stewardship of the XML process has been marked by a combination of deft political maneuvering with steadfast insistence on the principle of doing things based on principle, not expediency., where I think "principle" requires having clear goals and persuing them. (Regular readers of this blog might see that my Reasonable Principles for Reviewing Open XML and other Standards follows this line. If you get hold of the Standards Australia comments on DIS29500 ballot, you can see that most of them try to state the general principle behind the specific problem.)

But Dave Hollander and Michael Sperberg-McQueen mention how the goals were the foundation for the XML design effort too. The goals were fait accompli by the XML ERB by the time the larger XML WG formed (another good thing about Jon Bosak: he welcomed all sorts of stakeholder involvement) but I don't recall any of us on the WG (which would now be called an Interest Group, not to be confused with the current XML WG which took up from the old ERB) ever complaining about the goals.

Alice through the Looking Glass



Looking at the goals (and see Tim's comments if you don't trust mine) you can see that most of the goals are specific responses to problems either with SGML or with the SGML process at ISO then. (ISO standards were supposed to have 10 year reviews which would be an opportunity for changes to be addressed, outside the ordinary maintenance process. But some influential and vital members of the ISO group had been committed to keeping SGML unchanged for as long as possible, and many of the other members who wanted change wanted changes that would support technologies such as ISO HyTime better: these would be changes that made SGML more complicated and varigated rather than simpler, to the frustration of all.)

1. XML shall be straightforwardly usable over the Internet.


SGML had a particular issue that it was, by design, retargetable. Before Unicode and the URLs, every different system had different character sets and different ways of locating files. So SGML provided a mechanism for labelling that an entity (resource) would need some system-specific fix in order to be useful, and a mechanism for naming entities regardless of their location (PUBLIC identifiers.

Because of this goal, SDATA entities were removed from SGML as was the use of unresolved entities (entities PUBLIC identifiers with no SYSTEM identifiers.) It was unfeasible to expect users to fix document to suit their local systems: that is geekstuff. The use of Unicode and URLs was a non-brainer from this goal.

2. XML shall support a wide variety of applications.

While some people had been using SGML for non-publishing uses (Dave Peterson at MIT for example had been using it for numerical data from 1985 IIRC) its complexity and strangeness made it difficult for, in particular, people from the database world. Now, as it turns out, there problems can fruitfully be solved by treating them as publishing applications. But this has been a success of XML and HTML, not SGML per se.

3. XML shall be compatible with SGML.


In fact, ISO 8879 was changed to allow this. In particular, to allow documents with no DTDs, hex numeric character references (I had tried to get them introduced in my successful 1996 Corregendum to SGML, but horsetraded them away to get support for the main requirement, to support CJK characters better) and the empty elements form <x/>

4. It shall be easy to write programs which process XML documents.


SGML parsers were a pain to write. An SGML processor was really a compiler compiler where you could change delimiters, keywords and a whole lot of different behaviours. Note that process here is a defined term: a XML processor is the parser and support utilities. This goal does not state that it is against XML's goals to write complicated programs that use XML data!

5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.


SGML had an ancillary document, the SGML declaration which told you which which features a particular document needed. In theory, you could then look up the SGML system decaration for an application and see whether it matched. In fact, XML can be regarded as largely a particular SGML declaration, superceding the default Reference Concrete Syntax defined by IS8879:1986 (and taking on board the Extended Reference Concrete Syntax proposals which I and the CJK DOCP group were promoting.)

Now, in fact, there are two big optional features in XML: DTDs and non-UTF-* character encodings. Many early home-made XML parsers did not support DTDs or ignored them, and many supported only limited numbers of character encodings.

6. XML documents should be human-legible and reasonably clear.


As Tim notes, this is a goal which blocks off any attempt to allow binary data and non-graphical characters in XML. Text is king.

Before XML, I organized an effort to make up a set of rules for the Unicode characters that could be used in names in markup: this Native Language Markup list was part of the Extended Reference Concrete Syntax, was adopted and improved by XML 1.0 (and were downgraded in XML 1.1. Out of this effort came a strong belief that XML should not contain non-graphical or control characters: this ended up being reworked into to a W3C and Unicode Consortium note: Unicode in XML and other Markup Languages.

But the issue crops up periodically. Indeed, it one area where I think OOXML goes seriously wrong: in a few places it provides a mechanism for circumventing XML's character repertoire restrictions. I think the idea that just because someone generated an automatic name and used the backspace character as part of it, this should be regarded as acceptable practice in the standard is completely bogus. Several National Bodies have commented on it: I hope ECMA will have the good sense to remove it or severely deprecate it at the least. For example it is clearly a security hole to allow backspace in names, where the visible name may be coded differently than its readers expect: a kind of spoofing.)

7. The XML design should be prepared quickly.


SGML's "10 year review" had not even really started properly after 10 years. In fact, XML was the 10 year review of SGML!

8. The design of XML shall be formal and concise.


SGML has attracted criticism that it did not use academic formalisms, and was difficult to characterize with formalisms. I don't know why this isn't a criticism of the formalisms just as much: cart before the horse. Anyway, XML being simpler is much more friendly to simple theoretical formalisms, and consequently easier to write parsers for using compiler compilers. (In a sense, XML represent a move to unbundle the markup language from the compiler compiler technology. In my idle moments, I wonder whether compiler compiler systems would have been more capable of handling SGML if the SGML spec has been freely available on the internet in PDF or whatever: the lack of an open standard for SGML meant that the academic/private-hacker community (apart from James Clark) did not connect with the standard or its challenges.

9. XML documents shall be easy to create.


Tim wrote in 1998
The main goal was in fact to design XML in such a way that it would be tractable to design and build XML authoring systems. Our success in meeting this design goal remains to be established in the marketplace.

In 2008, the success is completely established: it is difficult to find anything anywhere which doesn't use XML, even when it is a mad choice!


10. Terseness in XML markup is of minimal importance.


SGML was designed with a big attention to the requirement of users: i.e. typists. Minimizing the number of keystrokes it took to markup a raw text file was a large part of the economic value proposition of SGML. SGML allowed you to leave off many delimiters, omit many tags, and gave many kinds of shortcuts so that you could just use simple keyboard symbols instead of explicit tags.

It is tempting to think of this as an old-fashioned concern which we, in our age of RIAs and off-shored outsourcing don't need to worry about. But what XML did was, in fact, to cast adrift the users of Wiki-like markup into a standards-free world, which has incredibly harmed the adoption of Wiki--like markup. And when we look at the current upheaval in the HTML 5 discussions going on currently, a central meme from that is that XML's restricted syntax is simply inappropriate for vanilla HTML. (For an alternative, see my ECS)

This goal #10 has been the cause for much of XML's success: with a stroke it allowed many SGML features to be removed without much fuss: DATATAG, SHORTREF, OMITTAG, SHORTTAG. Coupled with this goal was the realization that to a major extent, HTTP compression was the correct layer for reducing the transmission size of documents, rather than XML language features. (Of course, it is not true that terseness is of no importance in language syntax standards: the prefix mechanism in XML namespace is terseness mechanism after all!) And the removal of these features meant that the DTD was no longer necessary, a big win which many people had been seeking.

But to treat this design goal as somehow indicating any policy about how long a name in an schema should be goes beyond the intent of those goals, at least as I ever understood it. The goal of Native Language Markup was to allow people to markup documents using their customary names and symbols. This is different to the goal of literate programming, which is where I think people are getting confused.

In fact, what we are seeing with XML is that for international standards and in nations from the Hindu-European language groups or with English as an official language or (such as Indonesia) where the simple 26-letter alphabet has been adopted for transcription, schemas do restrict themselves to the ASCII repertoire names. No surprises there. However, for national and local documents for other languages or scripts, Native Language Markup is a big success, very spectaularly in the Chinese OUF spec and in Murata-san's schemas for Japanese local governments.

11 Comments

David Carlisle
2008-01-07 02:15:09
Minimizing the number of keystrokes it took to markup a raw text file was a large part of the economic value proposition of XML. XML allowed you to leave off many delimiters, omit many tags, and gave many kinds of shortcuts so that you could just use simple keyboard symbols instead of explicit tags.



s/XML/SGML/g

Aristotle Pagaltzis
2008-01-07 05:17:47

Now, as it turns out, it turns out that solutions many problems can fruitfully be engineered by treating them as publishing applications, but XML and HTML has made this clearer not SGML per se.


This appears to have been written at 3AM… I can’t parse that sentence at all. :-(

Mark
2008-01-07 06:15:35
The main goal was in fact to design XML in such a way that it would be tractable to design and build XML authoring systems. Our success in meeting this design goal remains to be established in the marketplace.


In 2008, the success is completely established: it is difficult to find anything anywhere which doesn’t use XML, even when it is a mad choice!



I do not think this is true if "authoring systems" means systems for non-techie authors to write content in WYSIWYG ala MS Word (unless the XML schema is very simple.) Even much HTML is "authored" in WYSIWYG ala Dreamweaver. I think this has been a huge impediment to XML adoption. If you've ever used a WYSIWYG XML editor and been stuck "betweeen tags" you know what I mean. Authors understand paragraphs and indenting font sizes and colors,, etc, but complex underlying structure is impossible to explain. Don't know if Schematron would be useful in this regard.

John Cowan
2008-01-07 14:34:02
Two failures of the XML process:


It would have been better to require that public identifiers be URIs, thus allowing system identifiers to continue their historic function of being local addresses for things. Instead, we have public ids limited to the charset of formal public ids, but nobody uses (or knows how to use) FPIs.


The XML Recommendation should have made clear that the purpose of attributes of type NOTATION was to specify the data type of the content of an element, rather than allowing this function to be reinvented as xsi:type.


Attribute value normalization is just silly.

Rick Jelliffe
2008-01-07 16:13:06
David, Aristotle: Thanks for the catches: hopefully clearer now.


Mark: The current generation of office applications use XML, and it is quite deeeply engrained not just superficial serialization. (The use of XPaths for locating forms information for example.)


John: FPIs seem quite similar to URNs don't they? I think the requirement for being fully resolved means that FPIs serve documentary and legacy purposes (the DOCBOOK/Catalogs using sector) as far as most of the world is concerned. Would you then have a URN for the ISO public entity sets? (Actually, now that ISO is handing them over to W3C for maintenance, I expect they will have a permanant URL that would be useful as a kind of FPI.)


I had several conversations with fellow W3C XML Schemas WG members about notations and types. There seemed to be a strong feeling that types were modern and new and productive while notations were a failed complication, but this was always based on a necessary distinction between type and notation that eluded me.


I think the idea may have been that a notation is only a lexical-space thing, while a type has both lexical and value spaces. I think there was also a strong agenda at play too with XML's datatyping, that what was needed was a kitchen sink of primitives. I took the opposite view, that what was needed is a DTLL-style system to allow parsing of arbitrary data into useful types (e.g., primitives for value space is OK, but you need extensibility for the lexical space).


The example I always bring up is of dates. The XML Schema method is that a date is only a date if it conforms to the lexical space of the ISO 8601 subset that was developed. My view would be that the lexical space of a date should be whatever the user wants, as long as it can be rigorously marked up so that you know what the mapping to and from ISO 8601 is. These two views are clearly dataheads versus docheads: from a markup/publishing background, the text data is a pre-existing fact that needs to be catered to; from the database POV the notation of the data should be an artifact of the localizing/formatting system.


Actually, it is simplistic to see this as something where documents and databases are mutually exclusive. At the moment I am working with some ACORD data: the document passes through many stages, and at each stage the data may be unchecked text or checked values. This is not a publishing problem, but a workflow one, and XML Schemas' approach to typing doesn't provide help.


(From the type-theoretical aspect, I think what I am saying is that the lexical space and value space form separate trees, and that XSD over-simplifies things (XSD over-simplifying? Believe it or not! :-) ) by trying to unify them.)

Mark
2008-01-07 19:17:58
Rick:


Just becasue there is some XML under the hood dos not mean that the output is usable for any specific purpose. I have seen a govenment agency invest years and millions of dollars to support mandated use of XML only to be unable to produce a product that end users would even consider using. One of the rallying cries for XML was to enable much of the information that is "lost" in complex doculments to be usable and analyzable. I can do this with XML, but I have to use an editor like XMLSpy. I can drop these into databases and produced amazing search and metrics easily. MS Word does not even come close to producing XML usable for many purposes. I don't need Microsoft's schma, I need tools which suppor my schema. If you cannot embed complex markup in documents just use HTML with meta tags and be done with it. In fact that's what most people do. That's because the authoring tools are just not there yet.

Rick Jelliffe
2008-01-07 22:50:39
Mark: Good point. I don't read Tim's comment to be about making good user interfaces, per se: when he said "tractable" I think he really meant it: the structured editors for SGML usually only made a small subset of SGML features available for editing, if they made any at all, because it was so hard.


In practice, it is impossible ever to have a universal XML editor that gives the user the ideal interface for each element. This is because, even with configuration, arbitrary XML can describe all sorts of information which may have a preferred form to a particular individual user.


What I think will happen is that office application will increasingly make available smarter widgets/controls that provide more kinds of generic views of data. In the past, we have had lists and tables, and that was about it. I think you may be underplaying how striking SmartArt in MS Office is, when considered as an example of an application providing new forms of embedded structured editing tightly coupled to application views.


If HTML hadn't stagnated for the obvious reasons (centrally that there an entrenched layer in MS doesn't get the "grow the pie" mentality that successful standards-making requires: how much wasted energy results from that attitude!) then it should have been adding lots of elements for these kinds of extra widgets and controls: menus, dialog boxes, sliders, not to mention the kinds of smart graphics. Thankfully the HTML 5 effort may make up for almost a decade of lost opportunity.


So generic office applications will increasingly become shells for manipulating nesting and linked sets of generic structured-controls, with each application providing a large number of them, and with the structuring of the controls providing the micro-structuring needed for convenient mapping to "information units" (in Eve Maler's sense).


This trend seemingly goes flat against the desire for fixed, small standards for office documents, and indeed I don't see how ODF or OOXML (to only a slightly lesser extent) will not be dinosaured by this trend. The thing will be how to exchange components, how to represent their signatures (schemas, abstract patterns), how to try to make them declarative and platform neutral (in the way that font metrics are), how to standardize them, etc.


The main example that people are aware of at the moment is charts. There are obviously a zillion ways to draft charts from the same data. There are obviously abstract properties that can be extracted about charts to provide hints on different platforms and graceful degradation. I suspect the answer is more in getting standards agreement on platform-independent, GUI-neutral CSS-ish properties for charts and the important widgets as they emerge, rather than expecting that vendors and developers will agree either on the same XUL or on the same WORA API.

David Webber
2008-01-08 10:02:41
And then there was XML Schema (xsd) aka SGML V2 where all of the above principles were ignored.


Particularly to wit - XSD can only be machine interpreted as the syntax is intentionally arcane, complex and convuluted; writing an XSD parser is non-trivial; XSD invites complex usage and mechanisms in XML exchanges.


In the midst of the XSD work - against much better wisdom - namespaces were also unleashed on the unsuspecting and ill-prepared for the full consequences.


Could we say that technologies such as ISO Schematron and OASIS CAM have emerged to mitigate the morass caused by W3C Schema and namespaces? At least there is some sanctury offered for those that appreciate the elegance and utility of simple wellformed XML - and I'm happy to see that in certain standards work this is making a come-back as people roll the complexity wheel around again.


David Webber
2008-01-08 10:05:01
And then there was XML Schema (xsd) aka SGML V2 where all of the above principles were ignored.


Particularly to wit - XSD can only be machine interpreted as the syntax is intentionally arcane, complex and convuluted; writing an XSD parser is non-trivial; XSD invites complex usage and mechanisms in XML exchanges.


In the midst of the XSD work - against much better wisdom - namespaces were also unleashed on the unsuspecting and ill-prepared for the full consequences.


Could we say that technologies such as ISO Schematron and OASIS CAM have emerged to mitigate the morass caused by W3C Schema and namespaces? At least there is some sanctury offered for those that appreciate the elegance and utility of simple wellformed XML - and I'm happy to see that in certain standards work this is making a come-back as people roll the complexity wheel around again.


len
2008-01-09 06:19:23
The part Rick leaves out is the struggle going on in the markup market of standards of DSSSL vs HyTime and both against the proprietarization of markup by companies such as Interleaf and Xerox. The first had personality issues but aside from that, it was a struggle of the link processors or relationship abstractions vs the stylesheet processors or control abstractions. In the zeitgeist of the time, the review of SGML was difficult to achieve given the politics and personalities. The second revealed the document market for what it was at the time: singular, predatory and insulated. A set of mini-fiefdoms prevailed. These were not wiped out by the web or by XML but by the monopoly status achieved by MS Office. In this sense, the dinosaur had better survival and reproduction strategies and prevailed over the small nimble but corrupt attempts to use diluted markup and anemic wysiwyg to create a market. They are the exemplar of why the right strategy at the wrong time doesn't succeed.


XML had the benefit of the years of experience with SGML and its predecessors GML and GenTagging. Where SGML had not been applied to networks at scale (it had been applied but only in isolated corporate and military hypermedia systems), HTML was the source of experience. XML was not a review of SGML. It was a burglary. I mean that not in the criminal sense but in the sense that control was consciously and deliberately wrested from the organizations and personalities and carried away by a self-selected group who are adopted by the W3C though not widely endorsed. It wasn't the most delicate way to get that done but it did open the door to extensible markup or SGML On The Web and it did succeed in paring down the standard into a smaller specification. Other side effects were to have two different and incompatible syntaxes for document (XML) and stylesheet (CSS) information, to confuse a generation of developers about the utility of markup, and to make permanent the role of HTML as the kudzu of the Internet. It's most singularly noticeable virtue was it brought an end to the HyTime vs DSSSL Wars with a great deal of face saving. The virtues of the work in these could then be reabsorbed similarly to the way a boneyard is scavenged for working parts.


Some of the goals as stated particularly terseness have been shown to be hobgoblins over the years. The notion that markup would in general be only seen by the processors (a much older speculation than the web itself) was/is/always will be false. That markup is the favored design for human readability (sounds good) tends to vary by application language. Both of these goals began to fall apart as soon as graphics apps appeared in XML syntax. Sometimes verboseness matters precisely because of human readability. Binaries are still an important means to achieve speed and in part, if weakly, to provide some measure of IP protection. The document experts are simply wrong about these because they work primarily in static information and don't deal with the issues of real time systems.


That the document metaphor can be extended to other applications was an idea that predated the web and XML by many years. Their are books on the so-called "Document database" that precede the web by a decade at least and possibly a half-century if you count Vannevar Bush.


What XML did achieve is a mode of operation in how to win in the market place of standards. Come in low, hit hard, make as much noise as possible, deride competitors, derail antecendents, and then sit as hard as possible on the original design so that no counter-revolution is possible. Trotsky and Lenin would see that as a fairly neo-Marxian strategy and be pleased by its success.


What XML has failed to achieve is any real independence from the web browser containers with the HTML legacy. I expect that change is coming but not from the document or database communities. It may come from the graphics communities or it may be the case that the browser finally evaporates into the 0 point of the screen/world coordinate system.


What is revealing is that 12 or so years on into the XML period, the web is showing signs of fracturing not because technologies can't dominate it but because communities are discovering their own boundaries and realizing that they want the walled gardens. Socialization is creating this effect as networks discover that self-selection includes exclusion as a principle of organization, and that taking a territory or market is still an act of force as much as finesse.


Here the XML principles may or may not apply, but the experience proved iconic.

Yoon Kit
2008-01-10 09:54:13
Rick,


The comment at openmalaysiablog [http://www.openmalaysiablog.com/2007/01/ooxml_has_poor_.html] was mainly the element names. MSOOXML's scrgbClr, algn, blurRad, dir, dist, rotWithShape names are unnecessarily terse and unclear. It reads like variables from their C code.


Do you think that its still a good idea to have names such as this within the spec? Or should Microsoft clear it up for the international community to understand? r shd Msft clr t p fr th intnl cmnt to unstnd?


yk.