Comparing XML office document formats #2: what counts

by Rick Jelliffe

As several commentators point out, there is quite a large size and complexity difference between the different office formats of the simple examples given in my previous blog Comparing Office Document Formats. But it is useful not to jump to conclusions. Don't be scared of wrapper elements: HTML has too few and is popular but impoverished because of it.

The bottom line of data formats is "is the information extractable?" not "is the markup pretty?" Complexity is certainly undesirable, but the choices are not simple: for example, if you decide to have really simple elements that only serve one signalling prupose each, or favour elements over attributes, you will probably end up with deeper nesting that may scare the horses: people will be perturbed. Yet in a sense you are uncomplexifying: at least you are increasing cohesion and decreasing coupling.

5 Comments

Tim Bornholtz
2006-07-25 09:39:42
I think you may have a typo (but I'm not 100% sure). Before the first example you have "back from the Open Office document to the HTML" but I think you meant "back from the Office Open XML document to the HTML".
I get quite confused when people are referring to Open Office, the product that implements ODF or Office Open XML, the file format in Microsoft Office. In this whole debate this is the one point that angers me. I think Microsoft deliberately picked a name for their format that would only serve to confuse the masses. Open Office was in use by OO.o long before MS came up with this format.
Rob Weir
2006-07-25 09:53:22
Interesting topic. I agree that we cannot fairly equate complexity with quality, but it is clear that in most domains increased complexity does bring increased costs of review, testing, etc., in order to reach a given level of quality. So, schema complexity is interesting to know when estimating costs to write a book on the format, to write a tool to read the format or even to review/edit a specification. Do you have any sense how the overall complexity nets out for these two schemas? Is your Document Complexity Metric applicable here? Can it be fairly compared across XML Schema and RELAX NG defintions?


So, I'm suggesting overall schema complexity matters to some (those who need to deal with the schema comprehensively, examples given earlier), but as you note, complexity to extract information matters more to others.

Peter Sefton
2006-07-25 12:52:56
I think the situation with lists is a little more complicated than what appears in the file format, we need to consider also the applications used to edit the formats.


While "ODF has nice explicit markup for list containers" these list containers are not that useful in a word processing format. They'd be good for some technical manuals and legal documents where document structure is strict, but for the average user trying to combine bullet lists with numbered lists with blockquotes in ad hoc ways the ODF approach is a mess of interacting multi-level list structures, list styles and and paragraph styles.


To make matters worse, the implementation of lists in OpenOffice.org Writer makes it almost impossible to use lists in any sensible way and the default template comes with list styles that are woefully inadequate.


The result of these problems with Writer is that it is likely that something that semantically should be a single list with another embedded in it will end up as three lists in a row in an ODF file, so you'll have a hard time writing a generic ODF to HTML converter.


Microsoft Word has its own issues with lists, but I think that lack of explicit structure in the Office Open XML is actually an advantage.


I have written a fair bit on the word processing to HTML problem. this post has some more detail about the challenges.

len
2006-07-26 06:47:38
Lossless round trips are better if that is a process the information will be subjected to (scope of operations). Yes, then wrapper elements are good in that situation. OTOH, if market uptake (reach of users) is the dominant selection criteria, then ease of use and recognizability are more important. HTML wins when the requirement is the basic operations (entry of the markup in any editor) have to memorizable (you and I can both hack HTML by memory; something we wouldn't do with ODF or Docbook, etc., without considerable practice). Many SGMLers tried to explain that one: the use of HTML for long lifecycle information would have a high cost later. In fact, markets seldom care about that and optimize for the shorter cycles because immediate growth fuels the engine of expansion and expansion fuels the growth of initiatives. See the history of Rome or any startup trying to become a bigCo without venture capital.


No free semantic lunch. Or as Spike tells Willow, "magic always has a price, a consequence.". :-) The more forces involved, the higher the costs. In a tensor, the lumpier the manifold.

sam
2007-02-27 16:09:56
test