Comparing XML office document formats: HTML, ODF, WordML, FO, Word2007

by Rick Jelliffe

I'm teaching a course this week where the attendees requested some basic information on office file formats. People want to know how easy it is to convert from the kind of XML they generate into other purposes. So I loaded a simple HTML file with headings, paragraph, table and list and converted it to various office XML(ish) formats: HTML (through Word 2000), WordML (through Word 2003), pre-standard ODF (SWT, through Open Office 2.0.2), ODF (through Open Office 2.0.2), as XML from the Office 2007 beta, and to XSL-FO using HTML2FO.

The document is a simple one that looks like this (the original has no formatting: the following example may inherit styles not intended):



A heading


A paragraph



A subheading





AB
12


  • a

  • bullet

  • list






Some of the file formats save as ZIP, in which case I extracted the content file and left any style files or metadata files (Some of the MS files have embedded metadata). Most of the formats just spew out data onto one line, so I reformatted the XML in Topologi Markup Editor using "Publishing Style" in the Foreman, and XML delimiting.

The WordML file contained a few odd characters like U+FOA7 (Topologi replaces them with a PI in the examples below to mark them out) which is a character in the Private Use Area of Unicode: I'm not completely sure what the purpose of this is, but I suspect they have mapped Wingdings font to the PUA area. I don't know why they don't just use the real Unicode characters there; perhaps the same mechanism is used for accessing user-defined fonts (as used in East Asia) with non-standard characters.

Word 2007 gives two options for saving as XML. If you just save it as a word document it is saved as a ZIP file, and the XML contents are in "word/document.xml". You can also save it direct to XML, which compiles all the parts in the same file: that's what I used below, with the extra parts removed.

As for sizes, this is a tiny example but it shows

512 eg.html
4.0K eg-word.htm (Word 2000)
4.5K eg-fo.xml (HTML2FO)
4.5K eg-word2007.docx/word/document.xml (Word 2007)
8.0K eg.stw (ZIP file)
8.5K eg.odt (ZIP file)
9.5K content-swt.xml (extracted contents)
10K content-odt.xml (extracted contents)
10K eg.rtf (Word 2000)
11K eg-word2007.xml (Word 2003)
12K eg.word2007.docx(Word 2007)
40K eg-word2007.xml (Word 2007)

15 Comments

Jim M.
2006-07-15 15:00:51
You might have started with an "Original HTML" with code that validates. E.g., where is your DOCTYPE, charset declaration, etc?
Rick Jelliffe
2006-07-15 19:59:21
There is no need for a charset declaration because I only used ASCII. Consequently every possible default charset potentially involved (text/plain = ASCII, text/html= ISO 8859-1, xml=UTF-8) will be correct.


As for DOCTYPE, surely you are trolling? There are lots of variant possibilities; more interesting ones would be what happens when the XHTML namespace is used and what happens when class attributes are used.

Dragan Sretenovic
2006-07-16 08:05:51
Isn't it amazing how much effort goes into "formatting" and how little care about semantic compatibility, in all of document formats. One would expect those demanding compatibility would be interested in what is IN the documents, rather than only how do they look.


All complex proprietary tags are just obscuring the content more and more. This is also a bad design, since content and formatting are mixed. HTML may still be the best solution, because it can separate most of formatting into CSS easily.


Not to mention absolute lack of "linking" capability that made modern (Google and similar) web search so useful. Is there an equivalent of HTML "a" tag in other formats that can reference outside of the document? Or something like: arizona_trip_2006.doc#day5

len
2006-07-16 12:32:32
The simplicity of the HTML says it all.


I've had to translate a fair amount of RTF into HTML and then use it in report generators. The simplicity of raw HTML beats the heck out of any XML format with built-in compatibilities.


What do you get for giving up future and past proofing? Conservation of effort.

xix [nine-teen]
2006-07-16 23:14:21
Hmm, comparing HTML w/o CSS with formats including style definitions, isn't it like comparing oranges with apples?
And isn't the example too simple, isn't it? I don't mean that layout languages should be complex but the last 10 years of HTML showed that HTML has its limitations, probably because it's used nowadays for things it never has been intended for.


xix [nine-teen]

Kurt Cagle
2006-07-16 23:56:35
Of course, even with HTML + CSS - which DOES get considerably farther down the stylistic pike, you're still talking about a surprisingly minimal set of information required compared to any of the "formal" page layout specifications. However, I think a more honest test may be to put together a formal page layout that needs to be replicated in any of these languages, including HTML+CSS. I suspect that at least some of the disparity in sizes might disappear.
Rick Jelliffe
2006-07-17 01:43:00
What I wanted to see with these files was just how text, lists and tables were handled in the different formats. xix is right that size is not everything (it is not nothing, on the other hand.) But Dragan is right about semantics too. And, I should mention that the Open Office XML format is not fully baked yet.


Interestingly, there is an error in the HTML: the first TH has a TD end tag. The HTML generated by Word 2000 strips out the TD and replaces it with bold and generates an extra row. Yuck. The XSL-FO strips out info about whether h1 or p is used: perhaps this is not intrinsic to FO and there is a way to keep this info, I don't know. But we couldn't recover the original HTML from the FO.


ODF and the Word XML outputs do preserve enough information in the input to regenerate the original HTML. Except for the

element.
Rick Jelliffe
2006-07-17 01:43:54
oops...I mean the <div> element
Florian Schönherr
2006-07-17 04:24:00
I'd like to see a format like DocBook in this comparision, too.
len
2006-07-17 06:04:26
Size is proportional the handling semantics being exposed. To handle multiple subjective views for any given element, the underlying object is complexified and this feeds back into the external representation. Think of the markup as the shape of the manifold to which the semantics are mapped. The HTML application objects are doing just a few things ok. The other application objects are doing many things well. Flexibility comes at the cost of a very lumpy manifold. Any mapping to that shares that cost expressed in the markup.
James Hales
2006-07-25 07:23:39
Very interesting. The relative sizes of these formats is unbelievable, and the actual markup looks hard to read.


So all that is due to the formats trying to be more flexible? Which capabilities included in these formats have caused such size and complexity?

Rick Jelliffe
2006-07-25 09:47:50
I have another blog entry on how the issue of size and complexity, with some XLST code.
SomeOne
2006-08-31 03:00:14
WordProcessingML, OpenDocument and XSL FO are designed to solve different problems than HTML. They focus on layout, formatting and printing. Wordprocessing applications are used (or should be used) to create printed documents, using them in a purely digital way/world is quite pointless.


Furthermore you're not comparing the formats but tools for creating those formats against your HTML skills. I'm not aware of any WYSIWYG HTML editor that will create such compact HTML code.


I've done some project recently involving creating WordProcessingML documents and I can tell you that you can strip nearly half of the markup from your file without loosing anything.

LeChe
2006-12-21 02:04:39
I agree with what has been said by len and others before: HTML has a completely different goal then the other formats.
Since there is no layouting information present in HTML, you rely on the browsers interpretation of your "styling" information. To generate a truly 100% un-misinterpretable layout, the document would be _very much_ longer. Thus, for fairness, you should include the business rules for interpreting the HTML of all the major browsers - let's see which one is the longest then! :)
On the other hand, it is true that MS puts loads and loads of trash in their documents. If you look at the XML structure, there is so much redundant and even unneccessary information that makes the documents so incredibly long.
To sum it up: if you want a HTML document that defines it layout properly without relying on ANY interpretation of the browser for where to position an element etc. you will end up with a longer document than a well formed XML document with no redundancy in it. You are just not taking advantage of all the programmatical assumtions and layouting definitions in the browser.


Regards,
Che


Btw: I very much agree with the comment of one person: to be fair you should compare a dreamweaver (or any other WYSIWYG editor) HTML document to the rest...


And Rick, Jim M. has a point: your document is NOT valid. You are REQUIRED to specify a doctype for an HTML document to be valid. Just because all browsers nowadays handle HTML even without doctype does not mean it is optional!
Just ranting because it is horrible to see what people do to the HTML lately... :)


Rick Jelliffe
2006-12-21 05:38:24
LeChe: Thanks for the comment!


To be "fair", I think I would have to try every possible application saving multiple files of different sizes and usages of styles/hardcoding to every possible format each supports. And then import them to every other application and re-save them in every possible format. This should only require a few tens of thousands of documents. :-)


The blog is entirely factual: I (tried to) make sure there was no evaluation or interpretations of the facts offered. Facts are not fair or unfair; they are pre-fair. I suppose selective presentation of facts is unfair, and incomplete facts can be misleading, but I didn't vet the results or make comments that could be misleading.


But there is very little material on the WWW that actually directly compares the different formats. Part of the reason is that people are too lazy to do it; they would prefer to sit in the armchairs coddling their prejudices.


As far as HTML validation requiring an DOCTYPE declaration, validating the example will not change the structures or information set in any significant way. Nor would it, I'd expect, alter any of the outputs in any significant ways.