Comparing XML office document formats: using XML Metrics

by Rick Jelliffe

I've blogged recently comparing the two contenders for the standard office XML format crown: the Sun/IBM sponsored Open Document Format (ODF) and the Microsoft sponsored MS Office Open XML format (MSOOX). Also I've blogged recently on various metrics for XML: magic numbers that help provide objective evidence for help characterize things like complexity in documents, to help evaluation and produce estimation. A reader, unsurprisingly, asked if I could combine the two threads and provide some metrics on ODF and MSOOX.

Fair enough! Here are some XML metrics for a large document with almost 180,000 words, tables, lists, sidebars and some graphics. I chose a large document so that bootstrap effects would be minimized. I used the ODF v.1.0 specification, converting it from .SWX to .DOC and .ODT in Open Office 2.0, then converting the .DOC to .DOCX in Word 2007 beta. Then I used a COTS archiver to treat the ODT and DOCX files as ZIP archives, and extracted the XMLfiles containing the basic text and markup: content.xml (ODF) and word/document.xml (MSOOX). I chose to use a .SWX format because I didn't want to have any MS-dependencies in the data, .DOC being proprietary.

I also resaved the document to .DOC, re-opened it and re-exported it to .DOCX and extracted the word/document.xml file. Resaving data is a good trick when doing data conversion, because it removes extraneous information or structures from the source: the first .DOC are what Open Office thinks .DOC looks like, the second .DOC is what Microsoft does things.

I used the upcoming release of the Topologi Complexity Detective to create the metrics. The reports on the ODF document are here Download file; the reports on the original MSOOX document are here Download file, and the better reports on the resaved MSOOX documents are here Download file. Comments below.

17 Comments

Micah Dubinko
2006-08-18 07:16:45
Hey Rick,


Interesting stuff (as usual). Is it just me, or are the linked word-1 and word-2 files identical? -m

Rick Jelliffe
2006-08-18 09:12:04
Thanks, and well spotted. I've corrected the second link and updated the blog. The difference is less than a percent.
Joshua Franklin
2006-08-18 12:33:42
Great article. I've been converting some spreadsheets to OpenOffice.org content.xml files and loading data into a database, I was pleasantly suprised how easy it was. (Better than CSV, and I easily got the "Track Changes" annotations and cell styles which the client was using.) I'd love to see a similar comparison for spreadsheets.
Brian Jones
2006-08-18 16:41:06
Hey Rick, do you think you could post a link to the document itself? I'm really curious to see the cases where you are getting the base 64 encoded data...
Thanks.


-Brian

Kaj Kandler
2006-08-18 17:29:31
I believe OpenOffice 2.0 does support ODF 1.0, while you write several times to compare ODF 1.1. Am I mistaken?


K

Kaj Kandler
2006-08-18 17:33:11
-Off topic - message to the webmaster:
This comment form butchers less than and greater than signs. It needs a little decode/encode magic.


Lets see if my signature makes it this time?


K<o>
P.s.: Yes!

Rick Jelliffe
2006-08-18 20:43:47
Here is a example. The SWX original has



<text:p text:style-name="Text body">Chapter
<text:reference-ref text:reference-format="chapter" text:ref-name="Introduction">1</text:reference-ref> contains an introduction to the
<text:user-field-get text:name="CommitteeName">OpenDocument</text:user-field-get> format. The structure of documents that conform to
the <text:user-field-get text:name="CommitteeName">OpenDocument</text:user-field-get> specification is explained in chapter
<text:reference-ref text:reference-format="chapter" text:ref-name="Document Structure">2</text:reference-ref>.


The MSOOX has



<w:r w:rsidR="00DE46CC">

<w:instrText xml:space="preserve"> REF Ref_Introduction \n \h

</w:instrText>


</w:r>


<w:r w:rsidR="00DE46CC">

<w:fldChar w:fldCharType="separate"/>


</w:r>


<w:r w:rsidR="00DE46CC">

<w:t>1

</w:t>


</w:r>


<w:r w:rsidR="00DE46CC">

<w:fldChar w:fldCharType="end"/>


</w:r>


<w:r w:rsidR="00DE46CC">

<w:t xml:space="preserve"> contains an introduction to the OpenDocument format.
The structure of documents that conform to the OpenDocument
specification is explained in chapter

</w:t>


</w:r>


<w:r w:rsidR="00DE46CC">

<w:fldChar w:fldCharType="begin">

<w:fldData
xml:space="preserve">CNDJ6nn5us4RjIIAqgBLqQsCAAAACAAAABkAAABSAGUAZgBfAEQAbwBjAHUAbQBlAG4AdAAlADIAMABTAHQAcgB1AGMAdAB1AHIAZQAAAAAA

</w:fldData>

</w:fldChar>


</w:r>


<w:r w:rsidR="00DE46CC">
<w:instrText xml:space="preserve"> REF Ref_Document%20Structure \n \h
</w:instrText>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:t>2
</w:t>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:fldChar w:fldCharType="end"/>
</w:r>



I should note that this may well be a fault of OpenOffice's output converter rather than Word. Because of the impenetrable .DOC stage, it is impossible to know or fix. (Well, at least there is open source!)

Rick Jelliffe
2006-08-18 20:57:35
You are right, the DOCTYPE declaration has "-//OpenOffice.org//DTD OfficeDocument 1.0//EN". I will correct the text, but it doesn't alter anything.
edges
2006-08-25 21:39:28
This really article leaves me even less convinved of the value of such metrics than before. What is desireable isn't more complexity or less complexity, but right complexity. That isn't going to be determined by these numbers but by actually analyzing the content and their representations in detail, and looking at how well the problem domain is actually represented, how completely, how precisely and how intelligable the result is. All these metrics seem to lead to is vague speculations and to encourage you to look more closely at this or that detail to determine the why of some numeric difference. An equally big design different might lead to equal numbers for different reasons, and if you use the numbers as your guide of where to look you'll never look there. It seems to me the moral of the story is if you want a real answer you have to look at it in detail, and you might as well just start out that way. As it is you didn't really get to a real answer just a number of incomplete suppositions.
Rick Jelliffe
2006-08-28 01:46:54
[edges] Yes, you are completely right that a metric then requires analysis, but I don't think I said otherwise.
But your phrase "vague speculation" hits the nail on the head: without some objective measures, your analysis and the evidence on which the analysis is based is speculation or guestimation. With a metric, we can provide some objective evidence; to put it another way, if we make statements about something but cannot come up with any objective metrics to back up our statement, then a manager can reasonably suspect that we are on flimsy ground. I have seen projects where a simple metric resolved an issue that the client thought was some kind of personality conflict between two consultants: the metric put the onus on the party saying "there is no difference between these two schemas" to have to show why the numbers (i.e. the objective evidence) varied so much. Metrics help save us from consultants; or, at least, good quality consultants are happy to modify their opinions in the face of more evidence.
In the case of ODF/MSOOX, we might easily say "Oh, of course ODF is simpler" out of prejudice and yet, on several fairly straightforward measures, it is not the case.
You seem to think it is a flaw if a metric encourages you to look into some detail; on the contrary, that is part of their function and why they can be useful.
I also agree that "right complexity" has a place; however, the "right"ness belongs to analysis, but the "complexity" belongs to metrics. There may indeed be better metrics, and they may involve measuring programmer productivity rather than the schema itself of course; but that is not a point against metrics in general.
SomeOne
2006-08-30 16:25:29
Those fldChar and fldData elements are most probably used for cross referencing/toc/indexing.


Word can create such things based on styles (headings,...) or special fields (field values).


It seems to me that OpenOffice.Org Writer is inserting fields when exporting to .doc.


Using styles could result in different markup and different metrics.


BTW, Ecma released draft 1.4 of Office Open XML on 23rd of August 2006 and MS Office 2007 is not (fully) compatible to this version.


hAl
2006-10-07 08:42:59
Starting with a ODF predecessor the conversion to .DOC could have been a reason for the poor size of the .DOC file compared to the ODT file.
If I understand you correctly you have coverted a .SWX file to the .DOC file using openoffice. That seems a weird way to start this test as openoffice (or any such converter) isn't likely to create an efficient small complex .DOC file but rather a larger less complex file as that is easier for conversion purposes.
Rick Jelliffe
2006-10-13 01:40:28
hAl, yes, the most that can be said is that at least one workflow produces these figures. As to whether other workflows produce other figures, and how significant the figures are, I leave to the reader. I wasn't trying to "show up" either OOX or ODF, I just took the most straightforward path on my system.
Tim Small
2007-09-24 11:54:58
this is very helpful explaining the next office 2007 tools
Rick Jelliffe
2007-09-24 18:42:27
Tim: Thanks.


Of course, these numbers are based on old versions of the technologies, not the versions of 2007 (e.g. ODF 1.1 or DIS29500). And I expect the 2008 versions (ODF 1.2 and IS 29500) will be different again. So that is a big caveat on these numbers.

Prashanth
2008-03-23 22:38:26
1.what differance between document & Format?
2.What differance Quality Manual and Quality Procedure?
Prashanth
2008-03-23 22:39:09
1.what differance between document & Format?
2.What differance Quality Manual and Quality Procedure?