Holes in Microsoft Office XML

by Simon St. Laurent

I've spent a lot of time this week rethinking Microsoft Office 2003's XML capabilities and figuring out how I feel about them a year after writing a book on them.

When I first saw the demos for Microsoft Office 2003, I was blown away. Word would let me edit XML documents, Excel would give me easy access to both reading and creating tabular XML data, Access would let me get data in and out, and XDocs (now InfoPath) would make it easier for people to create XML documents using forms. Everything looked great.

Since then, I've become sadder but wiser. The glossy demos hid an underlying reality which was far less delightful. The marketing story breaks down under any close inspection. While there are useful pieces here, no question, Office 2003's XML support has to be viewed as a Microsoft 1.0 release, awaiting maturation to a Microsoft 3.0 release, when the parts finally come together.

What's left me so bitter? A list is probably the best way to outline this briefly, without driving too deeply into a conversation that requires deep understanding of all of these parts. First, the technical issues:

WordprocessingML is a horrible mess.

On the bright side, it exposes everything in a Word document - you don't lose anything by saving to XML. On the dark side, the structure of what you get is amazingly twisted, and it's painfully obvious that WordprocessingML (formerly the catchier WordML) is a serialization of internal structures in Word, not an XML vocabulary designed by people who actually care about working with XML.

Word's XML functionality isn't integrated with its existing style functionality.

Users, at least some users, understand the style drop-down and applying styles. The XML task pane is new and additional. Creating XML documents in Word that look like you want them to look can require using both styles and XML. My advice to people who want to use Word to create XML: forget about the XML tools, unless you need lots of precise nesting and attributes. Use styles instead, and then extract the information from WordprocessingML. (Yes, I know that's painful.)

Word's XML functionality has awful interactivity.

Word's custom schema XML functionality lets you add XML to documents, but the process for making Word respond to that XML is presently incredibly clunky. Right now, to make Word itself make your styles match your markup, you get to save and close your document, then reopen it. Chapter 5 of Office 2003 XML walks through the current process for making this more automatic, which requires a lot of painstaking coding. There's lots of potential for creating a better experience with various panes and interfaces, hopefully soon in Visual Studio 2005 Tools for Office, which promises to be a "drag-and-drop experience rather than a code-writing experience," but that's still a task for .NET programmers, not the traditional Office power-user approach of using VBA within the application.

SpreadsheetML is incomplete.

Excel's SpreadsheetML vocabulary is infinitely easier to read and process than WordprocessingML, but it's also missing lots and lots of things, notably charts and VBA. Save an Excel spreadsheet as XML, and suddenly all the VBA vanishes. Big problem.

Excel lacks an interface for modifying XML schemas that are part of a spreadsheet.

Excel's facilities for importing and exporting XML through maps and lists are generally magnificent. So long as my XML is roughly tabular, with a fair amount of tolerance for other structures, I can show and edit XML as parts of my spreadsheet. This means that I can write reusable spreadsheets, which, for instance, take quarterly sales data and show it as charts, break it down twenty different ways, and offer projections. At the end of every quarter, I can load a new XML document and everything will instantly update. There's one big oops, though - if the schema of the XML data structure changes, there's no interface for changing the maps inside of Excel. Right now, you have to export the spreadsheet to SpreadsheetML, edit the schema in that XML, and then reload. That works, unless you had charts and VBA in your spreadsheet... then it's time for a lot of reconstructive surgery.

Access supports functionality in the GUI that's missing in VBA.

The XML functionality is Access is one of my favorite parts of Office. The designers wisely recognized that database columns and XML elements both have names, and that XML structures can easily represent relational tables. They didn't try to do anything especially funky, but let you get XML in and out. Despite that, there are a few big problems. Access lets you apply XSLT transformations on import and export through the GUI, but not through VBA! I've worked out how to apply transformations, but it means serializing a file to disk, reopening and parsing it, and reserializing it. This might not matter so much, except that Access only sees elements, not attributes, and lots of data formats use attributes....

The Web Services Toolkit is, well, lousy.

Word, Excel, and Access all rely on the Web Services Toolkit if you want to create web services-based apps from VBA. It generates lots of code, but isn't very sophisticated, even about SOAP. As Jeff Webb has pointed out, and as we demonstrated in the book, doing REST (basically really simple XML over HTTP) from Office is a lot easier, even though you have to write your own code. The documentation doesn't tell you that, of course.

InfoPath is its own universe.

InfoPath is built on an ingenious XSLT hack, and reuses lots of existing Web development technologies, which is a good thing. Unfortunately that also makes it very different from the rest of the Office suite, which generally use VBA. It's also pretty different from the .NET tools Microsoft is developing in Visual Studio Tools for Office.

Outlook? PowerPoint?

There isn't much to say about Outlook and PowerPoint in XML.

Visio is a bold SVG loner.

Visio 2003 came with amazing SVG support, but it's pretty much the only Microsoft product that's noticed SVG. Cool, but alone.

These technical problems are compounded by a number of marketing problems:

Office Standard vs. Office Professional vs. Enterprise vs. standalone apps.

For whatever reason, Microsoft decided to split the XML functionality depending on which version of Office people bought. Office Standard includes a version of Word which saves and opens WordprocessingML but doesn't let you work with the custom schema features, and a version of Excel that supports SpreadsheetML but lacks the easy XML import and export. The extra features are there in the Professional, Enterprise, and standalone box versions. This makes it really really hard to talk about these things as Office features, since lots of people will try them and find they just don't work. I was startled when my new O'Reilly laptop arrived with a copy of Office Standard, and suddenly some pieces I'd written with Professional didn't work.

InfoPath is rare.

InfoPath doesn't suffer from the Standard/Professional distinction, but instead it only comes with the Enterprise edition or as a standalone product. Because Microsoft chose to combine the forms designer and the data entry side into one application, you get to buy a license of InfoPath if you ever want to fill out InfoPath forms. I'm kind of happy Microsoft hasn't offered a free form-filling application, as this gives Mozilla more time to work on XForms, but it does make it hard to publish on InfoPath. The only market for InfoPath is inside the enterprise.

.NET and VBA are two different cultures.

I could, if I wanted, use .NET tools and wrappers for Office to use .NET's more powerful web services and thereby avoid the Web Services Toolkit for VBA, but that's a huge leap from the VBA that I've been using in Office for years. I also don't know how many .NET developers are particularly familiar with the Office object model.

It's harder to use than the demos suggest.

Microsoft's put a lot of effort into marketing this functionality, but hasn't spent much time dwelling on how hard it is to use. It's hard to sustain the excitement generated by watching a demo when you find yourself plunked into WordprocessingML and the badly-named Smart Documents API.

Who needs this again?

In general, I have a hard time seeing who the specific market is for these things. The XML import in Excel - sadly turned off in Office standard - seems like something that could be helpful to spreadsheet developers at every level. I have a hard time seeing people spending the time to create Word 'solutions' using XML, but if that was an order of magnitude or so easier it might be something that could reach from the enterprise to small businesses. I once had hopes we could use this stuff in our own (O'Reilly publishing process), but the complication level quickly dashed that. I guess it's fair to say that the Word XML features are all aimed at a market of people who have tremendous patience.

Schema ownership questions remain.

Reading things like this leave me wondering how safe it would be to build applications around these schemas. My guess overall is (and has been) that it's pretty safe, but it's still a higher barrier than I'd like to see, knowing how paranoid corporate lawyers can be. A simple statement that "we reserve the right to change these schemas but will never ever sue anyone for using them however they like" would make these things easier.

WordprocessingML is a mouthful.

It was a lot easier to say WordML.

All of these are certainly fixable, and there's reason to hope for improvement. I probably should have know better than to put so much time into what is effectively a 1.0 release. Office is an especially tough piece of software to change deeply, after a few decades of development, and it's not that surprising on reflection that this first generation of XML support is a thin candy shell around an existing product.

Can you do interesting things with Office XML functionality? Sure! Is Office's XML functionality a powerfully useful feature? Only occasionally.

I'll keep reporting on this as things improve. Hopefully that'll make for some brighter blog entries.

Have you managed to surpass these problems and do interesting work with Office XML?


2005-03-14 07:13:05
Hi Simon,

It would be interesting to go through your list again in a discussion of OpenOffice. It would win some (e.g. PowerPoint XML) lose some (web services toolkit) and be irrelevant for others. And of course the release of the OpenOffice 2.0 beta would make this new version of your list a lot more work, but even more interesting.

2005-03-16 06:09:19
Editing Word documents on the server

I was very interested to read this article. I have been working closely with WordML for about 6 months, albeit not in conjunction with Word itself.

I have certainly felt that I was on the bleeding edge of this technology sometimes, and the lack of extensive documentation and community support has been frustrating. However, I was at least relieved that the schema documentation is accurate and the process of round-tripping word documents to XML and back is reliable.

We have come up with an online document review and collaboration system called PleaseReview http://www.pleasereview.com.

We allow reviewers to view an HTML rendition of the document, so they can interactively comment and propose changes (using a pure web interface) and generally collaborate in a way which is not possible with Word's built-in commenting features. After the review has finished, the author can download the document with agreed changes already applied, and open comments shown as tracked changes.

I think this project would have been almost impossible without using WordML, because we effectively edit the document on the server to apply comments stored in a database. So I would say we have definitely used Word's XML features with a great deal of success. Maybe you would even like to consider using it in O'Reilly's review cycle :-)

2005-04-18 07:58:04
Anonymous reader comments
I apparently didn't list quite everything in this piece. Here are a few more comments from another reader.


I enjoyed reading your article about XML support in Office 2003.

Here are some points regarding Word 2003's XML support that you didn't mention. These apply to the case where you are loading a pure XML file (not WordML) and then saving it back as a pure XML file (they call it "data only").

- It will lose all XML Comments.

- It will lose all XML Processing Instructions that are inside the root element.

- It does not honor special attributes like xml:space="preserve". It will mistakenly lose newlines inside such attributes. Also, when saving back to disk, these special attributes will be lost.

- It has problems with whitespace handling. Markup like this...

This is <bold>very</bold> important.

will come in and get saved as this...

This is<bold>very</bold>important.

So, in general, if Word opens some arbitrary XML file and then saves it back to disk, it will horribly corrupt it in most cases.

Thought you might be interested in these items.

2005-04-18 15:01:48
I can't retrieve microsoft
office 2003 documents stored in a RDBMS


I can't retrieve microsoft
office 2003 documents stored in MySQL 4.0.20.a-nt database running on Windows 2003. I know that microsoft have introduced many changes in office 2003 mainly related with XML compatibility and there must be the origin of the situation but I don't undesrstand exactly why. I upgraded MyODBC in Windows server waiting to obtain office 2003 documents retrieval but it didn't fix the trouble. Do we have to await Micorsoft maturation to obtain compatibility with RDBMS ?


2005-05-17 04:41:48
Java and SpreadsheetML

Reading your article, I very much agree with your scepticism, especially regarding WordMessML. However, let's see things from the sunny side. SpreadsheetML is the thing I was working with during these past months and it's shortcommings (no charts, no VBA, no etc.) could be seen as benefits if you want to keep things basic and simple. At least it opens the possibility to integrate Excel, data and structures held therein with Java.

We use SpreadsheetML to open up Excel files, process it's data and data-structures in a Java-application and after transformation store the results in other data-repositories. We use SpreadsheetML to dynamically create Excel files in a web-application and in this way give our clients the opportunity to download the requested data in an application (Excel) that is ready to do analyses, calculations and information-retrieval.

I gather feedback on your article is not intended to promote software libraries however, our solution to integrating Excel and Java can be found at http://xelem.sourceforge.net/