XML design: data or documents?

by Michael C. Daconta

I have recently been getting back into XML processing for some upcoming work with business glossaries. So, I will be examining the relevant business glossary formats in forthcoming blog posts...



As a precursor to that, I would like to delve a bit into XML design by bringing up something that has troubled me on several occasions over the past six months. First, as I was writing a simple iTunes de-dupe utility, I had occasion to parse and process the iTunes XML format. There are many good articles on this format like this xml.com article entitled Hacking iTunes. The iTunes format is an XML data-dump of a data structure called "plist" which is a list of key/value pairs. In fact, "plist" stands for "property list". While using XML to persist data structures enables some minimal benefits via text encoding, it seems to be harmful to the larger goal of XML being easily understandable and thus processable by many applications. So, while in a small way, dumping data structures to XML is not evil, it also is not recommended. The reason is that the data remains fairly tightly coupled to the program which produced it and thus the semantic value of the data, as a standalone entity, is diminished. In short: Better to design XML documents than dump XML data.



Besides iTunes, this cropped up again about a month ago when I was examining a Microsoft dump of system information on Windows Server 2003 - I looked for that format on the web but was unable to locate it. If you have a link to it, please post a comment with it. That system information format basically fell into the same trap of simply dumping a data structure to XML. The problem I have with this is that it shifts the semantics from a more stable element into the more variable element's value. An element can have a unique ID. An element can be described via one or more schema types. An element can be reliably referenced externally. Thus, the semantics belong in the elements and not in the element (or attribute) value.


Do you agree?


Do you have other examples of data structures dumped to XML?


As a related reference, this xml.com article calls this a "dynamic document". However, since I am arguing here that an XML document is an expression of design, I would say that is, at best, a misnomer. To me, dumping data structures to XML seems to be a case of "interoperability lip service". So, do you design XML documents or dump XML data?

If so, why or why not?


Until next time, see you in the trenches. - Mike


15 Comments

Paul
2008-04-20 19:03:55
Numerous times. Let's face it, most xml out there is poorly designed. I work with sports data and have seen lots of formats. Some are better than others but all have serious flaws. And yes, some are AppDBdumpML like the awful iTunes xml. So our business normalizes then all into SportsML. We're doing fine, thank you.
bryan
2008-04-21 01:21:08
I agree that dumping to xml as opposed to designing a format is interoperability lipservice, but depending on the situation lipservice is all that is needed.


In the case of Plists I think that is something that is somewhat on the cusp of needing to be designed and something that only needs dumping; generally a well-designed format is needed for meaningful business data that you expect people to spend money to comply with, to program with or other things where a significant investment will be undertaken by organizations to work with the format, design of the format is also a function of a data system that includes such things as coherent documentation and everything that one needs for official standards .


Plists seem to me to be something that people use to tweak their mac settings (maybe I'm wrong on this, it's all I ever used it for) as such the data system about the format, documentation, well thought out structure, etc. can be at a minimum, because it can be assumed that individual users needing to do some small thing with the format will end up creating tools, tips, tutorials etc. If this is what the purpose of an XML based format is then it might make sense to not put money into the structuring of the format.



Toby
2008-04-21 04:12:42
It's true that simply dumping data to an ill-considered XML format falls a long way short of the potential gains that designing and using a good XML format offer.


Nevertheless, it's also true that, given a bunch of arbitrary data in an undocumented XML format, I can get a lot further with parsing and manipulating that data than if it were dumped to an undocumented binary format.


This is one of the biggest wins of XML; even if XML emitters thoughtlessly abuse it, end users can still extract some data, even if less than would be ideal. And since most users of any technology probably will mindlessly abuse it, you might as well cater for that use case.


How far would you have got with "Hacking iTunes" if plists were binary blobs?

len
2008-04-21 14:46:31
Dumping data and tagging it is useful. It isn't interoperable but it isn't a bad way to expose the guts of the data. Do that enough times with enough objects that the designers claim are derived from the same classes and you'll be able to create a pretty good one size fits all schema. It won't be very selective but you'll be able to prove "this" IS "one" of "those" for what that's worth. Where you have legacy, it is almost as good as those applications that reverse-engineer existing code into UML diagrams.


In YeOldDaze, we debated document design vs tag-sprinkling. In the former, we used our mighty powers of analysis and interview to design the perfect expression of the instance structure as cleanly as possible. In the latter, we took some samples of the printed documents and put tags wherever they seemed appropriate.


Then we applied style and reverse engineered a DTD from the instances.


Which worked better? All too often, tag sprinkling. Why? Expectations and existing standards (say 38784).


XML Doesn't Care. You Have To.

Michael Daconta
2008-04-21 16:43:12
Sounds like some of you are arguing for a good dose of "practicality" ... and to that I heartily agree that there are always those situations where "quick and dirty" is "good enough".
I also certainly agree that XML is better than a binary alternative.


I think the real question is where do you draw the line? Personally, I would hope that commercial products would be a good place to start for some thoughtful design.


I really like what len said:
"XML doesn't care, you have to".


- Mike
Asbjørn Ulsberg
2008-04-22 00:45:12
I actually think the OOXML format is a prime example of such an XML "dump". It's basically just the binary format serialized to XML without much thought or work put into it. At least that's what it looks like. And as Paul writes, most XML is poorly designed or not designed at all. However, as Toby writes, I'd take a bad XML dump over almost any other format any day, especially binary ones.


There are numerous other formats available today that gives the same self-describable semantic and well-defined parsing rules, though, like YAML and JSON.

joshuadfranklin
2008-04-22 11:58:19
MATLAB has a savexml() function that simply dumps var names and a text representation of a matrix, like:


[2 6 3.14159 3]


This is the worst XML I have ever worked with.

damour
2008-04-22 17:42:38
I had to do the same thing, idenfity dupes. I was VERY dissapointed in the XML format for itunes. the whole NameThe Good Soldier structure is just awful. Depending upong order of nodes for relationships is always a bad decision in my bood. What did you end up with for your programatic solution? I had to mix some complicated xpath with a hash data structure loop to get what i wanted. Would xQuery help with this in any way?
Michael Daconta
2008-04-23 01:57:44
Interesting comments ... here are some responses ...

I thought about discussing the OOXML format here in relation to this but did not have enough time to dig into the format to really determine whether it was dumping or design. I did download some of the specification from the web but it was clear that a cursory look would not be enough to make a determination. I would love to see a detailed analysis that examines the spec and summarizes the salient issues on that question.


Thanks Joshua for the MATLAB example. I am very interested in the import/export of commercial products in this regard. Most exported XML that I have seen so far is horrible! Anyone else have good examples, would be great to see a searchable online database of these formats for the various products. Sounds like a good category for some wiki pages.


"damour" asked about my de-dupe utility... unfortunately, I stalled in my development of it once I learned that the XML format cannot be used to actually eliminate the songs from iTunes. In other words, the iTunes database is distinct from the XML document. I read you could delete the database and then re-import from the XML but that sounds risky. So, I need to research this some more and find the right solution. If anyone has solved this I would love to hear from you.


Excellent comments!!! Keep em coming ... - Mike
Rinie
2008-04-23 14:22:05
iTunes / plist is like windows registry written in XML. Indeed awkward. However 'Thus, the semantics belong in the elements and not in the element (or attribute) value'
May lead to excess verbosity, something unreadable it feels like you have a book with one word per page.
Say
<year>2008</year>
<month>APR</month>
<day>04</day>


For RDBMS data externalised in XML I like 1 element per tablerow instead of 1 element per field.

Michael Daconta
2008-04-24 03:20:51
Hi Rinie,


Your example was a date:
2008
APR
04


However, I would argue that you did not capture the key semantics about the date for example:

2008
APR
04


And that is the problem I would have with one element per row. Excess verbosity should no longer be a concern except where you are a thousand percent certain that performance is the top concern. (Keep thinking ... remember Donald Knuth ... remember Donald Knuth ... :)


One element per row would not tell me which columns in that database row are really important to the business ... those would need to be separated out.


Regards,


- Mike

Michael Daconta
2008-04-24 03:27:59
Shoot, I forgot to format the XML:


I meant:
<CreationDate>
<year>2008</year>
<month>APR</month>
<day>04</day>
</CreationDate>


- Mike

len
2008-04-25 06:12:52
Here's one to ponder: many of the NIMS systems specify XML languages such as CAP and tranfers using SOAP. In many commercial systems such as HAN, we see RSS/Atom and REST.


CAP is a message with clear semantics (well, tractable) but specific in application (Say constrained). RSS/Atom is wide open and reusable across multiple domains of Mass Alerting.


Which would you prefer, Mike?

Michael Daconta
2008-04-25 07:56:36
Hi Len,


For me, the answer is obvious:
In emergency situations, clear semantics is paramount.
There is too much going on, too much built-in ambiguity and chaos that we cannot afford to add to it due to ill-preparation.
So, for emergencies, CAP over RSS any day.


Anyone else have other thoughts on the point Len raised?


Do you disagree? (It is ok to disagree... on this blog, we know how to respectfully disagree with each other and occasionally we change our mind if someone shows us a better way)... :)


Regards,


- Mike

len
2008-04-25 09:10:50
Thanks Mike. The contract rules of course so it is CAP in any case. And I agree. Another advantage of CAP was it made the application easier to design although the messaging is easy compared to the rest of what NIMS and PHIN in conjunction drive to the surface (eg, the incident forms we talked about earlier).


It seemed to me the RSS/subscription approach is easier on the wire, but the CAP approach is more suitable for ensuring all of the necessary rules are being followed. CAP simplified the condition where we are answering a customer that has an existing CAP-compliant mass or wide area notification system and wants us to hook yet another agency-centric app into the pipes.