Parsing XML... backwards?

by Michael Day

Okay, I've heard jokes about people parsing XML files backwards, starting at the end of the file and reporting SAX events in reverse document order, but it seems that someone has actually gone and done it. The justification sounds almost plausible: an instant messaging client (Adium on the Mac) that writes out XML message log files and uses backwards parsing as a method for retrieving the last N messages in constant time, regardless of how many messages the file contains in total. However, it's crazy to think of doing this for XML in general.

16 Comments

bpsm
2007-03-13 02:08:11
100% right. If you find yourself needing to parse XML backwards, perhaps you shouldn't be using XML in the first place.
M. David Peterson
2007-03-13 02:12:01
>> Just don't parse XML backwards!


AMEN, BROTHER!!! :D


>> The justification sounds almost plausible: an instant messaging client (Adium on the Mac) that writes out XML message log files and uses backwards parsing as a method for retrieving the last N messages in constant time, regardless of how many messages the file contains in total. <<


Of course you could always prepend each new entry to the log file similar to the way many feed generators will add a new entry to an existing web feed file. I guess maybe that would have been too easy of a solution? ;-)


Of course, with all of this, if their is ever going to be an XML 2.0, I pray to the almighty XML Gods that it comes in the form of "Concurrent XML" which would be less of change in the XML specification itself, and more of a technique in which suggests a particular way of creating an XML document that could reliably be broken up into smaller pieces, each of which could be processed independently and aide-effect free.


In short: Similar to canonical XML, it seems to me that a lot of the current problems with XML could just as easily be fixed by developing a best patterns and practices extension to the existing XML 1.*0* ;-) specification, rather than attempting to come up with back-a$$-wards solutions like processing an XML file backwards.


Thoughts?

M. David Peterson
2007-03-13 02:13:17
s/aide-effect/side-effect
Michael Day
2007-03-13 03:52:08
The thing is, there are many ways of having many little XML documents in one file: ZIP, for example, or multipart MIME. If you need constant time concatenation and retrieval, you can use the simple XML document chunk plus binary trailer method that I outlined, or multiple files in a directory, or your own database format. I don't see a great need for new XML specifications in this area.


It is true that XML documents are not appendable, but I don't really think that is a bug, more of a feature. Rather than changing XML, I would prefer to extend access methods (eg. XPath/XQuery, XSLT) to be able to treat multiple documents as one big data model. If you think about it, these technologies already support multiple files as one big data model thanks to external entities, and multiple documents thanks to XInclude and the XPath document() function, so it's not a big step.

Oleg Tkachenko
2007-03-13 04:33:11
"how do you append to the document in the first place if the root element is already closed?"


And why do you need XML document for logs when you can write XML fragment aka external general parsed entity?


"And if you never close the root element, then it's not an XML file at all."


Wrong. It's not XML _document_, but it can be perfectly wellformed external general parsed entity.

Michael Day
2007-03-13 04:42:04
Wrong. It's not XML _document_, but it can be perfectly wellformed external general parsed entity.


Good point! Breaking a document up into entities is a clever way of making a single document that can have content inserted at arbitrary points (entity boundaries) in constant time. I still think it's a bad idea to parse the resulting document backwards, though :)

Oleg Tkachenko
2007-03-13 04:43:27
I was writing about logging in XML format years ago - http://www.tkachenko.com/blog/archives/000053.html
The idea is just to have XML fragment
...
...


Then you can append data to the log really fast - just to the end of file with no parsing whatsoever. And for reading - some XML API like .NET XmlReader able to read XML fragments, otherwise just include fragment into wrapper document that's it. Simple and effective.

M. David Peterson
2007-03-13 04:43:40
@Michael,


Oh I agree 100%. My thoughts are based on the notion that there seems to be all of these efforts to try and solve problems in unconventional ways that make absolutely no sense. e.g. the example you provided above. If there is going to be any work done in this area, it seems to me that the best thing to place focus on is taking what we have and defining ways to utilize them to take full advantage of new areas of technology. Concurrent programming is obviously a significant piece of our future, and finding ways (like those you've just mentioned) to take full advantage of it with XML is developing into a hot topic as of late, and one of the solutions I have heard mentioned in various places is parsing and XML document backwards and forwards at the same time, which as you've already pointed out is a bad idea all together. As such, I would hope that if folks are going to spin cycles coming up with seemingly better ways to process XML, they would spin them in ways that actually make sense: Like creating best patterns and practices docs that define how to best go about taking advantage of parsing and processing XML documents concurrently. XProc seems like it would be a good place to start, as does taking advantage of the things you've mentioned like zipped archives of documents to be processed, etc...

Oleg Tkachenko
2007-03-13 04:46:05
I meant
<event>.../<event>
<event>.../<event>


Yes, I'm with you on parsing XML _documents_ backwards.

Peter Hosey
2007-03-13 23:39:29

First problem: the document encoding. You don't know what it is unless you sniff the beginning of the file and read the XML declaration, if present.


Truth.


Second problem: the DOCTYPE declaration. This can define entities and fixed attribute values, and again it's at the beginning of the file, not the end. If you parse the file backwards and hit an entity reference, you have no idea what to do with it.


Truth. I should probably add a forward-parse phase to LMX 2.0 or 3.0.


Third problem: comments. Say you're parsing backwards through an XML file and you see this: -->. Must be the end of a comment, right? Wrong, it's the end of an element: . This is a killer for efficient parsing as it means you need potentially unbounded look-ahead (or look-behind, in this case) to decide what something is. (This problem could be avoided if comments were symmetrical and ended with --!>, but XML just wasn't designed to be parsed backwards).


Truth. LMX is pretty naïve about such things (by necessity).


Fourth problem: processing instructions. As with the comments, how can you tell what this is: ?>. The problem this time is that text can contain unescaped > characters (as long as they don't follow ]]), so a backwards parser may need to look a very long way ahead to tell if this is the end of a processing instruction or just some text.


LMX doesn't handle unescaped >'s anyway--it assumes that a > always ends a tag.


Fifth problem: documents that are not well-formed. I shudder to think what this parser would do with them, especially considering that it may stop parsing before it even reaches the beginning of the document.


Very little. LMX is a delicate parser that expects well-formed input. Anything else is likely to make it behave more or less randomly. ☺


Sixth problem: how do you append to the document in the first place if the root element is already closed? And if you never close the root element, then it's not an XML file at all.


Our solution to this is actually one of the problems that led me to implement a reverse parser in the first place.


A common suggestion that people make when they see LMX is "Why not just save the messages in reverse order--with the newest message at the start of the file? Then you wouldn't need a reverse parser!" We don't do this because file I/O has no insert mode; we would need to rewrite all the older messages again, which would get very expensive as the chat goes on longer and longer.


But this overwrite behavior came back to help us when the question you mention arose. We simply seek to the start of the </chat> tag, then overwrite it with the new message and a new </chat> tag. So we get to have a real XML document, and add new elements to the log.


Sure, you might think that all of these problems can be avoided by making sure that your XML is in a fixed encoding, doesn't use entities or processing instructions or whatever, but what's the point of choosing an open standard text format if you're going to impose arbitrary limitations on its use?


Truth. LMX could definitely be better about supporting these things, though I don't see any hope for being very flexible in parsing borken XML.


There are a number of other ways that applications can use to maintain growable log files. You can write multiple well-formed XML documents to a single file, following each one by a binary trailer that gives the size of the last chunk of XML.


That's an interesting idea. I admit that we never thought of that.


Michael Day
2007-03-13 23:50:25
But this overwrite behavior came back to help us when the question you mention arose. We simply seek to the start of the </chat> tag, then overwrite it with the new message and a new </chat> tag. So we get to have a real XML document, and add new elements to the log.

Okay, that makes sense. I had assumed (for no real reason) that you only wished to append to the file, but of course you can always truncate it slightly and then append.


Another thing I avoided mentioning was robustness: if the application dies halfway through appending a new chunk to the file the results may not be pretty. The chunk + binary trailer approach can be engineered to survive this, as the last thing that you write in the trailer is a stamp saying that everything is okay. That way if there is garbage at the end of the file because someone pulled the plug out when you were writing to it last time you can always scan backwards for the last good trailer and truncate the file at that point, a bit like a journalling file system in some regards.

Peter Hosey
2007-03-13 23:51:02

I failed to escape part of my comment, so the first few paragraphs got eaten. Sorry. Here's the missing portion:


First problem: the document encoding. You don't know what it is unless you sniff the beginning of the file and read the XML declaration, if present.


Truth.


Second problem: the DOCTYPE declaration. This can define entities and fixed attribute values, and again it's at the beginning of the file, not the end. If you parse the file backwards and hit an entity reference, you have no idea what to do with it.


Truth. I should probably add a forward-parse phase to LMX 2.0 or 3.0.


Third problem: comments. Say you're parsing backwards through an XML file and you see this: -->. Must be the end of a comment, right? ...

Peter Hosey
2007-03-14 03:02:59

Another thing I avoided mentioning was robustness: if the application dies halfway through appending a new chunk to the file the results may not be pretty.

We write the entire message element + end-chat-tag out at once, then immediately call fsync and fcntl(F_FULLSYNC). It's still possible that this could break, but you pretty much need a meteorite to strike your hard drive for that to happen.

jeff
2008-02-21 18:32:16
My friend has att yahoo dsl this error just occoured today Feb 21 the isp tech said it was an isp problem another att yahoo tech said it was internet explorer has anyone ever had this error. line b parsing xml: 'null' is null or not an object
jeff
2008-02-21 18:36:52
I meant to also say this was the first site I found on the net that was even remotely close to relating to my error message
Michael Day
2008-02-21 20:15:25
jeff, this is a problem relating to whatever JavaScript is running on the site that your friend visited. Talk to the webmaster of the site in question, and tell them your browser version. Other than that, it's nothing to do with XML, and doesn't really belong on this forum. :)