exploring the problems involved in comparing XML

by Rod Chavez



i've spent the last week trying to get my brain wrapped around the issues
involved with performing XML comparisons and how they might be solved. i also
decided early on in my reading and discussions that i'd better attack this
problem in stages, otherwise i'd have nothing to show for it for a very long
time. in other words, this is a non-trivial problem. at least for me <g>




so what makes this a hard problem? after all, there's a ton of "difference"
programs out there, like diff, windiff, tkdiff, etc. why not take the two
XML-documents to be compared and run one of these programs over them and see
what it says? it will certainly tell us something, why can't we stop there?




the short answer is that while XML is a textual format, it is more then "just
text". it has structure too, and in order to provide a useful diff this
structure must be taken into account





the rest of this post assumes some familiarity with XML, at least at the
high-level. if you've never seen XML before (ie, you've spent the last 5 years
under a rock <g>), i'd recommend taking a quick look at
Norman Walsh's
A Technical Introduction
to XML





let's take a look at some real XML and see what some differences are, and
aren't. consider the following two documents:


<foo>
<bar/>
<baz/>
</foo>


<foo>
<bar></bar>
<baz></baz>
</foo>

from a "pure text" point-of-view, these documents are different. in particular,
lines 2 and 3 are different. but from an XML point-of-view, these documents can
be considered identical. they both have a "root" element foo, which in
turn has two child elements, bar and baz. the fact that in the
first document the elements bar and baz are atomic, and in the second
they aren't, doesn't change the fact that they are both childless, entityless
tags. hence, the two documents are considered equal. simple, eh? maybe, but
things get complicated quick




next let's look at the same two documents, but this time the second will now be
different (both in a text and XML way):


<foo>
<bar/>
<baz/>
</foo>


<foo>
<bar a="x"></bar>
<baz></baz>
</foo>

now things have gotten interesting. in the second document, the element
bar now has an attribute a whose value is "x". so from an XML
point-of-view, these documents are different. "ok, so they're different, what's
so hard about that?". well, the hard part is what do you display? diff
utilities not only tell you if the documents are different, they tell you
how they are different. they use different mechanisms to indicate where
and what type of differences there are. let's take a look at some choices,
easiest to hardest:

  1. physical text-diff

    this is the easiest to understand (and implement), but it's the least useful
    for XML. in the latter case above, both the bar and baz elements
    would be shown as deleted and then inserted. this is due to the line-oriented
    viewpoint of a text-diff

  2. logical XML-diff

    now we're doing better. the baz element would not be shown as
    having changed, while the bar element would. but not the entire element
    since the only change is that an attribute has been added. so rather then
    having the line with the bar element seeming to be deleted and then
    added, you'd want to see just the added attribute appearing



    you can see an example of this on Microsoft's
    GotDotNet site where they have an online
    XML Diff and Patch
    tool. if you feed the two documents we're discussing to their service, they
    will perform a logical XML-diff, and display just the added attribute
    high-lighted in yellow



    but there's still a problem with this, which is that while you're now viewing
    the difference between the two documents, you're not seeing either
    document. you're just seeing the change relative to the logical structure of
    the documents. why would this be a problem? well, suppose you had changed an
    XML document that was being tracked under a source-code control system like
    cvs. after seeing the differences, you
    might want to merge just some of the changes that have occurred between the
    two documents. in the logical view you don't even know what line the
    change occurred on. in a small, simple XML document this wouldn't be that big a
    deal, but what about something more involved, like a
    WSDL or a
    Schema file? trying to take the
    changes you're interested in and "back" then into one of the files might prove
    to be not only tedious, but error-prone as well



    btw, please note that i'm not trying to pick on Microsoft's "XML Diff and
    Patch" tool. it's actually very good. i'm just trying to point out the merits of
    the various types of diff strategies, and point to samples where they
    exist on the web. as tools go, XML Diff and Patch is quite complete

  3. physical XML-diff

    this one has the potential to be very useful. it would understand not just what
    had changed between the two documents from the perspective of XML, but it would
    also understand how to relate any discovered differences back to the physical
    layout of the original document. so you'd see the original document in the
    display, and any changes to be displayed
    (red things being deleted,
    yellow things being added) would
    appear in the right place. the tricky part is how to perform a logical XML-diff
    while still somehow preserving the information needed to relate difference back
    correctly





having laid out the problem space somewhat, i'd like nothing better then then
to tell you i've got the whole thing wrapped up, here's an implementation, now
go play with it... NOT! unfortunately this problem is kinda tricky, and i'm an
incrementalist be nature. i like to start off with the small, easy parts of a
problem, solve those to get familiar with the issues, and then build momentum
towards the hard parts (or, at least, the parts that seem hard to me)




so i decided to attack this whole area in the order i laid out above. i wrote a
physical text-diff online service
that you can play with right now, or you can
download and install it
yourself on any compatible servlet container. i've only run it on WLS, but it
should run elsewhere




if you know how to install a WAR (since that's how it's packaged) on your
servlet container, you're all set. check your servers documentation. for WLS,
the following command (all on one line) will do the trick, assuming you've got
the install i documented in my last blog
post. otherwise, make the
appropriate changes for your installation


/bea/jdk141_03/bin/java -cp /bea/weblogic81/server/lib/weblogic.jar
weblogic.Deployer -adminurl http://localhost:80 -username weblogic
-password weblogic -name diff -deploy diff.war

btw, this WAR includes all the source, so you're welcome to check it out,
modify it, etc. please let me know if you find any bugs, especially if you fix
them <g>




now let's walk through what it does and how it works. first, you file upload
two documents. in order to perform file upload, i'm using some
Java classes written by
Jason Hunter. it's a
very useful set of utilities that help the developer with an aspect of servlet
programming that for some reason has never been addressed by the J2EE standard
itself




let's take a look at the file upload process in more detail. on the client, you
want a form tag whose encoding is "multipart/form-data" and method is
"post". here's the relevant part of the HTML


<form ENCTYPE="multipart/form-data" ACTION="phys_text_diff/" METHOD="POST">
<input name="rig" type="file" size="50"/>
<input name="rox" type="file" size="50"/>
<input value="compare" type="submit"/>
</form>

i've stripped out any of the formatting, so you can see just the form and the
file upload controls. i use a table for formatting, and that's pretty gruesome
to look at. you can see the two file upload controls, the submit button and the
form tag wrapping them all




now let's look at the server side. as you can see from the client part, these
files are arriving as part of an HTML-POST so we've got to have our code inside
the doPost(...) servlet method. i could also have done this inside a
JSP, but i got started writing a straight servlet and never looked back. since
that's not really the point of this exercise, it doesn't seem to big a deal,
but i might change that later. what do you think?


protected void doPost(HttpServletRequest req, HttpServletResponse resp)
throws java.io.IOException
{
MultipartParser mp = new MultipartParser(req, 1024 * 400);
Part part;

...

while ((part = mp.readNextPart()) != null)
{
if (part.isFile())
{
// save file contents while they're available
FilePart filePart = (FilePart) part;

...
}
}

...
}

again for clarity, i've stripped away some of the boilerplate in order to focus
on the details. in this case, you can see a
com.oreilly.servlet.multipart.MultipartParser being created (you have to
hand in the HttpServletRequest and the max file size you want to handle), and
then it loops over each com.oreilly.servlet.multipart.Part testing each
in turn to see if any is a "file", and if they are, casts the Part to a
com.oreilly.servlet.multipart.FilePart and start pulling out the
contents of each file. note: make sure you drain the contents of each
Part as it's returned. once you get the next Part, all the contents of the
previous Part are lost. so if you didn't copy them off, they're gone. Jason's
doc is pretty clear on this point, but i still screwed this up in my initial
implementation




if you're interested in other code samples and information on file-uploads,
here's something you might want to take a
look at




now that we've got the files up on the server, we need to compare them and
report the differences. not surprisingly, this turns out to be a "hard"
problem. for example, let's imagine a file 10,000 lines long, where each line
is unique with respect to all other lines. then let's suppose you copy the file
and in the copy, delete the second line. when you compare these two files, you
want to be told that there's one line missing in the second file. but a valid
(and useless) answer would be that 9,999 lines had been deleted, and a
different 9,998 lines had been added. in a sense, both of these answers is
correct, but only one is useful. we want that one <g>




to understand this better, i spent some time googling around. it turns out that
this problem, comparing long sequences of data to find the useful
differences comes up over and over again, in very diverse settings. for
example, research is being done to compare very long biosequences, in an area
known as Computational Biology. here's a
paper i found on the
subject




the interesting thing about many of the papers i found is that when you
back-track the references from almost any of them, you either directly or
indirectly arrive at a paper published in 1976 by J. W. Hunt and M. D. McIlroy,

An Algorithm for Differential File Comparison
. it's pretty readable and
lays out the problem quite nicely. in a nutshell, it describes the problem as
taking two files as input and then computing the minimum number of changes
required to transform one file into the other. if this sounds familiar, this
paper was generated to document the research that went into the original
implementation of diff




while i found this information very interesting, i wasn't looking forward to
turning this (or any of the other) papers directly into working code. i googled
around for implementations i could use, and found
GNU Diff for Java. it's a port of
GNU Diff into Java. it's all represented by the class bmsi.util.Diff.
it's provided as is, and it's license is GPL. it works, and pretty much does
exactly what i wanted. here's how i used it


String[] rigLines = (String[]) rigList.toArray(new String[0]);
String[] roxLines = (String[]) roxList.toArray(new String[0]);
Diff diff = new Diff(rigLines, roxLines);
Diff.change change = diff.diff(Diff.forwardScript);

here you see a bmsi.util.Diff being created while taking as input two
String arrays, each array representing a file, and each element of the array
being a line from that file. next, the diff(...) method is called with an
argument indicating that the results should be returned in "forward" order,
from the top of the file to the bottom. the results themselves arrive in the
form of a linked list of bmsi.util.Diff.change objects




btw, looking at the HTML and Java, you'll notice that i
use the word rig and rox as an identifier or prefix. some of you
will recognize this as a somewhat subtle science-fiction reference. in any case, rig is
short for "original", and rox is short for "copy"




finally, it's time to format the results.


int rigLine = 0;
int roxLine = 0;

while (change != null)
{
if (change.deleted != 0) // DELETED
{
// deleted - number of lines in rig
// line0 - first line deleted in rig
// line1 - line before delete in rox

if (rigLine < change.line0)
{
int clearLines = change.line0 - rigLine;

printDiffLines(pw, rigLine, roxLine, rigLines, clearLines, 0);
rigLine += clearLines;
roxLine += clearLines;
}

printDiffLines(pw, rigLine, roxLine, rigLines, change.deleted, -1);
rigLine += change.deleted;
}

if (change.inserted != 0) // INSERTED
{
// inserted - number of lines in rox
// line0 - line before insert in rig
// line1 - first line inserted in rox

if (roxLine < change.line1)
{
int clearLines = change.line1 - roxLine;

printDiffLines(pw, rigLine, roxLine, rigLines, clearLines, 0);
rigLine += clearLines;
roxLine += clearLines;
}

printDiffLines(pw, rigLine, roxLine, roxLines, change.inserted, 1);
roxLine += change.inserted;
}

change = change.link;
}

if (rigLine < rigLines.length || roxLine < roxLines.length)
printDiffLines(pw, rigLine, roxLine, rigLines, rigLines.length - rigLine, 0);

so what's going on here? it's not too complicated, but it took me a while to
get this just right (famous last words <g>). first, the code loops over each
item in the change object until there are no more changes to process. the first
tricky point i had to deal with is that while we're dealing with a list of
changes, i still need to display all the lines in both files. in other words,
if the first change occurs on line 5, i've still gotta print lines 1 through 4




the two variables rigLine and roxLine are there to deal with
this. you can see these used inside both the "DELETED" and "INSERTED" sections
of the while loop. the if in both those sections check to see if
there are unprinted lines before the current change. if there are, they are
printed. finally, after the loop terminates, if there are any lines left they
are printed as well




at the bottom of the two sections, the lines of the change itself are printed.
and of course, anytime lines are printed, rigLine and roxLine are
incremented




well, that's pretty much it. as you can tell, there's still lots to do. i've
got lots more info about other products and libraries that take a crack at this
problem, and others besides. for example, how about supporting merge?
there seem to be several approaches to this. and how about schema? so far i
haven't really seen anyone take on that problem. still looking. finally, what
about having this available as a web-service? and if you have any ideas,
comments or suggestions, please don't keep them to yourself. let me know



any thoughts on the diff problem in general? or XML-diff in particular?


4 Comments

anonymous2
2003-10-09 17:07:05
Diffing XML
You may want to check out xmlunit (http://). They have solved this problem very nicely. Granted, the code needs to be refactored away from the unit testing concerns, but the algorithms for determining whether or not XML is the same or equal are there. From what we have seen, it seems to be pretty fast.


-John (jburwell@nospam.bkbank.com)

rodc
2003-10-10 10:25:05
Diffing XML
the problem with some of the test-utilities i've seen is that they just tell you that something is different, since that's usually all you need to pass/fail a test, they don't tell you all the ways they are different


but i'll check this tool out to see if i can use it


thanks!

anonymous2
2003-10-10 12:02:14
Diffing XML
I have used XMLUnit very successfully to compare files. The compare classes can be used outside of a unit test and it can give you detailed differences. Here is a snippet that will report all the differences between two files:


protected void checkXmlEqual(Reader inExpected, Reader inActual, String name) {


DetailedDiff diff = null;
try {
diff = new DetailedDiff(new Diff(inExpected, inActual));
} catch (Exception e) {
__log.info(" " + name + " has errors");
__log.info(" " + e.toString());
return;
}


if (!diff.similar()) {
__log.info(" " + name + " has errors");
__log.info(" " + diff.toString());
}
}

corlie1
2004-01-19 00:05:43
Based on your code: "Diffing" message-versions in forumsoftware.
Hi Rod and others,


Thanks for the code! I have used it to build an diff-feature of versions of a message in my open source webforum-application Outlineforum (1). This feature is important, because the idea of this webforumsoftware is that the discussion is kept up to date. When in this threaded-forum an "parent" message is updated, the child-message must be checked to see if it is still valid. This is showed by changing the background color of this message in red. In order to validate the message, one needs a quick look at what changes has been made in the parent-message.


If you want to have a quick look how I use the code:
Go to http://outlinerforum.xs4all.nl:8080/oforum/nodes.action?focus_id=207
Click on line "9 New car project" on the H-image to get an list of earlier
versions of this message. Select on the "Diff"-image in front of one of earlier versions of the message to have it compared with the latest version.


Thanks again!


Cor Lieftink
corl@xs4all.nl


(1) projectsite Outlinerforum (formerly named Policymaker):
http://sourceforge.net/projects/policymaker
(2) homepage: http://outlinerforum.xs4all.nl/