Thoughts on XPath, XML, Python

by Uche Ogbuji

Related link:

Nelson says: There's the stock Python install, which barely does anything [for XML]. That's overstated. Plain old SAX and minidom may not be ideal, but they're useable. Various bugs in PySAX and Minidom (see, for example this article ) have unfortunately plagued the standard library, but starting with Python 2.3, I think that they deliver what's promised. The main problem is that what they promise doesn't fit Python's shoes all that well. PySAX's very literal translation from Java's class/method callback feels very stilted in a language that now has the likes of generators and nested scopes. I suspect if PySAX were in development now things would be very different. It's to some extent a legacy problem. I used to recommend SAX to those who need performance, but I think my own recent work (represented in Amara) and that of Fredrik Lundh (in ElementTree) may be enough to render PySAX obsolete. as for minidom, it could do with a lot of more Python-friendly sugar so that people don't have to think in the W3C's over-elaborated API, but once you get the hang of DOM, you can pretty much do whatever you need with it.

So rather than completely writing off the stdlib XML facilities, as Nelson did, I damn it with faint praise. Not a difference worth much bother? Perhaps. Moving on, here's Nelson again:

PyXML, which has an ugly hack to confusingly install on top of the default Python libraries. But if you follow the advice of Python's most visible XML expert, Uche Ogbuji, you may think there's something wrong with PyXML and install 4Suite instead, which is the same as PyXML only different.

I've done a horrible job of explaining 4Suite if people are thinking it's in any way similar to PyXML. The two could hardly be more different. Maybe Nelson means that the XPath libraries are the same? This isn't true either. Years ago we did copy the 4Suite code base to PyXML, and it was massaged to make it fit better into PyXML overall. Since then, the XPath in 4Suite has evolved into an entirely different beast: much faster, more extensible, and with a cleaned up API.

Or should you use Amara instead? Fair question. When I developed Amara I considered lumping all that code back into 4Suite, but I thought it better to release it as a separate 4Suite add-on. For one thing, I think it has a very different flavor: focusing on Python idioms rather than what-would-W3C-do (which we'd been peeling away from gradually in 4Suite, anyway).

I think I can make a workable soundbite for the cause: If you're coming more from a Python background, and XML is just something that's getting in your way, try Amara. If you're coming from an XML background, and you think in DOM, XSLT and all that, try 4Suite. Does anyone find that soundbite useful? Based on it, I think Nelson should be trying Amara rather than just 4Suite. I should point out that Amara is very fast as well (and 4Suite has made huge strides from when it was too slow to bear: it's now very respectable, if not blistering).

ElementTree which is brilliantly fast and simple to use, but limited

Hmm. Several times I've made the mistake of claiming some limitation in ElementTree, and then along comes Fredrik to straighten me out. ElementTree is a lot more versatile than one might think at first glance. So why did I develop Amara? Why didn't I just use ElementTree? I did for a while, but I always felt that ElementTree does a great job of loosening DOM shackles for something more Python-flavored (hats off to Fredrik, who tried to coax me that DOM not good enough for Python long before I saw the light). But I honestly think ElementTree doesn't go quite far enough. Amara follows the principle that once I decide to shrug off DOM, I want to be able to use every possible nifty tool in Python's arsenal to make the XML feel native to the language. I want something closer to Gnosis Utilities Objectify, but using a much more declarative framework. I think that Amara's unique niche is a combination of extreme Python-friendliness and declarativity. I think that XML without declarativity results in far too much and too brittle code, even in Python.

xmltramp, which is even more hacky.

I'll risk the flames and be honest. I don't think xmltramp
is (yet) industrial strength. It's a lot hackier than ElementTree, Gnosis, generateDS, 4Suite or even Amara. It looks and probably feel great in the first foray, but I don't think that experience will scale to heavy usage. Besides, It doesn't support XPath.

But what's missing is a clear single simple library to use.

I don't believe a single choice is appropriate. I want many options. I think people who want just one way to process XML are limited by sketchy experience with XML. Just like I wouldn't expect one single library for text processing in Python (and I expect no one would suggest such a thing), I can't imagine how anyone couls shoehorn all the breadth and variety of XML use cases into a single idiom, or even two or three. XML is ridiculously versatile, and this necessitates broad choice. I do a lot with XML and consequently, i often use 3-4 different tools in any given day.

PyXML seems the most standard, but it seems very slow and it tries to be more DOM-like than Python-like. I hate DOM.

I don't promote PyXML s any sort of standard. To me the only standard is Python's stdlib and PyXML is not in it. It's just a couce, and a flawed one for some of the reasons you mention. I think PyXML was important, but has been overtaken by events. I'm not entirely blameless in that matter, and I'm sorry I never had all the energy to work on PyXML as hard as, say 4Suite, but I think at this point it's too late.

[with PyXML] from xml.dom.ext.reader import Sax2

Yuck. That's the ancient DOM code included in PyXML. Many people make the mistake of invoking it. It is dreadfully slow and consumes a dreadful amount of memory. Always use PyXML's minidom. Just replace the above with:

from xml import minidom
from xml import xpath
doc = minidom.parse('foo.opml').documentElement
for url in xpath.Evaluate('//@xmlUrl', doc):
print url.value

You'll get a lot more speed, but all my other downer comments on PyXML still apply. There are better options.

the awfulness of the libxml2 API

I couldn't agree more. libxml2 is a miracle of function, but alas in a form that doesn't suit Python one bit. I know that folks are working on better libxml2 wrappers, but familiar as I am with the C code, I honestly don't believe they can produce anything truly Pythonesque without losing all the performance gains.

So that's all the chatter. But code speaks louder, and I'll offer some in a subsequent entry.


2005-01-16 04:48:32
Thanks for the comments
Thanks for these comments. They are helpful.

I'm a reasonably experienced programmer, and I use Python most of the time. Reading Nelson's blog, I thought that he said very well some of the sense that I have about XML and Python.

I'm working on a project that uses a lot of XML and I often struggle with it. There are many reasons (some of which you've covered in your columns), but two are: XML is a big swallow, and things change very fast.

By a "big swallow" I mean that there is a great deal to learn at once: it seems like I can't understand how to proceed without unicode, some UML, related technologies like XPath, etc. By "things change" I don't just mean in the XML landscape-- although that is so-- I also mean in Python and XML. For instance, does the "Python and XML" book mean anything at this point? It is hard for me to know.

If part of Nelson's drift was to ask: what should a person choose that they won't regret in a month, or five years from now? then I'm asking the same thing. I'm very willing to work at it, but I'm sometimes unsure what "it" is.

Again, thanks,

2005-01-16 10:05:12
Thanks for the comments
You're right that the fundamental problem is that XML is a very complex beast. I compared it to text processing, and I think it's a very apt comparison. You wouldn't expect to use a single tool, or one technique for all your text processing needs. I think too many Python developers fail to realize that XML is almost as complex a subject matter. Whether this is a strength or weakness of XML is a topic for separate discussion.

As for 'does the "Python and XML" book mean anything at this point?', Well, we all know that books go out of date, and you've pointed out yourself how fast the landscape is changing. A lot of the book is out of date and we can all hope for a second edition. Meanwhile, I did offer a few updates in "
A Python & XML Companion
". Of course, the landscape has changed since then, and though all the code in that article is still valid, I would probably choose to do things a bit differently now :-)

You ask a great question: "what should a person choose that they won't regret in a month, or five years from now?" I'm not sure my crystal ball extends to five years, but I think I might try to look at that question in an upcoming article in my

2005-01-17 05:02:34
losing all of the performance gains?
> I know that folks are working on better libxml2
> wrappers, but familiar as I am with the C code,
> I honestly don't believe they can produce
> anything truly Pythonesque without losing all
> the performance gains.

I think that's a bit overstated:

import time
from lxml import etree

start = time.time()
tree = etree.parse('ot.xml') # 3.4 megs
print tree.xpath('(//v)[100]/text()')
end = time.time()
print "Time taken:", end - start

$ python2.3
['And Adah bare Jabal: he was the father of such as dwell in tents, and of such as have cattle.\n']
Time taken: 0.233765125275

[Though granted the first time it runs it takes 0.64 s; later on I benefit from OS caching]

If you don't use 'text()' in your XPath expression and query for elements, you'll get ElementTree compatible elements back. In my book that's Pythonesque.


2005-01-17 09:31:25
>>But what's missing is a clear single simple library to use.

>>I don't believe a single choice is appropriate.

'A good default is worth a dozen options.'
2005-01-18 18:01:05
losing all of the performance gains?
Wow. This is very impressive. As I've said before, I'll be interested in writing an article on lxml once it's ready for general release.
2005-01-18 18:04:00
Fair enough. Of course opinions differ here. I've heard from several people that they don't like having so many choices. But then there are many who prefer to have such diversity (not least of all the 74+ people who each chose to roll their own).

One instructive note: Java has a reasonable (for Java) default: Sun's JAXP reference implementation. Yet there are hundreds of additional choices, many of which are also popular.

XML is IMO way too varied a construction field for one class of truck.

2005-01-20 09:20:20
losing all of the performance gains?
It's getting there now. It has also gained basic XSLT support, and basic Relax NG support as well. I hope to do a first release in a few weeks time. I'll let the world know when it's ready, no fear. :)



2005-03-01 08:55:24
Macromedia Flash
Oddly enough, I'm a big fan of Macromedia Flash's implementation of XML. It is a very simple and elegant solution. I suggest looking into it.

Also, as an opinion: I think that to become very successful, elegance is needed not just in the design but also in the naming of variables and modules. It’s just like poetry. For instance: “from elementtree.ElementTree import ElementTree” and “binderytools.bind_file” look inelegant and will turn people off from the onset. I want to see "from libraryXML import document" and "document.loadXML(filename)".