Code, respecting XPath, XML, Python

by Uche Ogbuji

Related link: http://www.nelson.monkey.org/~nelson/weblog/tech/python/xpath.html



First of all, here are the three snippets Nelson posted:



PyXML



from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.opml').documentElement
for url in xpath.Evaluate('//@xmlUrl', doc):
print url.value


My take: this uses the ancient 4DOM code. I expect it to be slow as hell and suck all the memory out of your computer. People, avoid the line from xml.dom.ext.reader import Sax2 like the plague. If there are docs that still suggest it, they really should be fixed. If you do use PyXML, use minidom, but I personally have not been much of an advocate of PyXML in ages.



libxml2



import libxml2
doc = libxml2.parseFile('foo.opml')
for url in doc.xpathEval('//@xmlUrl'):
print url.content


My take: as Nelson admits this snippet is very deceptive. It doesn't show even a fraction of the hair-pulling that would characterize a real-world version of the same code. It ignores the fact that libxml2 forces you to do your own memory management, that it requires very hideous C-ish idioms to work through the XPath results, etc.



ElementTree



from elementtree import ElementTree
tree = ElementTree.parse("foo.opml")
for outline in tree.findall("//outline"):
print outline.get('xmlUrl')


My take: ElementTree is always a breath of fresh air, but Nelson mentions that he was hampered by the XPath limitations (no attribute axis, for example). Well, there is always some cost to max simplicity, max performance.



And out of my corner are the following offerings.



4Suite:



from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseUri("foo.opml")
for url in doc.xpath("//@xmlUrl"):
print url.value


Here you have 100% of XPath's power, plus the option to extend XPath in Python, if need be. It's also plenty fast these days, if not quite as fast as libxml2, and probably not as fast as cElementTree.



Amara



from amara import binderytools
rule = binderytools.preserve_attribute_details(u'*')
doc = binderytools.bind_file("foo.opml", rules=[rule])

for url in doc.xpath("//@xmlUrl"):
print url.value


Looks very similar to the 4Suite example besides the imports and the declared rule. Amara does not support XPath attributes by default (to save space, similar, I'd guess, to the reasoning in ElementTree), but you can trivially enable them by asserting the above rule. 4Suite has no such limitations, but Amara's edge is more clearly shown if you're not using XPath. For example, Amara would allow you to access an XHTML title easily, without needing XPath: print doc.html.head.title. This is what I mean by extreme Python-friendliness. I should point out, though, that Amara's XPath implementation does have some other limitations, but not any most users are likely to run into.





Got code of your own?


5 Comments

aristotle
2005-01-16 04:50:11
libxml2 bindings
Interesting that you say it's hard to make Pythonesque bindings. Over here in Perl-land I've been using XML::LibXML for quite a while, and I have to say Matt Seargent and Christian Gland did a rather good job making the interface feel Perlish despite keeping pretty close to the C API. I've been a fan for a long time: it's a low memory footprint speed devil with a pleasant interface — nothing not to like there.
uche
2005-01-16 09:55:57
libxml2 bindings
Interesting. I don't know all that much about Perl, so I think I speak for many readers that it would be an interesting to see an example of this, if you could post one here.
faassen
2005-01-17 04:53:53
lxml.etree
Since last week, thanks to lxml.etree, you can do libxml2 *and* elementtree at the same time. This allows you to write code like this:


>>> from lxml import etree
>>> tree = etree.parse('ot.xml')
>>> tree.xpath('(//v)[5]/text()')
[u'And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.\n']


or, say, this, modifying the elements returned:


>>> result = tree.xpath('(//v)[5]')
>>> result[0].text = 'The day and night verse.'
>>> tree.xpath('(//v)[5]/text()')
[u'The day and night verse.']


Unlike in the libxml2 example, there is no deception involved -- a Pythonic API and memory management is taken care of.

tomvons
2005-02-28 02:27:11
PyXML example


For the casual Googler, the minidom way would be:



from xml.dom import minidom
from xml import xpath


doc = minidom('foo.opml').documentElement
for url in xpath.Evaluate('//@xmlUrl', doc):
print url.value



Riccardo
2006-06-10 14:07:57
I'm the casual Googler :) I needed one year to pass by here, but found the answer I was searching for.
Tnx Tomvons (and others too)