Code, respecting XPath, XML, Python
by Uche Ogbuji
First of all, here are the three snippets Nelson posted:
from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.opml').documentElement
for url in xpath.Evaluate('//@xmlUrl', doc):
My take: this uses the ancient 4DOM code. I expect it to be slow as hell and suck all the memory out of your computer. People, avoid the line
from xml.dom.ext.reader import Sax2 like the plague. If there are docs that still suggest it, they really should be fixed. If you do use PyXML, use minidom, but I personally have not been much of an advocate of PyXML in ages.
doc = libxml2.parseFile('foo.opml')
for url in doc.xpathEval('//@xmlUrl'):
My take: as Nelson admits this snippet is very deceptive. It doesn't show even a fraction of the hair-pulling that would characterize a real-world version of the same code. It ignores the fact that libxml2 forces you to do your own memory management, that it requires very hideous C-ish idioms to work through the XPath results, etc.
from elementtree import ElementTree
tree = ElementTree.parse("foo.opml")
for outline in tree.findall("//outline"):
My take: ElementTree is always a breath of fresh air, but Nelson mentions that he was hampered by the XPath limitations (no attribute axis, for example). Well, there is always some cost to max simplicity, max performance.
And out of my corner are the following offerings.
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseUri("foo.opml")
for url in doc.xpath("//@xmlUrl"):
Here you have 100% of XPath's power, plus the option to extend XPath in Python, if need be. It's also plenty fast these days, if not quite as fast as libxml2, and probably not as fast as cElementTree.
from amara import binderytools
rule = binderytools.preserve_attribute_details(u'*')
doc = binderytools.bind_file("foo.opml", rules=[rule])
for url in doc.xpath("//@xmlUrl"):
Looks very similar to the 4Suite example besides the imports and the declared rule. Amara does not support XPath attributes by default (to save space, similar, I'd guess, to the reasoning in ElementTree), but you can trivially enable them by asserting the above rule. 4Suite has no such limitations, but Amara's edge is more clearly shown if you're not using XPath. For example, Amara would allow you to access an XHTML title easily, without needing XPath:
print doc.html.head.title. This is what I mean by extreme Python-friendliness. I should point out, though, that Amara's XPath implementation does have some other limitations, but not any most users are likely to run into.
Got code of your own?
Interesting that you say it's hard to make Pythonesque bindings. Over here in Perl-land I've been using XML::LibXML for quite a while, and I have to say Matt Seargent and Christian Gland did a rather good job making the interface feel Perlish despite keeping pretty close to the C API. I've been a fan for a long time: it's a low memory footprint speed devil with a pleasant interface — nothing not to like there.
Interesting. I don't know all that much about Perl, so I think I speak for many readers that it would be an interesting to see an example of this, if you could post one here.
Since last week, thanks to lxml.etree, you can do libxml2 *and* elementtree at the same time. This allows you to write code like this:
I'm the casual Googler :) I needed one year to pass by here, but found the answer I was searching for.
Tnx Tomvons (and others too)