Amara XML Toolkit: Simple things simple. Complex things possible.

by Uche Ogbuji

Related link: http://www.nelson.monkey.org/~nelson/weblog/tech/python/xpath2.html




I designed Amara XML Toolkit to make the simple things easy and the complex things possible. I'm open to honest, constructive criticism of where I failed in that aim, but I don't want any misconceptions floating out there.



Cutting to the high-speed chase scene, here is how Nelson Minar can do what he wants in Amara:



from amara import binderytools
doc = binderytools.bind_file("foo.opml")

for outline in doc.xpath("//outline"):
print outline.xmlUrl


If someone thinks that's too complex, I'll be happy to hear ideas of how to make it simpler. It's 4 lines of code that's very similar code to the ElementTree example. In my previous blog I went on the impression that Nelson really wanted to use XPath in attributes, so I showed how to make that possible in Amara. He somehow misinterpreted that, implying that throwing in such a rule is the only way to parse a document in Amara.



In reality, 90% of Amara users will never need to invoke a special rule while parsing XML. The defaults are generally fine, tuned for speed/space versus functionality.



Amara does let you turn on and off custom behaviors with simple declarative rules, and it lets you tune those rules to be applicable to just portions of a document. I think this is a good way to save users a lot of code. Yes, the downside is that you have to learn the available rules, but that is inevitable, and I've always thought it's easier to read a documentation on an existing capability than to write code to reinvent it.



But as I always say, code speaks louder than words, so here is more. Above I challenged folks to show how they could make the Amara bindery example simpler. Well, in my last release of Amara I decided to take on that challenge myself. Amara 0.9.2 introduces the Pushbind. With Pushbind, here is code that does what Nelson wants:



from amara import binderytools
for frag in binderytools.pushbind('outline',source='foo.opml'):
print frag.outline.xmlUrl


There you go. One fewer line, and the XML looks to all observation like just any other Python object coming in from an iterator. One nice bonus is that it is extremely memory efficient. In fact, it never uses much more memory, in general, than it takes to represent one outline element. This is true whether foo.opml is 1KB or 1MB.



As an illustration for general users, the following code prints all verses containing the word 'begat'
Jon Bosak's Old Testament in XML, a 3.3MB document, again without ever needing to have the entire document in memory (although there is always the possibility that the loop will outrun Python's garbage collector).



from amara import binderytools
for frag in binderytools.pushbind('v',source='ot.xml'):
text = unicode(frag.v)
if text.find('begat') != -1:
print text.encode('utf-8') #There's some non-ASCII in ot.xml


I personally think that Pushbind handles just about any of the cases that make people turn to SAX.




6 Comments

etoipiisminus1
2005-01-18 08:01:54
Your example
First, let me say Amara really looks nice! Thanks!


But I'm trying your example and I get tons of lines like:



instead of the text I expected... I'm missing something basic here, I'm sure.


This is on Windows XP, Python 2.3.4, latest Amara and 4suite.

etoipiisminus1
2005-01-18 11:38:16
Your example
Ok, I think I've figured it out. This works as I expect.

from amara import binderytools
for frag in binderytools.pushbind('v', source='ot.xml'):
text = unicode(frag.v)
if text.find('begat') >= 0:
print text.encode('utf-8')

Is that what you meant?
uche
2005-01-18 19:04:46
Your example
Oops. Yes. That's what I get for modifying after I test :-). Modified in the article.


BTW, after another comment got me thinking, I decided to update pushbind so that the extra frag thingie is no longer required. As of 0.9.3 (soon to be released, you can write:


from amara import binderytools


for v in binderytools.pushbind('v',source='ot.xml'):
text = unicode(v)
if text.find('begat') != -1:
print text.encode('utf-8')


Thanks.

effbot
2005-01-22 05:36:38
pushbind vs. sax
"I personally think that Pushbind handles just about any of the cases that make people turn to SAX"


Did you run your sample on the full OT.XML file? On my 3 GHz PC, it takes over two minutes to finish, using around 50% CPU throughout (which is enough to bring the fans up to full speed).


(cElementTree does the same thing in about 0.15 seconds, printing included, at no noticable load)


And if I press control-C during the parse, the process prints an error message, but doesn't return. It just sits there, burning CPU like crazy.

uche
2005-01-22 12:09:02
pushbind vs. sax
Hmm. Yes, I did. On my Dell Inspiron 8600 machine, which sounds pretty comparable to yours, it takes about 10 seconds. But wait, there's one thing: By the time I did get around to trying out ot.xml, I was using CVS which was closer to the (just released) 0.9.3 than to the 0.9.2 released at the time.


Your two minutes threw me back in my chair, though, so I tried it with Amara 0.9.2. Yes indeed. About 2 CPU-churning minutes. The funny thing is that I hadn't even noticed this disparity (I've been more preoccupied with correctness than speed). Looking through the changes in the relevant time-frame, the most likely cause is that I removed some crufy code for XPath location tracking (no longer necessary). The only conclusion I can draw is that the removed code was a pig.


Anyway, I'd be interested to see whether you can reproduce the dramatic speedup in 0.9.3.


Thanks for the note.


--Uche


effbot
2005-01-22 14:04:13
pushbind vs. sax
on my machine, the 0.9.3 pushbind is 10-15 times faster than before (12x on my main benchmark), bind_file is a bit faster, while pushdom seems to have slowed down slightly.


(but despite the slowdown, pushdom is still the fastest way to parse things under amara, at least for my samples).