A handy Python script for parsing the O'Reilly book catalogue.

by Jacek Artymiak

I'm working on a on-line bookstore project, which I'd like to automate as much as possible. Things like book insertion and deletion ought to happen automatically, with only occasional need to mess with code or HTML. My choice of language for this project is Python, with a touch of urllib and re. Since one of the sources of book titles and ISBN numbers I use is the O'Reilly book catalogue, I thought I'd share this little script with other ORA fans.



#!/usr/bin/python

import urllib, re, sys

try:
page = urllib.urlopen('http://www.oreilly.com/catalog/prdindex.html')
except IOError, (errno, strerror):
sys.exit ("I/O error(%s): %s" % (errno, strerror))

title = ""
isbn = ""
price = ""

page = page.read()
page = page.replace("\n", "")
page = page.replace("\r", "")
page = page.replace("> ", ">")
page = page.replace(" ", " ")
page = page[page.find("<b>Examples</b></td>") + len("<b>Examples</b></td>"):]

while(1):
page = page[page.find("<tr ") + len("<tr "):]

if (len(page) == 1):
break

page = page[page.find("http://www.oreilly.com/catalog/"):]

if (len(page) == 1):
break

page = page[page.find("\">"):]
page = page[2:]

title = page[:page.find("</a>")]

page = page[page.find("\">"):]
page = page[2:]

isbn = page[:page.find("</td>")]
isbn = isbn.replace("-", "")

page = page[page.find("\">"):]
page = page[2:]

price = page[:page.find("</td>")]

print title + ":" + isbn + ":" + price