Image by existentist.
I’ve been doing a lot of harvesting (aka screen-scraping) lately. Fortunately, I don’t need forms automation, so I’m using urllib2 and not Mechanize like my friend Ron Reiter recommended.
At first, when I wanted to get some information from a web page quick&dirty-wise, I used regular expressions. This approach works, but is not especially fun to write or to maintain. So for the next harvesting task, I decided to learn Beautifulsoup. Beautifulsoup has an excellent interface, and a parser that deals with messed up (read: random.shuffle()-ed) tags.
Unfortunately, Beautifulsoup is based on Python’s built in html parser, (htmllib.HTMLParser), which is a bad excuse for an html parser.
I decided to give up on it, when I tried to parse a page that had Javascript in it, and the Javascript had a string that contained html. HTMLParser choked on it, and as a result, the page was unparse-able with BeautifulSoup.
Distraught, I remembered that I knew of some other html parser called lxml. I played with it a little, and it seemed to eat up all the pages that BeautifulSoup choked on, and then asked for more. Efficiently.
Now, I had a problem. I already had a harvester written using BeautifulSoup’s interface, and after looking at lxml’s interface, the situation didn’t seem too good. While lxml sports a solid interface, it’s nowhere as quick&easy as BeautifulSoup’s. Also, it used xpath.
My solution: a BeautifulSoup interface wrapper for lxml, which I present here. It mostly does nasty xpath conversions, and it allowed me to make my already written harvester work as well as to write the next harvester. I also gave it to a friend who found it useful.
Here is a short usage example.
Let’s say we are interested in harvesting the names of all the artists appearing on Fiona Apple’s page on allmusic.com.
First, using FireBug’s “inspect”, we can see that all the interesting links are in a table with the id “large-list”. Also, we can see that all the artist links have “sql=11:” in them.
So, our code looks like this:
In [4]: soup = parse_html.fetch_soup('http://allmusic.com/cg/amg.dll?p=amg&sql=11:jjfixqegldde') In [5]: names = [x.all_text() for x in soup.find(id="large-list").find_all('a', href=re.compile("sql=11:"))] In [6]: print names ['Alanis Morissette', 'Tori Amos', 'Jeff Buckley', 'Aimee Mann', 'Heather Nova', 'Astrid Williamson', 'Kari Newhouse', 'Sarah Blasko', 'Daniel Powter', ...] |
So, here is the relevant code. It mostly translates nice function calls to nasty xpath. It is not really well documented, as it was a quick and dirty solution, and its interface is similar to BeautifulSoup’s. I hope you find it useful. I did.
Leave a Reply