Yesterday I gave a presentation on harvesting to the PyWeb-IL group. In the presentation, I described what I learned about harvesting and also gave a concrete example of how to find the “most influential artists” using data from allmusic.com and a (very) naive implementation of PageRank.
The PageRank implementation was based on wikipedia word-by-word, and is not efficient, but it works well enough for this presentation. I included it and the allmusic.com example mostly because I thought the results are pretty cool, and it’s very good teaching material.
Here is the presentation, and the code is available here.
Here is how to run it:
D:\work\pywebil-harvesting\upload>allmusic.py "/cg/amg.dll?p=amg&sql=11:3pfrxqq5ld6e" 2 out.pkl
simple_pagerank.py out.pkl
Happy harvesting!
Nice presentation!
In the presentation you mention BeautifulSoup and that it is based on HTMLParser which can be problematic. Note that up to version 3.0.7a, BeautifulSoup used SGMLParser which is much more robust, and for this reason as of this writing (July 2009) it is preferable for many uses.