PyWeb-IL Presentation on Harvesting: Finding the Most Influential Artists

Yesterday I gave a presentation on harvesting to the PyWeb-IL group. In the presentation, I described what I learned about harvesting and also gave a concrete example of how to find the “most influential artists” using data from allmusic.com and a (very) naive implementation of PageRank.

The PageRank implementation was based on wikipedia word-by-word, and is not efficient, but it works well enough for this presentation. I included it and the allmusic.com example mostly because I thought the results are pretty cool, and it’s very good teaching material.

Here is the presentation, and the code is available here.

Here is how to run it:

D:\work\pywebil-harvesting\upload>allmusic.py "/cg/amg.dll?p=amg&sql=11:3pfrxqq5ld6e" 2 out.pkl

simple_pagerank.py out.pkl

Happy harvesting!

This entry was posted in Python and tagged , , , , . Bookmark the permalink.

One Response to PyWeb-IL Presentation on Harvesting: Finding the Most Influential Artists

  1. Tal Einat says:

    Nice presentation!

    In the presentation you mention BeautifulSoup and that it is based on HTMLParser which can be problematic. Note that up to version 3.0.7a, BeautifulSoup used SGMLParser which is much more robust, and for this reason as of this writing (July 2009) it is preferable for many uses.