PyWeb-IL Presentation on Harvesting: Finding the Most Influential Artists

Yesterday I gave a presentation on harvesting to the PyWeb-IL group. In the presentation, I described what I learned about harvesting and also gave a concrete example of how to find the “most influential artists” using data from and a (very) naive implementation of PageRank.

The PageRank implementation was based on wikipedia word-by-word, and is not efficient, but it works well enough for this presentation. I included it and the example mostly because I thought the results are pretty cool, and it’s very good teaching material.

Here is the presentation, and the code is available here.

Here is how to run it:

D:\work\pywebil-harvesting\upload> "/cg/amg.dll?p=amg&sql=11:3pfrxqq5ld6e" 2 out.pkl out.pkl

Happy harvesting!

One reply on “PyWeb-IL Presentation on Harvesting: Finding the Most Influential Artists”

Nice presentation!

In the presentation you mention BeautifulSoup and that it is based on HTMLParser which can be problematic. Note that up to version 3.0.7a, BeautifulSoup used SGMLParser which is much more robust, and for this reason as of this writing (July 2009) it is preferable for many uses.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.