Yesterday I gave a presentation on harvesting to the PyWeb-IL group. In the presentation, I described what I learned about harvesting and also gave a concrete example of how to find the “most influential artists” using data from allmusic.com and a (very) naive implementation of PageRank.
The PageRank implementation was based on wikipedia word-by-word, and is not efficient, but it works well enough for this presentation. I included it and the allmusic.com example mostly because I thought the results are pretty cool, and it’s very good teaching material.
Here is how to run it:
D:\work\pywebil-harvesting\upload>allmusic.py "/cg/amg.dll?p=amg&sql=11:3pfrxqq5ld6e" 2 out.pkl