From time to time, I need to harvest a website, or many websites. For example, to collect the data from IMDB to run the Pagerank algorithm. Other times I need to query some non-web servers.
Usually in such cases, I have a ‘read_single_url’ function that is called in a loop from a ‘read_all_urls’ function. The straightforward implementation of this will run slowly. This is not because read_single_url takes a lot of time to parse the websites it downloads. The delay is mostly due to the latency of network operations. Even on high bandwidth connections, your bandwidth utilization will be quite low.
To fix this, I wrote a function named threadmap that runs each call of read_single_url in a separate thread. Just like map, threadmap runs a given function for each element in the input sequence, and returns once all the calls are complete.
Here is an example use of the function:
threadmap.threadmap(query_server, url_list, max_threads=10, on_exception=threadmap.IGNORE) |
My first naive implementation just created a thread for each element in the list, and started them all simultaneously. This caused network IOErrors and other problems. This issue was handled by setting a maximum number of threads that may run at once.
The next issue I had to handle was exceptions. It is not obvious what is the best course of action once the inner function raises an exception. At the least, the exception has to be handled so that threadmap’s synchronizing code may be allowed to run.
My current implementation allows for a few different behaviors: ignoring the exception, aborting threadmap, retrying, and returning a default value for the problematic call. To implement these behaviors, I used the traceback module, after reading Ian Bickings’ excellent explanation of exception re-raising.
For those interested, here’s a copy of the code. I’ll be glad to read any comments or suggestions about it.
“For example, to collect the data from IMDB to run the Pagerank algorithm.”
You know IMDB make their data publicly available for download, right? http://www.imdb.com/interfaces
Now I do, thanks.
However, last time I used it, it was for a completely different harvest :)
I actually found out only yesterday that IMDB provides those services (not for free actually), but what I wanted to say as a guy who writes data mining bots almost for a living, awesome code :)
Three years on, I still find this small library very useful. How about uploading it to github?