Categories
Databases Design Programming

A Simple Race-Condition

Lately, I’ve mostly been working on my startup. It’s a web-application, and one of the first things I’ve written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won’t present the code itself here, instead I’ll try to present the essence of the bug.

Consider a web application, required to do lengthy operations from time to time, either IO bound, or CPU bound. To save time, and maybe also bandwidth or CPU time, we are interested in caching the results of these operations.
So, let’s say we create some database table, that has the following fields:

  • Unique key: input
  • output
  • use_count *
  • Last use date *

I’ve marked with an asterisk the fields that are optional, and are related to managing the size of the cache. These are not relevant at the moment.
Here is how code using the cache would look like (in pseudocode form):

result = cache.select(input)
if result:
    return result
result = compute(input)
cache.insert(input, result)
return result

This code will work well under normal circumstances. However, in a multithreaded environment, or any environment where access to the database is shared, there is a race-condtion: What happens if there are two requests for the same input at about the same time?
Here’s a simple scheduling that will show the bug:

Thread1: result = cache.select(input). result is None
Thread1: result = compute(input)
Thread2: result = cache.select(input) result is None
Thread2: result = compute(input)
Thread2: cache.insert(input, result)
Thread1: cache.insert(input, result) – exception – duplicate records for the unique key input!

This is a classic race condition. And here’s a small challenge: What’s the best way to solve it?

Categories
Programming Python Utility Functions

Harvesting with threadmap

Dune 2 Harvester

From time to time, I need to harvest a website, or many websites. For example, to collect the data from IMDB to run the Pagerank algorithm. Other times I need to query some non-web servers.

Usually in such cases, I have a ‘read_single_url’ function that is called in a loop from a ‘read_all_urls’ function. The straightforward implementation of this will run slowly. This is not because read_single_url takes a lot of time to parse the websites it downloads. The delay is mostly due to the latency of network operations. Even on high bandwidth connections, your bandwidth utilization will be quite low.

To fix this, I wrote a function named threadmap that runs each call of read_single_url in a separate thread. Just like map, threadmap runs a given function for each element in the input sequence, and returns once all the calls are complete.

Here is an example use of the function:

threadmap.threadmap(query_server,
                    url_list,
                    max_threads=10,
                    on_exception=threadmap.IGNORE)

My first naive implementation just created a thread for each element in the list, and started them all simultaneously. This caused network IOErrors and other problems. This issue was handled by setting a maximum number of threads that may run at once.

The next issue I had to handle was exceptions. It is not obvious what is the best course of action once the inner function raises an exception. At the least, the exception has to be handled so that threadmap’s synchronizing code may be allowed to run.
My current implementation allows for a few different behaviors: ignoring the exception, aborting threadmap, retrying, and returning a default value for the problematic call. To implement these behaviors, I used the traceback module, after reading Ian Bickings’ excellent explanation of exception re-raising.

For those interested, here’s a copy of the code. I’ll be glad to read any comments or suggestions about it.