A Simple Race-Condition

Lately, I’ve mostly been working on my startup. It’s a web-application, and one of the first things I’ve written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won’t present the code itself here, instead I’ll try to present the essence of the bug.

Consider a web application, required to do lengthy operations from time to time, either IO bound, or CPU bound. To save time, and maybe also bandwidth or CPU time, we are interested in caching the results of these operations.
So, let’s say we create some database table, that has the following fields:

  • Unique key: input
  • output
  • use_count *
  • Last use date *

I’ve marked with an asterisk the fields that are optional, and are related to managing the size of the cache. These are not relevant at the moment.
Here is how code using the cache would look like (in pseudocode form):

result = cache.select(input)
if result:
    return result
result = compute(input)
cache.insert(input, result)
return result

This code will work well under normal circumstances. However, in a multithreaded environment, or any environment where access to the database is shared, there is a race-condtion: What happens if there are two requests for the same input at about the same time?
Here’s a simple scheduling that will show the bug:

Thread1: result = cache.select(input). result is None
Thread1: result = compute(input)
Thread2: result = cache.select(input) result is None
Thread2: result = compute(input)
Thread2: cache.insert(input, result)
Thread1: cache.insert(input, result) – exception – duplicate records for the unique key input!

This is a classic race condition. And here’s a small challenge: What’s the best way to solve it?

This entry was posted in Databases, Design, Programming and tagged , , , , , . Bookmark the permalink.

14 Responses to A Simple Race-Condition

  1. Erez says:

    Assuming this is a class method (not really important), I would put in the __init__ something like:
    self.global_compute_locks = collections.defaultdict(threading.Lock)

    And then:
    [python]
    result = cache.select(input)
    if result:
    return result

    with self.global_compute_locks[input]:
    result = cache.select(input)
    if not result:
    result = compute(input)
    cache.insert(input, result)
    return result
    [/python]
    I assume the slowdown by the double cache.select pales against compute.

  2. dbrodie says:

    Well, if
    result = compute(input)
    is consistent enough between threads (meaning you don’t care which thread’s result you use) then the obvious solution is:
    [python]
    result = cache.select(input)
    if result:
    return result
    result = compute(input)
    + lock(cache)
    prev_result = cache.select(input)
    if prev_result:
    return prev_result
    cache.insert(input, result)
    + unlock(cache)
    return result
    [/python]
    Here I am assuming that cache lookup is relatively cheap compared to computation (usually true), otherwise you could just include the compute in the lock and simplify the code

  3. budowski says:

    Another option which doesn’t require code modifications:
    When you create the table and apply the constraint (Create Unique), you could set an “Ignore duplicate key” option (for example in MSSQL: http://msdn.microsoft.com/en-us/library/csh347s8(VS.71).aspx).

    This means that the second INSERT will be ignored, and the DB will only issue a warning, and not an error.

    I’m pretty sure MySQL and Oracle also have this option.

  4. Inger says:

    Why can’t you just add the input to the table with NULL as the output, and then use UPDATE statement to fill the result when it’s done? This way you update only one row. (Of course I assume here all operations are completed successfully, otherwise you’ll need to delete the row or maybe cache the error).

    No locking mechanisms are required here.

  5. lorg says:

    General note:
    What I wrote in the post may apply to general cases of multiprogramming, and not just multi-threading. Because of that, every synchronization object should be applicable to these other cases, which might affect its simplicity of use.

    Most of the commenters noticed, but did not articulate that there are actually two issues to solve:
    1. Caching, as presented in the post itself
    2. Synchronizing, e.g. knowing what are the operations happening *now*.

    @Erez:
    As we discussed elsewhere, you’d have to make sure that your dict is threadsafe (which is doable).

    @Inger:
    I disagree. If another thread discovers the row already exists, with a NULL output, it needs to wait for the result to be put there, or get it itself (which means, getting it twice). If you want it to wait, you need some kind of synchronization mechanism.

    @dbrodie:
    You can do that, or you can just catch the exception and ignore it :)

    @Brodie:
    Your solution is equivalent to Budowski’s, just with locking, which is unneeded.

    My solution:
    Well, since I’m all for doing the least work required, I didn’t use a synchronizing mechanism, and in effect used budowski’s and dbrodie’s solution. I caught and ignored the exception raised for duplicate rows. Once this mechanism is more heavily used, and I’d be interested in preventing duplicate queries at the same time, I’ll implement some kind of synchronization, probably along the lines Erez suggested.

    Also, thanks to everyone for your comments: I appreciate the discussion on this issue. I think it’s a worthwhile discussion that makes you think.

  6. silky says:

    Well, surely the insert would check to see if the entry exists before it inserts, given that compute() takes time.

    The loss would be in the fact that it computed() for nothing, but then if you have lots of threads and another thread has already acted on the first set of data, life is good anyway, because the second guys insert().

    That would be my approach.

    A similar option would be for compute() to check, occasionally, if it still needs to continue. This may or may not be appropriate depending on what it does, but it would let it short circuit longer calculations if some other thread has already done them.

  7. Jack Pepper says:

    Add a table:

    create table cache_index (
    var input primary key
    var state enum(‘current’,'pending’)
    var cachetime timestamp default NOW();
    )

    your program is already multithreaded and therefore is using semaphores already, right?), so create another mutex semaphor “cachelock” to protect setting the cache table state variable:

    function iscache(requestinput) {
    cachelock.acquire_read();
    cachestate= cache_index.select(“select state where input=requestinput”)
    cachelock.release;
    return cachestate; // NULL return would mean that no value existed at all */
    }

    So now your main code looks like this:

    cachestate = iscache(input)
    if isnull(cachestate):
    cachelock.acquire_write();
    cache_index.select(“replace into cache_lock values input=input,state=’pending’”);
    cachelock.release;
    result = compute(input)
    cache.insert(input, result)
    cache_index.select(“replace into cache_lock values input=input,state=’current’”);
    return result
    if cachestate=’current’:
    result=cache.select(input)
    return result
    /* implied else cachestate=pending */
    wait() ? ;
    /* I don’t know the use case for waiting …

    and naturally you have some housekeeping thread that clears out the cache when things have gotten too old.

  8. arkon says:

    imri… mind you, that you should catch the double insertion in sql code already. thus its seamless to the application. i prefered this way in my code (back in digicash), to seperate the db as much as possible from the user code. so actually i had a try catch in the sql code which rethrows only if its not an exception of same key thingy…

    gluck

  9. Emil Friðriksson says:

    I suggest you use the MySQL ‘REPLACE’ syntax, instead of ‘INSERT’ as documented here: http://dev.mysql.com/doc/refman/5.0/en/replace.html

    It inserts the row if there is no row there and if there is already a row there with the same unique key, it deletes the row and inserts a new one with the new data. You could also just put an exception handler around the insert, as you are aware of the possibility of duplicate inserts, but I’d use ‘REPLACE’.

  10. Emil Friðriksson says:

    You could also use ‘INSERT … ON DUPLICATE KEY UPDATE’ so you can update the use_count… that makes even more sense than my previous recommendation.

  11. Emil Friðriksson says:

    Sorry, here is the documentation for it: http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html

  12. Rani says:

    I don’t get all the solutions that lock *the entire cache*.
    Assuming that compute() is reentrant, there’s no reason not to have multiple concurrent compute()s (for distinct inputs) and hence the locking has to be entry-based, similar to what Jack suggested.

  13. Motoma says:

    In a situation where the compute function is computationally intensive, it is not acceptable to compute the result multiple times. In this situation, you will want the first request to compute and cache, while the remaining requests block until the result has been cached, and then query the cache again.

    A little python-ish pseudocode:

    [code]
    def query(input):
    result = cache.select(input)
    # Result has already been computed.
    if result:
    return result

    if lock(input, blocking=False):
    # If we get here, the result has never been compute, so compute and cache.
    # Locking succeeded, meaning no one else is computing.
    result = compute(input)
    cache.insert(input, result)
    unlock(input)
    return result

    # If we get here, the result is being computed.
    # Locking failed, so we block until the computation is complete, then query again.
    # Recursion is not necessary, could return a code indicating to re query.
    lock(input, blocking=True)
    unlock(input)
    return query(input)
    [/code]

  14. ObiWan says:

    I’m quite late on this, but I think that there’s a consideration which needs to be done and about which you gave no infos; that is the cost in terms of computational time and network traffic related to a cache fetch, a record lock and a record insert

    See, your solution (that is, ignoring the error) may be ok, but it may cause unnecessary network traffic due to the “insert” attempt, so if such a traffic is a problem it may be worth considering the use of a lock

    Another possible solution may be the use of a shared queue managed by a separate process which will then perform in background the dequeue/insert tasks; I mean something like

    result = cache.select(input)
    if result:
    return result
    result = enqueue(input)

    so the code will just check if an entry is already cached, if not it will pass the input value to the queue handler which will call the “compute” function, store the result into the cache, enqueue the element and return the result at a “later time” the queue handling process will then, in background, proceed to dequeue one element at a time in FIFO style and insert them into the database