A Simple Race-Condition

Lately, I’ve mostly been working on my startup. It’s a web-application, and one of the first things I’ve written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won’t present the code itself here, instead I’ll try to present the essence of the bug.

Consider a web application, required to do lengthy operations from time to time, either IO bound, or CPU bound. To save time, and maybe also bandwidth or CPU time, we are interested in caching the results of these operations.
So, let’s say we create some database table, that has the following fields:

  • Unique key: input
  • output
  • use_count *
  • Last use date *

I’ve marked with an asterisk the fields that are optional, and are related to managing the size of the cache. These are not relevant at the moment.
Here is how code using the cache would look like (in pseudocode form):

result = cache.select(input)
if result:
    return result
result = compute(input)
cache.insert(input, result)
return result

This code will work well under normal circumstances. However, in a multithreaded environment, or any environment where access to the database is shared, there is a race-condtion: What happens if there are two requests for the same input at about the same time?
Here’s a simple scheduling that will show the bug:

Thread1: result = cache.select(input). result is None
Thread1: result = compute(input)
Thread2: result = cache.select(input) result is None
Thread2: result = compute(input)
Thread2: cache.insert(input, result)
Thread1: cache.insert(input, result) – exception – duplicate records for the unique key input!

This is a classic race condition. And here’s a small challenge: What’s the best way to solve it?

Posted in Databases, Design, Programming | Tagged , , , , , | 14 Comments

Debugging in IE.

I did something I shouldn’t have done: from JavaScript, I appendChildNodes()-ed some text and an img to an existing img. I apologize. Firefox told me it was OK. To put it more accurately, Firefox didn’t tell me anything, and just didn’t show the text and the img, which was what I wanted it to do. IE really didn’t like it.

Finding out what IE was so upset about wasn’t fun, as I didn’t have a debugger for IE. So I started looking for one. Not wanting to install visual studio just for that, I installed(*) “Microsoft Script Debugger”, which is one old piece of software. It’s so old that its readme states that it works with IE 4. At least it works. It lacks watches and some other features you’d expect from a debugger (which Firebug has!), but it got the job done. Mostly.

I wasted about 40 minutes on that issue.

* Installing “Microsoft Script Debugger”, contrary to what some my tell you, does not require installing old office versions. I followed a download link on Microsoft’s website, and got it. It does require some voodoo if you’re running Vista, but nothing too hard to handle.

Posted in Javascript, Programming, rants | Tagged , , , , | 1 Comment

Actual Data Always Needs To Be Explicit

This might seem obvious, but it wasn’t to me it first. Now I consider it a database design rule of thumb, or even a patten.
I’ll explain using an example. Consider an application where you also need automatic tagging of text. (As in generating keywords.) So you’ll have a table for objects that have textual fields (for instance, blog posts), and a table for tags.
Now, you would need a many-to-many mapping between these two tables. Various ORMs might do this automatically for you, or you might add a PostTag table yourself, with foreign keys to the other tables.

You think this might be enough, as your smart tagging algorithm can add tags and attach tags to blog posts. If you want to change it manually, then no problem, you just modify any of these tables. For example, if the algorithm makes a mistake, you just erase the mapping and/or the tag.

The problems start when you want to run the algorithm more than once.
First, the algorithm must not create duplicates on the second run. This is very easy to implement and doesn’t require any change to the DB. Now, let’s say that a taggable object (our blog post) has changed, and we want to update the tags accordingly. We might want to erase all the mappings we created for this object. No problem, also easy to do.

What about manual changes? Should these be erased as well? Probably not, at least not without alerting their creator. So we need to record the source of these mappings in an extra column of the mapping table, and use it to mark manually and algorithmically generated mappings differently.

How about deletions? What if the first time around, the algorithm made a mistake, and added a wrong tag, which was manually removed? Running the algorithm again will cause the tag to be added again. We need some way to mark “negative tags” , which are also pieces of information. The easiest way I found of doing this is adding a boolean “valid” column to the mapping table.

It’s important to note that this also applies to all mapping types and not just to many-to-many. So even when you don’t naturally need a separate table for a mapping, you should consider adding one, if the mapping is part of the actual data you keep. Also, if you need to keep extra data about the mapping itself, for example “relationship type” in a social network or “tag weight” as in our example, you would already have a separate table anyway.

I encountered this issue when I implemented my multiple source db-design. A reminder: I had data collected from various sources, and then combined together to a final merged record. The combining was done automatically.
My mistake was that I only considered the records as pieces of data, and didn’t consider that the actual grouping of raw data records is also part of the information I keep. As such, I should have represented the groupings in a separate table, with the added columns, as I outlined in this blog post.

Posted in Databases, Design, Programming, startup | Tagged , , | Leave a comment

Computing Large Determinants in Python

Story:
For my seminar work, I had to calculate the determinant of a large integer matrix.
In this case, large meant n>=500. You might say that this isn’t very large and I would agree. However, it is large enough to overflow a float and reach +inf. Reaching +inf is bad: once you do, you lose all the accuracy of the computation. Also, since determinant calculations use addition as well as subtraction, if +inf is reached in the middle of the calculation the value might never drop back, even if the determinant is a “small” number.

(Note: Even if the calculation hadn’t reached +inf, but just some very large floating point number, the same phenomenon could have occurred.)

As I said, my matrix was composed of integers. Since a determinant is a sum of multiples of matrix elements, you would expect such a determinant to be an integer as well. It would be nice to take advantage of that fact.

How should one compute such a determinant? Well, luckily Python has support for arbitrarily large integers. Let’s compute the determinant with them!
Unfortunately, Python’s numpy computes determinants using LU decomposition, which uses division – hence, floating point values.
Well, never mind, I know how to calculate a determinant, right?
Ten minutes later the naive determinant calculation by minors was implemented, with one “minor” setback – it takes O(n!) time.
Googling for integer determinants yielded some articles about division free algorithms, which weren’t really easy to implement. There was one suggestion about using Rational numbers and Gauss Elimination, with pivots chosen to tame the growth of the nominator and denominator.

So, I hitched up a rational number class, using Python’s arbitrarily large integers as nominator and denominator. Then, I got some code to do Gauss Elimination, although my pivoting wasn’t the most fancy. This seemed to work, and I got my desired large determinants.

Epilogue:
Not too long ago, I ran across mpmath, a Python library for arbitrarily large floating point numbers. It is possible I could have used it instead of my own rational numbers. Next time I run into a relevant problem I’ll keep this library in mind.
Also, an even shorter while ago, I became aware that since 2.6 Python boasts a fractions module. This will sure come in handy in the future.

Code:
Available here. If you intend to use it in Python 2.6 and above, I suggest replacing my Rational class with Python’s Fraction.
(Note that I adapted this module to my needs. Therefore it has a timer that prints time estimations for computations, and it uses Psyco.)

Posted in Algorithms, computer science, Math, Programming, Python, Research | Tagged , , , , , | 2 Comments

Easy Harvesting

Harvester
Image by existentist.

I’ve been doing a lot of harvesting (aka screen-scraping) lately. Fortunately, I don’t need forms automation, so I’m using urllib2 and not Mechanize like my friend Ron Reiter recommended.

At first, when I wanted to get some information from a web page quick&dirty-wise, I used regular expressions. This approach works, but is not especially fun to write or to maintain. So for the next harvesting task, I decided to learn Beautifulsoup. Beautifulsoup has an excellent interface, and a parser that deals with messed up (read: random.shuffle()-ed) tags.

Unfortunately, Beautifulsoup is based on Python’s built in html parser, (htmllib.HTMLParser), which is a bad excuse for an html parser.
I decided to give up on it, when I tried to parse a page that had Javascript in it, and the Javascript had a string that contained html. HTMLParser choked on it, and as a result, the page was unparse-able with BeautifulSoup.

Distraught, I remembered that I knew of some other html parser called lxml. I played with it a little, and it seemed to eat up all the pages that BeautifulSoup choked on, and then asked for more. Efficiently.

Now, I had a problem. I already had a harvester written using BeautifulSoup’s interface, and after looking at lxml’s interface, the situation didn’t seem too good. While lxml sports a solid interface, it’s nowhere as quick&easy as BeautifulSoup’s. Also, it used xpath.

My solution: a BeautifulSoup interface wrapper for lxml, which I present here. It mostly does nasty xpath conversions, and it allowed me to make my already written harvester work as well as to write the next harvester. I also gave it to a friend who found it useful.

Here is a short usage example.
Let’s say we are interested in harvesting the names of all the artists appearing on Fiona Apple’s page on allmusic.com.
First, using FireBug’s “inspect”, we can see that all the interesting links are in a table with the id “large-list”. Also, we can see that all the artist links have “sql=11:” in them.
So, our code looks like this:

In [4]: soup = parse_html.fetch_soup('http://allmusic.com/cg/amg.dll?p=amg&sql=11:jjfixqegldde')
In [5]: names = [x.all_text() for x in soup.find(id="large-list").find_all('a', href=re.compile("sql=11:"))]
In [6]: print names
['Alanis Morissette', 'Tori Amos', 'Jeff Buckley', 'Aimee Mann', 'Heather Nova',
 'Astrid Williamson', 'Kari Newhouse', 'Sarah Blasko', 'Daniel Powter', ...]

So, here is the relevant code. It mostly translates nice function calls to nasty xpath. It is not really well documented, as it was a quick and dirty solution, and its interface is similar to BeautifulSoup’s. I hope you find it useful. I did.

Posted in Programming, Python, Utility Functions | Tagged , , , | Leave a comment

You know you need to install a new Python version when…

>>> import decimal
>>> decimal.Decimal('0.2') <&nbsp; 0.3
False
>>>

This little gem took me two hours to track down.
It turns out that since my code is using sqlobject, it also uses the decimal module. I had some constants set up, and from within some larger algorithm I wanted to compare values extracted from the db to those constants. However, my code didn’t seem to work, and I was sure the problem lay somewhere else in the algorithm.
Two hours later, and I found this comparison. It might be similar to this issue, but I’m not sure.

In any case, I was using Python 2.5.2, and decimal works as expected in Python 2.6, so I guess it is indeed time for an upgrade.
After some more checking with Python 2.6, it seems that:

>>> decimal.Decimal('0.6') <&nbsp; 0.3
True

Not good.

Posted in Programming, Python | Tagged , , , | 4 Comments

A Python Rant

Last night I encountered yet again one of Python’s annoyances.
The annoyance I’m referring to is the lack of string like functions for lists. Trivial examples include find() and rfind(). Before you mention index though, it’s important to point out that index() checks for equality. I’d be much happier if instead it could take a function argument for comparisons.
A less trivial example is split(), also with a possible criterion argument. A complicated example is regular expressions.

It seems that most of these functions, when applied to lists should take at least a function argument. Maybe regular expressions for lists would be better with a key argument though.
This reminds me a bit of C++’s generic algorithms for collections.

On a similar subject, it would have been nice, if along with heapq, bisect functions would receive a key argument.

Posted in Programming, Python, rants | Tagged , , , , , | 6 Comments

Mochikit Drag&Drop Corner Case

I found myself working again on the UI for my startup. As my Javascript library, I use Mochikit. One of the reasons for that is that it’s the Turbogears builtin, and I came to like it. The other is that it’s really easy to create DOM objects with it.

In any case, Mochikit has really easy support for good looking drag&drop. However, as usual, my requirements were strange enough to fall upon the following corner case:
I wanted to add a “tool tip popup” for some text, where I would display pertinent information to said text. To make the tool tip popup thingy work, I used the following css “on mouse over” visibility trick:
[css]
.tooltip {
display: none;
}

.parent_object_class:hover .tooltip {
display: block;
}
[/css]

This works beautifully, and with a little bit of positioning, and maybe an event here and there, you can make it appear where you want.

Cue the drag&drop. I wanted to add some drag&drop based slider to that tool tip. Since I wanted to limit the “draggability” of the slider’s selector, I used the snap argument for Mochikit’s Draggable object so that if you move the mouse too far, the dragged selector stays at the limit of a predefined area.
This was all very well, and both of the tricks described worked pretty fine separately, until I tried to put them together.
When dragging and leaving the allowed area for the drag, because of the snap argument, the dragged object stays back, and mouse is no longer over a child element of the original tooltip and tooltipped text. This means that the css trick no longer applies, and the tooltip loses visibility. This would have been fine if the drag ended there. However, the drag was not ended, and at each move of the mouse, the coordinates would grow more. Since I use the drag coordinates to compute the result of the drag, I got some pretty strange results.

To work around this behavior, I used Draggable’s starteffect and endeffect optional arguments to make sure the tooltip remained visible, thus avoiding this issue.

Still, there were many other issues with all this drag&drop going around, and I decided to go for a simpler design, and not put in more time on this.
Issue sealed with a Keep It Simple Stupid.

Posted in Javascript, Programming, startup, web-design | Tagged , , , , | Leave a comment

Database Design Problem

A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was happy with.

While I was still making up my mind, I discussed the problem with a couple of friends, and to better describe the problem and the proposed solutions I wrote up a short document describing them. I decided to publish this document along with my choice and considerations. Maybe someone else will benefit from my choice, or at least from the alternatives I listed.

Problem description:
We want to to have a table with collected information from various sources.

For example, let’s say we want to collect information about paintings. We’d want to have a database holding for each painting we know about its dimensions, painter, description, link to an image file, etc. Since we collect this information from various sources (maybe harvest information from multiple websites), we would like our application to display each field either from all sources, or from the best source available.
(Note: in my original formulation, being able to display the value from the best source was enough).

Continue reading “Database Design Problem” »

Posted in Databases, Design, startup | Tagged , , , , , , | 5 Comments

Website development and not supporting Internet Explorer 6

My partner and I started to work on a website a few months ago. We have a working prototype, and we are always improving it. My work is mostly concentrated on a smart Python backend, and on a Javascript front-end, while a thin controller acts as a messenger between the two.
Lately, I’ve worked on improving the UI. As expected, I rely heavily on CSS. I generate a lot of html elements using Mochikit and format them with CSS classes. While obviously better than the old alternatives, I still don’t like CSS. Maybe it’s because I don’t understand it deeply enough, but for me, there is still a lot of voodoo involved. An example I found, which luckily I didn’t run into yet, is collapsing margins.

Still, even with all its voodoo, CSS is bearable. At least until you get to IE. My latest run in with it was a scrolling bug, and I ran into many other issues. However, as much as I complain, I’m probably getting it easy, as when we started work, we decided not to support IE 6, at least until required.
Our reasoning was:

  • Developing for IE6 both independently and consistently with other browsers has a high cost attached to it.
  • IE6′s use rates are declining, and will decline even more by the time we launch (See these statistics for example).
  • Our first versions were mostly required as a prototype to prove our technology to potential investors.
  • As a two-men team, and a one man programming team, we are very low on development resources.

Given my latest bout of UI programming, this choice made me just a little bit happier.

Posted in CSS, Javascript, startup, web-design | Tagged , , , , | 1 Comment