Archive for the ‘Databases’ Category

Bulk INSERTs FTW

Friday, July 17th, 2009

A short while ago, I had to research some API for a company I’m consulting for. This API yields very good quality data, but isn’t comfortable enough to process it for further research.
The obvious solution was to dump this data into some kind of database, and process it there.
Our first attempt was pickle files. It worked nicely enough, but when the input data was 850 megs, it died horribly with a memory error.

(It should be mentioned that just starting to work with the API costs about a 1.2 gigs of RAM.)

Afterwards, we tried sqlite, with similar results. After clearing it of memory errors, the code (sqlite + sqlalchemy + our code) was still not stable, and apart from that, dumping the data took too much time.

We decided that we needed some *real* database engine, and we arranged to get some nice sql-server with plenty of RAM and CPUs. We used the same sqlalchemy code, and for smaller sized inputs (a few megs) it worked very well. However, for our real input the processing, had it not died in a fiery MemoryError (again!) would have taken more than two weeks to finish.

(As my defense regarding the MemoryError I’ll add that we added an id cache for records, to try and shorten the timings. We could have avoided this cache and the MemoryError, but the timings would have been worse. Not to mention that most of the memory was taken by the API…)

At this point, we asked for help from someone who knows *a little bit* more about databases than us, and he suggested bulk inserts.

The recipe is simple: dump all your information into a csv file (tabs and newlines as delimiters).
Then do BULK INSERT, and a short while later, you’ll have your information inside.
We implemented the changes, and some tens of millions of records later, we had a database full of interesting stuff.

My suggestion: add FTW as a possible extension for the bulk insert syntax. It won’t do anything, but it will certainly fit.

A Simple Race-Condition

Sunday, June 28th, 2009

Lately, I've mostly been working on my startup. It's a web-application, and one of the first things I've written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won't present the code itself here, instead I'll try to present the essence of the bug.

Consider a web application, required to do lengthy operations from time to time, either IO bound, or CPU bound. To save time, and maybe also bandwidth or CPU time, we are interested in caching the results of these operations.
So, let's say we create some database table, that has the following fields:

  • Unique key: input
  • output
  • use_count *
  • Last use date *

I've marked with an asterisk the fields that are optional, and are related to managing the size of the cache. These are not relevant at the moment.
Here is how code using the cache would look like (in pseudocode form):

result = cache.select(input)
if result:
    return result
result = compute(input)
cache.insert(input, result)
return result

This code will work well under normal circumstances. However, in a multithreaded environment, or any environment where access to the database is shared, there is a race-condtion: What happens if there are two requests for the same input at about the same time?
Here's a simple scheduling that will show the bug:

Thread1: result = cache.select(input). result is None
Thread1: result = compute(input)
Thread2: result = cache.select(input) result is None
Thread2: result = compute(input)
Thread2: cache.insert(input, result)
Thread1: cache.insert(input, result) - exception - duplicate records for the unique key input!

This is a classic race condition. And here's a small challenge: What's the best way to solve it?

Actual Data Always Needs To Be Explicit

Friday, April 10th, 2009

This might seem obvious, but it wasn't to me it first. Now I consider it a database design rule of thumb, or even a patten.
I'll explain using an example. Consider an application where you also need automatic tagging of text. (As in generating keywords.) So you'll have a table for objects that have textual fields (for instance, blog posts), and a table for tags.
Now, you would need a many-to-many mapping between these two tables. Various ORMs might do this automatically for you, or you might add a PostTag table yourself, with foreign keys to the other tables.

You think this might be enough, as your smart tagging algorithm can add tags and attach tags to blog posts. If you want to change it manually, then no problem, you just modify any of these tables. For example, if the algorithm makes a mistake, you just erase the mapping and/or the tag.

The problems start when you want to run the algorithm more than once.
First, the algorithm must not create duplicates on the second run. This is very easy to implement and doesn't require any change to the DB. Now, let's say that a taggable object (our blog post) has changed, and we want to update the tags accordingly. We might want to erase all the mappings we created for this object. No problem, also easy to do.

What about manual changes? Should these be erased as well? Probably not, at least not without alerting their creator. So we need to record the source of these mappings in an extra column of the mapping table, and use it to mark manually and algorithmically generated mappings differently.

How about deletions? What if the first time around, the algorithm made a mistake, and added a wrong tag, which was manually removed? Running the algorithm again will cause the tag to be added again. We need some way to mark "negative tags" , which are also pieces of information. The easiest way I found of doing this is adding a boolean "valid" column to the mapping table.

It's important to note that this also applies to all mapping types and not just to many-to-many. So even when you don't naturally need a separate table for a mapping, you should consider adding one, if the mapping is part of the actual data you keep. Also, if you need to keep extra data about the mapping itself, for example "relationship type" in a social network or "tag weight" as in our example, you would already have a separate table anyway.

I encountered this issue when I implemented my multiple source db-design. A reminder: I had data collected from various sources, and then combined together to a final merged record. The combining was done automatically.
My mistake was that I only considered the records as pieces of data, and didn't consider that the actual grouping of raw data records is also part of the information I keep. As such, I should have represented the groupings in a separate table, with the added columns, as I outlined in this blog post.

Database Design Problem

Sunday, February 22nd, 2009

A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was happy with.

While I was still making up my mind, I discussed the problem with a couple of friends, and to better describe the problem and the proposed solutions I wrote up a short document describing them. I decided to publish this document along with my choice and considerations. Maybe someone else will benefit from my choice, or at least from the alternatives I listed.

Problem description:
We want to to have a table with collected information from various sources.

For example, let's say we want to collect information about paintings. We'd want to have a database holding for each painting we know about its dimensions, painter, description, link to an image file, etc. Since we collect this information from various sources (maybe harvest information from multiple websites), we would like our application to display each field either from all sources, or from the best source available.
(Note: in my original formulation, being able to display the value from the best source was enough).

(more...)