Categories
Programming Python

Simple SQLObject DB Migration how-to

I’ve been using sqlobject for plnnr.com for quite some time now. So far my experience with it has been positive. Although I’ll probably change ORM when I move to django, for now it stays. While it stays, I need to be able to upgrade my schema to add features.
SQLObject already has a tool for the job, sqlobject-admin. There are instructions on how to use it, but I found them unsatisfactory.
(By the way, both django’s ORM and sqlalchemy also have tools for that, django-south and sqlalchemy-migrate respectively.)

So here is how I use sqlobject-admin to do migrations. Note that if you’re using turbogears 1.0, you would probably be using tg-admin. In that case, bear in mind that tg-admin just simplifies the job for you by adding various standard parameters, but apart from that, the idea stays the same.
Notes:
* I wrote these instructions on a windows machine. On linux machines it should be almost the same, but might require tweaking.
* I used a specific db URI in the examples. You can change it to whatever you want.
* I once had to tweak the main sqlobject-admin file to add the current dir to sys.path. YMMV.

1. Example project:
Let’s setup a project that uses sqlobject. We’ll create a single file, ‘main.py’ with the following content:

import sqlobject
 
sqlobject.sqlhub.processConnection = sqlobject.connectionForURI('sqlite:/D|/work/sotest/sotest.sqlite')
 
class MyThing(sqlobject.SQLObject):
    bla = sqlobject.StringCol()

This is about as simple as I could get it with sqlobject.

2. Starting to use sqlobject-admin
Sqlobject-admin has quite a bit of bureaucracy to go through before you get everything to work right. For a simple project, I cheat (i.e. fake an egg :), and do the following:
a. Create a directory in your project called sqlobject-history
b. If your project name is sotest, create a directory inside your project called sotest.egg-info
c. Inside that dir create a file called sqlobject.txt
d. Inside that file write:

db_module=main
history_dir=$base/sqlobject-history

(note that the main here is the name of the module we created earlier).

3. Start using sqlobject-admin
This will be the workflow with sqlobject-admin:
1. Have the creation sql for the current code version.
2. Update your code
3. Generate the creation sql for the new code version, *without updating the db*
4. Create an upgrade script using the diff between the versions
5. Use the upgrade script.

More specifically:
1. First time – do:

sqlobject-admin record –egg=sotest -c sqlite:/D|/work/sotest/sotest.sqlite

2. To see that everything works, do:

sqlobject-admin list –egg=sotest -c sqlite:/D|/work/sotest/sotest.sqlite

and:

sqlobject-admin status –egg=sotest -c sqlite:/D|/work/sotest/sotest.sqlite

3. Update your database definition (in the Python file). For example, change the contents of main.py to:

import sqlobject
 
sqlobject.sqlhub.processConnection = sqlobject.connectionForURI('sqlite:/D|/work/sotest/sotest.sqlite')
 
class MyThing(sqlobject.SQLObject):
    bla = sqlobject.StringCol()
    bla2 = sqlobject.StringCol()

4. Here is the critical part. Do

sqlobject-admin record –egg=sotest -c sqlite:/D|/work/sotest/sotest.sqlite –no-db-record

In the sqlobject-history directory there should be now two subdirectories, for each version. Let’s call the old version X and the new version Y. In the old version directory create a file:
upgrade_sqlite_Y.sql (where Y is the new version’s name).
In this file, write down the sql to add the bla2 column to the MyThing table. You can use the creation sql commands in the respective versions’ directories to write it.

(note: if we used –edit we would get an editor opened, and if the edited file has any content when you close it, it will be saved as the upgrade script. I don’t like using this method. Note that if you’re on windows you’ll have to fix sqlobject-admin to open your editor, as the command it uses works only on linux machines.)

5. run

sqlobject-admin upgrade –egg=sotest -c sqlite:/D|/work/sotest/sotest.sqlite

6. Make sure everything is OK with sqlobject-admin status.

3. After using the upgrade script
You can use the same upgrade script for other instances of your project. Just make sure that you have the versions numbers correct, and the first version recorded in the database.

I hope this will be useful for someone using sqlobject, I know I needed this kind of how-to. If you have any questions, feel free to ask them in comments below.

Categories
Databases Programming

Bulk INSERTs FTW

A short while ago, I had to research some API for a company I’m consulting for. This API yields very good quality data, but isn’t comfortable enough to process it for further research.
The obvious solution was to dump this data into some kind of database, and process it there.
Our first attempt was pickle files. It worked nicely enough, but when the input data was 850 megs, it died horribly with a memory error.

(It should be mentioned that just starting to work with the API costs about a 1.2 gigs of RAM.)

Afterwards, we tried sqlite, with similar results. After clearing it of memory errors, the code (sqlite + sqlalchemy + our code) was still not stable, and apart from that, dumping the data took too much time.

We decided that we needed some *real* database engine, and we arranged to get some nice sql-server with plenty of RAM and CPUs. We used the same sqlalchemy code, and for smaller sized inputs (a few megs) it worked very well. However, for our real input the processing, had it not died in a fiery MemoryError (again!) would have taken more than two weeks to finish.

(As my defense regarding the MemoryError I’ll add that we added an id cache for records, to try and shorten the timings. We could have avoided this cache and the MemoryError, but the timings would have been worse. Not to mention that most of the memory was taken by the API…)

At this point, we asked for help from someone who knows *a little bit* more about databases than us, and he suggested bulk inserts.

The recipe is simple: dump all your information into a csv file (tabs and newlines as delimiters).
Then do BULK INSERT, and a short while later, you’ll have your information inside.
We implemented the changes, and some tens of millions of records later, we had a database full of interesting stuff.

My suggestion: add FTW as a possible extension for the bulk insert syntax. It won’t do anything, but it will certainly fit.

Categories
Databases Design Programming

A Simple Race-Condition

Lately, I’ve mostly been working on my startup. It’s a web-application, and one of the first things I’ve written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won’t present the code itself here, instead I’ll try to present the essence of the bug.

Consider a web application, required to do lengthy operations from time to time, either IO bound, or CPU bound. To save time, and maybe also bandwidth or CPU time, we are interested in caching the results of these operations.
So, let’s say we create some database table, that has the following fields:

  • Unique key: input
  • output
  • use_count *
  • Last use date *

I’ve marked with an asterisk the fields that are optional, and are related to managing the size of the cache. These are not relevant at the moment.
Here is how code using the cache would look like (in pseudocode form):

result = cache.select(input)
if result:
    return result
result = compute(input)
cache.insert(input, result)
return result

This code will work well under normal circumstances. However, in a multithreaded environment, or any environment where access to the database is shared, there is a race-condtion: What happens if there are two requests for the same input at about the same time?
Here’s a simple scheduling that will show the bug:

Thread1: result = cache.select(input). result is None
Thread1: result = compute(input)
Thread2: result = cache.select(input) result is None
Thread2: result = compute(input)
Thread2: cache.insert(input, result)
Thread1: cache.insert(input, result) – exception – duplicate records for the unique key input!

This is a classic race condition. And here’s a small challenge: What’s the best way to solve it?

Categories
Databases Design Programming startup

Actual Data Always Needs To Be Explicit

This might seem obvious, but it wasn’t to me it first. Now I consider it a database design rule of thumb, or even a patten.
I’ll explain using an example. Consider an application where you also need automatic tagging of text. (As in generating keywords.) So you’ll have a table for objects that have textual fields (for instance, blog posts), and a table for tags.
Now, you would need a many-to-many mapping between these two tables. Various ORMs might do this automatically for you, or you might add a PostTag table yourself, with foreign keys to the other tables.

You think this might be enough, as your smart tagging algorithm can add tags and attach tags to blog posts. If you want to change it manually, then no problem, you just modify any of these tables. For example, if the algorithm makes a mistake, you just erase the mapping and/or the tag.

The problems start when you want to run the algorithm more than once.
First, the algorithm must not create duplicates on the second run. This is very easy to implement and doesn’t require any change to the DB. Now, let’s say that a taggable object (our blog post) has changed, and we want to update the tags accordingly. We might want to erase all the mappings we created for this object. No problem, also easy to do.

What about manual changes? Should these be erased as well? Probably not, at least not without alerting their creator. So we need to record the source of these mappings in an extra column of the mapping table, and use it to mark manually and algorithmically generated mappings differently.

How about deletions? What if the first time around, the algorithm made a mistake, and added a wrong tag, which was manually removed? Running the algorithm again will cause the tag to be added again. We need some way to mark “negative tags” , which are also pieces of information. The easiest way I found of doing this is adding a boolean “valid” column to the mapping table.

It’s important to note that this also applies to all mapping types and not just to many-to-many. So even when you don’t naturally need a separate table for a mapping, you should consider adding one, if the mapping is part of the actual data you keep. Also, if you need to keep extra data about the mapping itself, for example “relationship type” in a social network or “tag weight” as in our example, you would already have a separate table anyway.

I encountered this issue when I implemented my multiple source db-design. A reminder: I had data collected from various sources, and then combined together to a final merged record. The combining was done automatically.
My mistake was that I only considered the records as pieces of data, and didn’t consider that the actual grouping of raw data records is also part of the information I keep. As such, I should have represented the groupings in a separate table, with the added columns, as I outlined in this blog post.

Categories
Databases Design startup

Database Design Problem

A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was happy with.

While I was still making up my mind, I discussed the problem with a couple of friends, and to better describe the problem and the proposed solutions I wrote up a short document describing them. I decided to publish this document along with my choice and considerations. Maybe someone else will benefit from my choice, or at least from the alternatives I listed.

Problem description:
We want to to have a table with collected information from various sources.

For example, let’s say we want to collect information about paintings. We’d want to have a database holding for each painting we know about its dimensions, painter, description, link to an image file, etc. Since we collect this information from various sources (maybe harvest information from multiple websites), we would like our application to display each field either from all sources, or from the best source available.
(Note: in my original formulation, being able to display the value from the best source was enough).