Bulk INSERTs FTW

A short while ago, I had to research some API for a company I’m consulting for. This API yields very good quality data, but isn’t comfortable enough to process it for further research.
The obvious solution was to dump this data into some kind of database, and process it there.
Our first attempt was pickle files. It worked nicely enough, but when the input data was 850 megs, it died horribly with a memory error.

(It should be mentioned that just starting to work with the API costs about a 1.2 gigs of RAM.)

Afterwards, we tried sqlite, with similar results. After clearing it of memory errors, the code (sqlite + sqlalchemy + our code) was still not stable, and apart from that, dumping the data took too much time.

We decided that we needed some *real* database engine, and we arranged to get some nice sql-server with plenty of RAM and CPUs. We used the same sqlalchemy code, and for smaller sized inputs (a few megs) it worked very well. However, for our real input the processing, had it not died in a fiery MemoryError (again!) would have taken more than two weeks to finish.

(As my defense regarding the MemoryError I’ll add that we added an id cache for records, to try and shorten the timings. We could have avoided this cache and the MemoryError, but the timings would have been worse. Not to mention that most of the memory was taken by the API…)

At this point, we asked for help from someone who knows *a little bit* more about databases than us, and he suggested bulk inserts.

The recipe is simple: dump all your information into a csv file (tabs and newlines as delimiters).
Then do BULK INSERT, and a short while later, you’ll have your information inside.
We implemented the changes, and some tens of millions of records later, we had a database full of interesting stuff.

My suggestion: add FTW as a possible extension for the bulk insert syntax. It won’t do anything, but it will certainly fit.

This entry was posted in Databases, Programming and tagged , , , , , . Bookmark the permalink.

One Response to Bulk INSERTs FTW

  1. budowski says:

    In your case (really large CSV files), the BULK INSERT process probably takes up a lot of time.
    You could speed up the process by hundreds of percents – instead of using CSV files, you could use a binary dump, where each record and column is saved in a binary format.
    The BULK INSERT command is given a format file (bcp format file – FORMATFILE parameter), that defines how each record looks like.

    I know that in a project I worked on, it dramatically increased processing speed (we processed tens of GBs of information).

    More information on the bcp format files:
    http://msdn.microsoft.com/en-us/library/ms191516.aspx
    http://msdn.microsoft.com/en-us/library/ms191479.aspx

    (Note: BULK INSERT can be also be executed using the bcp command line utility)