A short while ago, I had to research some API for a company I’m consulting for. This API yields very good quality data, but isn’t comfortable enough to process it for further research.
The obvious solution was to dump this data into some kind of database, and process it there.
Our first attempt was pickle files. It worked nicely enough, but when the input data was 850 megs, it died horribly with a memory error.
(It should be mentioned that just starting to work with the API costs about a 1.2 gigs of RAM.)
Afterwards, we tried sqlite, with similar results. After clearing it of memory errors, the code (sqlite + sqlalchemy + our code) was still not stable, and apart from that, dumping the data took too much time.
We decided that we needed some *real* database engine, and we arranged to get some nice sql-server with plenty of RAM and CPUs. We used the same sqlalchemy code, and for smaller sized inputs (a few megs) it worked very well. However, for our real input the processing, had it not died in a fiery MemoryError (again!) would have taken more than two weeks to finish.
(As my defense regarding the MemoryError I’ll add that we added an id cache for records, to try and shorten the timings. We could have avoided this cache and the MemoryError, but the timings would have been worse. Not to mention that most of the memory was taken by the API…)
At this point, we asked for help from someone who knows *a little bit* more about databases than us, and he suggested bulk inserts.
The recipe is simple: dump all your information into a csv file (tabs and newlines as delimiters).
Then do BULK INSERT, and a short while later, you’ll have your information inside.
We implemented the changes, and some tens of millions of records later, we had a database full of interesting stuff.
My suggestion: add FTW as a possible extension for the bulk insert syntax. It won’t do anything, but it will certainly fit.