PyWeb-IL Presentation on Harvesting: Finding the Most Influential Artists

Yesterday I gave a presentation on harvesting to the PyWeb-IL group. In the presentation, I described what I learned about harvesting and also gave a concrete example of how to find the “most influential artists” using data from and a (very) naive implementation of PageRank.

The PageRank implementation was based on wikipedia word-by-word, and is not efficient, but it works well enough for this presentation. I included it and the example mostly because I thought the results are pretty cool, and it’s very good teaching material.

Here is the presentation, and the code is available here.

Here is how to run it:

D:\work\pywebil-harvesting\upload> "/cg/amg.dll?p=amg&sql=11:3pfrxqq5ld6e" 2 out.pkl out.pkl

Happy harvesting!

Databases Programming


A short while ago, I had to research some API for a company I’m consulting for. This API yields very good quality data, but isn’t comfortable enough to process it for further research.
The obvious solution was to dump this data into some kind of database, and process it there.
Our first attempt was pickle files. It worked nicely enough, but when the input data was 850 megs, it died horribly with a memory error.

(It should be mentioned that just starting to work with the API costs about a 1.2 gigs of RAM.)

Afterwards, we tried sqlite, with similar results. After clearing it of memory errors, the code (sqlite + sqlalchemy + our code) was still not stable, and apart from that, dumping the data took too much time.

We decided that we needed some *real* database engine, and we arranged to get some nice sql-server with plenty of RAM and CPUs. We used the same sqlalchemy code, and for smaller sized inputs (a few megs) it worked very well. However, for our real input the processing, had it not died in a fiery MemoryError (again!) would have taken more than two weeks to finish.

(As my defense regarding the MemoryError I’ll add that we added an id cache for records, to try and shorten the timings. We could have avoided this cache and the MemoryError, but the timings would have been worse. Not to mention that most of the memory was taken by the API…)

At this point, we asked for help from someone who knows *a little bit* more about databases than us, and he suggested bulk inserts.

The recipe is simple: dump all your information into a csv file (tabs and newlines as delimiters).
Then do BULK INSERT, and a short while later, you’ll have your information inside.
We implemented the changes, and some tens of millions of records later, we had a database full of interesting stuff.

My suggestion: add FTW as a possible extension for the bulk insert syntax. It won’t do anything, but it will certainly fit.


A New Kind of Journalism and Citizen Involvement

It’s become a fashion of late to write about the effect the Internet had on journalism, and the way people get informed. Usually the discussion revolves around blogs, twitter, how the newspapers are dying, and so on.

I’d like to point out something different that I’ve observed of late.

It started a few weeks ago, with the story of judge Drori. He acquitted a man who ran over a the clerk at the parking lot, because she refused to let him leave without paying, and stood in the way of his car. After reading the story, and the actual court ruling, there was a public outrage. Judge Drori wrote a ruling of about 300 pages, where he explains the acquittal. Many people commented about the ruling itself, its length, the reasons given in it and so on.

The second story is about the farmer Shay Dromi, who was acquitted today of killing. Two years ago, two Bedouin burglers broke into his farm at night, poisoned his dog, and then went about their business of stealing his property. At least they would have had Dromi hadn’t noticed them, confronted them, shot one to death, and wounded the other. This was amid a wave of crime and break-ins at the area, while the police weren’t doing much to stop that wave. As I said, Dromi was acquitted, the ruling was also published, and many people commented on the subject.

Now we can get to the point: usually, acquittals or convictions of the “small” people don’t merit much press. Judge Drori’s ruling probably would not have reached that publicity if he wasn’t up for a seat at the supreme court. Dromi’s story was publicized heavily a few years back, and the Knesset even changed the self-defense law because of this case.
However, the publishing of full-text rulings is new. Except for a case I was personally involved in, I never read court rulings. I don’t really know a lot about law.

Having these two stories published online, and not only in print, allows publishers to link to the full rulings. Newspapers will never come 300 attached pages of dense law text.
Yet online it’s easy as creating a link – and just like that, you have citizens reading rulings, understanding court processes, having opinions, commenting, and getting involved.

I find this amazing, and it makes me optimistic. The times are a-changing.


Origami Lizard On Book

Origami Lizard on Book

I took this quite some time ago. The book was lying on my desk, I don’t remember the specific reason I needed it at the time.
I thought I’d also prepare the instructions for this critter – they are quite easy and yield this nice lizard/dinosaur thingy. When I first created this fold, I thought it was quite an achievement to make it with four legs.

In any case, it came out a very nice image. I find it amusing that in the book I’m currently reading, there are mentions of computational origami as a prerequisite of transcendence.