Algorithms computer science Math Programming Python

Pagerank recommends movies

So how was your Yom Kippur? After Yom Kippur ended, I sat down to write something that was nagging me for quite some time. I wanted to see what pagerank had to say about movies. I’ve always liked to picture things as graphs, and looking at movies and actors as nodes in a graph is very intuitive. I also think that the idea has some merit. I like most of the movies that Edward Norton appears in, so it makes sense to give him a high score, and see which other movies Edward Norton “recommends”.

So I fired up my IDE, and started to work.

Well, at first I was a little bit disappointed, because IMDbPY doesn’t have code to read the IMDB charts. Well, I was going to extend it a little bit, but then I remembered that I already downloaded all the data I needed some time ago (more than half a year I believe), and that was good enough. So I opened up some old pickle file I had lying around, and Voila! after some playing around with the format, I had a weighted graph at my disposal. I weighted the graph according to the IMDB listing – if an actor appeared higher on the cast, his (or hers) weight to and from that movie will be higher.

I implemented a quick and dirty pagerank, and then added some code to score my recommendations higher. I did that by adding each iteration my recommendation’s weight to the movie’s score.

Without recommendations, the top spots were:

  • 24
  • The Sopranos
  • The Wizard of Oz
  • JFK
  • Cidade de Deus
  • Citizen Kane
  • King Kong
  • Voyna i mir
  • Dolce vita, La
  • The Patriot

After adding “Edward Norton” and “Tom Hanks” as recommendations, I got the following results:

  • American History X
  • Fight Club
  • 24
  • Red Dragon
  • The Sopranos
  • Catch Me If You Can
  • The Wizard of Oz
  • JFK
  • Forrest Gump
  • Cidade de Deus

Well, I was surprised a little to see 24 and the Sopranos rank so high, and also a little bit disappointed. The recommendations didn’t really add enough information, and more work will have to be done for this to work properly. It has the right ‘smell’ however. It looks like it has potential to be useful – I can just see the program, having a small table with a ‘seen/unseen’ checkbox, and a place to write your favorites, and the program will tell which movies you should watch next. This reminds me a little bit of the recommendations on soulseek and Amarok. It would be really fun to write a total recommendation engine, with the best of all worlds. Someone probably already implemented it.

In any case, I’ll work on it some more. This is interesting.

If anyone wants the code, tell me, and I’ll make it ‘online worthy’…

Personal Programming Python

Testing, 1 2 3, Testing

Finally, I’m after the test in complex functions. Not that it went as well as I wanted, but hey, you can’t have it all. Maybe gonna try again sometime later. In the meantime, I got this “today is the first day of the rest of your life” feeling. After finishing with my previous work, and the previous semester, I’m going to move to Haifa (with my girlfriend) in less than a month, start the next (and last) semester soon, and also start real work on some real projects (one of those being diStorm). Here’s to starting and ending things.

On the flipside, after finishing the test, I had some testing issues to work out. See – let’s say you got a piece of code you change. Well, the obvious thing to do is make sure it works. It is also good to find all the places that reference that piece of code, and make sure these are updated, and test those as well. It is even better if you also have some unit-testing code available (built using the easy-to-use unittest module). It’s even better if your test code actually checks the relevant pieces of code.

But alas, it can all fail due to human stupidity. In that particular instance, mine. After all this work, not running the damn unit-test code is just plain stupid. So, I hope I’ve learned my lesson. Again. Never commit untested code.

And while we are on the subject of testing, check out, over at Ned Batchelder’s place (I also happen to read and like his blog). Coverage is an excellent tool to improve your test code. Just run your test code with, likewise: -x

and then run: -r

and get some nice looking, informative results:

Name                                        Stmts   Exec  Cover
exptree                                        65     33    50%
exputils                                       85     48    56%
template_gen                                  100     67    67%
test_exputils                                  41     41   100%

Excellent. You can also get an annotated version of the source, telling you which line was run, and which wasn’t. It just doesn’t get anymore useful then that. So happy testing. I hope you fare better then me.

Programming Python Utility Functions

Small Python Utility Functions

While working with Gil Dabach on Distorm3, I found out that I’ve been missing a lot of utility functions. I’m going to write about some of them now.

def classify_to_dict(seq, key_func):
    result = {}
    for item in seq:
        key = key_func(item)
        if key in result:
            result[key] = [item]
    return result

This is really a simple function, but a very powerful one. Here is a simple demonstration:

>>> base_tools.classify_to_dict(range(5), lambda k: k%2)
{0: [0, 2, 4], 1: [1, 3]}

Note the similarity and difference from groupby.
If any of you can suggest a better name then classify_to_dict (maybe just classify?), I’ll be happy to hear it.

Some other very useful functions include arg_min and arg_max, which respectively return the index of the maximum and minimum element in a sequence. I also like to use union and intersection, which behave just like sum, but instead of using the + operator, use the | and & operators respectively. This is most useful for sets, and on some rare occasions, for numbers.

Another function I like (but rarely use) is unzip. I know, I know, paddy mentioned it is pretty obvious, and I know that it could just as well be called transpose, however, I still find using unzip(seq) a better choice then the much less obvious zip(*seq). Readability counts.

What are your favorite utility functions?

A minor update: I’ve incorporated the use of a syntax highlighter now, and it should be enabled for comments as well by the next challenge (which will be soon enough).