Posts Tagged ‘Algorithm’

Rhyme and Reason with Python

Wednesday, February 6th, 2008

After reading xkcd's blog, I went to LimerickDB.com. It looks and works just like bash.org, but for limericks. Quite amused with some of the limericks available, I wanted to try and compose my own.
The first one came out fine, but I got stuck without a rhyme for the second one:

a man named guido once thought
he'd invent a language well wrought
indentation as tabs

Uhm, no luck there... "tabs" does not rhyme well. Then an idea came... I could create a rhyme index!
How to do it?
First, I took my already written word index. Last time I used it was to solve a substitution cipher for my crypto homework.
Then, I looked for the phonemes for each word, using espeak - a text to speech program:

lorg@kild:~$ espeak -q -x hello
 h@l'oU

After doing so, I collected words that had similar phonemes endings, and there it was, a rhyme index:

In [84]: words.get_word_rhymes("tabs", word_index, rhyme_index)[:10]
Out[84]:
['crabs', 'stabs', 'macnabs',
 'dabs', 'cabs', 'scabs',
 'slabs', 'arabs', 'snobs',
 'thebes']

Note that better rhymes are listed first - according to the length of the match.
To create this list I didn't use stress information, although with some more work it may be done.

Back to our limerick, we can see that "tabs" does not have really good rhymes. Better luck with indentation maybe:

In [86]: words.get_word_rhymes("indentation", word_index, rhyme_index)[:20]
Out[86]:
['kishion', 'alleviation', 'expostulation',
 'machination', 'syrophenician', 'proposition',
 'infatuation', 'computation', 'mansion',
 'meditation', 'aggression', 'anticipation',
 'clarification', 'logician', 'gratulation',
 'discretion', 'quotation', 'presentation',
 'dissolution', 'deprecation']

And we get (not the best limerick I admit):

a man named guido once thought
he'd invent a language well wrought
with space indentation
and good presentation
we all got the language we sought

My original ending line was "it left all the perl hackers distraught"... but it ain't true, so I gave it up. The 'distraught' was found with my script as well.
If you want to look for your own rhymes, then you can use my hacked-up script, although it isn't written well enough for my taste. (For example, don't expect to see documentation :)
I'll upload a better version once I refactor it.
To save you the trouble of generating the word index and the rhyme index from scratch, I used cPickle to store them. You can use them for lookups regardless of the script.

Elegant Line Clipping Algorithm

Monday, November 26th, 2007

I actually wrote about that a long time ago, in the wiki that once was algorithm.co.il. Since the wiki is long gone, and I figured this is a very elegant algorithm, I decided I should write about it.

First, some background - while writing Ascii-Plotter, I had to plot lines. To plot lines correctly I had to clip them. Now since this was a fun project, I decided I wanted to figure out how to do that myself, even though the problem is really solved. So I sat down with pen and paper, and did a little thinking, and I came up this nice algorithm.

The idea is simply this: consider a line, that is not parallel to any of the axes. This line intersects all the four lines that make up the limits of our screen. Here is a diagram:

An intersection of a line with a bounding box

Now notice that there are six points of interest: the four intersection points, and the two endpoints of the line. For each point, we find its parameter in the parametric representation of the line, where a and b are the endpoints:

x(t) = a+t(b-a)

To find out the actual endpoints of the clipped line, we sort the six points according to their parameter, and take the two innermost points (with zero based indexing, those would be 2 and 3). What if the line is outside the box you say? We check that the result points' parameters are between 0 and 1 (which correspond to the original endpoints).

Note that for a line parallel to one of the axes, a similar method applies, but without two of the intersection points. However, it is probably easier to just calculate the clipping directly in that case.

Here is some sample code, without all the surrounding checks (for parallelism with the axes, trivial clipping etc...).
The full version is available in the code of Ascii-Plotter.

points_t = [0.0,1.0]

points_t.append( float(rect_bottom_left[0]-line_pt_1[0])/(line_pt_2[0]-line_pt_1[0]) )
points_t.append( float(rect_top_right[0]-line_pt_1[0])/(line_pt_2[0]-line_pt_1[0]) )
points_t.append( float(rect_bottom_left[1]-line_pt_1[1])/(line_pt_2[1]-line_pt_1[1]) )
points_t.append( float(rect_top_right[1]-line_pt_1[1])/(line_pt_2[1]-line_pt_1[1]) )

points_t.sort()
if points_t[2] <0 or points_t[2]>= 1 or points_t[3] <0 or points_t[2]>= 1:
    return None
result = [(pt_1 + t*(pt_2-pt_1)) for t in (points_t[2],points_t[3]) for (pt_1, pt_2) in zip(line_pt_1, line_pt_2)]
return (result[0],result[1]), (result[2], result[3])

Simple word lookup for crosswords and such…

Sunday, November 11th, 2007

I am swamped with university homework. I did manage to come up with some small somewhat useful utility.

Do you know when you need a word that ends with a double letter and then an 'n'? you would want to write it like this: "xxN$". Well, I wrote a simple tool that translates a 'word lookup syntax' into regular expressions, and then does a lookup for the word.

This raises a two questions:

What exactly is the syntax?
Small letters stand for any letter, and big letter stand for their specific letters. So for example, "bEEn" would match "been", "seen" and "keep", but not "boon". Punctuation is left unchanged, so you can use $,^, (?!xy) and so on. I didn't work too too much on it, so [] doesn't translate well (the letters inside it will come out garbled).

What is the data for the lookup?
Well, I used the excellent natural language toolkit (obviously in Python). Very easy to start using it, with a nice tutorial. It's nice knowing that such a tool is available. The source data is actually an extract from the Gutenberg project. The result words are sorted according to frequency in the source language, and the 20 most frequent words are returned.

Without further ado, here are some code snippets. (The compiled regexp is printed. Note how ugly (yet simple) it is :)
(Note: I did cut out some of the output...)

In [1]: fd = words.create_word_index()

In [2]: words.find_words("^bEEn$",fd)
^(?P<b>.)EE(?!((?P=b))|(E))(?P<n>.)$
Out[2]:
['BEEN',
 'SEEN',
 'KEEP',
 'FEET',
 'FEEL',
 'SEEK',
 'SEED',
 'MEET',
 'DEEP',
 'NEED',
 'SEEM',
 ...]

In [3]: words.find_words("^jvfpf$",fd)
^(?P<j>.)(?!((?P=j)))(?P<v>.)(?!((?P=j))|((?P=v)))(?P<f>.)(?!((?P=j))|((?P=f))|((?P=v)))(?P<p>.)(?P=f)$
Out[3]:
['THERE',
 'THESE',
 'WHERE',
 'JESUS',
 'MOSES',
 'HOSTS',
 'LINEN',
 'PIECE',
 'SCENE',
 'WHOSO',
 'POSTS',
 'EPHAH',
 ...]

Note that if you find some of the result words a bit too uncommon for your taste (e.g. Ephah), I would suggest removing the bible from the source data... :)

The (very ugly, uncommented, draft code) is available here. I apologize for its unseemliness, but I figured that it's better to make it available in its current form, rather than never put it online for lack of time and will. It could also be seen that the code isn't terribly efficient. It is however, 'good enough' for interactive work.

Pagerank recommends movies

Saturday, September 22nd, 2007

So how was your Yom Kippur? After Yom Kippur ended, I sat down to write something that was nagging me for quite some time. I wanted to see what pagerank had to say about movies. I've always liked to picture things as graphs, and looking at movies and actors as nodes in a graph is very intuitive. I also think that the idea has some merit. I like most of the movies that Edward Norton appears in, so it makes sense to give him a high score, and see which other movies Edward Norton "recommends".

So I fired up my IDE, and started to work.

Well, at first I was a little bit disappointed, because IMDbPY doesn't have code to read the IMDB charts. Well, I was going to extend it a little bit, but then I remembered that I already downloaded all the data I needed some time ago (more than half a year I believe), and that was good enough. So I opened up some old pickle file I had lying around, and Voila! after some playing around with the format, I had a weighted graph at my disposal. I weighted the graph according to the IMDB listing - if an actor appeared higher on the cast, his (or hers) weight to and from that movie will be higher.

I implemented a quick and dirty pagerank, and then added some code to score my recommendations higher. I did that by adding each iteration my recommendation's weight to the movie's score.

Without recommendations, the top spots were:

  • 24
  • The Sopranos
  • The Wizard of Oz
  • JFK
  • Cidade de Deus
  • Citizen Kane
  • King Kong
  • Voyna i mir
  • Dolce vita, La
  • The Patriot

After adding "Edward Norton" and "Tom Hanks" as recommendations, I got the following results:

  • American History X
  • Fight Club
  • 24
  • Red Dragon
  • The Sopranos
  • Catch Me If You Can
  • The Wizard of Oz
  • JFK
  • Forrest Gump
  • Cidade de Deus

Well, I was surprised a little to see 24 and the Sopranos rank so high, and also a little bit disappointed. The recommendations didn't really add enough information, and more work will have to be done for this to work properly. It has the right 'smell' however. It looks like it has potential to be useful - I can just see the program, having a small table with a 'seen/unseen' checkbox, and a place to write your favorites, and the program will tell which movies you should watch next. This reminds me a little bit of the recommendations on soulseek and Amarok. It would be really fun to write a total recommendation engine, with the best of all worlds. Someone probably already implemented it.

In any case, I'll work on it some more. This is interesting.

If anyone wants the code, tell me, and I'll make it 'online worthy'...

How PyTuner works

Wednesday, August 1st, 2007

PyTuner is really quite simple. Here is the outline of the algorithm behind it:

  1. Record audio data with pymedia.
  2. Compute the FFT of the audio data - now we have the strength of each frequency in the sample.
  3. Find peaks in the result of the FFT. This means looking for the dominant frequencies.
  4. Find the lowest frequency that has a peak besides the zero frequency.
  5. Find the nearest 'desired note' (the notes we should tune to) to that frequency.
  6. Mark the difference between the actual frequency and the desired note frequency.

There is more to the algorithm. First, to find peaks, I used a simple algorithm - I computed the average strength of the frequencies in the sample, and looked at all the frequencies above two standard deviations above that average. This yields an array, where each frequency is marked by 0, or 1. (Instead of 1 you can mark it with its original strength). For each consecutive sequence of frequencies with 1 - I computed the average of the run, and that was the peak location. So for example, if the data was:
0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
The peak would be in the frequency matching index 5.
(If you used the original strength instead of 0 and 1, you can apply a weighted average to get a more accurate result)

To find the actual note of a given frequency, I used the fact that notes (half-tones) are separated by a ratio of 2^(1/12) . (the twelfth root of 2). So, if our base note is A, at 110 Hz, to compute the note we would use the following code snippet:

note_names = "A A# B C C# D D# E F F# G G#".split()
note_index = math.log(note_freq/base_freq, 2**(1.0/12))
note_name = note_names[note_index%12]

While writing PyTuner, at first I implemented the FFT on my own, just to know how to do it once. I found it helped me to better understand the nature of the algorithm, and its output. I find the practice of implementing various algorithms on your own, for educational purposes, quite healthy. I always like to know how things work 'behind the scenes'.

Proofreading and what’s wrong with the heapq module

Friday, May 4th, 2007

Consider the following problem:

you have a large book you would like to proofread, with many chapters (100+) and a few men (4) at your disposal. How would you distribute the chapters among the men, considering that each proofreader must get whole chapters?

Well, the obvious solution is just divide the number of chapters by the number of men, and give each proofreader an appropriate number of (randomly picked?) chapters. But what if the chapters or of varying length? Well, you could just distribute them randomly, but that just doesn't feel right now does it?

I was asked by a friend to help him write a script to solve this, and quite quickly I came up with the solution:

Sort the chapters according to length (number of words), in descending order. Keep the proofreaders in a sorted order. While there are still chapters left - get the next longest chapter, and give it to the proofreader with the least total length. Rearrange the proofreaders.

This algorithm is a greedy algorithm and is also regarded as the straightforward way to solve this problem.

Well, this seems simple enough - In the case where there are more then a few proofreaders - well, the obvious solution is to use a minimum heap. Using a minimum heap in python should be easy - just import the heapq module, use heapify and various other functions, just as you would use sort, and you are home free.

Not quite...

Turns out, that using the heapq module isn't so easy as it seems, or at least not as quick and simple as I would like it to be. While you can sort list using your own compare function, and also providing a key-function, the heapq module sorts items using Python's standard operators. This means, that to use heapq properly to sort an object according to some property, you have to write an object wrapper that has __lt__ and similar functions defined. All this coding could be saved, if heapq had a key argument that could be used, or any other method to control the comparisons.

And what about the proofreading you ask? Well, it worked. It divided the chapters quite evenly, although we did end up using sort() repeatedly, as the number of proofreaders was small and we did not want to overly complicate matters.

This again emphasizes a something I've come to discover - being able to write something quickly is usually more important then later being able to run it quickly. If you write something in an environment that allows for quick coding - later if speed is required you'll be able to optimize it quickly, or even change the design or the basic algorithm.