• Skip to primary navigation
  • Skip to main content

Algorithm.co.il

  • About
  • Best Posts
  • Origami
  • Older Projects

Favourites

Rhyme and Reason with Python

Posted on 2008-02-06 by lorg 1 Comment

After reading xkcd’s blog, I went to LimerickDB.com. It looks and works just like bash.org, but for limericks. Quite amused with some of the limericks available, I wanted to try and compose my own.
The first one came out fine, but I got stuck without a rhyme for the second one:

a man named guido once thought
he’d invent a language well wrought
indentation as tabs

Uhm, no luck there… “tabs” does not rhyme well. Then an idea came… I could create a rhyme index!
How to do it?
First, I took my already written word index. Last time I used it was to solve a substitution cipher for my crypto homework.
Then, I looked for the phonemes for each word, using espeak – a text to speech program:

lorg@kild:~$ espeak -q -x hello
 h@l'oU

After doing so, I collected words that had similar phonemes endings, and there it was, a rhyme index:

In [84]: words.get_word_rhymes("tabs", word_index, rhyme_index)[:10]
Out[84]:
['crabs', 'stabs', 'macnabs',
 'dabs', 'cabs', 'scabs',
 'slabs', 'arabs', 'snobs',
 'thebes']

In [84]: words.get_word_rhymes("tabs", word_index, rhyme_index)[:10] Out[84]: ['crabs', 'stabs', 'macnabs', 'dabs', 'cabs', 'scabs', 'slabs', 'arabs', 'snobs', 'thebes']

Note that better rhymes are listed first – according to the length of the match.
To create this list I didn’t use stress information, although with some more work it may be done.

Back to our limerick, we can see that “tabs” does not have really good rhymes. Better luck with indentation maybe:

In [86]: words.get_word_rhymes("indentation", word_index, rhyme_index)[:20]
Out[86]:
['kishion', 'alleviation', 'expostulation',
 'machination', 'syrophenician', 'proposition',
 'infatuation', 'computation', 'mansion',
 'meditation', 'aggression', 'anticipation',
 'clarification', 'logician', 'gratulation',
 'discretion', 'quotation', 'presentation',
 'dissolution', 'deprecation']

In [86]: words.get_word_rhymes("indentation", word_index, rhyme_index)[:20] Out[86]: ['kishion', 'alleviation', 'expostulation', 'machination', 'syrophenician', 'proposition', 'infatuation', 'computation', 'mansion', 'meditation', 'aggression', 'anticipation', 'clarification', 'logician', 'gratulation', 'discretion', 'quotation', 'presentation', 'dissolution', 'deprecation']

And we get (not the best limerick I admit):

a man named guido once thought
he’d invent a language well wrought
with space indentation
and good presentation
we all got the language we sought

My original ending line was “it left all the perl hackers distraught”… but it ain’t true, so I gave it up. The ‘distraught’ was found with my script as well.
If you want to look for your own rhymes, then you can use my hacked-up script, although it isn’t written well enough for my taste. (For example, don’t expect to see documentation :)
I’ll upload a better version once I refactor it.
To save you the trouble of generating the word index and the rhyme index from scratch, I used cPickle to store them. You can use them for lookups regardless of the script.

Filed under: Algorithms, Favourites, Humour, Programming, Python

Manually fuzzing my own compiler

Posted on 2008-01-30 by lorg Leave a Comment

As I mentioned before, I had to write a compiler for simplified CPL. An obvious requirement was that the compiler generate correct code. A less obvious requirement, but important none-the-less, was that after a syntax error, the compiler will keep parsing the source program.

Now, the default behavior of a parser generated by Bison for a synatx error is to return from the parsing function, yyparse. You may of-course call yyparse again, but this would be mostly meaningless – you lost all your context. A simple example would be a language that has to have ‘program’ as the first word. Once you are past that token, you will not be able to parse the source-program again, because your first production (which might look like this):

program: TOK_PROGRAM declarations statements TOK_END

won’t parse.

This is solved in Bison by adding error productions. For example:

expression: '(' error ')'

This production means that an error encountered within parenthesis may be considered a valid expression for the purposes of parsing. A proper semantic action for that error (the code that runs when the production is parsed) will add an error message to some error list, and maybe do some general book-keeping.

So where does the fuzzing come in?
Well, my compiler was mostly working, but it still had no error recovery. That means that any syntax error would cause it to exit, with just that one error. Consider your favorite compiler, (such as gcc), exiting on the first missing semicolon. This is just not acceptable. So I added my basic error recovery and was pretty satisfied.
Then, I had to test the newly written error handling. So I wrote a CPL program in which my goal was to try and kill my own compiler. Thats a fun way to test your code. This also happens to be a ‘fuzzing mindset’. I actually managed to find some holes I didn’t think about, and closed them. Of course, these were not security holes, just ‘compilation holes’.
Here is an excerpt from one of the programs I wrote:
[c]
x=x++++++++++++++100;
x=x=3;
}
for (x=1; x<100; x=x+1; {
x = x;
}
else () {
for (x=1; x<100; x=x+1;) {
x = -1;
}
}
[/c]

It goes on like this for a bit, so I’ll save you the trouble. Later I did some deep testing for the code generation. I wrote a test-case for every possible operator in the language (there aren’t that many) and for each type (real and int. Did I mention it was simplified cpl?). Since the language doesn’t support strings, each of these test cases printed 1 for success and 0 for failure. I ran the compiled output with my VM script and then had a simple Python script collect all the results and return failure if any of them failed. I later also tested control flow structures using the same method.
I had all of these tests available in my makefile, and after each change all I had to do was ‘make test’. Which I did, often.

Looking back after finishing the assignment, it occurred to me that I could have also written an actual ‘fuzzer’ for source programs. However, in the context of such a university assignment this seems like overkill. Had I been writing a real compiler, there’s a very good chance I’d have written a ‘source fuzzer’.

All in all, after I got over the nastiness which is Bison, it was pretty fun. It has been a long time since I wrote C or C++, and it’s good to clean some of the rust away from time to time.

Filed under: Compilation, computer science, Favourites, Programming, Projects, Python

Fractals in 10 minutes no. 3 – The Dragon

Posted on 2008-01-16 by lorg 3 Comments

When first I looked through the pages of the book “Hacker’s Delight”, I found myself looking at the chapter about bases. There I learned a very curious fact – with the digits of 0,1 and the base of -2, you can represent any integer. Right afterwards I learned something even more interesting – with the digits of 0,1 and the base of 1-i, you can represent and number of the form a+bi where a and b are integers. Having nothing to do with this curious fact, I let the subject go.
Some time later, I was reading through Knuth’s “Art of Computer Programming”, and found that with the base of (1-i)^-1, and digits of 0,1 you can generate the dragon fractal!

The dragon fractal

Generating the fractal is quite simple actually:

def create_dragon_set(n):
    """calculate the dragon set, according to Knuth"""
    s = set([0.0+0.0j])
    for i in range(n):
        new_power = (1.0-1.0j)**(-i)
        s |= set(x+new_power for x in s)
    return s

def create_dragon_set(n): """calculate the dragon set, according to Knuth""" s = set([0.0+0.0j]) for i in range(n): new_power = (1.0-1.0j)**(-i) s |= set(x+new_power for x in s) return s

(By the way, can you do it better?)

The annoying part is converting the complex numbers to drawable integer points. After doing so, I used PIL to draw the jpeg.
Here’s a link to the code.

Filed under: Favourites, Fractals, Math, Programming, Python

Fractals in 10 minutes No. 2 – Recursive Spring

Posted on 2007-12-19 by lorg 2 Comments

Imagine a straight wire. Bend this wire until its ends meet. You get a ring. Next stage.
Take another straight wire, bend it as before, but this time, don’t let the ends meet, instead continue bending the wire more and more, until you get a spring. Now, Think of this spring as the first wire, and bend it until the spring’s ends meet. We get a spring-ring. (See pictures below)

At this point it should be obvious – instead of making the ends of the spring meet, we can continue bending the spring into a meta-spring, and again make this meta-spring’s ends meet. Rinse and repeat ad-infinitum.

At first I had a hard time picturing it, and I thought it would be tough to write about it without drawing it. So I set with pen and paper, and tried to draw it. That was even harder than picturing it. So I thought I could write a script to draw it, and that turned out to be pretty easy.

Here are the first four stages. The second and third are shown from the side. Click for bigger versions.

To implement this, I wrote a function that draws any track. A track is defined to be a function takes a float argument between 0 and 1, and returns the appropriate 3d point in between, where 0 is one end of the track and 1 is the other. (Actually, I also decided that tracks should return a “right hand” vector as well). Then I wrote a ring:

def ring(t):
    pi = math.pi
    cos = math.cos
    sin = math.sin
    x = numpy.array((2*cos(2*pi*t),2*sin(2*pi*t),0))
    return (x,-x/3)

def ring(t): pi = math.pi cos = math.cos sin = math.sin x = numpy.array((2*cos(2*pi*t),2*sin(2*pi*t),0)) return (x,-x/3)

Lastly I wrote a wrapper function, that takes any track, and springifies it. It was called springify_track. At this point I could do this:
(The constants are the radius of the spring loops, the sampling delta, and the number loops.)

t1 = ring
t2 = springify_track(t1, 0.25, 0.1, 50)
t3 = springify_track(t2, 0.06, 0.0011, 5000)
t4 = springify_track(t3, 0.011, 0.000012, 500000)

t1 = ring t2 = springify_track(t1, 0.25, 0.1, 50) t3 = springify_track(t2, 0.06, 0.0011, 5000) t4 = springify_track(t3, 0.011, 0.000012, 500000)

Filed under: Favourites, Fractals, Graphics, Programming, Python

Python module usage statistics – Cont.

Posted on 2007-11-21 by lorg 3 Comments

Well, very embarrassingly for me, turns out I had a bug in my original post and code. As per Doug’s suggestion, I tried running the script I wrote on the standard library, and got results I didn’t quite believe. So I checked them, opened socket.py, and there was an “import _socket”. However module “_socket” was not listed in my results. After some digging around, (and feeling my face getting red hot…) I found the bug. It was the smallest thing: I forgot to add re.MULTILINE as an argument to re.search, so the start of my regexp, “^[ \t]*import” didn’t match except on the beginning of the file. #@&*!!! Just no words to express how I felt at that moment.

So, I fixed the bug, tested my code a bit, fixed some more (very minor) bugs, and here are the top 25 of the (hopefully correct this time!) results:

  • sys,1426
  • os,1250
  • unittest,566
  • time,446
  • re,383
  • string,321
  • types,298
  • setuptools,264
  • pkg_resources,217
  • cStringIO,184
  • zope.interface,177
  • datetime,173
  • shutil,167
  • os.path,162
  • gtk,143
  • StringIO,143
  • random,136
  • tempfile,132
  • copy,131
  • threading,128
  • distutils.core,127
  • doctest,126
  • md5,125
  • setuptools.command.e,116
  • logging,116

Except the larger numbers, the arrangement pretty much stayed the same.

(This seems to at least confirm my claim that the results were representative)
Here are the results for the standard library (/usr/lib/python2.5/):

  • sys,1309
  • os,1065
  • ctypes,588
  • re,496
  • string,493
  • types,435
  • time,374
  • numpy,297
  • warnings,254
  • os.path,204
  • cStringIO,196
  • common,185
  • math,159
  • traceback,158
  • gettext,152
  • codecs,147
  • StringIO,147
  • copy,133
  • __future__,128
  • tempfile,126
  • random,119
  • threading,117
  • unittest,108
  • numpy.testing,105
  • errno,100

These results seem different. ctypes seems to be the biggest change. Note that these results might be slanted, as some stdlib modules are implemented in C, and I didn’t test a ‘clean’ Python installation (but rather one with all my non-default modules installed).

Here is a link to the new results, the new stdlib results, and the new fixed graph. I created it using the first 325 module. The fixed R squared value is now approx. 0.99. Here is the code.

I would like to apologize to anyone who was misled by the original results, and again state that it is quite possible that there are still more bugs there, but I still claim the results to be represntative.

Filed under: Favourites, Math, Programming, Python

Python module usage statistics

Posted on 2007-11-19 by lorg 5 Comments

IMPORTANT UPDATE: The code I used to create these statistics had some bugs. The fixed statistics are available here.

After reading Doug Hellman’s post about python stdlib modules he needs the documentation to, I commented there that I need the documentation for logging because I don’t use it too frequently. Later, I thought a little bit more about it, and I wanted to check which modules are actually used often, and which rarely.

My first step was to “grep import *.py” on my python scripts directory. Later I progressed to writing a simple script which basically does the same thing, but also rates each imported module according to frequency (it looks for “import <modules>” and “from <module> import bla”). The results weren’t interesting enough. Before going any further, I sat down, and wrote my expected list of frequent stdlib modules:

  • sys
  • os
  • re
  • math
  • thread
  • urllib2
  • time
  • random
  • struct
  • socket
  • itertools
  • operator

My next step was to try Google Codesearch. Some time ago I wrote a little script to harvest results from Google Codesearch, but enough time has passed, and the script doesn’t work anymore. Parsing their output is messy enough, and their results don’t seem that helpful anyway. (They don’t return all the results in a project). So I thought about using koders.com, when I had a flash. I can use Cheeseshop! easy_install must have a way to just download the source of a project! (For those who don’t know: you can do easy_install SQLObject, and easy_install will look for the module in the PyPI, download it and then install it for you). Indeed, easy_install had this option (-eb), and I was in luck. Quickly, I got a list of all the modules in PyPI, shuffled it, and picked 300 randomly. (Actually, downloading them wasn’t that quick, and about 20 didn’t download right :).

I ran my statistics script, but non-stdlib modules were also listed. I fixed some bugs, improved the script a little, and (partially) disallowed inner project imports using a little heuristic. These are the improved results:

  • sys,113
  • os,107
  • setuptools,57
  • common,50
  • unittest,50
  • __future__,25
  • distutils.core,24
  • zope.interface,22
  • re,19
  • pygame,18
  • time,13
  • datetime,11
  • string,10
  • zope,10
  • pyglet.gl,9
  • types,8
  • random,7
  • pkg_resources,7
  • gtk,6
  • struct,6
  • ez_setup,6
  • zope.component,5
  • math,5
  • logging,5
  • sqlalchemy,5

(Please note that I do not argue that these results are completely true, as my code might still contain problems, and it probably filters too much. However, I do claim that these results are representative).

Well, some surprises… I didn’t expect setuptools and types to be there. Especially types… who needs it anymore (and anyway)? Seeing unittest so high in the list gives me a pretty good feeling about opensource. __future__ being there is amusing… Seems that we just can’t wait for the next version. And indeed, logging is pretty low, but not as low as I thought it will be.

Another curiosity: the module frequencies seem (unsurprisingly) to obey Zipf’s law. Here is a small graph showing the results: module imports on a log-log graph

(plotted with Google Docs). The R squared value is 0.92. For the original results (where inner dependencies were allowed), it seemed that the law was even stronger.

For those interested in the full results, here is a csv file containing them.

One final note: while these results are true for projects in Python, they don’t represent usage in interactive mode. Since I use interactive mode a lot, and during interactive work I use os and re quite a lot, I bet the results would be changed if we somehow could count interactive mode usage.


Filed under: Favourites, Math, Programming, Python

Pagerank recommends movies

Posted on 2007-09-22 by lorg 1 Comment

So how was your Yom Kippur? After Yom Kippur ended, I sat down to write something that was nagging me for quite some time. I wanted to see what pagerank had to say about movies. I’ve always liked to picture things as graphs, and looking at movies and actors as nodes in a graph is very intuitive. I also think that the idea has some merit. I like most of the movies that Edward Norton appears in, so it makes sense to give him a high score, and see which other movies Edward Norton “recommends”.

So I fired up my IDE, and started to work.

Well, at first I was a little bit disappointed, because IMDbPY doesn’t have code to read the IMDB charts. Well, I was going to extend it a little bit, but then I remembered that I already downloaded all the data I needed some time ago (more than half a year I believe), and that was good enough. So I opened up some old pickle file I had lying around, and Voila! after some playing around with the format, I had a weighted graph at my disposal. I weighted the graph according to the IMDB listing – if an actor appeared higher on the cast, his (or hers) weight to and from that movie will be higher.

I implemented a quick and dirty pagerank, and then added some code to score my recommendations higher. I did that by adding each iteration my recommendation’s weight to the movie’s score.

Without recommendations, the top spots were:

  • 24
  • The Sopranos
  • The Wizard of Oz
  • JFK
  • Cidade de Deus
  • Citizen Kane
  • King Kong
  • Voyna i mir
  • Dolce vita, La
  • The Patriot

After adding “Edward Norton” and “Tom Hanks” as recommendations, I got the following results:

  • American History X
  • Fight Club
  • 24
  • Red Dragon
  • The Sopranos
  • Catch Me If You Can
  • The Wizard of Oz
  • JFK
  • Forrest Gump
  • Cidade de Deus

Well, I was surprised a little to see 24 and the Sopranos rank so high, and also a little bit disappointed. The recommendations didn’t really add enough information, and more work will have to be done for this to work properly. It has the right ‘smell’ however. It looks like it has potential to be useful – I can just see the program, having a small table with a ‘seen/unseen’ checkbox, and a place to write your favorites, and the program will tell which movies you should watch next. This reminds me a little bit of the recommendations on soulseek and Amarok. It would be really fun to write a total recommendation engine, with the best of all worlds. Someone probably already implemented it.

In any case, I’ll work on it some more. This is interesting.

If anyone wants the code, tell me, and I’ll make it ‘online worthy’…

Filed under: Algorithms, computer science, Favourites, Math, Programming, Python

Understanding the Nyquist Limit

Posted on 2007-08-18 by lorg 2 Comments

The Nyquist Limit of half the sampling rate, often seemed to me as a ‘magical limit’. After I read the proof I understood it better, and the graphical representation of the overlapping spectra (also seen on the wikipedia page of the proof) is quite convincing. Yet now I found a very interesting and ‘easy’ explanation, which I’m going to relate. The explanation may not be complete, but it certainly gave me a good feel of the limit.

It all started when some late night I was driving with my girlfriend from Haifa to Tel-Aviv. (In Israel, many mathematical school problems start that way :). As it sometimes happens, part of the road was under repairs, and there were ‘danger lights’ alongside the road. These danger lights were many beacons, spaced a few meters apart along the road, each blinking in its own time. Driving alongside them, you could see it as if the ‘light’ was going through them ( see illustration, 0 is unlighted, x is lighted):

… x 0 0 x 0 …

… 0 x 0 0 x …

… 0 0 x 0 0 …

and so on. To me it seemed as if the light was advancing. Let us assume now a light pattern of one lighted, two off, and so on, going forward, one beacon at a time. This is the pattern in the illustration. This pattern however, is identical to the pattern of one lighted, two off, going backwards, two beacons at a time. Just like watching a rotor – you could parse it either way, and both ways are correct. This could be interpreted as aliasing.

At this point – this will already start to ‘smell’ like something is going on with the sampling theorem… Come to think of it, watching a rotor is exactly an example of the application of the sampling theorem! Cool!

(Warning: this is very reminiscent of group theory. Every now and then I will use it a little. To the non-mathematically-inclined: bear with me :)

Let’s move on. Imagine now a finite number of beacons. Let’s say five. Now consider a single light traveling through the beacons. What is the maximum speed in which it can travel, where we still can discern the direction? (this is equivalent to an infinite number of beacons, with a pattern of 1 on, 4 off. Just like Z5 ~= Z/[5])

Let’s check it out. (note: since 5 is prime each speed will generate one pattern only and not more.) We will obviously ignore the zero speed, and move on to the speed of 1. At this speed, the pattern generated is:

x 0 0 0 0
0 x 0 0 0
0 0 x 0 0
0 0 0 x 0
0 0 0 0 x

or in reverse.

At the speed of 2, the pattern generated is:

x 0 0 0 0
0 0 x 0 0
0 0 0 0 x
0 x 0 0 0
0 0 0 x 0

or in reverse.

Notice that we can discern both cases which direction the pattern is going. But a speed of 3, will be exactly the reverse of the speed of two! Equivalently, a speed of four will not be discernible from a reverse of speed 1! If you don’t believe me, try to see it in the patterns above. This is equivalent of saying that over Z5, -1 == 4, and -2 == 3. We can carry on with this, and show it is the same for any n. (In the additive group Zn, it is easy to show that the inverses lie in the different halves of the group). To summarize – given a finite number N of beacons, we will be able to discern the directions of speeds up to floor((N-1)/2). Cool.

I know that for the mathematically inclined this might seem obvious, and for the rest of the world, this could seem a little cryptic. But I’m really happy with finding this connection, and such a simple example of the Nyquist Limit.

Mathematical note: I bet I got a few mathematical things wrong here. I didn’t write a rigorous proof (but I did proof read it). Tell me your corrections, and what you think. I’ll be glad to hear.

Filed under: Favourites, Group Theory, Math

"Where is Waldo?", or "Security by Origami"

Posted on 2007-05-20 by lorg 6 Comments

The Problem

A friend of mine gave me a riddle this morning regarding “Where’s Waldo?”. The riddle is as follows:

You and a friend play “Where’s Waldo?”. You solve the puzzle before your friend, and you want to prove to your friend you solved the puzzle, without giving him any hints. How do you do this?

Obviously, this is of course very reminiscent of zero knowledge proofs. A good zero knowledge proof will allow your friend to be convinced before he found Waldo himself, and even if he never finds him at all. A somewhat less interesting solution is a solution that will allow your friend to be convinced if and when he finds Waldo himself and can then verify your proof.

There are obviously many “intuitive” solutions, and I will not write them here… I will write here the second solution I thought of, but I consider it to be the more interesting one. However, this solution doesn’t solve the original problem – it only allows your friend to verify your solution once he solved it himself.

The Solution

Take a sheet of paper of identical dimensions to the picture, and mark a spot on it in the position where Waldo would have been on that sheet of paper. Fold that sheet of paper to some kind of origami animal, and give it to your friend. Once he solves the puzzle, he can open the folding, and see for himself that the point was marked correctly.

This is obviously not a very good solution. It is just a glorified note – you write down your solution on a note, and ask your friend not to peek until he solves it himself. So I came up with an improvement:

Agree beforehand with your friend on some (large) origami folding (such as a beetle). He shouldn’t know the instructions to fold it. Take a sheet of paper, and mark Waldo’s position on it (with a dot). Hold the paper above the fold, and mark on the fold (with another dot) the projection of the original dot on the folding. Now unfold the origami – you have a crease pattern. Give the crease pattern to your friend. When he solves the puzzle, refold the origami, and prove to your friend that the projection of the dot on the fold coincides with the dot on the picture. As an added bonus – your friend just learned how to fold another origami beast!

Of course, this solution isn’t water-tight. It also relies on crease patterns being hard to solve. It is mostly security by obscurity – but this time, ‘security by obscurity’ becomes ‘security by origami’. I just found that fascinating – that origami may be considered a function that is ‘hard to reverse engineer’ (even if it is not so) – such as a hash function. Origami does behave a little bit like a hash…

Tell me your original solutions to the problem.

Filed under: Favourites, Math, Origami, Protocols, Security

© 2022 Algorithm.co.il - Algorithms, for the heck of it