Projects Statistics

The Art and Science of Pulling Numbers Out of Your Sleeve

About a year or so ago, I was reading R. V. Jones’ excellent book ‘Most Secret War’. One of the stories I remembered and told my colleagues about, was how Jones estimated the rocket production capabilities of the Germans. He did so after looking at an aerial photograph of a rocket fuel shed. One of my colleague then told me of a course she took at the Hebrew University of Jerusalem that teaches how to make such estimates. I let it go at the time, and just remembered that there is such a course.

A few days ago, I met up with this colleague, and I was reminded of this course. You see, right now I’m collecting data for a project I’m doing with a friend, and I needed estimation know-how. So I asked her the name of the course, and she gave it to me. Two Google search later and I had the English name of the course, “Order of Magnitude in Physics Problems”, and a textbook to look at. I just finished reading the introduction, and I know that I’m going to read the rest of it too. Not just for my current project – but because it’s such a good skill to have.

The book opens with a problem:

We dedicate the first example to physicists who need employment outside of physics. […] How much money is there in a fully loaded Brinks armored car?

The book then goes to show how to answer such a question intelligently. I’m hooked.

computer science Math Programming Python Statistics Utility Functions

Solution for the Random Selection Challenge

A few days ago, I wrote up two small Python Challenges. Several people have presented solutions for the first challenge, and I also posted my solution in the comments there.

However, the second challenge remained unsolved, and I will present a solution for it in this post.

Programming Python Utility Functions

Harvesting with threadmap

Dune 2 Harvester

From time to time, I need to harvest a website, or many websites. For example, to collect the data from IMDB to run the Pagerank algorithm. Other times I need to query some non-web servers.

Usually in such cases, I have a ‘read_single_url’ function that is called in a loop from a ‘read_all_urls’ function. The straightforward implementation of this will run slowly. This is not because read_single_url takes a lot of time to parse the websites it downloads. The delay is mostly due to the latency of network operations. Even on high bandwidth connections, your bandwidth utilization will be quite low.

To fix this, I wrote a function named threadmap that runs each call of read_single_url in a separate thread. Just like map, threadmap runs a given function for each element in the input sequence, and returns once all the calls are complete.

Here is an example use of the function:


My first naive implementation just created a thread for each element in the list, and started them all simultaneously. This caused network IOErrors and other problems. This issue was handled by setting a maximum number of threads that may run at once.

The next issue I had to handle was exceptions. It is not obvious what is the best course of action once the inner function raises an exception. At the least, the exception has to be handled so that threadmap’s synchronizing code may be allowed to run.
My current implementation allows for a few different behaviors: ignoring the exception, aborting threadmap, retrying, and returning a default value for the problematic call. To implement these behaviors, I used the traceback module, after reading Ian Bickings’ excellent explanation of exception re-raising.

For those interested, here’s a copy of the code. I’ll be glad to read any comments or suggestions about it.

Math Programming Python

Fun with Matrices

I’ll let the code speak for itself:

In [81]: m = Matrix(array([[1.0,1.0],[0.0,1.0]]))
In [82]: def my_sqrt(x, num_iters):
   ....:     r = 0.5*x
   ....:     for i in xrange(num_iters):
   ....:             r = 0.5*(r+x/r)
   ....:     return r
In [83]: m*m
array([[ 1.,  2.],
       [ 0.,  1.]])
In [84]: my_sqrt(m*m, 10)
array([[ 1.,  1.],
       [ 0.,  1.]])
In [85]: m
array([[ 1.,  1.],
       [ 0.,  1.]])

It’s always fun to see the math work out. At first when I learned that e^A for a matrix A may also be defined using the Taylor series in the usual way, I was really amazed. It still amazes me that this stuff works out so well. This is another kind of beauty.

Programming Python web-design

LStrings and ASCII-Plotter at

It seems Gregory is fond of ascii-art, as he took my scripts, and put them online at is a cool new website, which allows you to run scripts as small web applications. What Gregory actually did, with just a few key-strokes, was turning my two scripts into web applications. While these scripts aren’t the most useful web applications in the world, I could think of a few others that are.

Gregory also thought of another use for the UtilityMill. Consider my post ‘Rhyme and Reason with Python’. Let’s say you want to try out my script as well, but you don’t want to go through the hassle of downloading it and all the required dependencies, and then figure out how to run it. If I set it up as a small web application there, I could have any reader be able to see the code and its output without any effort. I’ll be sure to try something like that in my next code-oriented post.


Origami Pig Instructions

I took some time to create the instructions for this one:

Origami Pig

Read on for the full instructions.

Challenges Math Programming Python

Small Python Challenge No. 3 – Random Selection

This time I’ll give two related problems, both not too hard.

Lets warm up with the first:

You have a mapping between items and probabilities. You need to choose each item with its probability.

For example, consider the items [‘good’, ‘bad’, ‘ugly’], with probabilities of [0.5, 0.3, 0.2] accordingly. Your solution should choose good with probability 50%, bad with 30% and ugly with 20%.

I came to this challenge because just today I had to solve it, and it seems like a common problem. Hence, it makes sense to ask ‘what is the best way?’.

The second problem is slightly harder:

Assume a bell shaped function p(x) that you can ‘solve’. This means that given a value y, you can get all x such that p(x)=y. For example, sin(x)^2 in [0,pi] is such a function. Given a function such as Python’s random.random() that yields a uniform distribution of values in [0,1), write a function that yields a distribution proportional to p(x) in the appropriate interval.

For example, consider the function p(x) = e^(-x^2) in [-1,1]. Since p(0) = 1, and p(0.5)~0.779, the value 0 should be p(0)/p(0.5)~1.28 times more common than 0.5.

As usual, the preferred solutions are the elegant ones. Go!

note: please post your solutions in the comments, using [ python]…[ /python] tags (but without the spaces in the tags).

Algorithms Humour Programming Python

Rhyme and Reason with Python

After reading xkcd’s blog, I went to It looks and works just like, but for limericks. Quite amused with some of the limericks available, I wanted to try and compose my own.
The first one came out fine, but I got stuck without a rhyme for the second one:

a man named guido once thought
he’d invent a language well wrought
indentation as tabs

Uhm, no luck there… “tabs” does not rhyme well. Then an idea came… I could create a rhyme index!
How to do it?
First, I took my already written word index. Last time I used it was to solve a substitution cipher for my crypto homework.
Then, I looked for the phonemes for each word, using espeak – a text to speech program:

lorg@kild:~$ espeak -q -x hello

After doing so, I collected words that had similar phonemes endings, and there it was, a rhyme index:

In [84]: words.get_word_rhymes("tabs", word_index, rhyme_index)[:10]
['crabs', 'stabs', 'macnabs',
 'dabs', 'cabs', 'scabs',
 'slabs', 'arabs', 'snobs',

Note that better rhymes are listed first – according to the length of the match.
To create this list I didn’t use stress information, although with some more work it may be done.

Back to our limerick, we can see that “tabs” does not have really good rhymes. Better luck with indentation maybe:

In [86]: words.get_word_rhymes("indentation", word_index, rhyme_index)[:20]
['kishion', 'alleviation', 'expostulation',
 'machination', 'syrophenician', 'proposition',
 'infatuation', 'computation', 'mansion',
 'meditation', 'aggression', 'anticipation',
 'clarification', 'logician', 'gratulation',
 'discretion', 'quotation', 'presentation',
 'dissolution', 'deprecation']

And we get (not the best limerick I admit):

a man named guido once thought
he’d invent a language well wrought
with space indentation
and good presentation
we all got the language we sought

My original ending line was “it left all the perl hackers distraught”… but it ain’t true, so I gave it up. The ‘distraught’ was found with my script as well.
If you want to look for your own rhymes, then you can use my hacked-up script, although it isn’t written well enough for my taste. (For example, don’t expect to see documentation :)
I’ll upload a better version once I refactor it.
To save you the trouble of generating the word index and the rhyme index from scratch, I used cPickle to store them. You can use them for lookups regardless of the script.

Security web-design

Troubles with Wild Themes

Some time ago, I wrote that I was planning on using a new theme for this blog. To do this, I first looked for possible candidates on, and then started to adapt the one I liked. However, while working on the theme, I noticed hidden links in the code of the theme.

These links were hidden by using “font-size: 1px”. Hidden links are there to increase the search engine placement of the creator and his affiliates. In this case the creators were ‘‘. You can check their stats in technorati, and see they have about 250 blogs linking to them, mostly via regular credits.

Upon further examination I found two more themes by these guys, with the same hidden SEO links.
Afterwards, I checked some 20 blogs that had their themes, about half of which still had the SEO (Search Engine Optimization) links in them. Other linked urls were (a blog) and (an ad exchange service).
I reported this to, and it seems I can no longer find the themes via their search engines. However, the themes are still available for download.

The SEO links, while being an annoyance and an underhanded thing to do, are not the main issue here. The big problem I had with this theme was, that if someone has no compunction about putting SEO links, he\they might put a backdoor there as well. I’m not saying that these guys did it, but I know that if I was a bad guy looking to make money – I’d do it.
This is a very easy way to infect servers. Just prepare a few good-looking themes for wordpress, phpbb or any other standard web-application, sit back, and watch the botnets grow. You don’t only get to infect the server, you can try to infect any client that connects to an infected server. Instead of researching a new vulnerability, just use social engineering, like you do with end users surfing the web.

You can publish your themes, and evidently, they won’t go through too much scrutiny. With an appropriate Google Alert in place – you can also be informed whenever someone new installs your theme.

This is a trust issue – and it seems that you shouldn’t trust WordPress’ theme DB. While this may seem obvious to you, it wasn’t to me at first, and I bet it isn’t obvious to many others starting their small blogs and looking for a good theme.

Compilation Math Programming Python

Interesting links – 4

“Intermediate and Advanced Software Carpentry in Python” is an excellent reading by Titus Brown. If you feel you’re good with Python but want to improve it, or if you are an experienced programmer that wants to get better, this is a good place to go. I liked it.

During the time I was working on my Compilation course, I was thinking about the challenge of writing yacc in yacc. Well, I went searching for “yacc in yacc” and stumbled across this page. It is a part of a very strange ‘pattern wiki’. It has fascinating discussions of the subjects in it, but I don’t like the old-style wiki navigation. Still worth a taste.

Do you know the kind of duel where two people need to name the biggest number? If you thought something along the lines of ‘hey, I’ll write something with a lot of nines’, well, you are in for a surprise. This article has a nice solution for the problem. Excellent read, especially if you are into computation theory.  I really liked that one.