Two bugs don’t make a right

Three lefts roadsign
While working on my new startup, we are doing a little bit of reasoning using implications. One of the more curious forms of implications is the negative form: consider the following exaggerated example:

  • a place being kid-friendly implies that it is not romantic.
  • a place being a strip club implies it is not kid-friendly

If we allow negative implications to be transitive, then it would follow that since being a strip club makes a place less kid-friendly, it makes it more romantic. We don’t want that. So I had to write some code to specifically ignore that situation. Before writing that, in the best tradition of TDD I wrote a test for two chained negative implications. I implemented the code, the test passed and I was happy.

For a while.

Fast forward a couple of weeks, and I’m trying out adding some negative implications, and the program doesn’t behave as expected. My code doesn’t work. I turn back to my test, check it out, and sure enough, all the thing the test asserts as True are actually True, and the test does test the right thing.

Digging deeper, I discovered the issue. I had two bugs: the first was that the code handling chained negative implications wasn’t working right. The second was in my graph building algorithm – it seems that I was forgetting to add some edges. What made that second bug insidious was that it hid the effect of the first bug from the test – effectively making the test pass.

So – for me it was – two negative implications don’t mean a positive one, and two bugs don’t make a feature.

Posted in Programming | Tagged , , , , , , | 1 Comment

Optimizing Django ORM / Postgres queries using left join

For the latest project I’m working on, we’re using Django with Postgres. I was writing some code that had to find a list of objects that weren’t processed yet. The way they were stored in the DB is like so:

class SomeObject(models.Model):
    #some data
 
class ProcessedObjectData(models.Model):
    some_object = models.ForeignKey(SomeObject, db_index = True)
    #some more data

In this schema, SomeObject is the original object, and a ProcessedObjectData row is created as the result of the processing. You might argue that the two tables should be merged together to form a single table, but that is not right in our case: first, SomeObject “has standing on its own”. Second, we are interested in having more than one ProcessedObjectData per one SomeObject.

Given this situation, I was interested in finding all the SomeObject’s that don’t have a certain type of ProcessedObjectData. A relatively easy way to express it (in Python + Django ORM) would be:

SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id'))

Unfortunately, while this is reasonable enough for a few thousand rows (takes a few seconds), when you go above 10k and certainly for 100k objects, this starts running slowly. This is an example of a rule of mine:

Code is either fast or slow. Code that is “in the middle” is actually slow for a large enough data-set.

This might not be 100% true, but it usually is and in this case – very much so.

So, how to optimize that? First, you need to make sure that you’re optimizing the right thing. After a few calls to the profiler I was certain that it was this specific query that was taking all of the time. The next step was to write some hand-crafted SQL to solve that, using:

SomeObject.objects.raw(...Insert SQL here...)

As it turns out, it was suggested to me by Adi to use left-join. After reading about it a little bit and playing around with it, I came up with a solution: do a left join in an inner select, and use the outer select to filter only the rows with NULL – indicating a missing ProcessedObjectData element. Here is a code example of how this could look:

SELECT id FROM (
    SELECT some_object.id AS id, processed_object_data.id AS other_id FROM
    some_object
    LEFT JOIN
    processed_object_data
    ON
    (some_object.id = processed_object_data.some_object_id) AND
    (...SOME FILTER ON processed_object_data...)
) AS inner_select 
WHERE 
inner_select.other_id IS NULL
LIMIT 100

That worked decently enough (a few seconds for 100k’s of rows), and I was satisfied. Now to handling the actual processing, and not the logistics required to operate it.

Posted in Databases, Optimization | Tagged , , , , , | 1 Comment

Collision: the story of the random bug

So here I was, trying to write some Django server-side code, when every once in a while, some test would fail.
Now, it is important to know that we are using any_model, a cute little library that allows you to specify only the fields you need when creating objects, and randomizes the rest (to help uncover more bugs).

In this particular instance, the test that was failing was trying to store objects on the server using an API, and then check that the new objects exist in the DB. Every once in a while, an object didn’t exist. It should be noted that the table with the missing rows had a Djano-ORM URLField.

So first things first, I changed the code to print the random seed it was using on every failure. Now the next time it failed (a day later), I had the random seed in hand.

I then proceeded to use that random seed – and now I had a reproducible bug – it failed every time, consistently.

The next step was finding the cause of the bug. To cut a long story short – it turns out that it looked for an object with a specific URL. Which url? the url created for the first object (we had two).

The bug was that the second object was getting the same url as the first. I remind you, these urls are generated randomly. The troublesome url was http://72.14.221.99

I leave you now to guess/check what are the chances for the collision here
(the correct way to do that would be to check any_model’s code for generating urls, and not just say 1 in 2^32… :)

So I made sure the second object got a new url, and all was well, and the land had rest for forty years. (or less).

Posted in Python | Tagged , , , , , | 2 Comments

Cheap language detection using NLTK

Some months ago, I was facing a problem of having to deal with large amounts of textual data from an external source. One of the problems was that I wanted only the english elements, but was getting tons of non-english ones. To solve that I needed some quick way of getting rid of non-english texts. A few days later, while in the shower, the idea came to me: using NLTK stopwords!

What I did was, for each language in nltk, count the number of stopwords in the given text. The nice thing about this is that it usually generates a pretty strong read about the language of the text. Originally I used it only for English/non-English detection, but after a little bit of work I made it specify which language it detected. Now, I needed a quick hack for my issue, so this code is not very rigorously tested, but I figure that it would still be interesting. Without further ado, here’s the code:

import nltk
 
ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS
 
STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()}
 
def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    return max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), key = lambda x: x[1])[0]
 
 
def is_english(text):
    text = text.lower()
    words = set(nltk.wordpunct_tokenize(text))
    return len(words & ENGLISH_STOPWORDS) > len(words & NON_ENGLISH_STOPWORDS)

The question to you: what other quick NLTK, or NLP hacks did you write?

Posted in Python | Tagged , , , , | 3 Comments

Wikipedia Images

A few days ago a friend (x) of a friend (y) showed me and my friend (y) a small app he was developing, that had photos from flickr and picasa. We suggested adding photos from Wikipedia as well, but he (x) said that the photos were too big, and it was too much trouble resizing them.
Luckily for him I knew of Wikipedia’s “hidden” image resizing feature, and as it was useful to me and to someone else, I thought I’d share it here.

Let’s say you are looking to resize the following image of the Eiffel Tower: http://en.wikipedia.org/wiki/File:Tour_Eiffel_Wikimedia_Commons.jpg. Then the url to the image itself is:

http://upload.wikimedia.org/wikipedia/commons/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg

To get the url to a resized image, just add ‘thumb/’ after ‘commons/’ and then add ‘/[%d]px-[filename]‘ at the end, were %d is the new width. So for our image, the new url would be:

http://upload.wikimedia.org/wikipedia/commons/thumb/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg/300px-Tour_Eiffel_Wikimedia_Commons.jpg

That’s it, simple and quick. Have fun adding Wikipedia to your content!

Posted in Programming | Tagged , , , | 3 Comments

Python Module Usage Stats – Feb 2011

Here are the top 30 “base modules”, ordered by number of PyPI projects importing them. These results are based on 11,204 packages download from PyPI. Explanations, full results and code to generate them are available below.

Results


(click to enlarge)

Full results are available (see Methodology to understand what they mean exactly).

Discussion

Some interesting tidbits and comparisons:

  • It seems django has gained “some popularity”. Zope is very high up on the list, and plone is at 42 with 907 projects importing it.
  • The number of projects importing unittest is somewhat depressing, especially relative to setuptools which is impressive. That might be because setuptools is somewhat a prerequisite to appear on PyPI (practically speaking), while unittest is not. (Edit: corrected by Michael Foord in a comment)
  • optparse with 1875 vs. getopt with 515.
  • cPickle with 690 vs. pickle with 598.
  • simplejson with 760 vs. json with 593.

I invite you all to find out more interesting pieces of information by going over the results. I bet there’s a lot more knowledge to be gained from this.

Background

Back in 2007 I wrote a small script that counted module imports in python code. I used it to generate statistics for Python modules. A week or two ago I had an idea to repeat that experiment – and see the difference between 2007 and 2011. I also thought of a small hypothesis to test: since django became very popular, I’d expect it to be very high up on the list.

I started working with my old code, and decided that I should update it. Looking for imports in Python code is not as simple as it seems. I considered using the tokenize and parser modules, but decided against that. Using parser would make my code version dependent and by the time I thought of tokenize, I had the complicated part already worked out. By the complicated part I mean of course the big regexps I used ;)

Methodology

Input: PyPI and a source distribution of the Python2.7 standard library. I wrote a small script (cheese_getter.py) to fetch python modules. It does it by reading the PyPI index page, and then using easy_install to fetch each module. Since currently there are a bit less than 13k modules in PyPI, this took some time.

Parsing: I wrote a relatively simple piece of code to find “import x” and “from x import y” statements in code. This is much more tricky than it seems: statements such as “from x import a,b”, “from . import bla” and

from bla import \
               some_module\
               some_module2

should all be supported. In order to achieve uniformity, I converted each import statement to a series of dotted modules. So for example, “import a.b” will yield “a” and “a.b”, and “from b import c,d” will yield “b”, “b.c”, and “b.d”.

Processing: I created three result types:

  1. total number of imports
  2. total number of packages importing the module
  3. total number of packages importing the module, only for the first module mentioned in a dotted module name, e.g. not “a.b”, only “a”.

I believe the third is the most informative, although there are interesting things to learn from the others as well.

Code: Full code is available. Peer reviews and independent reports are welcome :)

Posted in Python | Tagged , , , | 9 Comments

10 Awesome Theorems & Results

When I look back at various mathematical courses I took, most have at least one theorem that I really liked. Usually I like it because the proof has a surprising trick, sometimes it’s because of the unexpected conclusion, or maybe the unintuitive feel to it. In other cases it’s just the elegance of the proof, or the result itself.
Without further ado, here’s a selection of my favorite theorems, in no particular order:

1. Linear Algebra: the Cayley Hamilton theorem. When I first grokked the fact that you can substitute matrices for the variables in polynomials, I was awestruck. Then I learned that you can define eA by using a Taylor series, but the fun doesn’t stop there. Using the Eigenvalues you can greatly simplify the calculation, and it all “works out the same” (i.e., if A=P-1DP and D is diagonal, then p(A) = P-1p(D)P. This works also for Jordan forms). Also, since you can show that complex numbers are isomorphic to the 2×2 matrices of the form [[a, b], [-b, a]], and that the calculations were exactly the same – well, everything “fell into place for me”. At the time it seemed to be one of the joys of Mathematics.

2. Calculus: the Bolzano-Weierstrass Theorem. One of the first non trivial results you learn in calculus, I originally learned the version that says: “Every bounded infinite set has a limit point”, and its proof was a bit more elegant in my eyes than the proof of the Wikipedia version. I liked it so much that one time when I was in boot camp in the service, I worked it out again just to keep my mind working. Good times.

3. Probability: The elegant result of V(x) = E(V(x|y)) + V(E(x|y)). Just the sight of it makes one sigh with contentedness, and the result itself is very nice.

4. Calculus, again: Stokes’ theorem and its friends. Very useful and non intuitive, in layman’s terms it says that you can reason about what happens in an area just by knowing about its perimeter.

5. Numerical Analysis: Richardson Extrapolation: one of the most elegant forms of bootstrapping, you start with a simple approximation method as a building block, and at the end you get a very strong high-quality approximation.

6. Computability: The Parameter theorem. Especially elegant, it basically gives the mathematical definition of the “bind” function for function parameters. In simple terms it uses the source code of a function f(x, y), to find the source code of a function g(y) such that g(y) = f(a, y) for some a. The nice thing about it is that it works only on source code, without calling the function themselves.
This theorem had the added bonus that once I grokked it, the test in computability was very, very easy :)

7. Functional analysis: here it’s a relatively minor result that I ended up remembering distinctly: Given z1.. zn which are linearly independent in E, show that there exists a d such that for each w1…wn that follow ||wi – zi|| < d for each i, are also linearly independent. The footnote says that such a finite, linearly independent group is called stable. When visualizing I think of it this way: given a such a group, kick it. As long as you don’t kick it too strongly – it will stay linearly independent. Now that’s stable.

8. Mathematical Logic: The Compactness theorem: “a set of first-order sentences has a model if and only if every finite subset of it has a model”. One direction is almost trivial, but the other is deep. When studying for the test in this course, I remember being stuck for days on an exercise that required the use of this theorem. Once I fully understood the method of its use, it became a favorite.
(By the way, the exercise was the following: Let G a countable group of first order statements, and p a first order statement. Show that if p is true in every countable model of G, than G |= p.)

9. Cryptography: I’ve learned a bit of cryptography on my own before taking the cryptography course. When I did though, two methods were especially memorable: The first was the “Meet in the Middle” attack. Not to be confused with “Man in the Middle”, this method allows one to attack symmetric ciphers constructed by repeatedly applying a simpler cipher. This known plaintext attack got its name from its method of operation: the attacker calculates all possible decryptions the ciphertext and stores them in a lookup table. Then, he calculates all encryptions of the plaintext and looks them up in that lookup table. Once a result is found – the combination of the encryption and the decryption keys used is the final key of the composed cipher.

10. The second cryptography result that I liked was secret sharing. Trivial secret sharing is so simple, and yet effective, that when I first learned it I thought: “how come I didn’t think of this before?”.

There are obviously many more elegant theorems, some of which I’ve learned in my studies. I sure hope to learn a few more. Still, these are special. As a highschool math teacher once told us about the Pythagorean theorem: “I want you to remember the proof even if I wake you in the middle of the night”. The theorems in this short list come close to that ideal.

Now I wonder – what are your favorite theorems?

Posted in Cryptography, Math | Tagged , | 12 Comments

Beautiful Code

A few days ago, @edensh mentioned in Facebook beautiful code, and many people gave examples of assembly, while I was thinking of Python.

That got me thinking: what is beautiful code for me?

So here are my criteria for beautiful code:

  1. Readable (also visually pretty)
  2. Concise
  3. Does something non trivial (usually in an unexpectedly short manner)
  4. Good (solves the problem, efficiently)

If we consider code to be an implementation of a solution to a problem, than 3 & 4 usually apply to the solution, while 1 & 2 apply to the code itself. This brings me to why I like Python:

  1. Code is more readable. Specifically, I can still still easily understand code I wrote years ago. Also, Python’s zen encourages you to write readable code. For example “explicit is better than implicit” directly applies to readability.
  2. Python is visually appealing, although I guess that’s a matter of opinion :)
  3. Python almost always allows me to express my solutions easily & succinctly, whereas with other languages (C, C++, Java) I have to fight to “get my point across”.
  4. Python almost always has the right data structures to implement my solutions efficiently.

With that in mind, it’s clear to me now how assembly code can be beautiful.

Note that I didn’t mention C#, Ruby or Haskell. I don’t have much experience with these languages, but from what I’ve seen so far, it seems to me that these languages may help you write beautiful code. Of these, Haskell is probably going to be the first language I’ll learn – I think it will be the most educating experience, although I’m pretty sure others will argue with me regarding Haskell’s readability :)

Now, My question to you is: what do you think makes code beautiful?

Posted in Programming Philosophy | Tagged , , , , | 2 Comments

Small programming challenge No. 6 – nblocks

I came up with this challenge when I had to write a function to divide a sequence to percentiles. I needed this to calculate some statistics over trips for plnnr.com. “This sounds trivial” I thought, and reached for my simple blocks function:

def blocks(seq, block_len):
    """blocks(range(5),2) -&gt; [[0, 1], [2, 3], [4]]"""
    seq_len = len(seq)
    if seq_len%block_len == 0:
        num_blocks = seq_len/block_len
    else:
        num_blocks = 1 + (seq_len/block_len)
 
    result = [[] for i in xrange(num_blocks)]
    for idx, obj in enumerate(seq):
        result[idx/block_len].append(obj)
    return result

So I set block_len to len(seq)/10 and called blocks(seq, block_len). Unfortunately, according to the docs of blocks (which I wrote…), when there is extra data, a “partial” block is added – which is exactly what we don’t want when calculating percentiles.
Instead, the behavior we want is nblocks(seq, number_of_blocks), for example: nblocks(range(10), 3) -> [0, 1, 2], [3, 4, 5], [6, 7, 8, 9].

This is a surprisingly hard to write function, or rather, harder than you’d expect. I’ll be especially glad if someone writes it elegantly. My solution works well enough, but isn’t the prettiest.

So, you have the definition – let’s see if you can do better than me. Once enough solutions are presented, I will present my own.

IMPORTANT EDIT: the required signature is nblocks(seq, num_blocks). So for seq(range(10), 3), the return value should be 3 blocks, with the last one having an extra item. As a general rule, the extra items should be spread as evenly as possible.

Posted in Challenges, Programming, Python, Statistics | Tagged , , , , | 21 Comments

Moving to a new server

I moved the blog to a new server, and a new wordpress install.

It will take me some time to get everything back in order, but once I finish it, I’m going to get back to writing regularly. Stay tuned!

Posted in Miscellaneous | Leave a comment