Categories

## Fast Peak Autocorrelation

So, I was at geekcon. It was a blast.

There were many interesting projects, and I didn’t get to play with them all. I did get to work a bit on the Lunar Lander from last year, and this year it was finished successfully. My part was the PC game which interfaced with the microcontroller controlling the lander. As you probably guessed, it was written in Python.

This year, as promised, I worked on the Automatic Improviser. I worked on it with Ira Cherkes.

While the final version worked, it didn’t work well enough, and there is still much work to be done. Still, we had excellent progress.
By the way, I know this subject has been tackled before, and I still wanted to try it myself, without reading too much literature about it.

One of the components of the system is a beat recognizer. My idea to discover the beat is simple: find the envelope (similar to removing AM modulation), and then find the low “frequency” of the envelope.
Instead of doing a Fast Fourier Transform for beat recognition, we were advised that autocorellation will do the trick better and faster. However, when trying to autocorellate using scipy.signal.correlate we discovered that autocorellation was too slow for real time beat recognition, and certainly wasteful.

To solve this issue, we decided to first do peak detection on the envelope, and then autocorellate the peaks. Since there shouldn’t be too many peaks, this has the potential of being really quick. However, there was no standard function to do autocorellation of peaks, so we implemented it ourselves. We were pretty rushed, so we worked fast. Here’s the code:

```def autocorrelate_peaks(peaks): peaks_dict = dict(peaks) indexes = set(peaks_dict.keys()) deltas = set() for i in indexes: for j in indexes: if j>i: continue deltas.add(i-j)   result = {} for d in deltas: moved = set(i+d for i in indexes) to_mult = moved & indexes assert to_mult <= indexes s = sum(peaks_dict[i-d]*peaks_dict[i] for i in to_mult) result[d] = s return result```

This function takes as input a list of tuples, each of the form (peak_index, peak_value), and returns a mapping between non-zero corellation offsets and their values.
Here’s is a sample run:

```In [7]: autocorrelate_peaks([(2, 2), (5,1), (7,3)]) Out[7]: {0: 14, 2: 3, 3: 2, 5: 6}```

After implementing this function, our recognition loop was back to real-time, and we didn’t have to bother with optimizing again.

Categories

## Easy Harvesting

Image by existentist.

I’ve been doing a lot of harvesting (aka screen-scraping) lately. Fortunately, I don’t need forms automation, so I’m using urllib2 and not Mechanize like my friend Ron Reiter recommended.

At first, when I wanted to get some information from a web page quick&dirty-wise, I used regular expressions. This approach works, but is not especially fun to write or to maintain. So for the next harvesting task, I decided to learn Beautifulsoup. Beautifulsoup has an excellent interface, and a parser that deals with messed up (read: random.shuffle()-ed) tags.

Unfortunately, Beautifulsoup is based on Python’s built in html parser, (htmllib.HTMLParser), which is a bad excuse for an html parser.
I decided to give up on it, when I tried to parse a page that had Javascript in it, and the Javascript had a string that contained html. HTMLParser choked on it, and as a result, the page was unparse-able with BeautifulSoup.

Distraught, I remembered that I knew of some other html parser called lxml. I played with it a little, and it seemed to eat up all the pages that BeautifulSoup choked on, and then asked for more. Efficiently.

Now, I had a problem. I already had a harvester written using BeautifulSoup’s interface, and after looking at lxml’s interface, the situation didn’t seem too good. While lxml sports a solid interface, it’s nowhere as quick&easy as BeautifulSoup’s. Also, it used xpath.

My solution: a BeautifulSoup interface wrapper for lxml, which I present here. It mostly does nasty xpath conversions, and it allowed me to make my already written harvester work as well as to write the next harvester. I also gave it to a friend who found it useful.

Here is a short usage example.
Let’s say we are interested in harvesting the names of all the artists appearing on Fiona Apple’s page on allmusic.com.
First, using FireBug’s “inspect”, we can see that all the interesting links are in a table with the id “large-list”. Also, we can see that all the artist links have “sql=11:” in them.
So, our code looks like this:

```In [4]: soup = parse_html.fetch_soup('http://allmusic.com/cg/amg.dll?p=amg&sql=11:jjfixqegldde') In [5]: names = [x.all_text() for x in soup.find(id="large-list").find_all('a', href=re.compile("sql=11:"))] In [6]: print names ['Alanis Morissette', 'Tori Amos', 'Jeff Buckley', 'Aimee Mann', 'Heather Nova', 'Astrid Williamson', 'Kari Newhouse', 'Sarah Blasko', 'Daniel Powter', ...]```

So, here is the relevant code. It mostly translates nice function calls to nasty xpath. It is not really well documented, as it was a quick and dirty solution, and its interface is similar to BeautifulSoup’s. I hope you find it useful. I did.

Categories

## Small Python Interactive Interface Trick

Not too long ago, I helped someone with some research work, and one of the research tools was also used from the Python interactive prompt.
The interface used was handy enough for development: intuitive, readable, short, and useful. Still, for interactive research it wasn’t handy enough.
To improve the script, I decided to use a trick I heard about a long time ago – using __repr__:

```class NoParens(object): def __init__(self, obj): self.obj = obj def __call__(self, *args, **kwargs): return self.obj(*args, **kwargs) def __repr__(self): result = self.obj() return str(result)```

And using it looks like this:

```In [2]: import noparens   In [3]: def test(): ...: return 4**4 ...:   In [4]: t = noparens.NoParens(test)   In [5]: t Out[5]: 256   In [6]: t,t,t Out[6]: (256, 256, 256)```

For research consoles that interact with other programs or devices, such as debuggers and sniffers I found this useful, and made work a bit less annoying.

Categories

## Python Tidbits

Memory Leaks

With the advent of the Python’s garbage collector, it would seem that Python programs should not leak memory. However, this is not always the case. With just a bit of circular references, you can find yourself leaking quite a bit of memory.

A simple example is a bidirectional data structure, like a 2-way linked list, or a tree where the nodes have a reference to their parents. To really free those, you have to break the cycle manually. Another option is to use a weakref for some of the references. (Usually, I like to use weakref.proxy for the parent links.)

An easy way to test code for memory leaks is to run it in a loop (allocate, do job, deallocate, repeat), and call gc.collect() before and after. If the value that gc.collect() returns after the loop grows with the size of the loop, you might have a memory leak.

I found out about a special case of a potential memory leak just the other day.
Let’s say you want to implement some switch code. A common idiom is to write it like this:

```def a(x): print 'a', x def b(x): print 'b', x ... func = {0: a, 1: b}[x] func()```

This works well enough. However, if you do the same thing in a class instance, things start to get messy:

```class A(object): def __init__(self): self.lookup = {0: self.a, 1: self.b} def a(self, x): print 'a', x def b(self, x): print 'b', x def do_something(self): ... func = self.lookup[x] func(x)```

In this simple example, __init__ creates a reference cycle. This is because self.a creates a method object on the spot, with a reference to self.
So, if z is an instance of A, our cycle is: z->lookup->z.a->z.

Take heed.
(Note: Some cycles are collectible with gc.collect(). The simple cycle described here is such a case. It may get ugly though.)

Undefined
While writing the VM for vial, we needed some way to avoid testing assembly code for undefined values.

For example, after executing a div instruction, CF is undefined. We handle this in the templates by writing CF=UNKNOWN(), and we allow to VM to decide on a value. If you need the VM to just execute code, returning 0 for undefined values is fine. If, however, you are testing code, your actual CPU might return a non-zero value, and your test will fail with no real reason.

To solve this problem, our VM may be told to return a special singleton object for undefined values:

```class _UndefinedNumber(object): def __eq__(self, other): if other is self: return True return False def __ne__(self, other): return not self == other def binary_action(self, other): return self def unary_action(self): return self def __str__(self): return "Undefined" __repr__ = __str__   (__add__, __sub__, __mul__, __div__, __floordiv__, __mod__, __pow__, __and__, __xor__, __or__, __lshift__, __rshift__, __rlshift__, __rrshift__, __radd__, __rsub__, __rmul__, __rdiv__, __rfloordiv__, __rmod__, __rpow__, __rand__, __rxor__, __ror__) = 24*[binary_action]   (__neg__, __pos__, __abs__, __invert__) = 4*[unary_action]   Undefined = _UndefinedNumber()```

Now, we test only the flags that are not Undefined. Note that any operation on an Undefined returns an Undefined. This is done to allow registers to be Undefined as well. This way, if AX gets Undefined, so will AH and EAX.

Here is an example:

```In [2]: Undefined + 1 Out[2]: Undefined   In [3]: Undefined * 2 Out[3]: Undefined   In [4]: Undefined / Undefined Out[4]: Undefined```

Note that this constant might behave a bit strangely. For example, Undefined * 0 is Undefined, and Undefined – Undefined is also Undefined. Because of that, I’m not sure using this object in production circumstances is a good idea, so in the meantime, this is just a testing utility.

Categories

## Various Small Python Helpers

I’ve been working lately with arkon on various projects related to diStorm. One of these projects involved writing a solid Python API around an existing C API. This C API uses a lot of flags as arguments and return values of functions.
Since I want my code to be easy to use in an interactive shell, I’d like to return something that’s not just a number. The obvious solution is to use a Python Enum. There are many implementations going around, most of them equivalent. I’ve got the one I use. What I recently added though, is the SymbolInt:

```def SymbolInt(value, name): class _SymbolInt(int): def __str__(self): return name def __repr__(self): return 'SymbolInt(%d, "%s")' % (value, name) def __eq__(self, other): if isinstance(other, str): other = other.lower() return int(self)==other or name.lower() == other def __ne__(self, other): return not self == other return _SymbolInt(value)```

Short, and very much to the point, this makes exploratory API’s much more readable and usable. It is useful as a return value from enum functions as well.

Along with SymbolInt I wrote hexint:

```class hexint(long): def __str__(self): return hex(self)[2:-1] __repr__ = __str__ def __repr__(self): return hex(self)[:-1]```

Lastly, here’s a little function that I found missing from itertools:

```def head(iterable, num_items): for obj,i in zip(iterable,xrange(num_items)): yield obj```

UPDATE: Don’t use this head() function, use itertools.islice instead. Thanks go to Erez for pointing that out.

Categories

## Solution for the Random Selection Challenge

A few days ago, I wrote up two small Python Challenges. Several people have presented solutions for the first challenge, and I also posted my solution in the comments there.

However, the second challenge remained unsolved, and I will present a solution for it in this post.

Categories

From time to time, I need to harvest a website, or many websites. For example, to collect the data from IMDB to run the Pagerank algorithm. Other times I need to query some non-web servers.

Usually in such cases, I have a ‘read_single_url’ function that is called in a loop from a ‘read_all_urls’ function. The straightforward implementation of this will run slowly. This is not because read_single_url takes a lot of time to parse the websites it downloads. The delay is mostly due to the latency of network operations. Even on high bandwidth connections, your bandwidth utilization will be quite low.

To fix this, I wrote a function named threadmap that runs each call of read_single_url in a separate thread. Just like map, threadmap runs a given function for each element in the input sequence, and returns once all the calls are complete.

Here is an example use of the function:

```threadmap.threadmap(query_server, url_list, max_threads=10, on_exception=threadmap.IGNORE)```

My first naive implementation just created a thread for each element in the list, and started them all simultaneously. This caused network IOErrors and other problems. This issue was handled by setting a maximum number of threads that may run at once.

The next issue I had to handle was exceptions. It is not obvious what is the best course of action once the inner function raises an exception. At the least, the exception has to be handled so that threadmap’s synchronizing code may be allowed to run.
My current implementation allows for a few different behaviors: ignoring the exception, aborting threadmap, retrying, and returning a default value for the problematic call. To implement these behaviors, I used the traceback module, after reading Ian Bickings’ excellent explanation of exception re-raising.

Categories

## Small Python Challenge No. 2 – LRU Cache

Caching is easy. Consider the cache I used to optimize the recursive spring:

```class _NotInDict(object): pass _NotInDict = _NotInDict() def cached(func): cache = {} def wrapper_func(*args): prev_result = cache.get(args, _NotInDict) if prev_result is _NotInDict: result = func(*args) cache[args] = result return result return prev_result return wrapper_func```

This kind of cache is simple and effective (especially for recursions), and may be used in all sorts of situations. However, sometimes you want a size limited cache. In that case you have to decide on the criterion used to decide which items to throw away. There are many kinds of criteria used, for further reading check out wikipedia.

For now I’d like to discuss the LRU cache though. LRU stands for Least Recently Used, which means that you throw away the items you didn’t use for a long time. Time in this case is measured by actions. I thought of this type of cache when I worked on the recursive spring. Since each step in the ‘recursivation’ used two samples of the previous step, caching was an obvious choice, and if I had to size limit my cache, LRU whould be the type of cache to use, as you could be certain that the older samples would not have to be used until the next drawing.

The challenge for the weekend is to write an LRU cache in Python. The cache has to be general – support hash-able keys and any cache size required. It has to be efficient – in the size of the cache and the time it takes for a lookup and an update. Once the standard requirements have been met, the big competition should be on elegance. I wrote a benchmark implementation, which is efficient, but not fast. Once I see some solutions, I’ll talk a bit about mine, which is an interesting case-study.

Categories

## Simple word lookup for crosswords and such…

I am swamped with university homework. I did manage to come up with some small somewhat useful utility.

Do you know when you need a word that ends with a double letter and then an ‘n’? you would want to write it like this: “xxN\$”. Well, I wrote a simple tool that translates a ‘word lookup syntax’ into regular expressions, and then does a lookup for the word.

This raises a two questions:

What exactly is the syntax?
Small letters stand for any letter, and big letter stand for their specific letters. So for example, “bEEn” would match “been”, “seen” and “keep”, but not “boon”. Punctuation is left unchanged, so you can use \$,^, (?!xy) and so on. I didn’t work too too much on it, so [] doesn’t translate well (the letters inside it will come out garbled).

What is the data for the lookup?
Well, I used the excellent natural language toolkit (obviously in Python). Very easy to start using it, with a nice tutorial. It’s nice knowing that such a tool is available. The source data is actually an extract from the Gutenberg project. The result words are sorted according to frequency in the source language, and the 20 most frequent words are returned.

Without further ado, here are some code snippets. (The compiled regexp is printed. Note how ugly (yet simple) it is :)
(Note: I did cut out some of the output…)

```In [1]: fd = words.create_word_index()   In [2]: words.find_words("^bEEn\$",fd) ^(?P<b>.)EE(?!((?P=b))|(E))(?P<n>.)\$ Out[2]: ['BEEN', 'SEEN', 'KEEP', 'FEET', 'FEEL', 'SEEK', 'SEED', 'MEET', 'DEEP', 'NEED', 'SEEM', ...]   In [3]: words.find_words("^jvfpf\$",fd) ^(?P<j>.)(?!((?P=j)))(?P<v>.)(?!((?P=j))|((?P=v)))(?P<f>.)(?!((?P=j))|((?P=f))|((?P=v)))(?P<p>.)(?P=f)\$ Out[3]: ['THERE', 'THESE', 'WHERE', 'JESUS', 'MOSES', 'HOSTS', 'LINEN', 'PIECE', 'SCENE', 'WHOSO', 'POSTS', 'EPHAH', ...]```

Note that if you find some of the result words a bit too uncommon for your taste (e.g. Ephah), I would suggest removing the bible from the source data… :)

The (very ugly, uncommented, draft code) is available here. I apologize for its unseemliness, but I figured that it’s better to make it available in its current form, rather than never put it online for lack of time and will. It could also be seen that the code isn’t terribly efficient. It is however, ‘good enough’ for interactive work.

Categories

## Small Python Utility Functions

While working with Gil Dabach on Distorm3, I found out that I’ve been missing a lot of utility functions. I’m going to write about some of them now.

```def classify_to_dict(seq, key_func): result = {} for item in seq: key = key_func(item) if key in result: result[key].append(item) else: result[key] = [item] return result```

This is really a simple function, but a very powerful one. Here is a simple demonstration:

```>>> base_tools.classify_to_dict(range(5), lambda k: k%2) {0: [0, 2, 4], 1: [1, 3]}```

Note the similarity and difference from `groupby`.
If any of you can suggest a better name then classify_to_dict (maybe just classify?), I’ll be happy to hear it.

Some other very useful functions include `arg_min` and `arg_max`, which respectively return the index of the maximum and minimum element in a sequence. I also like to use `union` and `intersection`, which behave just like `sum`, but instead of using the + operator, use the | and & operators respectively. This is most useful for sets, and on some rare occasions, for numbers.

Another function I like (but rarely use) is `unzip`. I know, I know, paddy mentioned it is pretty obvious, and I know that it could just as well be called transpose, however, I still find using unzip(seq) a better choice then the much less obvious zip(*seq). Readability counts.

What are your favorite utility functions?

A minor update: I’ve incorporated the use of a syntax highlighter now, and it should be enabled for comments as well by the next challenge (which will be soon enough).