Categories
Programming Python Utility Functions

Easy Harvesting

Harvester
Image by existentist.

I’ve been doing a lot of harvesting (aka screen-scraping) lately. Fortunately, I don’t need forms automation, so I’m using urllib2 and not Mechanize like my friend Ron Reiter recommended.

At first, when I wanted to get some information from a web page quick&dirty-wise, I used regular expressions. This approach works, but is not especially fun to write or to maintain. So for the next harvesting task, I decided to learn Beautifulsoup. Beautifulsoup has an excellent interface, and a parser that deals with messed up (read: random.shuffle()-ed) tags.

Unfortunately, Beautifulsoup is based on Python’s built in html parser, (htmllib.HTMLParser), which is a bad excuse for an html parser.
I decided to give up on it, when I tried to parse a page that had Javascript in it, and the Javascript had a string that contained html. HTMLParser choked on it, and as a result, the page was unparse-able with BeautifulSoup.

Distraught, I remembered that I knew of some other html parser called lxml. I played with it a little, and it seemed to eat up all the pages that BeautifulSoup choked on, and then asked for more. Efficiently.

Now, I had a problem. I already had a harvester written using BeautifulSoup’s interface, and after looking at lxml’s interface, the situation didn’t seem too good. While lxml sports a solid interface, it’s nowhere as quick&easy as BeautifulSoup’s. Also, it used xpath.

My solution: a BeautifulSoup interface wrapper for lxml, which I present here. It mostly does nasty xpath conversions, and it allowed me to make my already written harvester work as well as to write the next harvester. I also gave it to a friend who found it useful.

Here is a short usage example.
Let’s say we are interested in harvesting the names of all the artists appearing on Fiona Apple’s page on allmusic.com.
First, using FireBug’s “inspect”, we can see that all the interesting links are in a table with the id “large-list”. Also, we can see that all the artist links have “sql=11:” in them.
So, our code looks like this:

In [4]: soup = parse_html.fetch_soup('http://allmusic.com/cg/amg.dll?p=amg&sql=11:jjfixqegldde')
In [5]: names = [x.all_text() for x in soup.find(id="large-list").find_all('a', href=re.compile("sql=11:"))]
In [6]: print names
['Alanis Morissette', 'Tori Amos', 'Jeff Buckley', 'Aimee Mann', 'Heather Nova',
 'Astrid Williamson', 'Kari Newhouse', 'Sarah Blasko', 'Daniel Powter', ...]

So, here is the relevant code. It mostly translates nice function calls to nasty xpath. It is not really well documented, as it was a quick and dirty solution, and its interface is similar to BeautifulSoup’s. I hope you find it useful. I did.

Categories
Programming Python

You know you need to install a new Python version when…

>>> import decimal
>>> decimal.Decimal('0.2') <&nbsp; 0.3
False
>>>

This little gem took me two hours to track down.
It turns out that since my code is using sqlobject, it also uses the decimal module. I had some constants set up, and from within some larger algorithm I wanted to compare values extracted from the db to those constants. However, my code didn’t seem to work, and I was sure the problem lay somewhere else in the algorithm.
Two hours later, and I found this comparison. It might be similar to this issue, but I’m not sure.

In any case, I was using Python 2.5.2, and decimal works as expected in Python 2.6, so I guess it is indeed time for an upgrade.
After some more checking with Python 2.6, it seems that:

>>> decimal.Decimal('0.6') <&nbsp; 0.3
True

Not good.

Categories
Programming Python rants

A Python Rant

Last night I encountered yet again one of Python’s annoyances.
The annoyance I’m referring to is the lack of string like functions for lists. Trivial examples include find() and rfind(). Before you mention index though, it’s important to point out that index() checks for equality. I’d be much happier if instead it could take a function argument for comparisons.
A less trivial example is split(), also with a possible criterion argument. A complicated example is regular expressions.

It seems that most of these functions, when applied to lists should take at least a function argument. Maybe regular expressions for lists would be better with a key argument though.
This reminds me a bit of C++’s generic algorithms for collections.

On a similar subject, it would have been nice, if along with heapq, bisect functions would receive a key argument.

Categories
Design Programming

Open Question No. 1: Persistent Predicates?

Lately I’ve been developing a website. One issue that I’ll probably need to address in the near future is “persistent predicates”. By “persistent predicates” I mean the problem of treating predicates as data.

Consider the following situation: you are developing some big rss reader/aggregator and you want to allow users to specify handling rules. How would you keep these rules in memory, and how would you keep them on disk?
Obviously, this problem was solved before. Just consider email filters, or even packet filters in ethereal.

One way of approaching the problem is implementing simple predicate templates:
“%field contains %s” where field is subject, or body, etc.
Once that is accomplished, you can specify that a “filter” is some combination (for example logical and, or logical or) of multiple predicates. To store this, we’ll have an actual predicate table (or pickle) with their data, and a one-to-many mapping of filters to predicates.

Another option is allowing just some very simple predicates, and a filter will just “point” to (have an id/name of) the required predicate, and the required data. In this option, all data is stored with the filter.

A more complicated solution is to implement some logical serialize-able lanugage (such as the expression trees I used for diStorm or PyKoan). Using this language, the predicates can be very dynamic, and be combined and manipulated programmatically. This solution might be overkill for many projects though.

An interesting issue regarding handling of predicates, is their application to constraint solving. However, this is an issue for a future post. Suffice it to say, that when writing PyKoan I’m using a constraint solver. Since I’m representing predicates with expression trees, the ability to analyze and manipulate predicates is very handy.

Besides looking at existing solutions, I’m very curious to hear other peoples’ opinions. Feel free to write about your preferred solution in the comments.

Categories
Algorithms computer science Fractals Math Programming Python

Fractal Memory Usage and a Big Number

In a previous post I said I’d talk about 4**(4**(4**4)).
First, about the number. I first saw it mentioned in a math lesson back in high-school, when the teacher wanted to demonstrate estimation abilities.
He wrote it on the board, like so:

4444

and then asked students to estimate how big that number is. Turns out this number is much bigger than most students thought. (For comparison, the number of atoms in the observable universe is 1080.)

Given that this number is so big, it should come as no surprise that computing it will consume all available memory.
What is interesting is the way the memory is consumed:

(It died with a MemoryError just before the next “jump”.)

When I first saw this I was fascinated. To those interested it is generated by the long pow function in Python, implemented in longobject.c. According to the comment there, the algorithm is a left to right 5-ary exponentiation, taken from “Handbook of Applied Cryptography”.
In short, the simpler algorithm “left-to-right binary exponentiation” (used for smaller exponents) repeatedly squares the accumulator, and according to the binary digits of the exponents, multiplies it by the base. 5-ary exponentiation basically does the same thing, but in each iteration it ‘eats’ five digits of the exponent instead of just one.

It might be very interesting to study algorithms and code according to their memory usage. Once at work I used a similar memory usage graph to optimize the memory usage of some algorithm that was eating too much memory. I don’t remember the details as it happened a few years ago, but I do remember that I recognized the “type of the memory eating”, which helped me to narrow the problem down to a specific area of code, where indeed the problem was located. (I consider this a case of programming-fu :)

Notes:
1) The reason I’m talking about Python is because it has arbitrarily sized longs built-in.
2) This graph was generated using Windows’ task manager. On my Ubuntu machine it looks quite different, almost a straight line. Any guesses why?

Categories
Personal Programming Python Testing

My Bad Memory, High Load, and Python

About a month ago the new Ubuntu 8.04 was released and I wanted a clean install. I downloaded an image and burned it to a CD. Just before installing, I tried “check CD for defects” and found a few. Turns out (*) this was because of bad memory – and memtest confirmed it.
So I went to the shop, replaced the bad memory, and also bought two new sticks. I went home to install the new Ubuntu, and after the installation, Firefox crashed. After rebooting back to memtest, I saw this:

memory errors in memtest

Back at the computer shop, they asked me to reproduce the errors. Just firing up the computer and booting directly into memtest didn’t seem to do the trick, so I suspected that I had to overwork my computer a bit to reproduce this.

Since I was at the lab, I didn’t want to muck around too much.
So I thought, “what’s the quickest way to give your CPU a run around the block?”
That’s right – a tight loop:

while True:
    pass

However, this snippet doesn’t really play with the memory.

The next simplest thing to do, that also jiggles some ram is the following (and one of my favorites):

In [1]: x = 4**(4**4)
In [2]: y = 4**x

I will talk about this peculiar piece of code at a later post.

In any case, this snippet also didn’t reproduce the error. It is also quite unwieldy, as it raises a MemoryError after some time. Later at home I tried two more scripts.
The first is a variation on the one above:

x = 4**(4**4)
while True:
    try:
        y = 4**x
    except MemoryError:
        pass

I ran a few of those in parallel. However, my Ubuntu machine actually killed the processes running this one by one.

The second is smarter. It allocates some memory and then just copies it around:

import sys
import copy
megabytes = int(sys.argv[1])
 
x1 = [["a"*1000 + str(i) for i in range(1000)] for j in range(megabytes)]
while True:
    x2 = copy.deepcopy(x1)

After both of these scripts didn’t reproduce the problem and it still persisted arbitrarily, I returned the computer to the lab. Turns out that the two replacement sticks and the two new sticks weren’t exactly identical, and that was the cause of the problem. So now my memory is well again.

As for the scripts above, I once wrote a similar script at work. I was asked to help with testing some software in some stress testing. The goal was to simulate a heavily used computer. A few lines of Python later and the testing environment was ready.

Footnotes:
(*) – Finding out that it was a memory issue wasn’t as easy as it sounds. I didn’t think of running memtest. I checked the image on my HD with md5, and the hash didn’t match. I downloaded a second image, and again the hash didn’t match. I checked twice.
At this point I was really surprised: not only the second check didn’t match the published md5, it also didn’t match the first check. Some hours and plenty of voodoo later, a friend suggested running memtest, and the culprit was found.

Categories
Game Development Math Programming Projects Python

PyKoan – The Logic Game

As you can probably tell, I’m back from my undeclared hiatus. I’ve got lots of stuff to talk about, and I’ll be starting with PyKoan, one small project I’ve been working on lately in my spare time.

A few weeks ago I stumbled upon an article in wikipedia, regarding a logic game. This game fascinated me, especially because of the “Godel Escher Bach” connection. Quoting from Wikipedia:

Zendo is a game of inductive logic designed by Kory Heath in which one player (the “Master”) creates a rule for structures (“koans”) to follow, and the other players (the “Students”) try to discover it by building and studying various koans which follow or break the rule. The first student to correctly state the rule wins.

As it happens I’m also taking a mathematical logic course this semester. Having read about the game, I wanted to write a similar computer game. Hence – PyKoan.

PyKoan is a game where the goal is to discover some logical rule, for example, “For each x holds x%2 == 0”. This rule is applied to a koan – a list of integers. An example koan that “has Buddha nature” (follows the rule) is [0,2,8]. One which doesn’t is [1].

To implement this idea, I wrote an implementation of an expression tree very similar to the one used in vial, but with a different implemented language. I also did some experimentation with the design. Since I’ve been talking a lot about expression trees without giving a full explanation, in a future post I’ll write about the implementation used in PyKoan.

So far I didn’t code a lot of the game, just the expression tree framework, and a simple rule builder. When using Python’s interactive prompt, one can get a taste of how the game might feel:

In [19]: r = rulegen.create_rule(rulegen.easy_grammer, rulegen.easy_grammer_start)
 
In [20]: r.eval([])
Out[20]: False
 
In [21]: r.eval([0])
Out[21]: False
 
In [22]: r.eval([1])
Out[22]: False
 
In [23]: r.eval([2])
Out[23]: True
 
In [24]: r.eval([2,2])
Out[24]: True
 
In [25]: r.eval([2,2,3])
Out[25]: True
 
In [26]: r.eval([3])
Out[26]: False
 
In [27]: r.eval([4])
Out[27]: False
 
In [28]: print r
x exists such that x == 2

Here is how I would generate such an expression manually:

In [3]: from exptree import exps
 
In [4]: exps.Exists('x',exps.Eq('x',2))
Out[4]: exps.Exists(exps.Var(x), exps.Eq(exps.Var(x), exps.Imm(2)))
 
In [5]: print _
x exists such that x == 2

The game has many interesting possibilities for research, for example, computer players. Other possibilities include “just” guessing koans (not the rule itself), creating interesting and playable rules, and so on. There’s a lot to do.

This time, instead of just publishing the (unfinished) code, I decided to do something different. I’ve opened a space in assembla, with public read access. I’m opening this project for participation: if you want to join then leave a comment, or send me an email.

(Since Assembla seems to be going through some connectivity issues right now, here’s a link to a copy).

Categories
Programming Python

Singleton: my use-case for class decorators

It seems that class decorators will finally be making an appearance in Python 3 and Python 2.6. I’ve thought about using them from time to time, but the use-cases weren’t that common. However, I did find myself writing a lot of singletons.

This time I wanted to write a bunch of boiler-plate objects, that should all be singletons. So I thought of class decorators, to avoid repetition. Here is my singleton implementation, and it seems to work under Python2.6:

>>> def eq_is(self, other):
...     return self is other
...
>>> def singleton(cls):
...     cls.__eq__ = eq_is
...     return cls()
...
>>> @singleton
... class a(object):
...     def __repr__(self):
...         return 'foobar'
...
>>> a
foobar
>>> a == a
True

I just can’t wait for the new Pythons to come out of alpha!

Categories
Programming Python

The Case Against Floating Point ==

While working on the clipping bug, I had to do some floating point comparisons. To handle this virtual mine-field, I used simple precision comparisons.

Not surprisingly, I wanted to know what’s ‘the right way’ to do this in Python. When I discussed this subject on #python, I was told to take a look at the decimal module, of which I wasn’t aware at the time.

Most experienced programmers know that you shouldn’t compare floating point numbers with ==. If you want to check floating point equality, you usually decide on a precision, and check either the absolute error, or the relative error. Hence, floating point == isn’t used, maybe except for rare circumstances. I would even venture to say that when possible, static checkers should emit warnings on floating point ==.

On the other hand there are the beginner programmers. Those usually use == by mistake, and will be surprised by the strange results they sometimes get back.

My suggestion is to do either of the following:
1. Change floating point == to behave like a valid floating point comparison. That means using precision and some error measure.
2. Change floating point == to raise an exception, with an error string suggesting using precision comparison, or the decimal module.

Since this change is not backwards compatible, I suggest it be added only to Python 3.
Personally, I prefer no. 2. It is clearer, and less confusing.

Arguments against this suggestion are:

1. This change breaks existing programs:
I believe it most likely triggers hidden bugs. Since the suggestion is to change it only in Python 3, Those programs will most likely be broken by other changes as well, and will need to be changed anyway.

2. This change breaks compatibility with C-like languages:
I agree. However, the already agreed on change of the / operator is even a stronger break. Most arguments for changing the / operator apply here as well.

3. Programmers will still need the regular ==:
Maybe, and even then, only for very rare cases. For these, a special function\method might be used, which could be named floating_exact_eq.

What are your thoughts on the subject?

Categories
Programming Python Utility Functions

Various Small Python Helpers

I’ve been working lately with arkon on various projects related to diStorm. One of these projects involved writing a solid Python API around an existing C API. This C API uses a lot of flags as arguments and return values of functions.
Since I want my code to be easy to use in an interactive shell, I’d like to return something that’s not just a number. The obvious solution is to use a Python Enum. There are many implementations going around, most of them equivalent. I’ve got the one I use. What I recently added though, is the SymbolInt:

def SymbolInt(value, name):
    class _SymbolInt(int):
        def __str__(self):
            return name
        def __repr__(self):
            return 'SymbolInt(%d, "%s")' % (value, name)
        def __eq__(self, other):
            if isinstance(other, str):
                other = other.lower()
            return int(self)==other or name.lower() == other
        def __ne__(self, other):
            return not self == other
    return _SymbolInt(value)

Short, and very much to the point, this makes exploratory API’s much more readable and usable. It is useful as a return value from enum functions as well.

Along with SymbolInt I wrote hexint:

class hexint(long):
    def __str__(self):
        return hex(self)[2:-1]
    __repr__ = __str__
    def __repr__(self):
        return hex(self)[:-1]

This one makes program addresses readable.

Lastly, here’s a little function that I found missing from itertools:

def head(iterable, num_items):
    for obj,i in zip(iterable,xrange(num_items)):
        yield obj

UPDATE: Don’t use this head() function, use itertools.islice instead. Thanks go to Erez for pointing that out.