Categories

## Computing Large Determinants in Python

Story:
For my seminar work, I had to calculate the determinant of a large integer matrix.
In this case, large meant n>=500. You might say that this isn’t very large and I would agree. However, it is large enough to overflow a float and reach +inf. Reaching +inf is bad: once you do, you lose all the accuracy of the computation. Also, since determinant calculations use addition as well as subtraction, if +inf is reached in the middle of the calculation the value might never drop back, even if the determinant is a “small” number.

(Note: Even if the calculation hadn’t reached +inf, but just some very large floating point number, the same phenomenon could have occurred.)

As I said, my matrix was composed of integers. Since a determinant is a sum of multiples of matrix elements, you would expect such a determinant to be an integer as well. It would be nice to take advantage of that fact.

How should one compute such a determinant? Well, luckily Python has support for arbitrarily large integers. Let’s compute the determinant with them!
Unfortunately, Python’s numpy computes determinants using LU decomposition, which uses division – hence, floating point values.
Well, never mind, I know how to calculate a determinant, right?
Ten minutes later the naive determinant calculation by minors was implemented, with one “minor” setback – it takes O(n!) time.
Googling for integer determinants yielded some articles about division free algorithms, which weren’t really easy to implement. There was one suggestion about using Rational numbers and Gauss Elimination, with pivots chosen to tame the growth of the nominator and denominator.

So, I hitched up a rational number class, using Python’s arbitrarily large integers as nominator and denominator. Then, I got some code to do Gauss Elimination, although my pivoting wasn’t the most fancy. This seemed to work, and I got my desired large determinants.

Epilogue:
Not too long ago, I ran across mpmath, a Python library for arbitrarily large floating point numbers. It is possible I could have used it instead of my own rational numbers. Next time I run into a relevant problem I’ll keep this library in mind.
Also, an even shorter while ago, I became aware that since 2.6 Python boasts a fractions module. This will sure come in handy in the future.

Code:
Available here. If you intend to use it in Python 2.6 and above, I suggest replacing my Rational class with Python’s Fraction.
(Note that I adapted this module to my needs. Therefore it has a timer that prints time estimations for computations, and it uses Psyco.)

Categories

## Easy Harvesting

Image by existentist.

I’ve been doing a lot of harvesting (aka screen-scraping) lately. Fortunately, I don’t need forms automation, so I’m using urllib2 and not Mechanize like my friend Ron Reiter recommended.

At first, when I wanted to get some information from a web page quick&dirty-wise, I used regular expressions. This approach works, but is not especially fun to write or to maintain. So for the next harvesting task, I decided to learn Beautifulsoup. Beautifulsoup has an excellent interface, and a parser that deals with messed up (read: random.shuffle()-ed) tags.

Unfortunately, Beautifulsoup is based on Python’s built in html parser, (htmllib.HTMLParser), which is a bad excuse for an html parser.
I decided to give up on it, when I tried to parse a page that had Javascript in it, and the Javascript had a string that contained html. HTMLParser choked on it, and as a result, the page was unparse-able with BeautifulSoup.

Distraught, I remembered that I knew of some other html parser called lxml. I played with it a little, and it seemed to eat up all the pages that BeautifulSoup choked on, and then asked for more. Efficiently.

Now, I had a problem. I already had a harvester written using BeautifulSoup’s interface, and after looking at lxml’s interface, the situation didn’t seem too good. While lxml sports a solid interface, it’s nowhere as quick&easy as BeautifulSoup’s. Also, it used xpath.

My solution: a BeautifulSoup interface wrapper for lxml, which I present here. It mostly does nasty xpath conversions, and it allowed me to make my already written harvester work as well as to write the next harvester. I also gave it to a friend who found it useful.

Here is a short usage example.
Let’s say we are interested in harvesting the names of all the artists appearing on Fiona Apple’s page on allmusic.com.
First, using FireBug’s “inspect”, we can see that all the interesting links are in a table with the id “large-list”. Also, we can see that all the artist links have “sql=11:” in them.
So, our code looks like this:

 In [4]: soup = parse_html.fetch_soup('http://allmusic.com/cg/amg.dll?p=amg&sql=11:jjfixqegldde') In [5]: names = [x.all_text() for x in soup.find(id="large-list").find_all('a', href=re.compile("sql=11:"))] In [6]: print names ['Alanis Morissette', 'Tori Amos', 'Jeff Buckley', 'Aimee Mann', 'Heather Nova', 'Astrid Williamson', 'Kari Newhouse', 'Sarah Blasko', 'Daniel Powter', ...]

So, here is the relevant code. It mostly translates nice function calls to nasty xpath. It is not really well documented, as it was a quick and dirty solution, and its interface is similar to BeautifulSoup’s. I hope you find it useful. I did.

Categories

## You know you need to install a new Python version when…

 >>> import decimal >>> decimal.Decimal('0.2') <  0.3 False >>>

This little gem took me two hours to track down.
It turns out that since my code is using sqlobject, it also uses the decimal module. I had some constants set up, and from within some larger algorithm I wanted to compare values extracted from the db to those constants. However, my code didn’t seem to work, and I was sure the problem lay somewhere else in the algorithm.
Two hours later, and I found this comparison. It might be similar to this issue, but I’m not sure.

In any case, I was using Python 2.5.2, and decimal works as expected in Python 2.6, so I guess it is indeed time for an upgrade.
After some more checking with Python 2.6, it seems that:

 >>> decimal.Decimal('0.6') <  0.3 True

Not good.

Categories

## A Python Rant

Last night I encountered yet again one of Python’s annoyances.
The annoyance I’m referring to is the lack of string like functions for lists. Trivial examples include find() and rfind(). Before you mention index though, it’s important to point out that index() checks for equality. I’d be much happier if instead it could take a function argument for comparisons.
A less trivial example is split(), also with a possible criterion argument. A complicated example is regular expressions.

It seems that most of these functions, when applied to lists should take at least a function argument. Maybe regular expressions for lists would be better with a key argument though.
This reminds me a bit of C++’s generic algorithms for collections.

On a similar subject, it would have been nice, if along with heapq, bisect functions would receive a key argument.