Categories

## Checking the ulam spiral

In the following post, the Ulam spiral is described. It’s a very simple object – write down consecutive natural numbers starting from 41 in a square spiral. Curiously, the numbers on the diagonal are primes: Reading this post, I immediately wanted to check how long this property holds. The original blog post suggests:

Remarkably, all the numbers on the red diagonal are prime — even when the spiral is continued into a 20 × 20 square.

Yet it doesn’t mention if it stops there or not. Without peeking (in Wikipedia), I wrote a simple program to check:

```import sys import psyco psyco.full()   UP = 0 LEFT = 1 DOWN = 2 RIGHT = 3   DIRECTIONS = {UP: (0, -1), LEFT: (-1, 0), DOWN: (0, 1), RIGHT: (1, 0)}   def generate_ulam_seq(n): start = 41 x = 0 y = 0 min_x = 0 max_x = 0 min_y = 0 max_y = 0 value = start direction = UP for i in xrange(n): yield x, y, value value += 1 add_x, add_y = DIRECTIONS[direction] x, y = x + add_x, y + add_y if x < min_x: direction = (direction+1) % 4 min_x = x if x > max_x: direction = (direction+1) % 4 max_x = x if y < min_y: direction = (direction+1) % 4 min_y = y if y > max_y: direction = (direction+1) % 4 max_y = y   def is_prime(n): return all(n%x != 0 for x in xrange(2, int((n**0.5) + 1)))   def main(): for x, y, v in generate_ulam_seq(int(sys.argv)): if x == y and not is_prime(v): print x, y, v   if __name__ == '__main__': main()```

Running it, we get:

```> ulam.py 10000 20 20 1681 -21 -21 1763 22 22 2021 -25 -25 2491 28 28 3233 -33 -33 4331 38 38 5893 -41 -41 6683 41 41 6847 42 42 7181 -44 -44 7697 -45 -45 8051 -46 -46 8413 48 48 9353```

So the property doesn’t hold for a very long time. Reading up on Wikipedia, it seems the Ulam spiral still has interesting properties related to primes in higher sizes. Maybe I’ll plot that one as well in a future post.

PS: Regarding primality testing, this is a cute, quick&dirty one liner. In the range of numbers we’re discussing, it is ‘good enough’.

Categories

## Fractals in 10 minutes No. 6: Turtle Snowflake

I didn’t write this one, but I found it’s simplicity and availability so compelling, I couldn’t just not write about it.
In a reddit post from a while ago, some commenter named jfedor left the following comment:

A little known fact is that you can do the following on any standard Python installation:

```from turtle import *   def f(length, depth): if depth == 0: forward(length) else: f(length/3, depth-1) right(60) f(length/3, depth-1) left(120) f(length/3, depth-1) right(60) f(length/3, depth-1)   f(500, 4)```

If you copy paste, it’s a fractal in less than a minute. If you type it yourself, it’s still less than 10. And it’s something you can show a kid. I really liked this one.

Categories

## PyWeb-IL Presentation on Harvesting: Finding the Most Influential Artists

Yesterday I gave a presentation on harvesting to the PyWeb-IL group. In the presentation, I described what I learned about harvesting and also gave a concrete example of how to find the “most influential artists” using data from allmusic.com and a (very) naive implementation of PageRank.

The PageRank implementation was based on wikipedia word-by-word, and is not efficient, but it works well enough for this presentation. I included it and the allmusic.com example mostly because I thought the results are pretty cool, and it’s very good teaching material.

Here is the presentation, and the code is available here.

Here is how to run it:
``` D:\work\pywebil-harvesting\upload>allmusic.py "/cg/amg.dll?p=amg&sql=11:3pfrxqq5ld6e" 2 out.pkl```

``` ```

```simple_pagerank.py out.pkl ```

Happy harvesting!

Categories

## Bulk INSERTs FTW

A short while ago, I had to research some API for a company I’m consulting for. This API yields very good quality data, but isn’t comfortable enough to process it for further research.
The obvious solution was to dump this data into some kind of database, and process it there.
Our first attempt was pickle files. It worked nicely enough, but when the input data was 850 megs, it died horribly with a memory error.

(It should be mentioned that just starting to work with the API costs about a 1.2 gigs of RAM.)

Afterwards, we tried sqlite, with similar results. After clearing it of memory errors, the code (sqlite + sqlalchemy + our code) was still not stable, and apart from that, dumping the data took too much time.

We decided that we needed some *real* database engine, and we arranged to get some nice sql-server with plenty of RAM and CPUs. We used the same sqlalchemy code, and for smaller sized inputs (a few megs) it worked very well. However, for our real input the processing, had it not died in a fiery MemoryError (again!) would have taken more than two weeks to finish.

(As my defense regarding the MemoryError I’ll add that we added an id cache for records, to try and shorten the timings. We could have avoided this cache and the MemoryError, but the timings would have been worse. Not to mention that most of the memory was taken by the API…)

At this point, we asked for help from someone who knows *a little bit* more about databases than us, and he suggested bulk inserts.

The recipe is simple: dump all your information into a csv file (tabs and newlines as delimiters).
Then do BULK INSERT, and a short while later, you’ll have your information inside.
We implemented the changes, and some tens of millions of records later, we had a database full of interesting stuff.

My suggestion: add FTW as a possible extension for the bulk insert syntax. It won’t do anything, but it will certainly fit.

Categories

## Computing Large Determinants in Python

Story:
For my seminar work, I had to calculate the determinant of a large integer matrix.
In this case, large meant n>=500. You might say that this isn’t very large and I would agree. However, it is large enough to overflow a float and reach +inf. Reaching +inf is bad: once you do, you lose all the accuracy of the computation. Also, since determinant calculations use addition as well as subtraction, if +inf is reached in the middle of the calculation the value might never drop back, even if the determinant is a “small” number.

(Note: Even if the calculation hadn’t reached +inf, but just some very large floating point number, the same phenomenon could have occurred.)

As I said, my matrix was composed of integers. Since a determinant is a sum of multiples of matrix elements, you would expect such a determinant to be an integer as well. It would be nice to take advantage of that fact.

How should one compute such a determinant? Well, luckily Python has support for arbitrarily large integers. Let’s compute the determinant with them!
Unfortunately, Python’s numpy computes determinants using LU decomposition, which uses division – hence, floating point values.
Well, never mind, I know how to calculate a determinant, right?
Ten minutes later the naive determinant calculation by minors was implemented, with one “minor” setback – it takes O(n!) time.
Googling for integer determinants yielded some articles about division free algorithms, which weren’t really easy to implement. There was one suggestion about using Rational numbers and Gauss Elimination, with pivots chosen to tame the growth of the nominator and denominator.

So, I hitched up a rational number class, using Python’s arbitrarily large integers as nominator and denominator. Then, I got some code to do Gauss Elimination, although my pivoting wasn’t the most fancy. This seemed to work, and I got my desired large determinants.

Epilogue:
Not too long ago, I ran across mpmath, a Python library for arbitrarily large floating point numbers. It is possible I could have used it instead of my own rational numbers. Next time I run into a relevant problem I’ll keep this library in mind.
Also, an even shorter while ago, I became aware that since 2.6 Python boasts a fractions module. This will sure come in handy in the future.

Code:
Available here. If you intend to use it in Python 2.6 and above, I suggest replacing my Rational class with Python’s Fraction.
(Note that I adapted this module to my needs. Therefore it has a timer that prints time estimations for computations, and it uses Psyco.)

Categories

## Easy Harvesting Image by existentist.

I’ve been doing a lot of harvesting (aka screen-scraping) lately. Fortunately, I don’t need forms automation, so I’m using urllib2 and not Mechanize like my friend Ron Reiter recommended.

At first, when I wanted to get some information from a web page quick&dirty-wise, I used regular expressions. This approach works, but is not especially fun to write or to maintain. So for the next harvesting task, I decided to learn Beautifulsoup. Beautifulsoup has an excellent interface, and a parser that deals with messed up (read: random.shuffle()-ed) tags.

Unfortunately, Beautifulsoup is based on Python’s built in html parser, (htmllib.HTMLParser), which is a bad excuse for an html parser.
I decided to give up on it, when I tried to parse a page that had Javascript in it, and the Javascript had a string that contained html. HTMLParser choked on it, and as a result, the page was unparse-able with BeautifulSoup.

Distraught, I remembered that I knew of some other html parser called lxml. I played with it a little, and it seemed to eat up all the pages that BeautifulSoup choked on, and then asked for more. Efficiently.

Now, I had a problem. I already had a harvester written using BeautifulSoup’s interface, and after looking at lxml’s interface, the situation didn’t seem too good. While lxml sports a solid interface, it’s nowhere as quick&easy as BeautifulSoup’s. Also, it used xpath.

My solution: a BeautifulSoup interface wrapper for lxml, which I present here. It mostly does nasty xpath conversions, and it allowed me to make my already written harvester work as well as to write the next harvester. I also gave it to a friend who found it useful.

Here is a short usage example.
Let’s say we are interested in harvesting the names of all the artists appearing on Fiona Apple’s page on allmusic.com.
First, using FireBug’s “inspect”, we can see that all the interesting links are in a table with the id “large-list”. Also, we can see that all the artist links have “sql=11:” in them.
So, our code looks like this:

```In : soup = parse_html.fetch_soup('http://allmusic.com/cg/amg.dll?p=amg&sql=11:jjfixqegldde') In : names = [x.all_text() for x in soup.find(id="large-list").find_all('a', href=re.compile("sql=11:"))] In : print names ['Alanis Morissette', 'Tori Amos', 'Jeff Buckley', 'Aimee Mann', 'Heather Nova', 'Astrid Williamson', 'Kari Newhouse', 'Sarah Blasko', 'Daniel Powter', ...]```

So, here is the relevant code. It mostly translates nice function calls to nasty xpath. It is not really well documented, as it was a quick and dirty solution, and its interface is similar to BeautifulSoup’s. I hope you find it useful. I did.

Categories

## You know you need to install a new Python version when…

```>>> import decimal >>> decimal.Decimal('0.2') <&nbsp; 0.3 False >>>```

This little gem took me two hours to track down.
It turns out that since my code is using sqlobject, it also uses the decimal module. I had some constants set up, and from within some larger algorithm I wanted to compare values extracted from the db to those constants. However, my code didn’t seem to work, and I was sure the problem lay somewhere else in the algorithm.
Two hours later, and I found this comparison. It might be similar to this issue, but I’m not sure.

In any case, I was using Python 2.5.2, and decimal works as expected in Python 2.6, so I guess it is indeed time for an upgrade.
After some more checking with Python 2.6, it seems that:

```>>> decimal.Decimal('0.6') <&nbsp; 0.3 True```

Not good.

Categories

## A Python Rant

Last night I encountered yet again one of Python’s annoyances.
The annoyance I’m referring to is the lack of string like functions for lists. Trivial examples include find() and rfind(). Before you mention index though, it’s important to point out that index() checks for equality. I’d be much happier if instead it could take a function argument for comparisons.
A less trivial example is split(), also with a possible criterion argument. A complicated example is regular expressions.

It seems that most of these functions, when applied to lists should take at least a function argument. Maybe regular expressions for lists would be better with a key argument though.
This reminds me a bit of C++’s generic algorithms for collections.

On a similar subject, it would have been nice, if along with heapq, bisect functions would receive a key argument.

Categories

## Fractals in 10 minutes No. 5: Sierpinski Chaos Game

A few years back, I was in some course where they also taught me some Matlab. One of the exercises was to draw the Sierpinski triangle using a method of progressing a point randomly. I was quite surprised at the time, because I thought the Sierpinski triangle was more of an analytical thing: you drew it using inverted triangles.

I wanted to check up on it, and it turns out the method has a name: Chaos Game.
To generate a Sierpinski triangle using this method, one starts with a some point inside the triangle. Then, at each step, the next point is half the distance from the current point to one of the corners selected at random. I think this method of generating the Sierpinski triangle is even easier than the analytical one. I used pylab (matplotlib) to create this image.
As I usually do, I also wanted to draw it using ascii-art. However, I must confess, I am not satisfied with the result:

```              %6
6*%8
%8  66
%6%*6%%6
6%'    '66
6866'  '6%86
%6* 86 '%8 *66
%**%%'***%6***%%
8'6            6*8
6%%*8          %*'*6
6%6'666        86# 86%
%%*''**%6      %**''*'%6
*%8'   '6%6    668'   '#%%
%% '6' '#*'6%  *% %8' '6''6*
%68%#6#*#68*86*%6#%8%8'6%6*6%*
```

The code isn’t really good, as I didn’t put much thought into it and just hacked it up. Still, I’m putting it up, and I might improve it someday.

Categories

## Fractals in 10 minutes No. 4: Mandelbrot and Julia in Ascii Art

I felt like it’s about time I tackled the Mandelbrot and Julia sets, in Ascii-Art. Heck, the Mandelbrot fractal is on the logo of this Blog! However, being the most well-known fractal, this issue was tackled already, with satisfactory results.

Still, I took the ten minutes required to draw them in Python, using numpy:

```def get_mandelbrot(x, num_iter = 10): c = x for i in range(num_iter): x = x**2 + c return abs(x)>2   def get_julia(x, c, num_iter = 10): for i in range(num_iter): x = x**2 + c return abs(x)>2```

“Hey!” you might say, “There’s no loop in here!”. Indeed, with numpy loops can sometimes be avoided, when using arrays. When the expression x**2 + c is applied to an array x, it is applied element-wise, allowing for implicit loops. The actual magic happens in the following function:

```def get_board(bottom_left, top_right, num_x, num_y): x0, y0 = bottom_left x1, y1 = top_right x_values = numpy.arange(x0, x1, (x1-x0)/float(num_x)) y_values = numpy.arange(y0, y1, (y1-y0)/float(num_y)) return numpy.array([[x+1j*y for x in x_values] for y in y_values])```

The result of get_board will be used as the “x input” later on. It should be noted though that while that’s a cute trick this time, it might grow unhandy for more complicated computations. For example, making each element to reflect the iteration number on which it “escaped”.
So, here are the results:

```############################   ##############
###########################   ###############
##########################     ##############
#####################                ########
####################                #########
############ #######                 ########
############                         ########
#########                           #########
#########                           #########
############                         ########
############ #######                 ########
####################                #########
#####################                ########
##########################     ##############
###########################   ###############

############################################################
############################################################
############################################################
############################ ###############################
######################### ##   #############################
############## #######           ###########################
######## ##      ###              ##########  ##############
#########  ###        ###         ######       #############
##############       ######         ###        ###  ########
###############  ##########              ###      ## #######
############################           ####### #############
##############################   ## ########################
################################ ###########################
############################################################
############################################################
```

Here’s the code. (With the usual bsd-license slapped on top of it :)