computer science Design Optimization Programming Python

10 Python Optimization Tips and Issues

Following my previous post on Optimizing Javascript, I thought I’d write a similar post regarding Python optimization.

Javascript Optimization

Javascript Optimization Tricks

Profiling, Profiling, Profiling

This seems obvious, but still merits a mention. Don’t optimize your code before profiling it. If you need a profiler, just use firebug’s. Having tests to make sure that your post-profiling code is equivalent to the original is also nice.

Later is better than now

As is well known, it is usually useful to have some of the page content to be fetched in the original http get, but have non-essential content fetched using ajax after page-load.

In some cases, user interaction with your website does not require fetching new content, but doing some processing. Generating html is especially slow. In that case, the user will have to wait until the processing is done. Unless of course, you can also ‘postpone it for later’: simply use setTimeout to do the work outside of the even handler.
For example, I used Bill Chadwick’s ArrowLine, which redraws on zoom-in/out. Since the redraw isn’t quick, this makes zooming in pretty slow, especially when quickly zooming more than once.
My solution: since what takes time is drawing the arrows themselves, I put arrow drawing code in a timeout callback, which means the interface doesn’t get unresponsive.

(Of course, this might misbehave, and I had to make sure my code still works if multiple timeouts are enqueued, etc..)

A single div is all you need!

I had to write a dropdown widget for UI, one which could show arbitrary html inside the dropdown box. Problem is, I had to to have a lot of similar dropdowns. This in itself is pretty easy, as we all know how to use a for-loop.
It gets complicated when you find out that adding these dropdowns takes a lot of time. What’s worse, the part that takes time is adding the div that gets displayed when you show it, but is otherwise hidden.

To solve this, first you may apply “Later is better than now”, and create each dropdown only when it’s needed. This will work. In some cases, a different approach is warranted: all of these dropdowns share the same design of the ‘dropdown div’, and only one of these divs will be shown at any time (I had to write code to make sure of it). So, instead of making a new drop-down div for each dropdown, create just one, and let them share it. This might complicate your code, but the rewards may be great.

To make it work, you may use an object factory, or a function that creates widget classes, or any other pattern that works for you.

Math Programming Python

Checking the ulam spiral

In the following post, the Ulam spiral is described. It’s a very simple object – write down consecutive natural numbers starting from 41 in a square spiral. Curiously, the numbers on the diagonal are primes:

Ulam spiral

Reading this post, I immediately wanted to check how long this property holds. The original blog post suggests:

Remarkably, all the numbers on the red diagonal are prime — even when the spiral is continued into a 20 × 20 square.

Yet it doesn’t mention if it stops there or not. Without peeking (in Wikipedia), I wrote a simple program to check:

import sys
import psyco
UP = 0
LEFT = 1
DOWN = 2
DIRECTIONS = {UP: (0, -1),
              LEFT: (-1, 0),
              DOWN: (0, 1),
              RIGHT: (1, 0)}
def generate_ulam_seq(n):
    start = 41
    x = 0
    y = 0
    min_x = 0
    max_x = 0
    min_y = 0
    max_y = 0
    value = start
    direction = UP
    for i in xrange(n):
        yield x, y, value
        value += 1
        add_x, add_y = DIRECTIONS[direction]
        x, y = x + add_x, y + add_y
        if x < min_x:
            direction = (direction+1) % 4
            min_x = x
        if x > max_x:
            direction = (direction+1) % 4
            max_x = x
        if y < min_y:
            direction = (direction+1) % 4
            min_y = y
        if y > max_y:
            direction = (direction+1) % 4
            max_y = y
def is_prime(n):
    return all(n%x != 0 for x in xrange(2, int((n**0.5) + 1)))
def main():
    for x, y, v in generate_ulam_seq(int(sys.argv[1])):
        if x == y and not is_prime(v):
            print x, y, v
if __name__ == '__main__':

Running it, we get:

> 10000
20 20 1681
-21 -21 1763
22 22 2021
-25 -25 2491
28 28 3233
-33 -33 4331
38 38 5893
-41 -41 6683
41 41 6847
42 42 7181
-44 -44 7697
-45 -45 8051
-46 -46 8413
48 48 9353

So the property doesn’t hold for a very long time. Reading up on Wikipedia, it seems the Ulam spiral still has interesting properties related to primes in higher sizes. Maybe I’ll plot that one as well in a future post.

PS: Regarding primality testing, this is a cute, quick&dirty one liner. In the range of numbers we’re discussing, it is ‘good enough’.


Privacy mode not so private

I like my privacy. I also prefer to keep my information secure. I might be a bit more paranoid than the rest, but not extremely so. A short while ago, I discovered something disturbing regarding Firefox. It seems to be a ‘secret everybody knows’, yet Firefox doesn’t say anything about that.

What is it? When using Firefox’s capabilities to ‘clear private data’ (under options->privacy), even when checking all the checkboxes, a lot of information is still kept. This is even true when using the new ‘private browsing’ which supposedly allows you to browse without any record kept.

How is the information kept? Using Local Shared Objects (LSO’s), which are basically cookies used by flash objects. Who uses these cookies? Almost everyone. The result? If you trusted Firefox so far to keep your browsing history secure, take a look at the following locations, and tell me what do you see.

How to mitigate? Simplest option is just to delete the files you see in the aforementioned locations. Better yet, install BetterPrivacy. Of course, you can also install any kind of flash blocker, or any other tool, to make sure you don’t keep those LSOs.

If you do end up using BetterPrivacy, be sure to check the “On cookie deletion also delete empty cookie folders” checkbox. If you don’t, while the cookies are no longer kept, the record of the sites you visited is still kept locally.


PNG Minification

Following Ned Batchelder’s advice I’ve used pngout to minify the png’s on my startup’s web page.

It took them down from 641,281 bytes to 338,705. This is quite a nice return for the effort of a download and a single command line:

for %i in (*.png) do pngout “%i”


Threat analysis, security by obscurity and WordPress

Rusty Lock
  Image by Mykl Roventine

I’ve been running wordpress for a long time now, and luckily so far, it hasn’t been hacked.
Of course – this doesn’t prove anything, as I didn’t count hacking attempts. It also doesn’t show it’s unhackable – on the contrary, I believe that my wordpress installation is hackable by a determined attacker.

However, there’s a subtle issue at play regarding the ‘determined attacker’. There are several kinds of attackers today, and the two most notable ones are the ‘targeted attacker’ and the ‘mass attacker’. The targeted attacker aims to attack your resources specifically, probably because his interest in them. The mass attacker on the other hand, is interested in any resource like your own.

From this premise it follows that the two attackers will likely use different methods of operation. The mass attacker is looking to increase his ROI. He will use mass tools with the most coverage, and if an attack doesn’t work on a specific target, nevermind, it will work on others. For him, work is worthwhile only if it allows him to attack a substantial number of new targets.
In contrast, the targeted attacker’s goal is to break into your resources. For her the fact that a given attack will yield hundreds of other targets is irrelevant, unless it helps attacking you. She might start with top-of-the-shelf mass tools, but when these won’t work, the targeted attacker will study her target, until she finds a vulnerability, and then use it.

Now the question you should ask yourself – who are you defending against? When defending against a mass attacker, making yourself unnoticed and uncommon might be worthwhile. A little security by obscurity will most likely be enough to thwart most of the attacks.
Against targeted attacks you need a more solid defense, utilizing all the tricks in your bag, and still be aware that it probably won’t be enough. You should also seek to minimize damages in case of a successful attack.

Today, most wordpress blogs are under mass attacks. WordPress blogs are searched, exploited and the 0wned automatically, with the goal of getting the most coverage.
For some time now I’ve been using a small trick that helps to defend against mass attacks. The trick is simple – I added a small .htaccess file password-protecting the admin directory of my wordpress installation. Of course, in all probability the password may be bruteforced or completely bypassed by a very determined attacker, but against a mass attacker it is very effective.

I’ve seen suggestions to rename your files and dirs – this will probably also work. Still, it should be noted that this kind of methods only add obfuscation, thereby only protecting from mass attacks. Personally, I don’t consider the last method worthwhile – it complicates your installation and upgrade process, it requires much more work to be done right, and at most adds similar security to the .htaccess file, most likely less.

To conclude – do your threat analysis, and use the defense methods with the most ROI relative to that analysis. Just as another method – do consider using .htaccess files to prevent access to your admin directory.

Challenges computer science Python

My solution to the counting sets challenge

A few days ago, I wrote up a challenge – to count the number of sets a given set is contained in.

In the comments, I touched briefly on the original problem from which the challenge was created, and I’ll describe it in more depth here.
In the problem, I am given an initial group of sets, and then an endless ‘stream of sets’. For each of the sets in the stream, I have to measure its uniqueness. relative to the initial group of sets. A set that is contained in only one set from the initial group is very unique, one that is contained in ten – not so much.

So how to solve this problem? My original solution is somewhat akin to the classic “lion-in-the-desert” problem, but more like the “blood test” story. I didn’t find a link to the story, so I’ll give it as I remember it.

In an army somewhere, it was discovered that at least one of the soldiers was sick and so had to be put in isolation until he heals. It is only possible to check for the disease via a blood test, but tests are expensive, and they didn’t want to test all of the soldiers. What did they do?

They took enough blood from each soldier. Now, from each sample they took a little bit, and divided the samples into two groups. They mixed together the samples of each group, and tested the mixed sample. If the sample was positive – they repeated the process for the blood samples of all the soldiers in the matching group.

Now my solution is clear: let’s build a tree of set unions. At bottom level will be the union of couples of sets. At the next level, unions of couples of couples of sets. So on, until we end up with just two sets, or even just one – if we are not sure the set is contained in any of the initial sets.

Testing is just like in the story. We’ll start at the two biggest unions, and work our way down. There is an optimization though – if a set appears more than say, 10 times, it’s not very unique, and its score is zeroed. In that case, we don’t have to go down all the way, but stop as soon as we pass the 10 “positive result” mark.

Here’s the code:

class SetGroup(object):
    def __init__(self, set_list):
        cur_level = list(set_list)
        self.levels = []
        while len(cur_level) > 1:
            cur_level = [union(couple) for couple in blocks(cur_level, 2)]
    def count(self, some_set, max_appear = None):
        indexes = [0]
        for level in self.levels:
            indexes = itertools.chain((2*x for x in indexes), (2*x+1 for x in indexes))
            indexes = (x for x in indexes if x < len(level))
            indexes = [x for x in indexes if some_set <= level[x]]
            if max_appear is not None and len(indexes) >= max_appear:
                return max_appear
        return len(indexes)

Here’s a link to the full code.

I didn’t implement this solution right away. At first, I used the naive approach, of checking against each set. Then, when it proved to be too slow, I tried implementing the solution outlined by Shenberg and Eric in the comments to the challenge. Unfortunately, their solution proved to be very slow as well. I believe it’s because some elements appear in almost all of the sets, and so computing the intersection for these elements takes a long time.
Although originally I thought that my solution would suffer from some serious drawbacks (can you see what they are?), the max_appear limit removed most of the issues.

Implementing this solution was a major part of taking down the running time of the complete algorithm for the full problem I was solving from about 2 days, to about 15-20 minutes. That was one fun optimizing session :)