Following my previous post on Optimizing Javascript, I thought I’d write a similar post regarding Python optimization.
Before going on to the more interesting stuff, there are a few issues that need to be addressed:
0. Basics
Know the basics – especially profiling!
Just by looking at the profiling output, you can tell where does the computing time go. To get that information, I like to sort on cumulative time (i.e, time taken by a given function and all functions called from it, over all of its calls).
0.5. Knowing your goal, and your enemy
The kind of optimizations you do, and how far you’re willing to go is dependent on your code’s users. If you’re writing batch processing software, your required time for running might be a minute, an hour, or a day. So far, I had to optimize various cases of weeks to minutes for batch processing and also seconds to milliseconds for web-application UI.
Your timing should apply to typical input, and probably to your biggest probable input as well.
Create some simple benchmark you can test your code against. It’s important that your benchmark be typical ‘complexity-wise’, but smaller in size, so that running it and getting profiling results takes no more than a few seconds. You may even want multiple benchmarks, each one for a different size. That way, once you are more sure of yourself, you can run your code against the larger benchmark. If your benchmarks are real inputs – all the better.
1. Python vs. C and similar considerations
In my line of work, I usually do research oriented development. That means that it’s harder to know upfront where the bottlenecks will be. As a result, the prevailing attitude is usually “let’s write it in Python, and later, when the need arises, convert the critical code to C”.
So far, I haven’t had the chance to do that. Usually what happens is we write the code, it works well enough, and we figure that the flexibility of writing it in Python is more important than the Python to C conversion gains. Also, Python is not always the bottleneck – sometimes it’s a database, or some 3rd party API.
Usually “import pysco”, and changing the code to allow parallel processing is cheaper and simpler than the conversion to C.
2. The small time-eater
A common problem is when a relatively trivial function is taking a lot of cumulative time. That’s usually a sign you’re doing something wrong. I had this issue when I used my symbolic constants for a new project. Consider the following:
def SymbolInt(value, name): class _SymbolInt(int): def __str__(self): return name def __repr__(self): return 'SymbolInt(%d, "%s")' % (value, name) def __eq__(self, other): if isinstance(other, str): other = other.lower() return int(self)==other or name.lower() == other def __ne__(self, other): return not self == other return _SymbolInt(value) |
This one is very nice for interactive interfaces. However, in the new project, I found out that __eq__ was taking *a lot* of time. Way more than it should, even when I wasn’t comparing SymbolInt-s to strings!
It turned out that ‘or name.lower() == other’ was very bad speed wise. So for that project, I removed this subcondition, and voila! My code was fast!
3. The algorithm is critical
In many cases I’ve worked on, the greatest reductions in running time were due to algorithm changes. That means that playing with issues such as variable lookups and so on should come after you’re mostly settled on your algorithm. The latest example that I can think of is the set counting problem, where using my solution got me down from two weeks to 20 something minutes on my real input.
Later I did some simpler optimizations that chopped off a few more minutes.
4. Avoiding loops
That one is easy. Everyone knows you should avoid loops, especially nested ones. Still, there are some cases where your code just has to have these loops – because that’s the essence of what your code is doing.
To make loop avoidance possible, and specifically cartesian product kind of loops, consider refactoring your code to use set intersections and unions. As simple illustration, consider:
Instead of this,
for x in a: for y in b: if x == y: yield (x,y) |
do this:
return set(a) & set(b) |
Sometimes, applying this change doesn’t quite fit your algorithm. In that cases, try to change your algorithm to accommodate. For example, it might yield less accurate results. In that case, aim for returning more than you need, and then do a second pass to filter the bad ones. The time gains from avoiding the extra loops should still be worth it.
(Note: this is similar to doing your computations in your database queries instead of in your code. Similar ideas apply.)
5. Lookups
If you spend time looking for something, use a dict. If that is not feasable, use any other data-structure that fits your problem. For example, let’s say you’re looking for given strings in a lot of files. You can build a small index beforehand, and instead of looking at the files each time, just look at this index.
(Note: this is similar to creating an index on the database column you are searching on.)
6. Memory
When dealing with large inputs, you’ll usually want to reduce your memory requirements. Consider an algorithm that requires O(n) memory, for n-sized inputs. All you need is a factor of 4, and 500 megs of input, and your code will choke on many current machines.
Also, I’ve found out that writing your code in such a way as to use drastically less memory, will sometimes force me to write it more time-efficiently as well.
There are a few techniques to dealing with the memory issue. The central idea is to have as little of your data as you need available at any time.
7. Generators
Generator expressions are usually preferable to list comprehensions. Similarly, consider replacing this kind of functions:
def myfunc(some_input): ... result = [] for bla in foo: ... result.append(bar) return result |
with the following idiom:
def myfunc(some_input): ... for bla in foo: ... yield bar |
This has the added advantage of simplifying myfunc, as its state is kept for you. On really big inputs and outputs, this one could save you from keeping all of your output in memory.
If you are not familiar with generators, I suggest reading David Beazley’s presentation on the subject, it’s an excellent read, regardless of optimizations.
8. Outputs
If your goal is to generate output, dump it to a file as soon as possible. This is made simple by the previous idiom:
for bar in myfunc(): #process bar ... dump(foobar) |
Just make sure that dump doesn’t keep your data around for too long.
For example, I once had to insert a lot of data into a database. After I finished processing each record, I would insert it. The bottleneck was the database. I tried flushing only after several inserts (which meant inserts in chunks of N for various N), until I was introduced to the solution: bulk inserts.
The, my extraction script just dumped to a text file, which was lightning fast, and later I did the bulk insert.
9. Summing up
Do
sum(x for x in some_generator) |
Instead of
for x in some_list: my_sum += x |
Kidding!
Use your profiler, your head, psyco, and more experienced advice in the best order that suits you. As I’ve come to learn, getting advice from friends is an excellent way to avoid bashing your head against some mad bugger’s O(n2) wall.
Really nice info, I agree with everything. I’ve met lazy people (as programmers should be) that skip the profiling step and just guess where they need to optimize so I’d like to +1 how important it is. Not to mention it’s easy:
import profile
profile.run(“any_python_expression”)
where usually you’d replace any_python_expression with a call to your main function…
One more…
Understand where your program fits. It may be easier to optimise a cooperating program rather than the current one, to gain the performance gain needed for the whole system.
– Paddy.
Thanks for posting this. I found it to be very practical and helpful.
One cheap trick you can also use is to cache a method call in a tight loop. Method lookups can be expensive, so for example in:
> for i in xrange(1000):
> myobj.compute(i)
you can eliminate the lookup:
> compute = myobj.compute
> for i in xrange(1000):
> compute(i)
Nice one
Great post, thanks.
I guess the bottom line is that an algorithm should be designed for performance and later use profiling to improve implementation speed.
Benny,
It might be best to design first for correctness and then address any performance issues.