Databases Optimization

Optimizing Django ORM / Postgres queries using left join

For the latest project I’m working on, we’re using Django with Postgres. I was writing some code that had to find a list of objects that weren’t processed yet. The way they were stored in the DB is like so:

class SomeObject(models.Model):
    #some data
class ProcessedObjectData(models.Model):
    some_object = models.ForeignKey(SomeObject, db_index = True)
    #some more data

In this schema, SomeObject is the original object, and a ProcessedObjectData row is created as the result of the processing. You might argue that the two tables should be merged together to form a single table, but that is not right in our case: first, SomeObject “has standing on its own”. Second, we are interested in having more than one ProcessedObjectData per one SomeObject.

Given this situation, I was interested in finding all the SomeObject’s that don’t have a certain type of ProcessedObjectData. A relatively easy way to express it (in Python + Django ORM) would be:

SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id'))

Unfortunately, while this is reasonable enough for a few thousand rows (takes a few seconds), when you go above 10k and certainly for 100k objects, this starts running slowly. This is an example of a rule of mine:

Code is either fast or slow. Code that is “in the middle” is actually slow for a large enough data-set.

This might not be 100% true, but it usually is and in this case – very much so.

So, how to optimize that? First, you need to make sure that you’re optimizing the right thing. After a few calls to the profiler I was certain that it was this specific query that was taking all of the time. The next step was to write some hand-crafted SQL to solve that, using:

SomeObject.objects.raw(...Insert SQL here...)

As it turns out, it was suggested to me by Adi to use left-join. After reading about it a little bit and playing around with it, I came up with a solution: do a left join in an inner select, and use the outer select to filter only the rows with NULL – indicating a missing ProcessedObjectData element. Here is a code example of how this could look:

    SELECT AS id, AS other_id FROM
    ( = processed_object_data.some_object_id) AND
    (...some FILTER ON processed_object_data...)
) AS inner_select 
inner_select.other_id IS NULL

That worked decently enough (a few seconds for 100k’s of rows), and I was satisfied. Now to handling the actual processing, and not the logistics required to operate it.

Javascript Optimization

Javascript Optimization Tricks

Profiling, Profiling, Profiling

This seems obvious, but still merits a mention. Don’t optimize your code before profiling it. If you need a profiler, just use firebug’s. Having tests to make sure that your post-profiling code is equivalent to the original is also nice.

Later is better than now

As is well known, it is usually useful to have some of the page content to be fetched in the original http get, but have non-essential content fetched using ajax after page-load.

In some cases, user interaction with your website does not require fetching new content, but doing some processing. Generating html is especially slow. In that case, the user will have to wait until the processing is done. Unless of course, you can also ‘postpone it for later’: simply use setTimeout to do the work outside of the even handler.
For example, I used Bill Chadwick’s ArrowLine, which redraws on zoom-in/out. Since the redraw isn’t quick, this makes zooming in pretty slow, especially when quickly zooming more than once.
My solution: since what takes time is drawing the arrows themselves, I put arrow drawing code in a timeout callback, which means the interface doesn’t get unresponsive.

(Of course, this might misbehave, and I had to make sure my code still works if multiple timeouts are enqueued, etc..)

A single div is all you need!

I had to write a dropdown widget for UI, one which could show arbitrary html inside the dropdown box. Problem is, I had to to have a lot of similar dropdowns. This in itself is pretty easy, as we all know how to use a for-loop.
It gets complicated when you find out that adding these dropdowns takes a lot of time. What’s worse, the part that takes time is adding the div that gets displayed when you show it, but is otherwise hidden.

To solve this, first you may apply “Later is better than now”, and create each dropdown only when it’s needed. This will work. In some cases, a different approach is warranted: all of these dropdowns share the same design of the ‘dropdown div’, and only one of these divs will be shown at any time (I had to write code to make sure of it). So, instead of making a new drop-down div for each dropdown, create just one, and let them share it. This might complicate your code, but the rewards may be great.

To make it work, you may use an object factory, or a function that creates widget classes, or any other pattern that works for you.