Optimizing Django ORM / Postgres queries using left join

For the latest project I’m working on, we’re using Django with Postgres. I was writing some code that had to find a list of objects that weren’t processed yet. The way they were stored in the DB is like so:

class SomeObject(models.Model):
    #some data
 
class ProcessedObjectData(models.Model):
    some_object = models.ForeignKey(SomeObject, db_index = True)
    #some more data

In this schema, SomeObject is the original object, and a ProcessedObjectData row is created as the result of the processing. You might argue that the two tables should be merged together to form a single table, but that is not right in our case: first, SomeObject “has standing on its own”. Second, we are interested in having more than one ProcessedObjectData per one SomeObject.

Given this situation, I was interested in finding all the SomeObject’s that don’t have a certain type of ProcessedObjectData. A relatively easy way to express it (in Python + Django ORM) would be:

SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id'))

Unfortunately, while this is reasonable enough for a few thousand rows (takes a few seconds), when you go above 10k and certainly for 100k objects, this starts running slowly. This is an example of a rule of mine:

Code is either fast or slow. Code that is “in the middle” is actually slow for a large enough data-set.

This might not be 100% true, but it usually is and in this case – very much so.

So, how to optimize that? First, you need to make sure that you’re optimizing the right thing. After a few calls to the profiler I was certain that it was this specific query that was taking all of the time. The next step was to write some hand-crafted SQL to solve that, using:

SomeObject.objects.raw(...Insert SQL here...)

As it turns out, it was suggested to me by Adi to use left-join. After reading about it a little bit and playing around with it, I came up with a solution: do a left join in an inner select, and use the outer select to filter only the rows with NULL – indicating a missing ProcessedObjectData element. Here is a code example of how this could look:

SELECT id FROM (
    SELECT some_object.id AS id, processed_object_data.id AS other_id FROM
    some_object
    LEFT JOIN
    processed_object_data
    ON
    (some_object.id = processed_object_data.some_object_id) AND
    (...SOME FILTER ON processed_object_data...)
) AS inner_select 
WHERE 
inner_select.other_id IS NULL
LIMIT 100

That worked decently enough (a few seconds for 100k’s of rows), and I was satisfied. Now to handling the actual processing, and not the logistics required to operate it.

This entry was posted in Databases, Optimization and tagged , , , , , . Bookmark the permalink.

One Response to Optimizing Django ORM / Postgres queries using left join

  1. Ram Rachum says:

    Hi Imri,

    I’m clueless about SQL and moreso about optimizing it. But looking at your first line of Django, are you sure it’s as efficient as something like:

    SomeObject.objects.exclude(processed_object_data__whatever=’whatever’)

    Or are you unable to do it in this case?