• Skip to primary navigation
  • Skip to main content

Algorithm.co.il

  • About
  • Best Posts
  • Origami
  • Older Projects

Optimizing Django ORM / Postgres queries using left join

Posted on 2012-07-13 by lorg 1 Comment

For the latest project I’m working on, we’re using Django with Postgres. I was writing some code that had to find a list of objects that weren’t processed yet. The way they were stored in the DB is like so:

class SomeObject(models.Model):
    #some data
 
class ProcessedObjectData(models.Model):
    some_object = models.ForeignKey(SomeObject, db_index = True)
    #some more data

class SomeObject(models.Model): #some data class ProcessedObjectData(models.Model): some_object = models.ForeignKey(SomeObject, db_index = True) #some more data

In this schema, SomeObject is the original object, and a ProcessedObjectData row is created as the result of the processing. You might argue that the two tables should be merged together to form a single table, but that is not right in our case: first, SomeObject “has standing on its own”. Second, we are interested in having more than one ProcessedObjectData per one SomeObject.

Given this situation, I was interested in finding all the SomeObject’s that don’t have a certain type of ProcessedObjectData. A relatively easy way to express it (in Python + Django ORM) would be:

SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id'))

SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id'))

Unfortunately, while this is reasonable enough for a few thousand rows (takes a few seconds), when you go above 10k and certainly for 100k objects, this starts running slowly. This is an example of a rule of mine:

Code is either fast or slow. Code that is “in the middle” is actually slow for a large enough data-set.

This might not be 100% true, but it usually is and in this case – very much so.

So, how to optimize that? First, you need to make sure that you’re optimizing the right thing. After a few calls to the profiler I was certain that it was this specific query that was taking all of the time. The next step was to write some hand-crafted SQL to solve that, using:

SomeObject.objects.raw(...Insert SQL here...)

SomeObject.objects.raw(...Insert SQL here...)

As it turns out, it was suggested to me by Adi to use left-join. After reading about it a little bit and playing around with it, I came up with a solution: do a left join in an inner select, and use the outer select to filter only the rows with NULL – indicating a missing ProcessedObjectData element. Here is a code example of how this could look:

SELECT id FROM (
    SELECT some_object.id AS id, processed_object_data.id AS other_id FROM
    some_object
    LEFT JOIN
    processed_object_data
    ON
    (some_object.id = processed_object_data.some_object_id) AND
    (...some FILTER ON processed_object_data...)
) AS inner_select 
WHERE 
inner_select.other_id IS NULL
LIMIT 100

select id from ( select some_object.id as id, processed_object_data.id as other_id from some_object left join processed_object_data on (some_object.id = processed_object_data.some_object_id) and (...some filter on processed_object_data...) ) as inner_select where inner_select.other_id is NULL limit 100

That worked decently enough (a few seconds for 100k’s of rows), and I was satisfied. Now to handling the actual processing, and not the logistics required to operate it.

Filed under: Databases, Optimization

Reader Interactions

Comments

  1. Ram Rachum says

    2012-07-13 at 11:42 pm

    Hi Imri,

    I’m clueless about SQL and moreso about optimizing it. But looking at your first line of Django, are you sure it’s as efficient as something like:

    SomeObject.objects.exclude(processed_object_data__whatever=’whatever’)

    Or are you unable to do it in this case?

    Reply

Leave a Reply Cancel reply

© 2022 Algorithm.co.il - Algorithms, for the heck of it