For the latest project I’m working on, we’re using Django with Postgres. I was writing some code that had to find a list of objects that weren’t processed yet. The way they were stored in the DB is like so:
class SomeObject(models.Model): #some data class ProcessedObjectData(models.Model): some_object = models.ForeignKey(SomeObject, db_index = True) #some more data |
In this schema, SomeObject is the original object, and a ProcessedObjectData row is created as the result of the processing. You might argue that the two tables should be merged together to form a single table, but that is not right in our case: first, SomeObject “has standing on its own”. Second, we are interested in having more than one ProcessedObjectData per one SomeObject.
Given this situation, I was interested in finding all the SomeObject’s that don’t have a certain type of ProcessedObjectData. A relatively easy way to express it (in Python + Django ORM) would be:
SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id')) |
Unfortunately, while this is reasonable enough for a few thousand rows (takes a few seconds), when you go above 10k and certainly for 100k objects, this starts running slowly. This is an example of a rule of mine:
Code is either fast or slow. Code that is “in the middle” is actually slow for a large enough data-set.
This might not be 100% true, but it usually is and in this case – very much so.
So, how to optimize that? First, you need to make sure that you’re optimizing the right thing. After a few calls to the profiler I was certain that it was this specific query that was taking all of the time. The next step was to write some hand-crafted SQL to solve that, using:
SomeObject.objects.raw(...Insert SQL here...) |
As it turns out, it was suggested to me by Adi to use left-join. After reading about it a little bit and playing around with it, I came up with a solution: do a left join in an inner select, and use the outer select to filter only the rows with NULL – indicating a missing ProcessedObjectData element. Here is a code example of how this could look:
SELECT id FROM ( SELECT some_object.id AS id, processed_object_data.id AS other_id FROM some_object LEFT JOIN processed_object_data ON (some_object.id = processed_object_data.some_object_id) AND (...some FILTER ON processed_object_data...) ) AS inner_select WHERE inner_select.other_id IS NULL LIMIT 100 |
That worked decently enough (a few seconds for 100k’s of rows), and I was satisfied. Now to handling the actual processing, and not the logistics required to operate it.