For the latest project I’m working on, we’re using Django with Postgres. I was writing some code that had to find a list of objects that weren’t processed yet. The way they were stored in the DB is like so:
class SomeObject(models.Model): #some data class ProcessedObjectData(models.Model): some_object = models.ForeignKey(SomeObject, db_index = True) #some more data |
In this schema, SomeObject is the original object, and a ProcessedObjectData row is created as the result of the processing. You might argue that the two tables should be merged together to form a single table, but that is not right in our case: first, SomeObject “has standing on its own”. Second, we are interested in having more than one ProcessedObjectData per one SomeObject.
Given this situation, I was interested in finding all the SomeObject’s that don’t have a certain type of ProcessedObjectData. A relatively easy way to express it (in Python + Django ORM) would be:
SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id')) |
Unfortunately, while this is reasonable enough for a few thousand rows (takes a few seconds), when you go above 10k and certainly for 100k objects, this starts running slowly. This is an example of a rule of mine:
Code is either fast or slow. Code that is “in the middle” is actually slow for a large enough data-set.
This might not be 100% true, but it usually is and in this case – very much so.
So, how to optimize that? First, you need to make sure that you’re optimizing the right thing. After a few calls to the profiler I was certain that it was this specific query that was taking all of the time. The next step was to write some hand-crafted SQL to solve that, using:
SomeObject.objects.raw(...Insert SQL here...) |
As it turns out, it was suggested to me by Adi to use left-join. After reading about it a little bit and playing around with it, I came up with a solution: do a left join in an inner select, and use the outer select to filter only the rows with NULL – indicating a missing ProcessedObjectData element. Here is a code example of how this could look:
SELECT id FROM ( SELECT some_object.id AS id, processed_object_data.id AS other_id FROM some_object LEFT JOIN processed_object_data ON (some_object.id = processed_object_data.some_object_id) AND (...some FILTER ON processed_object_data...) ) AS inner_select WHERE inner_select.other_id IS NULL LIMIT 100 |
That worked decently enough (a few seconds for 100k’s of rows), and I was satisfied. Now to handling the actual processing, and not the logistics required to operate it.
Hi Imri,
I’m clueless about SQL and moreso about optimizing it. But looking at your first line of Django, are you sure it’s as efficient as something like:
SomeObject.objects.exclude(processed_object_data__whatever=’whatever’)
Or are you unable to do it in this case?