UP | HOME
mattduck
PostsEmailRSSGithubCVEmacs.London  

2021-Jan-31

Filtering Django querysets without re-querying the database

1. ORM problems

I think most ORM patterns I've seen present a bad abstraction. Rather than providing an interface that safely hides the details of the underlying SQL, they make it extremely easy to trip yourself up by writing OOP code which looks conventional but is actually problematic when translated to relational data access patterns. One impact this has for users of your program is increased latency, as the impedance mismatch between SQL and the ORM code causes time to be wasted on:

  • Issuing many more queries to the database than required (See the N+1 problem).
  • Reading columns over the wire that aren't actually used in the application (SELECT * FROM mytable).
  • Querying the same data multiple times in a request.
  • Loading excess rows into the application to do processing that could easily have been pushed into the database.

(This doesn't mean ORMs don't have a place - the downsides may be worth it in situations where you can work much quicker with ORM classes than with SQL, performance isn't a concern, you have a very small dataset, etc.).

2. Django ORM and the QuerySet cache

Django's ORM suffers from these problems. Unlike SQLAlchemy, it doesn't provide a core API to compose SQL - you're forced to deal with Django's Model and QuerySet interfaces. It's very easy to end up in a place where your functions and SQL queries are mismatched, with queries fired off at various arbitrary points in the request path.

One of the ways that Django attempts to mitigate this problem is through the queryset result cache. Querysets try to be smart about not re-querying the database to answer basic questions that they already have the answer to. For example, if I iterate over a queryset multiple times, the results will be cached:

users = User.objects.all()
for user in users:  # SQL: SELECT id, name, etc FROM users
    print(user)

for user in users:  # Does not issue SQL
    print(user)

Similarly, if I iterate over the queryset and later call .count(), the queryset will return the length of the cached results, instead of issuing SELECT count(*) FROM users:

users = User.objects.all()
for user in users:  # SQL: SELECT id, name, etc FROM users
    print(user)

print(users.count())  # Does not issue SQL

3. How the result cache is implemented

The QuerySet class stores the cache on self._result_cache. When populated, this object appears to just be a list of model instances. It gets initialised to None when building a new queryset:

class QuerySet:
    """Represent a lazy database lookup for a set of objects."""

    def __init__(self, model=None, query=None, using=None, hints=None):
        self.model = model
        self._db = using
        self._hints = hints or {}
        self._query = query or sql.Query(self.model)
        self._result_cache = None
        # ...

You can read the Django source code on Github (these snippets are copied from master at time of writing). Here's the full implementation of .count(), which will only issue a COUNT(*) statement if the result cache is empty:

def count(self):
    """
    Perform a SELECT COUNT() and return the number of records as an
    integer.
    If the QuerySet is already fully cached, return the length of the
    cached results set to avoid multiple SELECT COUNT(*) calls.
    """
    if self._result_cache is not None:
        return len(self._result_cache)

    return self.query.get_count(using=self.db)

self._result_cache is mentioned 23 times in django/db/models/query.py. In addition to the count() implementation, it's used:

  • In exists(), __len__(), __iter__(), __bool__() and __getitem__() - these all operate on the cache if it's populated.
  • In _fetch_all() - this queries the database to populate the cache if it isn't already populated, and is called by methods like __len__().
  • In delete() and update() - these both clear the cache.
  • In __deepcopy__() - this explicitly ignores the cache so it isn't copied to new querysets.

4. When the cache is cleared

There are two main places where the result cache is invalidated:

  1. After writes, and
  2. when returning a clone of the queryset, which happens after calling common functions like .filter().

4.1. Avoid using .all()

Something to watch out for is that calling .all() returns a new queryset without any cached results. The code below looks very similar to the first snippet, but issues a redundant query:

users = User.objects.all()
for user in users:  # SQL: SELECT id, name, etc FROM users
    print(user)

for user in users.all():  # SQL: SELECT id, name etc FROM users
    print(user)

These additional queries can add up fast in real-world cases - particularly if you've used select_related() and prefetch_related() in an attempt to avoid N+1 problems, in which case calling .all() again could issue multiple queries to populate your pre-fetching.

5. Where the cache breaks downs

There are times where it seems like Django could re-use the cached results, but doesn't. For example, imagine we have already retrieved all the users, and we want to narrow the queryset further:

users = User.objects.all()
for user in users:  # SQL: SELECT id, name, etc FROM users
    print(user)

narrowed_users = users.filter(id__in=[1, 2, 3])
for user in narrowed_users:   # SELECT id, name, etc FROM users WHERE id IN (1, 2, 3)
    print(user)

For complex filters that mimic SQL operations, we of course want to push that into the database. But it's easy to imagine how this particular filter could be implemented in Python - we know the rows are already cached in memory, and the filter we're applying can be done in one line of Python ([row for row in result_cache if row.id in (1, 2, 3)]). But filter() returns a new QuerySet with our new IN clause and re-fetches the matching User rows from the database.

6. Why doesn't the QuerySet provide an API to support this?

It would be nice to have an additional API to filter against cached data if the queryset has already fetched rows from the database. It must be pretty common that Django codebases end up with business logic split between different parts of the codebase - sometimes operating on querysets, sometimes on model instances, sometimes issuing new queries, etc. Eg. I can imagine a code path like:

  1. You do some expensive pre-fetching at the top of your request path, and you pass the queryset along to some other code.
  2. You need to call a method on your model instances which returns a value that you want to use to narrow the queryset further. Iterating your queryset to call each model instance is OK because you're re-using the cache. But then…
  3. You have to call a third-party library which expects to operate on a QuerySet. You can give it a new queryset using .filter(myfield__in=my_matching_row_ids), but this is going to hit the database again and re-run all your expensive prefetching queries.

For this case, it would be useful if you could pass a queryset to the third party API which contains a narrowed view of the queryset that you already fetched from the database in step 1, with the cache already populated.

There are definitely arguments against offering an API like this - eg. it could introduce confusion around whether data is up to date or not, and "ideally" you should have structured your code using the right patterns to not get into these problems. But when that ship has already sailed I think a tool like this would have uses.

7. Hacking the result cache to avoid hitting the database

One way we can implement this behaviour is by overriding the result cache ourselves. For example:

# Build the root queryset and fetch the data
users = User.objects.all()
for user in users:  # SQL: SELECT id, name, etc FROM users
    print(user)

# Narrow your models to a list
narrowed_users = [u for u in users if u.id in (1, 2, 3)]

# Build your filtered queryset
narrowed_queryset = users.filter(id__in=[u.id for u in narrowed_users])

# Set the cache on the queryset
narrowed_queryset._result_cache = narrowed_users

# Iterate the queryset freely
for user in narrowed_users:   # Does not re-issue the query
    print(user)

# Further filtering will issue a new queryset that includes our filter
registered_users = narrowed_queryset.filter(is_registered=True)
for user in register_users:   # SELECT id, name, etc FROM users WHERE is_registered=true AND id IN (1, 2, 3)
    print(user)

This solves the problem we had - now we can call some Model/OOP functions to get the IDs of the rows we care about, and pass a new QuerySet downstream which matches on those IDs without re-fetching rows we already had from the database. They will only be re-fetched if the downstream code needs to further modify the queryset.

7.1. Is this terrible?

I searched for a bit and couldn't find examples of people doing this. I wouldn't expect it to be "supported" behaviour and it isn't something I'd want to do without a good test suite around my program to catch any changes in how QuerySet uses self._result_cache. You also want to be careful about using IN for large result sets. But it does seem to work nicely for some cases.

8. Further reading

There are a few Django libraries that provide different solutions for ORM caching. I don't think any of them would be helpful in this exact situation, but it's probably worth understanding what they can do.

You can find the QuerySet source code at django/db/models/query.py. Here it is on Github. It's quite readable, and is useful for understanding what a queryset does.

There are many articles about ORM pitfalls. I think a good starting point is Cal Paterson's The troublesome "Active Record" pattern and the pages linked there.