How four characters caused 579 errors

by Taylor Fausak on

When I started work this morning, my goal was to eliminate errors from our server log. We have a couple functions that occasionally throw errors when given weird input. Typically they don’t matter because no data gets messed up and the user doesn’t see anything.

However, they are problematic because they pollute the logs, making it harder to find real errors. After looking at them every day for a while, I decided to get rid of them. Unfortunately, in doing so I managed to introduce a real error. It didn’t affect any users, but it did affect our data.

So I started combing through the logs and looking for errors. Most of them were easy to fix. Throwing in an if/else or try/except block usually got the job done. Other times preconditions that used to be true weren’t true any more.

The trickiest ones, though, were logic errors. Take this chunk of code for example (we use Django, backed by MongoEngine):

if 'family_id' in request.GET:
    family = Family.objects(id=request.GET['family_id']).first()
    if family:
        request.session['family_id'] = family.id

Generally speaking, calling objects(...).first() instead of objects.get(...) is a good way to avoid handling DoesNotExist and MultipleObjectsReturned errors. However, the parameters have to be valid, otherwise it’ll return a ValidationError.

I saw a bunch of ValidationErrors in the logs, and this was the culprit. Turns out that family_id was sometimes set to None, which is an invalid ID. I knew it was an easy fix, so I changed it to this:

try:
    family = Family.objects(id=request.GET.get('family_id'))
except (ValidationError, Family.DoesNotExist):
    pass
if family is not None:
    request.session['family_id'] = getattr(family, 'id', None)

That took care of the ValidationErrors, but I introduced two more subtle bugs. So subtle, in fact, that our unit tests missed them. I pushed my commit and went on with my day.

A couple hours later, Cody let me know that the logs were exploding with errors. The messages were cryptic, but they started around the time I pushed that commit. My heart sunk. I immediately started thinking about the time I accidentally deleted all our data and hoped it wasn’t that bad.

I looked over my commits. (GitHub’s compare view helped immensely.) At first, nothing seemed amiss. I finally got it down to three possible commits and scrutinized every bit of them. That’s when I saw it: I wrote Family.objects(...) instead of Family.objects.get(...). That missing .get meant I was returning a QuerySet instead of a Family object.

That wouldn’t have been so bad if it wasn’t for the other change I made further down. Instead of using family.id, which would’ve raised an AttributeError when called on a QuerySet, I used getattr(family, 'id', None), which silently returns None.

I quickly changed the offending code and pushed a new commit. The final working code looked like this:

try:
    family = Family.objects.get(id=request.GET.get('family_id'))
except (ValidationError, Family.DoesNotExist):
    pass
if family is not None:
    request.session['family_id'] = family.id

During the six hours that the code was missing that very important .get, it managed to cause 579 errors. Luckily they didn’t affect any users and only had a minimal impact on our data.

After slamming my head on the desk for a couple minutes, I wrote a handful of test cases to make sure this problem never happened again.