logo

Josh Clow

Software is a human craft

Bug Reduction (probably) isn't the goal

May 05, 2020

I was looking over a document at work today talking about bug filing and triage process, and ran across a line in it that jumped out at me. To paraphrase, it argued that a big part of improving bug triage was to improve code quality so there are fewer bugs to triage.

This, or rather the broader notion that “having fewer bugs” should be a major goal of commercial software construction, is in my opinion, incorrect, as given away by the title of this post.

Because that sounds contrarian, let me start by stating up front that I do not think people should spend no effort on catching bugs up front, nor do I think that the generalities I’m going to talk about are applicable to all specific situations. Too many bugs will kill a project. I just think that the reasons it will kill a project are the same reason that too much focus on eliminating them will also kill a project.

Where do bugs come from?

My answer to this question is both glib and, I think, pretty apt. Bugs come from humans. That’s the glib part. The apt part is a function of what it means to be human and the limits of human cognition.

Humans make mistakes. One major reason we make mistakes with complex things is because we cannot hold all the relevant data in our heads at one time and dispassionatly take action based on it. The more complex a thing is to do, the more likely we will make a mistake doing it. There are ways to mitigate this, and we all apply them: sometimes we split a complex task into smaller sub-tasks so that we can work on each piece as if it were a less complex task. Sometimes we reply on multiple people to compensate for mistakes in each other, when it’s unlikely two people will make the exact same mistake. Sometimes we abstract the problem away, reason about the abstraction, and then try to fit that back to the real problem. Sometimes we just practice the thing over and over again to reduce the chance for mistakes.

So I expect we can agree that coding is a complex thing, and the more complex the code gets, the more likely there are to be bugs. One interesting thing about that is that the number of bugs per lines of code written by people tends to be pretty consistent over time. To be clear, “lines of code written” is an imperfect metric here, as is number of bugs, but if you pick definitions for those things and a methodology for measuring it and keep that consistent over a variety of codebases, you will tend to find that whatever your “bugs per lines of code” number is tends to be pretty stable over a large enough sample size. McConnell used, fairly famously, “defects per thousand lines of code” in his book Code Complete, and while different methodologies produce different numbers, they all tend to reflect this notion that if you have X number of lines of code, you will expect on average to have Y bugs.

It’s important not to misuse or misunderstand this metric. This is not an especially useful metric for making predicitions about code quality or for measuring if your code quality is getting better or worse or something like that (in particular, as soon as you start measuring it with precision, you rightly raise concerns that engineers will either be in fear that it will be held against them, will game the system, or both). The main value for this discussion is in the fact that this stability holds true across programming languages, problem domains, and time.

When you think about that, that’s actually a pretty weird thing! You’d think that, as we’ve introduced better tools, better languages, better processes…as we’ve learned from the mistakes of the past and gotten better as developers (both individually and as a discipline) that something in there might move the needle in terms of how many bugs an average developer will write as a function of lines of code written, but it seems like it doesn’t.

Of course, I’ve spoiled the answer by talking about complexity above. “Lines of code” is an imperfect but reasonably correlated concept to code complexity. So from that perspective, it makes more sense that it would be stable. If software with more lines of code is more complex, of course you would expect to have more bugs in it, independent of language used, who is doing the work, domain, etc.

But what this means is that bug prevention is a misleading concept. The only way to meaningfully reduce the number of bugs that you will generate in software production is to write less code over time. Even in domains that have high stakes (like, say, sending human beings to the moon), the techniques that work slow everything down and reduce the amount of code that gets written over time.

If bug prevention isn’t a goal, what should be?

Software construction is an imperfect art. So ultimately, any answer to this question will be flawed in some dimension, but to speak in generalities, the goal should be to manage risk. In commercal contexts, this includes managing risks to the business. If you produce incredibly buggy software and don’t fix it quickly enough, you will probably lose your customers. But also, if you take years to release even a single feature, then you also will probably lose customers. Other contexts usually have to make similar trade-offs between “too many bugs” and “too little code”, even if it’s just so that developers can get something done.

So one of the things that has been true in every software project I’ve worked on, going back to research, moving into a large commercial company (Microsoft) and into smaller startups (Textio and Sift) is that there’s some level of risk tolerance around the effects of a bug making it in front of a customer. In other words, real world software construction that needs to strike the balance between “quality” and “speed” necessarily has to let bugs through the net and into production. So if that’s an unavoidable consequence, you need to plan for it

Coming back to that document

So this is what I think is so wrong about the claim that a major factor in bug triage strategy should be to create fewer bugs. Not only is creating fewer bugs probably not reducing business risk, it’s also ignoring the whole point of the document being written: to handle the bugs that make it into production.

This may seem pedantic, but I’ve seen many organizations make the mistake of focuing most of their effort on “bug reduction” through constantly improving tests, bringing in static analysis tools, adding exacting code review standards, etc. When this has been successful at reducing bug counts, it’s not been because any of those things made the code better, it’s because it slowed code down to a crawl (in Windows, when I worked there, it was routine for a change I made to take a month or more to get to the equivalent of a “production” branch, and we only actually released the results to customers every few years), and as we’ve seen, slowing code down is effectively writing less code and yep, that reduces bugs!

But the problem with this is that if you only focus your energy on the prevention, then when a bug inevitably gets through, it’s incredibly painful because you haven’t refined the tools and processes for dealing with that. Perhaps it goes “undetected” by anyone in the engineering team for months, but detected by the customer and that kills you at contract renewal time because they see your product as flawed. Or perhaps it gets detected but sits in a queue because nobody knows how to reconcile bugs with new feature work in a way that sticks for the organization. Or maybe it was detected immediately (perhaps because it took your site down, perhaps for a less dramatic reason) but it takes hours (or days) to “undo” the commit that caused the bug.

All of those things will tend to make people look at bugs getting into prod as “high cost” and therefore something to elminiate. But the real solution is to make the cost of a bug getting out as low as possible. Make it routine! If we’re going to fantasize about impossible worlds where we never let a bug escape into production, we should also fantasize about the impossible world where when a bug gets out, it gets identified immediately, the associated change gets undone without a huge knot of dependencies making that impossible, and the fix shows up as a routine part of an engineer’s “what am I doing today?” list such that they can be confident that it needs doing in the relative priority they see.

That is a fantasy, of course, but rather than pushing on the frequently-diminishing returns of addiitonal “bug reduction” solutions, I encourage everyone to spend equal time on figuring out how to lower the costs of a bug getting out. Improve your triage process! Get a ticketing system that works for you! Make rollbacks easy! Keep code velocity up so that it’s practical for engineers to make many small changes instead of one large change (thus reducing the chances that a rollback gets “stuck” because it’s too difficult to figure out what changes need to be backed out).

In other words, plan for humans being humans instead of being robots.