Three anti-patterns in bug management

Half of developers use a quarter of their time on bugs. Top teams analyze causes deeply, involve all engineers, and avoid bug backlogs.

Half of developers spend more than a quarter of their time fixing bugs, according to this Rollbar survey.

Best-in-class development teams produce fewer bugs to begin with and welcome new-found bugs as creative challenges to overcome. These teams have stopped at least three widespread anti-patterns:

We fix bugs, but we don’t solve them
We leave bugs to junior engineers
We manage a backlog of bugs instead of eliminating them

1. We fix bugs, but we don’t solve them

The alerting system prompts you with a notification: the application code broke down due to a null exception. This is a classic problem in programming languages without null safety (which is a majority, notable exceptions are Rust, Kotlin, Dart, or Typescript when configured well). The developer figures out where the null exception comes from, adds a test to make sure this case is covered in the testing suite (which becomes red) and adds a new branch to control for this null value. The test now passes and becomes green, the fix is deployed and the day is saved. I call this the “firefighting mode”. Sure, the developer put out the fire, and the users are “saved”, but have we reinforced our system?

For the best teams, fixing the bug is only half the work. The second half involves taking the time to look at the issue in detail and ask:

What were the exact conditions that led to this bug? Did the failing process receive unexpected data? Did the developer who introduced the change use a faulty architecture or data model? Did they inherit from a previous faulty architecture? Did the development environment display an error?
Why is this situation different from our usual ways? Is there even a “usual way”, or standard, in our case?
What is the cheapest test or experiment we can devise to test our hypotheses? Let’s try it right now!
What worked and what did not?
What should we change? Our code, our standard? Do we need to provide some quick training on the next daily?

By doing this, teams invest more time into each problem to understand their whole system: their code, interfaces, users, and how the developers think. They solve bugs rather than just fix them, to avoid ever fighting this fire again.

2. We leave bugs to juniors engineers

It is customary for the CTO, VP of Engineering, or Engineering Manager to have the most junior developers take care of bugs on their own. They would either have a rotation (“Let’s have Billie take care of bugs this week, next week it will be Chandler”) or distribute bugs mainly to junior developers at each sprint. The misconception behind is this: senior developers code better and their time is more useful in architecting and building new features. On the other hand, junior developers have a lot to learn and bugs are useful to get your hands dirty quickly.

But this thinking is wrong if your aim is to solve bugs and not just fix them. Although it could be good practice to involve junior engineers in the solving exercise, you need your best engineers on the job. The question is not just “How do we get this fix out in production?”, but “Who needs to learn what in order to make the product better?”. Everyone should participate because everyone has something to learn, no matter their seniority. The lean insight is to build quality in: develop each developer's ability to write good code.

3. We manage a backlog of bugs instead of eliminating them

I have seen this pattern emerge in companies as they scale their product, but also in single teams with pressure for delivery. Here’s a quick anatomy of a bug backlog:

Bugs that the company deems urgent because they directly affect revenue or reputation in a demonstrable manner. They can hide in the backlog for a few weeks before someone realizes how urgent they are.
Bugs that reporters care about because it affects their experience with the product negatively. Every single minute or hour they are not resolved is a pain to those people. If they are your client or partner, they may already be gone by the time they’re fixed.
Bugs that fall in a muddy grey zone and that we keep just in case, but most likely the oldest ones have been sitting there for months or even years.

Each of these creates a level of mental load and difficult communication efforts with clients, partners, or within the company itself. The trick is that this extra load often sits outside of the development team, and mostly affects a Product Manager or a support team. But also, bugs become stale incredibly fast.

Here is the typical context of a bug reported by a user: when prompted after 5 minutes, the issue is very fresh in their mind and they are able to provide rich details. But only a few days later, they don’t remember and the bug falls into the muddy grey zone, waiting for a replication that will probably never happen. Most of the bugs that are at least one month old become completely useless, yet the teams still spend time discussing their existence.

The human element

Despite its importance, we’re missing a strong theory on software quality management. We’re equipped with some tools (linters, testing frameworks, benchmarks, alerting systems), and methods (XP, Domain Driven Design), but the human element seems mysteriously left out.

I practice and teach Dantotsu, which leverages lean principles and bridges this gap. The goal of Dantotsu is to reach 0 defects across a product flow by developing the people that develop, build, and service it. You can read more about Dantotsu in an article written by a tech lead from AutoRABIT, or in the talk I gave at FlowCon France. If you'd like to learn Dantotsu in my next class, email [email protected] and I'll send over some details.

There are three more anti-patterns I commonly see in tech startups. So stay tuned for part 2!

1. We fix bugs, but we don’t solve them

2. We leave bugs to juniors engineers

3. We manage a backlog of bugs instead of eliminating them

The human element

Sign up for more like this.