You Can't Build More Nines

Published on February 19, 2025

(Originally published on Medium)

Software teams are built to.. well.. build. We have design processes, RFC processes, change management processes.. lots of processes. All of them tend to be optimized for building. 

But, inevitably after building enough complexity, we start to realize that our systems are not reliable enough. We start to measure uptime and, lo, there are not enough nines!

Of course, our first inclination is to build our way to more nines. Build CI/CD pipelines. Build canary deployments. Build a platform. Build synthetic testing.

It’s usually at this point that the dissolution sets in. Why aren’t we getting more nines? We built stuff for that!

But what are we measuring when we measure uptime anyway?

We are, in effect, measuring how often we see what we want to see when we look. Is it up now? Yes. How about now? Yes. Did our users succeed mostly?

Well if that’s what we’re measuring, and we’re trying to build more nines, we have to ask: what is a nine composed of?

One might say it is composed of time slices in which we’re up, or successful events. So, could we say then that if we build a system that is up, that we built the nines?

Unfortunately, math would like a word. Those nines are a percentage, so we’re always subject to everything in the denominator.

Or, to paraphrase a common military euphemism: “the entropy gets a vote.” No matter how bullet-proof you build the components of your system, the only way to make nines go up is to be ready to deal with the host of surprises that take them back down. By definition a percentage is a zero sum game. So, really, to add nines to your target, you have to subtract something else. You have to subtract the faults.

But, but, I’ve built systems to add nines!

So, it’s goat herding for me then?

We built our organization’s resilience.