Resilience in Software Foundation : News

Superficial Blamelessness

Wed, 15 Apr 2026 22:30:00 +0000

In 2012, John Allspaw (then CTO of Etsy) wrote a seminal blog post on the need for what he called Blameless Postmortems. Built off the notion of a “Just Culture” from the research of Sidney Dekker, he summarized Etsy’s approach to balancing accountability in post-incident reviews with a need for avoiding blaming individuals who were involved in the incident:

“Having a "Just Culture" means that you’re making [an] effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.”

He goes on to explain how engineers involved in incidents can be encouraged to give detailed accounts of what happened without fear of punishment or retribution. The word blamelessness took on a life of its own and became slightly disconnected from Dekker’s research and John’s blog post. Further refinements such as blame-awareness came in to add some nuance to the ongoing discourse as well.

For many organizations, some form of blamelessness has become a more standard practice and blame-awareness has been gaining in popularity. However, there is an anti-pattern I have noticed as well, which I like to call superficial (or shallow) blamelessness that I think is important for people to be on the lookout for.

Why Try to Be Blameless?

Resilience Engineering, and many of the connected disciplines it borrows from, aims to leverage incidents and the energy put behind them to better understand how a system works. The idea goes, the better you understand the dynamics in play, the less likely you are to make misguided suggestions, to operate on false assumptions, and to create accidental harm (among many other things).

Disciplines that look at complex systems prefer to focus on interactions between components rather than specific parts themselves (what many consider to be the “root cause”): individual units can be working fine—reliably and in accordance with their goals—but still end up behaving surprisingly as an ensemble.

Looking at the parts in isolation may work well enough when your systems are simple: find what broke, replace or fix or tweak it, and move on. But as they scale up and become more connected to other systems, they also necessarily become more complex. As a result, the way we construct explanations about those systems has to change as well. Looking at so-called “faulty” individual components or people that need to be corrected provides diminishing returns and loses its effectiveness.

Eventually, this is why blame hurts. Blame fundamentally implies that someone or something misbehaved, that they are responsible for causing the outcomes. Blame, as a concept, invites retribution. And retribution in turn drives people involved in adverse events to protect themselves, which means it becomes harder to learn from what happened.

So not only do we get a focus that is centred on individual units, we make it harder to ever get a better understanding on top of it. This is where concepts such as blamelessness (or blame-awareness) come from: we realize that blame is an expected reaction to surprising negative events, but also know that post-incident responses that lean into these feelings do not tend to be effective in the long run.

But being blameless isn’t purely a concept for justice to ensure we don’t punish people for being put in a tough situation by a system that doesn’t even realize it—although it certainly can help—it is also a concept that invites a stance that is systems-oriented.

You don’t just avoid blame. It’s a natural feeling to feel blame, but you have to go past it and shift your perspective towards the system “owning” the situation.

What Makes Blame Superficial

Superficial blamelessness will be easiest to spot when retribution is still the norm, whether officially or only informally. People will avoid naming employees or even teams, and internal reports will look as non-specific as if they were intended for an external audience. But even if you take away the “name” bit from the cycle of name → blame → shame, the cycle lives on.

Organizations where blame is less “ambient” or better harnessed will instead consider it safe to name names, and will invite a conversation with the people involved to get their perspective—one they do not have as a bystander—to get a richer view of how the system works. But even in these places, where it might be safe to give your account of what happened, there’s not yet any guarantee that you’re getting what you should out of it all. Blame is only one of many blockers on the way to a better systems perspective.

Be wary of successfully avoiding retribution, yet finding your post-incident process still biased towards an individualistic stance instead of a systemic one. Consider, for example, whether suggestions of what to improve after adverse events mostly focus on what specific people involved need to do better, even without punishment. Common ones are:

Provide more training
Adjust a worker’s behavior
Write a specific patch or test
Be reminded to “pay more attention”
Add more reviewers or supervision to catch mistakes
Tweak the process or procedures directly related to the incident

The locus of intervention, in these cases, is oriented towards specific people or components. Some of these might be useful and helpful, but if they’re most of what you get, you can suspect a systemic point of view isn’t strongly developed yet.

More systemic interventions are those that would change the conditions that lead to challenging situations. You could, for example, change or clarify broad pressures and goal conflicts. Likewise, the behavior you didn’t like in the context leading up to an incident might be valued in “normal” circumstances. Software development, after all, can have contributing factors that come from other departments too.

You might also find items such as tweaking processes or procedures that go beyond the circumstances of the incident. Maybe someone found out that relevant and useful information came up in informal discussion groups: those could perhaps be encouraged more, or the recipe expanded to other areas important to the organization, because they can help foster a culture that can deal with surprises better. If you find other teams have solutions that exist already, it’s worth asking if they could be borrowed and propagated, but also what could make it easier to figure out if there are ideas worth borrowing. And why not go further and find out under what circumstances that other team figured it out to see if there’s something to learn?

Control-centric vs. Empowerment-centric Approaches

The distinction between systemic and individualistic is usually a good marker, mostly observable once you are looking at a completed review process and its list of actions. It might not be visible yet, because the process isn’t over or continuous, and it might also not be simple: people do what they can given the circumstances.

Another angle in which you can frame this comparison between individualistic and systemic elements is going to be based on an opposition between control and empowerment. If you are approaching the problem from a stance where you need to restrict and control what people do to keep it within specific parameters because they are not operating as you expect them to, then you are taking a control-centric approach. Ask yourself whether this isn’t a subtle way to be blaming people for the problems they’ve been asked to solve.

An empowerment-centric approach would instead ask how you can make the work simpler, safer, easier, clearer, or more manageable. Involve the people impacted in the process, not because they need to do better, but because the system needs to make sure they’re operating in better conditions.

Fred Hebert

Staff SRE

Saturation

Mon, 02 Mar 2026 21:14:39 +0000

Saturation

Every system has its limits. If you keep throwing more and more at a system, at some point it will no longer be able to function properly. The term saturation refers to the state of a system where the demands on the system are high enough that the system is at the limits of its normal performance.

Electrical engineering provides a helpful visualization of saturation. Imagine you want to amplify an electrical signal by a factor of four. This is the behavior you want.

However, real amplifiers have limits on how much the voltage can swing. With some amplifiers, you get a clipping effect, where the output hits some maximum or minimum voltage. In the graph below, this happens above +3.3V and below -3.3V: the signal goes flat instead of following the shape of the input.

We say that the amplifier is in saturation when this happens. When it is in saturation, it no longer behaves like a linear amplifier anymore. Instead, it has undesired, non-linear behavior.

The Limits of Software Systems

We often think of software as having an ethereal quality to it, “only slightly removed from pure thought-stuff” as Fred Brooks puts it. But software needs to execute on hardware, and hardware always has physical limits. The primary physical limits of hardware are CPU cycles, memory, disk space, and network bandwidth. Each of these is a finite resource with fixed, limited capacity.

Very bad things can happen if you fully deplete these physical resources. For example, if you exhaust the available physical memory on a Linux machine, the dreaded OOM (Out of Memory) Killer will terminate a running process to reclaim some memory, and it may be a process you would prefer to keep running.

Since bad things like OOM kills happen when you reach the physical limits of these resources, there are also virtual limits that engineers put in place to fail before running out of physical resources. For example, many applications use resource pools as a resource management strategy: the two I know best are thread pools and database connection pools. Once all of the resources in a pool are in use, new requests are queued, and the requester must wait for a resource to become free. The effect of queueing is that it increases latency, which changes system behavior.

Speaking of queues, there are queues everywhere in the system, and they frequently have their own virtual limits in the form of maximum sizes. For example, if you wanted to know maximum size of the queue for incoming TCP socket connections on the Linux system, you can see it by doing:

$ sysctl net.core.somaxconn

net.core.somaxconn = 4096

You can view various limits by using the sysctl -a and ulimit -a commands, including maximum number of IP ports, maximum number of threads, or and maximum number of open file descriptors.

Even sysctl and ulimit won’t tell you about all of the limits on your system. I once saw a very confusing “No space left on device” error on a Linux system, even though there was plenty of free disk space. That was the day I learned that there is also a limit on the total number of files and directories that a given filesystem can support, even if these files aren’t consuming all of the space. In my case, the limit had been exhausted by a script that created many small files each time it was executed, but they never got cleaned.

Cloud computing makes it easier for us to provision additional compute, disk, and network resources. We can even automate the process of deploying additional compute resources when we’re running low. Such automated systems are called autoscalers, because they automatically request or return resources based on a signal that indicates load, such as CPU utilization or request rate. But even though we describe the cloud as being elastic (that’s the E in Amazon services like EC2 and EKS), cloud providers themselves have a finite amount of resources. For example, you may get an insufficient capacity error, which means that your cloud provider cannot currently meet your request for additional resources. In practice, autoscalers also have virtual limits, e.g. a configured maximum number of pods or virtual machines that it will scale up to, in order to protect against an expected runaway increase in the number of pods.

Sometimes limits are imposed on us by external systems that we interact with. If your system interacts with a third-party system (such as a cloud provider) over an API, there’s a good chance that they enforce rate limits over those APIs.

System Behavior Changes at the Limit

Virtual limits exist to protect the system from breaching its physical limit. The idea is to reduce the potential damage to the system due to overload by failing earlier in a less harmful way. It’s similar to how a circuit breaker protects your house by opening a circuit to prevent the current from reaching a dangerous level. It’s annoying to have to go to the breaker box and flip the breaker closed again, but it’s better than having your house burn down. However, this may be of little consolation if your application falls over when you’ve breached a virtual limit.

In some cases, the system doesn’t behave any differently until the limit is actually breached. For example, a system that is writing data to disk will behave perfectly normally right until the disk is absolutely full, at which time it will return write errors. Unless you have an alert explicitly set up to track when your disk is almost full, you won’t find out until it’s too late.

In many cases, though, the system behavior starts to change as it nears saturation. As mentioned earlier, a common behavior change near saturation is an increase in response latency. For example, Java applications that run low on available memory will spend more time on garbage collection (gc), including stop-the-world phases of garbage collection which are known as gc pauses. This means that as a service starts to run out of memory, response latency goes up. If the response latency gets too high, this can result in timeout errors. Similarly, high CPU utilization can also increase response latency, because the application threads have to wait to get access to CPU resources in order to service requests.

Examples of Incidents Involving Saturation

Once you start thinking of saturation explicitly as a failure mode, you’ll start to see it in many incidents, either as part of the failure mode itself, or as a factor that makes response or recovery more difficult.

Here are some examples from public incident writeups, my emphasis added.

Waymo

In December 2025, there was a large power outage in San Francisco. This had the surprising effect of paralyzing Waymos in the city. It turned out that Waymos are more likely to issue confirmation checks when encountering an intersection when the traffic lights are out. From the Waymo announcement (emphasis mine):

While we successfully traversed more than 7,000 dark signals on Saturday, the outage created a concentrated spike in these requests. This created a backlog that, in some cases, led to response delays contributing to congestion on already-overwhelmed streets.

Cloudflare

In November 2025, Cloudflare experienced an outage that involved a file being larger than a virtual limit on its size. From the writeup:

The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

Google Cloud Platform

In June 2025, Google Cloud Platform experienced an outage where recovery was made more difficult due to saturation. From the writeup:

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure.

OpenAI

In December 2024, OpenAI experienced an outage that involved their Kubernetes API servers becoming saturated. From the writeup:

With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters.

Canva

Also in December 2024, Canva experienced an outage that was triggered by a deploy. There was no problem with the newly deployed code itself, but clients downloading the new javascript files from the CDN set off a series of events that led to their API gateway getting overloaded. From their writeup:

API Gateway tasks begin failing due to memory exhaustion, leading to a full collapse.

Rogers

In July 2022, Rogers experienced an outage that involved routers getting overloaded. From the writeup:

The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the information.

Slack

In January 2021, Slack experienced an outage that involved an internal provisioning system getting overloaded. From their writeup:

The spike of load from the simultaneous provisioning of so many instances under suboptimal network conditions meant that provision-service hit two separate resource bottlenecks (the most significant one was the Linux open files limit, but we also exceeded an AWS quota limit).

Resilient Systems Deal Effectively with Saturation

Saturation is a key concern in resilience engineering. In fact, according to the resilience engineering researcher David Woods, the difference between a brittle system and a resilient one comes down to how well a system is able to deal with saturation.

Resilient systems are able to anticipate when saturation might happen soon, and take action before the imminent saturation turns into a real problem. As mentioned earlier, in cases like memory pressure, we can see symptoms that the system is getting close to a limit, but it in other cases like a disk filling up or hitting an autoscaling limit, there won’t be any indications that we’re close to a limit unless we know what signals to look for.

Resilient systems are also able to reconfigure themselves to deal with saturation. Incident response is the paradigmatic example of this sort of system reconfiguration activity, which is why resilience engineering is particularly relevant to incident response. When your organization experiences a saturation-related incident, the human responders need to take action. You can think about the actions that they take as literally changing how the system works, in order to bring the system back to health. That’s what resilience is all about.

Lorin Hochstein

Four Responses to Overload

Sat, 21 Feb 2026 01:16:13 +0000

“A pattern is a way to generalize and transfer findings from one situation to others.”

—Woods et al., Patterns in How People Think and Work: The Importance of Pattern Discovery for Understanding Complex Adaptive Systems

Overload is when the demands on a system, including its human operators, exceed their capacity to effectively manage those demands. It is so pervasive that people adapt to it without even noticing. What’s worse, our adaptations also conceal the fact that we are managing overload. All systems have various finite limits in capacity. The real world can and does routinely exceed those limits. As a result, adaptations to overload can be observed in almost all incidents.

“The threat of being trapped in a workload bottleneck leads to four patterns of adaptation.”

—Woods et al.

There are four responses to overload, each with its own tradeoffs. Two responses are reactive and urgent: shed load or reduce thoroughness. The other two require anticipation and enough time to take effect: add resources, or time-shift workloads.

Four Responses to Overload

1. Shed load

2. Do all task components less thoroughly, consuming fewer resources

Tactical Responses

3. Shift work in time to lower workload periods

4. Recruit more resources

Strategic Responses

We shed load by dropping all but the most essential tasks—sometimes at a painful cost for the tasks dropped, or if we misread which is most essential.

When we reduce thoroughness, spreading ourselves thin over every task, each task becomes more fragile—we trade off some immediate relief while increasing the likelihood that these points of fragility will break on us later.

The latter two responses depend on anticipation of the state of overload. When we add resources, bring in more people for example, it will take time to get them up to speed so they can usefully contribute to managing the workload. But recruiting people and onboarding them is itself additional work that must trade-off against our limited resources. Bringing in new automation also requires investment and training. Adding resources only works when there is enough time to absorb those delays.

We can time-shift workloads: prepare some of the work ahead of time, or delay work to later. To do that effectively we need to understand changing tempos in our work. When we anticipate an upcoming rush, we need to act early enough to prepare those things that require extra time or care. Or we can choose to delay other work now if we know when things will slow down enough that we can return to it.

Here are a couple of examples of time-shifting in particular:

From Patterns in How People Think and Work: The Importance of Pattern Discovery for Understanding Complex Adaptive Systems: “Anesthesiology residents performed extra work during the set-up of the operating room before the patient entered, in order to avoid potential workload bottlenecks should certain contingencies arise later. When anesthesiologists needed some type of capacity to respond to acute physiological changes in a patient, they rarely had enough time or attentional resources available to carry out the tasks required to create the needed capability. Hence, expertise in anesthesiology, as in many high performance settings, consists of being able to anticipate potential bottlenecks and high tempo periods and to invest in certain tasks which prepare the practitioner to be highly responsive should the need to intervene arise.”

In the software context of scheduled database migrations, we typically prepare rollback plans. If something goes wrong during the migration, we are able to quickly recover. This is time-shifting some of the work of recovery ahead of the migration. There is another tradeoff here. If the database migration goes smoothly, and the prepared rollback proves unnecessary, we may feel in hindsight that those extra preparations were wasted.

Overload is pervasive. Knowing this model of four responses to overload can help us to recognize its presence, to choose better adaptations when we know there’s overload, and to anticipate ways overload can cascade from one part of the system to another. Through the study of patterns, insights from one specific challenge (such as overload) become pragmatic tools we can apply in otherwise unrelated situations. However, looking at a pattern in isolation from concrete examples can feel too abstract. We recommend you also read Expertise & Overload for a specific case which features real-world examples of all four responses to overload.

Eric Dobbs

Expertise and Overload

Sat, 21 Feb 2026 00:52:15 +0000

Resilience engineering views incidents through a different frame than the conventional approach in the software industry, which tends to treat incidents as an irritating interruption to our work. The unrelenting pressure of the roadmap hangs over retrospectives. The clock is always ticking. We focus on the quickest fixes that might prevent the failure in the future so we can “get back to work.” To apply resilience engineering in the software business we must recognize that incidents are part of everyday work in complex systems, not merely interruptions. In fact, incidents are a recurring opportunity for learning and discovery.

There are two features of (almost) all incidents that usually go unnoticed or ignored: expertise and overload. Both of these are pervasive in incidents, yet also elusive.

Expertise by its very nature conceals the difficulty of incident response work. It is deep, well-practiced know-how, judgement, and experience for effective problem-solving and effective coordination under pressure. Similarly, overload is so pervasive that people adapt to it without even noticing. The cognitive capacity of incident responders can be exhausted by the demands on attention and communication, contributing to distraction, impaired decision-making, delayed actions, and increased errors or stress.

In order to draw these elusive and important things into the open, I’m going to dig into a short example from a particularly difficult moment in an incident drill in which two experienced incident commanders deftly managed the overload. The high pressure of this moment is a perfect laboratory to examine these subtle and elusive features of both expertise and overload.

Consider this section of the video transcript from the incident drill. Let's get a first impression without commentary. (I’m deliberately not expanding the acronyms yet, bear with me.) Sarah is leading the response and Alex is her deputy.

A few minutes later, Sarah is managing a few other threads of investigation.

These exchanges happen quickly and casually. Why bother pointing to this at all? There is a lot to learn about both expertise and overload hiding between the lines of this dialog. Let's dig in.

First off, Sarah's claim during that exchange turns out to be a misunderstanding of Hamed's role. In my interviews with her she reported feeling embarrassed to have gotten this wrong during the drill. It is frequently the case that the expert must make effective decisions even with incomplete and inaccurate understanding. It is also notable that despite her misunderstanding of Hamed’s specific role, she correctly assessed that Tanya has deeper technical understanding than Hamed. This is a signal to me that Sarah also has expertise in quickly recognizing and contrasting the expertise of others.

Next up, the acronyms BCP and CAN. They are useful glimpses of how expertise conceals the difficulty of work. Do you know these terms? I didn’t know either of them. Sarah and Alex are fluent in a shared vocabulary. For them, BCP is a business continuity plan and CAN is a conditions-actions-needs report. Does the expansion of the acronyms expand your understanding of the underlying complexity? This is the very point of acronyms. They encapsulate complex ideas. Experts use them to communicate efficiently, especially under pressure.

Many readers will have at least heard of business continuity planning and related ideas of disaster recovery. Fewer readers will know that conditions-actions-needs reports are a communication framework for reporting progress during an incident. Both of these are deep rabbit holes that don’t add enough to this discussion, so I will avoid them for brevity.

As an observer who is unfamiliar with the jargon, these terms are opaque. Sarah and Alex know what they’re talking about, but the underlying complexity of this situation is concealed for the uninformed observer. Their fluency with shared jargon conceals the complexity they are tackling. There’s another subtlety on this point. Imagine someone who fluently understands BCP but has never heard of CAN. They would feel the elevated sense of urgency knowing that whatever else is going on, Sarah thinks they may need to engage the business continuity plan. They might view the delay in Alex’s response as lacking an appropriate level of urgency.¹

At this point in the incident, Sarah has previously asked four different times in Slack to learn more about the BCP. The only answer has been a link to the document. These are subtle clues of her expertise in recognizing and managing overload.

Take note of what she is doing and what she is not doing. She isn’t reading the document herself. This hints that she is adapting to overload. A less experienced person might adapt differently, trying to still do all the things in front of them but reducing thoroughness on each. Another option would be to just drop it—she could shed load by ignoring the BCP. I can tell that she hasn’t just dropped the work because she keeps asking for more details. And each request for information is itself an attempt to recruit resources to help cope with the overload. Let’s look at how she recruits Alex to do this work.

Sarah's expertise with overload goes beyond understanding her own limits. Even while she delegates the reading to Alex, she demonstrates awareness that he may also be overloaded.

Knowing the BCP will demand careful attention, she asks about his capacity (“do you have bandwidth…”) even before she makes the specific request for help (“...to read it?”).

What's more, we see a moment of investing in common ground ² with Alex, and reducing his cognitive work for this task. She shares her screen to show exactly which document. This gesture is loaded with insight about how expertise tackles overload:

I now know that she’s opened the doc even if she hasn’t read it. Moreover, she’s keeping track of it to be able to quickly pull it into view. This is an example of time-shifting work. She’s got the document close to hand when delegating to Alex. Showing the doc reduces the scope of work for Alex by priming his perception to recognize the document. Alex won't have to spend any extra energy wondering if he is looking at the correct document. It also subtly signals the importance of the request.

Sarah's demonstration of her expertise at managing overload and maintaining common ground happens in the space of a single breath.

Expertise hides what makes work difficult.³

Because overload is pervasive in incidents, experienced responders must develop tacit expertise in managing overload. But because that expertise is tacit, they won't even notice the subtle ways they help everyone around them cope with overload.

If you have ever had the experience of relief when one of your best responders shows up, you have felt the tacit knowledge. You don't know why, but somehow incidents just go better when they are in the room. And maybe the main thing you have learned is to call for their help if things get bad.

By contrast, this example features two responders who know overload and are explicitly familiar with a pattern of four responses to overload (for more on this, see the companion article on the Four Responses to Overload). When I interviewed them individually about their experience in the drill, each brought up this pattern on their own without any prompting from me. Given that the pattern was that front-of-mind for them, when they participate in incident retros, they almost certainly bring these topics up for discussion and deliberately spread the explicit knowledge in their respective organizations.

What’s more, we can see Sarah repeatedly choose the two more strategic adaptations to her own overload: recruiting resources and time-shifting work. Unless the observer knows to look for it, and knows how to recognize it, this expertise in managing overload is invisible.

With similar expertise, Alex responds to Sarah’s request by time-shifting work: “Yeah, once I get this CAN out then I'll read it. I'm almost done.”

He responds, signalling he understands the importance of the request. He adapts to his own overload by momentarily deferring this request. He also explicitly states what he has prioritized instead—this is extra energy Alex spends to maintain common ground with Sarah. This provides the briefest update on her previous request for the CAN, and also offers her a chance to override that prioritization decision.

Once again, this display of expertise happens in the space of a single breath. Sarah’s short reply (“Awesome. Thank you.”) serves to close this short conversation—implicitly validating Alex’s decision to finish the CAN first.

There are more examples of overload and expertise in the next exchange.

...busily typing and voicing a complicated group of questions...
Sarah: Oh man it takes me so long to type.

Alex: There are some execution steps for you in the BCP for failing over.

Sarah: Hold just a second while I get these... uh...
...this train of thought trails off as she switches to voice what she's typing...

In the video recording we can see and hear many ways Sarah tries to phrase the questions. Alex could also see and hear her working. He nevertheless chooses to interrupt, increasing Sarah's cognitive load. But he also gives just enough indication to justify the interruption: this is about the BCP.

Sarah, explicitly time-shifts work again (“Hold just a second.”). She also tacitly sheds load (...this train of thought trails off...). This moment is the first we’ve seen of Sarah shedding load.

This example of shedding load is particularly instructive. We frequently adapt to overload without any conscious decision. Under these overloading demands, Sarah focuses attention on the most important or most urgent work—in this case, typing the questions, voicing them, and explicitly targeting them at specific responders.

Notice again how expertise hides the difficulty of the work. Some of the work disappears completely. Sarah normally does extra work to provide context of her thinking. That extra work just drops as her voice tracks exactly what she's typing. The absence of that work is only apparent by contrasting what "normal" sounds like for her under less intense circumstances.

Let’s look at one more observation of expertise with managing overload in the complicated group of questions Sarah typed into Slack:

Can we move part of the traffic? Or is it all or nothing?
Are there servers in the area of the DC we can shut down? Is our UAT there and able to be shut down for instance?
Can the DC vendor give us an ETA? Is there anything else they can do to cool the room they are not already doing? Are we the only tenant there?

What we learn here is that the model of four responses to overload applies to machines just as it applies to people. Her first question (“Can we move part of the traffic?”) is recruiting resources in the form of an alternate data center.

Her second question (“Are there servers ... we can shut down?”) is working the same problem from the opposite direction. It would shed load on the network and also remove heat-generating computation from the data center.

Her third question (“Anything else they can do to cool the room?”) is also recruiting resources. She also explicitly directs those questions to specific responders. Each question is demanding on its own and Sarah is spreading that load across the team. Notice how she is managing overload at many levels simultaneously: her own, the incident response team, and the machines and resources related to the data center.

Finally, she is able to turn her attention back to Alex (“Alex, sorry, I put you in a buffer. What was that about the BCP?”). There is a subtle signalling that Alex had interrupted skillfully. Sarah shows her understanding that he has the details about the BCP and that there is work in that plan that she must execute.

It is again notable what we do not see here. Whatever else Alex has been doing, he’s managed his own attention and overload to be able to respond to Sarah at this point. Unlike the earlier exchange when he was working on the CAN and deferred her request, here he immediately engages by reading the six steps aloud.

Putting Overload & Expertise Insights Into Action

This part of the incident drill illuminates three things that are difficult to even perceive: expertise, overload, and explicit expertise about overload. Instead of observing overload and expertise directly (which is often difficult or impossible), we can infer their presence by looking at what is observable: the adaptations themselves.

Overload is pervasive. Knowing this model of four responses to overload can help us to recognize its presence, to anticipate ways overload can cascade from one part of the system to another, and to choose better adaptations when we know there’s overload.

If you suspect you have hidden overload, look for any of the four adaptations. If you see time-shifting work or recruiting resources, you can infer that responders are anticipating the risk of overload and may be adequately managing it. Shedding load and reducing thoroughness suggest acute overload that deserves more aggressive intervention. For example, resources that have been recruited to support local overload do not come for free. Ensure whatever tradeoffs that are paying for the additional resources are also under consideration. If there are many tasks that are getting time-shifted, there may be hidden costs of context-switching to mitigate those.

If you already know overload is present, then review the four adaptations to weigh your options for managing it. If the overload is an immediate threat, focus on ways to shed load or reduce thoroughness. If you feel some breathing room, look around for resources you could recruit or find ways to time-shift the work.

Expertise can be even more elusive than overload. It is certainly much more difficult to categorize. So much expertise develops tacitly that experts themselves have a difficult time explaining how they do what they do. Although I have shown many examples of how subtle and invisible expertise is, I have not offered any specific advice about how to recognize expert performance in the wild. There are learnable skills for discovering this expertise but we will tackle those in a different article.

Learning in Addition to Fixing

I’ve made another subtle move in this article by featuring an incident drill. Drills by their nature focus attention on how the incident responders do their work, instead of focusing on what needs to be fixed in the code or documentation, or monitoring. Humans develop skills and expertise through practice and experience. Conventional retrospectives focused only on the fixing miss out on all the places we could be sharing expertise across the organization or improving the work of running incidents.

Remember our resilience engineering frame: incidents are first-class work. We can gain a competitive advantage by rewarding ongoing professional development in incident response. Through this know-how we can continually earn the trust of our customers.

One cool trick you can now apply to all of your future retrospectives is to ask how people adapted to overload in this incident. These patterns of adaptations to overload happen at a wide range of tempos and scales. Your teams and your services have unique expressions of overload and associated adaptations. Incidents provide an opportunity to look closely at those specific adaptations to specific overload and where specific circumstances exceeded the capacity of the system to adapt. The general pattern is useful, but your people need the grungy local details about how overload shows up in your world. When reflecting on incidents after the fact, these observations can reveal improvements to the operational practices of your teams or the operational tooling of the services in question.

Sharing the model of four responses to overload across your teams, and regularly discovering how people apply that model in real incidents is a powerful driver to turn incident retros into a community of practice with ongoing transfer of expertise and know-how. In this way, you can create leverage from the pervasive combination of expertise, and overload in incidents and spread the more specific expertise about overload.

Decompensation and Cascading Failures

Thu, 22 Jan 2026 05:22:00 +0000

Consider the following scenario: A set of automated tasks has somehow failed to run to completion. Because a thorough fix will take some time and the tasks you need run are time-sensitive, you complete the work manually to keep things moving forward. If this sounds familiar, then you may have helped provide compensation for that system. Compensation is a very interesting mechanism in software systems because it can keep complex systems alive, but also because it can be a factor in how they quickly and unexpectedly collapse.

There’s a tangling of closely related concepts in describing how complex systems collapse. Many of these lead to cascading failures, where a failure in one part of the system propagates to another one or to many others, which further causes other subsystems to fail.

Past figuring out how individual failures happen, there are ways in which the system fragilizes itself, such as increasing coupling between parts by going solid. A more subtle way coupling can happen is through compensation, which hides that anything might even be going wrong.

In this post, we’ll take a look at what compensation means, and then how decompensation can rapidly happen and lead to a cascade of failures, and why this is relevant to Resilience Engineering.

Compensation

A basic rule of complex systems is that they always run in a degraded mode, but they continue to function because the system has ways to deal with some of these situations. Degradation can take multiple forms. A partial loss of capacity or temporary overload can be handled by tradeoffs between efficiency and thoroughness (see the ETTO principle), or through structural elements like redundancy and building spare capacity, for example. A more subtle coping mechanism relies on adaptive patterns of compensation, a term borrowed from biology that indicates the mechanism through which an organism finds new ways to perform a capacity that has been lost.

Compensation is rarely seen in software as we often think of it, e.g. code running without humans, because software tends to be designed such that everything does the job it has been designed to do, and little else. Instead, compensation in this context, as an adaptive mechanism, is most likely to be observed in the human part of a sociotechnical system.

Examples of humans compensating for software faults or flaws could include:

In provisioning new enterprise accounts, the system will automatically allocate sufficient resources for the contract. However, because new users tend to do large migration jobs when setting up their environment, the customer success team makes a habit of manually overriding parameters to grant more capacity for the first weeks of use, before returning it to normal once usage is steady.
During an outage where hardware capacity is limited, a team will decide to run their own services hotter by tweaking their config to do more work on fewer devices or on older hardware (possibly slightly hurting their own tail latencies in the process) in order to leave more room for other teams’ more critical services whose lack of resources would have a worse impact on users.
A cache is added in front of a database to speed up responses, and limit the impact of slow queries. As the database becomes reliant on the cache to protect it, teams gradually build utilities and ad-hoc processes to backfill and prewarm the cache in case of issues, rather than spending time on the more challenging task of optimizing queries or re-thinking their product design.

Likewise, people can and will compensate for each other:

An on-call worker has been on an incident at night, and as they are recovering or exhausted, someone else who is not currently on-call intercepts and fields alerts instead, just to make sure their colleague is not woken up.
Designers are unavailable, but a graphical bug has been ticketed. You decide to fix it to the best of your ability by looking at similar elements in your product despite not having design as a requirement or expectation of your role.
The platform team is swamped with work to support other internal customers. But deadline pressure in the product team cannot wait for the platform team to become available. So the product team spins up cloud services on their own to work around the perceived bottleneck of the platform team.

You’ll note that all these examples of compensation require doing things such as a unit bending and unofficially extending their definition of work, using their own spare capacity to help or cover for another part of the system that is under stress, or doing a mix of both to provide capabilities that are desirable but currently not part of the nominal function. Compensation can be temporary, or, over time, become a permanent fixture of how the system works.

Decompensation

Compensatory mechanisms are often called on so gradually that your average observer wouldn't even know it's taking place. Systems (or organisms) that appear absolutely healthy one day collapse, and that is how we discover they were overextended for a long while.

To understand decompensation, let’s go back to biological systems, with an example of congestive heart failure provided by Dr. Richard I. Cook in a group discussion a few years ago.

Effects of heart damage accumulate gradually over the years—partly just by aging—and can be offset by compensatory mechanisms in the human body. As the heart becomes weaker and pumps less blood with each beat, adjustments manage to keep the overall flow constant over time. This can be done by increasing the heart rate using complex neural and hormonal signaling.

Other processes can be added to this: kidneys faced with lower blood pressure and flow can reduce how much urine they create to keep more fluid in the circulatory system, which increases cardiac filling pressure, which stretches the heart further before each beat, which adds to the stroke volume. Multiple pathways of this kind exist through the body, and they can maintain or optimize cardiac performance.

However, each of these compensatory mechanisms has other, less desirable consequences. The heart remains damaged and they offset it, but the organism remains unable to generate greater cardiac output, such as would be required during exercise. You would therefore see "normal" cardiac performance at rest, with little ability to deal with increased demand. If the damage is gradual enough, the organism will adjust its behavior to maintain compensation: you will walk slower, take breaks while climbing stairs, and will just generally avoid situations that strain your body. This may be done without awareness of the decreased capacity of the system, and you may even resist acknowledging that you ever slowed down.

Decompensation happens when all the compensatory mechanisms no longer prevent a downward spiral. If the heart can't maintain its output anymore, other organs (most often the kidneys) start failing. A failing organ can't overextend itself to help the heart; what was a stable negative feedback loop (slowing down some adverse effect) becomes a positive feedback loop (accelerating and amplifying things), which quickly leads to collapse and death.

Someone with a compensated congestive heart failure appears externally well and stable. They have gradually adjusted their habits to cope with their limited capacity as their heart weakened through life. However, looking well and healthy can hide how precarious of a position the person is in. Someone in their late sixties skipping their heart medication for a few days or ignoring their low-salt diet could be enough to tip the scales into decompensation.

Decompensation usually doesn’t happen because compensation mechanisms fail, but because their range is exhausted. A system that is compensating looks fine until it doesn’t. That's when failures may cascade and major breakdowns occur.

Decompensation and Resilience in Software

A classic case of decompensation in software systems is introducing queues to handle temporary overload in front of services with fixed capacity. However, as the lulls that are used to catch up become rarer with increased usage, the software system reaches saturation. People might try to compensate by increasing the queue size or increasing job timeouts, but as delays get worse, tasks get re-enqueued by clients, which further saturates the queue; at this point the ability to delay jobs is no longer effective at controlling load, and the accelerating feedback loop tells us the system is on its way to collapse.

From that point on, we may encounter a cascading failure. Some tipping point is met, and one failure mode grows and propagates throughout the rest of the system. While cascading failures are fascinating and worth studying on their own, they can be caused by a variety of mechanisms. Decompensation is one of them, but so are common mode failures—a simultaneous failure of several components due to a single shared dependency

Decompensation will tend to be more subtle because compensating behavior hides problems. We could think of auto-scaling reaching its limits and then killing a cluster because it falls behind on its ability to do work as a common mode failure. This could become decompensation if you discovered that whenever your system neared its scaling limits, engineers went in and made optimizations that kept squeezing the lemon further and further, trying to find ways to reduce demand.

Common mode failures are worth looking into because they too can cause cascades, and might reveal gaps in how we analyzed or planned our systems. While compensation and decompensation can reveal similar issues, they are of particular interest to Resilience Engineering because they are also signs of adaptive behavior.

Adaptive Behavior, Adaptation, and Resilience Engineering

It is worthwhile to make a distinction between compensation, which is a type of adaptive process, and adaptation writ large. See compensation as coping with ongoing pressures or degradations (even if they never end!), whereas successful adaptation could be thought of as reaching new, more optimal operating points. Think of how someone driving a manual car with worn out brakes (or who fears their brakes might overheat) might use engine braking to slow the car down when going downhill; this is very different from regenerative braking in electric or hybrid vehicles, which is a designed feature.

The reason this distinction is useful is that while compensation has people work outside the design envelope, finding this behavior is often key to redefining the design envelope. Successful adaptation requires harnessing adaptive behaviors and reworking the system.

This fundamentally requires you to study work-as-done (a topic for another blog post to come) to understand where people act with reciprocity, sacrificing bits of their performance to help the overall system, where they bend rules and definitions to provide success. In places where this is a habit, you will find both strength in how they adapt, and brittleness in the risk of becoming dependent on this hidden work without knowing about it, and exhausting all such capacity in the future.

Part of Resilience Engineering in practice is knowing this happens, that it is necessary to deal with the unexpected, and knowing how to both protect, foster, and build on this type of capacity to keep systems successful.

Negotiating the Paradox We Face in Resilience Engineering—Lessons From an Engineering Leader

Tue, 29 Jul 2025 17:16:37 +0000

Author’s note: I need to credit my friend and former colleague Tim Nicholas for his review and contribution to this article and the insights within it. Tim has been pivotal in shaping how I understand failures in complex systems and Resilience Engineering, and how I incorporate this into my overall approach as an engineering leader. (Also, the spelling is New Zealand English.)

So you’re an Engineering Manager or a Director, or perhaps a Staff Engineer. You’ve discovered the world of Resilience Engineering and had some pretty significant "a ha" moments. Now you want to bring Resilience Engineering into your organisation, and set the world to rights.

I’ve been exactly where you are right now, and I’m going to suggest you hold your horses, and let me share my insights and recommendations from my experience as an Engineering Leader working to make Resilience Engineering a reality. There’s pitfalls on this journey that I can help you avoid.

As an SRE (or adjacent) leader, you will find yourself in a challenging position when it comes to Resilience Engineering. You’re the layer in the middle. You're stuck between the practitioners, who are passionate but may not have good insight into the realities and challenges of leaders, and the executive layers, who have different goals and concerns and don’t have the motivation to deviate from things they think are working.

Most people working in tech and IT, no matter the type of company or organisation, will be familiar with various practices around incidents. A lot of these are summed up in regular incident reporting. Often produced monthly or existing in a dashboard, this reporting includes data and metrics such as MTTR, incident counts by severity and root cause type, and compliance with completing post incident review (PIR) action items.

These are often non-negotiable—the default standards and expectations that have permeated deeply and widely throughout our industry. In many cases, Boards and Executives expect to see these reports as indicators of risk and operational effectiveness, to ensure they fulfill their corporate governance obligations. Engineering Leaders and Managers simultaneously use them as a purported lens to measure team and engineering system performance. Producing a Root Cause Analysis (RCA) is a deeply embedded expectation of both leadership and customers following a high severity incident, as is tracking completion of PIR action items. You can’t explore a monitoring or incident response tool without being sold the ability to drive down your MTTR.

However, these are all things that those of us in the Resilience in Software community know and accept to be not just misleading and outdated, but also counterproductive and potentially harmful to the actual improvement of safety in an organisation.

That leaves those of us in Resilience Engineering with quite the dilemma. Do we speak out against these deeply embedded practices and expectations and seek to explicitly educate our colleagues and leaders, or is it all a necessary evil in making space for resilience work?

The Paradox

The fact is, we are faced with a paradox. A contradiction that reveals deeper truths and a reality that is difficult to reconcile.

For an engineering leader to build trust and credibility—to be perceived as competent—they need to meet or exceed the expectations of leaders and executives by carrying out these default standards and practices. You believe this to be contrary to your goals, however my experience tells me you should do these things anyway in pursuit of bigger picture resilience and adaptive capacity.

Before you get upset and stop reading, hear me out. I’ve been the leadership layer in the middle, sandwiched between the practitioners who really understand Resilience Engineering and see the need for it every day, and the executives with different concerns, goals and motivations.

My goal when it comes to Resilience Engineering is to legitimise the work, the practices and the concepts. I want to make space for engineers to conduct incident analysis to better understand our systems and the conditions that enable incidents to occur. I also want to build understanding and acknowledgement of complexity and Resilience Engineering in leadership to enable better awareness and decision making.

Very specifically, my goals no longer include the death of root cause analysis, incident metrics, MTTR and PIR action tracking.

Telling executive leadership that incident counts, MTTR, and root cause analysis are wrong/incorrect/misleading gets you nowhere. I can confidently say that when an executive leader wants to be talking about quality of service for your customers, the last thing they want to hear about is academic papers and Monte Carlo simulations. This narrative is harming our ability to do resilience work, and undermines the credibility of our ideas, our practices and our movement. It is very difficult to educate people against their will, especially if their view of competence on a subject is different to your own. We must make Resilience Engineering appealing and useful to executives if we are to get their support and buy in.

Board and Executive pressure around incidents and the need to present something that is viewed as tangible and sound should never be underestimated. Once you are in the pressure cooker of “too many incidents” or “poor reliability,” incident data and reports can give leaders the tools to feel in control but also demonstrate that they are in control. Many of the traditional incident practices and metrics help leaders communicate that they are controlling things, and also creates opportunities for them to exert control on what people do—even when they don't have control over the actual system or system outcomes.

The closer you are to the sharp end—the point where the most direct and often challenging work of a particular activity takes place—the more apparent it is that uncertainty exists. Many of the above-mentioned default standards and expectations around incidents are designed to help manage and control uncertainty. However, Resilience Engineering is essentially about embracing uncertainty as an unavoidable reality of complex systems. As a result we tend to want everyone to accept uncertainty as unavoidable. In the Boardroom, far removed from the sharp end, the perspective is very different. Uncertainty is something that should be managed and reduced. It can become a risk with acceptable probability, but it can't be unmanaged. From this different perspective, the system and its outcomes must be controlled, or have the illusion of being controlled, even if we know this to be at odds with the system’s reality.

My Recommendations

You need to play the game that others think they are playing. As an engineering leader, you only have so much social capital and so many cards to play in what you can change and influence. You want these cards to be used for your agenda, and you want to make meaningful progress. This might be influencing for additional headcount or prioritising a particular piece of system improvement work. I don’t think those limited cards should be used for debating concepts like MTTR and incident counts.

Give Them What They’re Asking For

If leaders and executives are asking, then continue to produce that incident data and show those PIR actions. If you’re not doing it then someone else will, and it’s much better to be in control of the data and the narrative. Position yourself and your SRE leaders as credible, in control and taking action, even if it’s not by your definition. If your leaders aren’t currently asking for typical incident reports and metrics, I strongly suggest you have the data on hand to be in a position to provide it should you be asked. You need your perceived competence and resilience work to ride the wave of leadership change and organisational evolution. Even if you succeed with one set of leaders in building their understanding that these are not the metrics or approaches they need, leaders come and go. If you are perceived as too far from industry norms you will have a hard time establishing trust and credibility with a new set of leaders who haven't been on the same journey.

Proceed with Caution When Educating and Influencing

Rather than seeking to educate directly in a way that can be perceived as contrary, model the Resilience Engineering concepts, language and practices you want to see others adopt. Once your stakeholders become familiar with the concepts through a more implicit approach, you’ll find them more receptive to being introduced to new ways of understanding reliability and resilience, and different types of recommendations from their teams as a result.

Educating and influencing leadership on Resilience Engineering is not without risks, so I recommend that you proceed with caution. This approach assumes leaders have the willingness to learn and the conceptual grounding to make informed decisions. In reality, this is rarely the case. When it comes to incidents, leaders tend to believe they know how things should be done and are resistant to alternative approaches and perspectives. Even if you do manage to succeed in educating leadership, the outcome is uncertain. At best they gain only a surface-level grasp of new concepts, at worst they misinterpret and misapply what they’ve learned, causing further harm.

A more effective strategy is to empower people closer to the work—those who have done the analysis and understand the tradeoffs—to provide recommendations for executive action. These recommendations should narrow the range of options and enable leadership to make decisions confidently. These recommendations are more likely to be accepted if they are backed by the credibility and trust gained through giving leaders what they’re asking for.

Anecdotally, attempting to influence leadership towards more of our Resilience Engineering methods and practices has a notable rate of burnout for SRE leaders and senior individual contributors. I suspect the phenomenon we are currently experiencing in this regard is part of some broader industry cycle due to factors outside of our control. Either way, we need our perceived competence and resilience work to ride out the wave of this industry cycle, just as we need to ride out the wave of leadership change and organisational evolution. And we need to be able to do that with our mental and emotional wellbeing intact.

Acknowledge and Negotiate the Paradox

So what does this look like in practice? Think of it like putting spinach in a kid’s smoothie—give them what they think they want, and include the good, healthy stuff they also need in the process. If someone states human error as a cause for an incident, you can mention that this tends to be an indicator of underlying systemic challenges worth taking a closer look at. If you can, offer to provide SRE resources to conduct a more useful PIR. If leaders are getting caught up on incident counts and MTTR, you can use SLOs to provide a more meaningful representation of the quality of the customer experience, while providing context in the form of themes and patterns relevant to your audience.

It’s here in these conversations, with the background of outputs and metrics you’re not enthusiastic about, that you’re best placed to use those valuable cards to make recommendations based on insights from incident learning and resilience work. Whether it be highlighting systematic challenges and areas of underinvestment or making recommendations to support better prioritisation and improving the engineered system.

In all of this, you are constantly negotiating the paradox we face in Resilience Engineering. These opposing perspectives and methods will always coexist. One might be more dominant than the other at certain points in time, but there will always be that push and pull. You need to be ready for it. By building the perception of competence and playing along with short term optics, you can pursue the bigger picture of Resilience Engineering.

I still hope that one day we will build broad industry recognition for Resilience Engineering concepts and practices, while outdated methods and metrics fade off into the sunset. Until that day comes, we play the game while working to change the rules.

Michelle Casey

SRE/Engineering Leader

MTTR Is (Still) Lying to You

Thu, 27 Feb 2025 19:33:10 +0000

Ed note: This is largely a re-post of what was in the 2022 VOID report. Since then, I continute to see ongoing conversations (and product marketing pages) touting the benefits of MTTR so I thought it was time to dust off these data that show how unreliable and unhelpful MTTR is to software organizations.

Software organizations tend to value measurement, iteration, and improvement based on data. These are great things for an organization to focus on; however, this has led to an industry practice of calculating and tracking Mean Time to Resolve, or MTTR. While it’s understandable to want to have a clear metric for tracking incident resolution, MTTR is problematic for a number of reasons.

The first is solely statistical: based on data evaluated by Stepan Davidovič, The VOID, and Lorin Hochstein, measures of central tendency like the mean, aren’t a good representation of positively-skewed data, in which most values are clustered around the left side of the distribution while the right tail of the distribution is longer and contains fewer values. The mean will be influenced by the spread of the data, and the inherent outliers. Consider these (real incident) data from the VOID:

The mean is clearly well to the right of the majority of the data points in the distribution—this is to be expected for a positively-skewed distribution with a small set of large outliers. So next you might consider a different MTTR: Median Time to Respond. The median value will be less influenced by outliers, so it might be a better representation of the data that you could use over time.

That certainly looks more representative, and your putative incident response metric just got 2.5 hours better simply by looking at it this way. Which begs the question: what actionable conclusions can you reach based on this information? Presumably, you calculate and track a metric like this in order to improve (or know when things are getting worse), which means you need to be able to detect changes in that metric.

This is where Stepan Davidovič was able to demonstrate something really powerful in his report on MTTR, which is also applicable to these data. Davidovič demonstrated both with empirical data and Monte Carlo simulations how incident duration data do not lend themselves to reliable calculations of improvement related to any central tendency calculations of incident duration (or overall incident count). If you’re not familiar with Monte Carlo simulations, think of it as an A/B test on simulated data instead of real-world production data.

Davidovič took two different, equally-sized samples from three companies’ incident data, and compared the two sets, one which had experimental changes applied to it (e.g. a 10% improvement in incident duration), and the other which had no changes and acted as the control. He then calculated the MTTR difference between the two samples over a series of 100K simulations, such that a positive value indicates an improvement, or shortening of MTTR.

Source: Incident Metrics in SRE (Google/O’Reilly)

38% of the simulations had the MTTR difference fall below zero for Company A, 40% for Company B, and 20% for Company C. Looking at the absolute change in MTTR, the probability of detecting a 15-minute improvement is only 49%, 50%, and 64%, respectively. I don't know about you, but personally I wouldn't take those odds.

Even though the experimental condition shortened incidents, the odds of detecting any improvement at all are well outside the tolerance of 10% random flukes.

Both the VOID report and Lorin’s experiments show the same outcomes. If the metric you’re using isn’t itself reliable, then how could it possibly be valuable as a broader measure of system reliability?

MTTx Are Shallow Data

The second problem with MTTx metrics is they are trying to simplify something that is inherently complex. They tell us little about what an incident is really like for the organization, which can vary wildly in terms of the number of people and teams involved, the level of stress, what is needed technically and organizationally to fix it, and what the team learned as a result. MTTx (along with other data like severity, impact, count, and so on) are what John Allspaw calls “shallow” incident data. They are appealing because they appear to make clear, concrete sense of what are really messy, surprising situations that don’t lend themselves to simple summaries.

Two incidents of the same length can have dramatically different levels of surprise and uncertainty in how people came to understand what was happening. They can also contain wildly different risks with respect to taking actions that are meant to mitigate or improve the situation.

The length of an incident yields little internally actionable information about the incident. Consider two chess games of the same length but very different pace, expertise, and complexity of moves; feature films that generally all clock in at around two hours but can have wildly different plots, complexity of characters, production budgets; War and Peace (~1,200 pages) vs. Slaughterhouse Five (~200 pages).

Moving Beyond MTTx

Ideally, given these data, organizations should stop using MTTX, or TTx data in general, as a metric related to organizational performance or reliability. First and foremost, if you are collecting this metric, chart the distribution of your incident data. If they’re positively skewed, then you’re not measuring what you think you’re measuring with MTTx, and you can’t rely on it.

Second, ask yourself what decisions you’re making based on those data. Are you prioritizing hiring SREs, or are you trying to justify budget or infrastructure investments? What could you decide given either a decrease or increase in TTx (were that a reliable metric)?

Given that MTTx metrics are not a meaningful indicator of an organization’s reliability or performance, the obvious question is what should organizations use instead? I’d like to challenge the premise of this question. While rolled-up quantitative metrics are often presented on organizational dashboards, quarterly reviews, and board presentations, they generally fail to capture the underlying complexity of software-related incidents. That said, it may not be possible to immediately or completely abandon metrics that management or execs want to see. Vanessa Huerta Granda wrote an excellent post detailing a process of using MTTR and incident count metrics as a way to “set the direction of the analysis we do around our entire incident universe.” They can be an excellent jumping-off point to then present the insights, themes, and desired outcomes of detailed qualitative incident analyses that highlight knowledge gaps, production pressures and trade-offs, and organizational/socio-technical misalignments, which are also an important form of data:

Because we had that [qualitative] data we were able to make strategic recommendations that led to improvements in how we do our work. In addition, having that data allowed us to get buy-in from leadership to make a number of changes; both in our technology as well as in our culture. We beefed up a number of tools and processes, our product teams started attending retrospectives and prioritizing action items, most importantly everyone across the org had a better understanding of what all went behind ‘keeping the lights on.’

If quantitative metrics are still something your org is asking for, I suggest focusing on Service Level Objectives (SLOs) and cost of coordination data. Chapter 17 of Implementing Service Level Objectives covers using SLOs as a substitute for MTTx and other common problematic reliability metrics. If people or staffing is something you need metrics for, consider tracking how many people were involved in each incident, across how many teams, and at what levels of the organization—these factors are what Dr. Laura Maguire deemed “hidden costs of coordination” that can add to the cognitive demands of people responding to incidents. There’s a big difference between a 30-minute incident that involves 10 people from three engineering teams, executives, and PR versus one that an engineer or two puzzles over for a day or so, and is lower impact.

I am aware of a number of companies that have moved away from tracking shallow incident metrics, and none of them regret walking away from those practices. If you're looking for folks who are actively working on these kinds of issues in their own organizations, you can find us by becoming an RISF member.

MTTR Resources:

Courtney Nash

Founder, The VOID

VP, Resilience in Software Foundation

RSS feed

Three Takes on Four Concepts for Resilience Engineering

Wed, 19 Feb 2025 21:28:20 +0000

Ed note: The first time I read Dr. David Woods' paper Four Concepts for Resilience Engineering, I felt so many things click in my brain. While the field of Resilience Engineering is not new, those of us in the software industry are relatively new to RE, and seeing a paper that is so clearly applicable to software systems left a strong impression on me. When I reached out to the RISF community for someone to contribute their thoughts on this paper, multiple hands went up in Slack—and they were all so good I couldn't pick just one. So what follows is three software engineers' takes on Woods' paper: Erika Rowland (ER), Eric Dobbs (ED), and Fred Hebert (FH). Each provides their own perspective on Woods' proposed concepts, and additionally Dobbs provides some excellent software-centric examples of how you might see these concepts manifest themselves in your systems.

Introduction

ED: Resilience ≠ Reliability. Resilience is human skills and human relationships. Reliability is what we build into our software.

ER: Resilience has been used a number of different ways in Resilience Engineering literature. This conflation of different definitions makes it difficult to parse and understand what that literature is arguing for and why.

FH: The paper opens by admitting that the popularity of the term has led to confusion regarding what it means in the first place. I recall seeing other papers which held the ill-defined term as one of the biggest weakness of a discipline named after it. All the different uses seen around the place have been categorized into 4 groups by Woods: rebound, robustness, graceful extensibility, and sustained adaptability.

Rebound

ER: Rebound is about returning to normal functioning after a disruption. This ability to rebound seems to depend directly on the conditions before the disrupting event. How was the system prepared before the chaos began?

While literature on rebound tends to focus on individual disruptions or traumas, the more interesting thing to study is the idea of surprises. A surprise is, to quote Woods:

the event is a surprise when it falls outside the scope of variations and disturbances that the system in question is capable of handling.

A surprise is an event that challenges an existing model and forces the system to learn or adapt the model.

ED: There are many common examples of rebound. Roll back a deploy. Restore lost data from a backup. Reboot a server. Restart a container. Truncate log files to free up disk space. Follow the instructions in a runbook. Basically, this is anything you do to put some sub-system more or less back the way it was.

Robustness

FH: This is generally perceived to be a conflation of resilience with another term—the ability to absorb disruptions—robustness. More robustness means your system can tolerate a broader range of disturbances while still responding effectively. Generally, robust control works, and only works, for cases where the disturbances are well-modelled.

ED: Monitoring and alerting are the most basic measures. We build one component to monitor another and call in the humans if some threshold is crossed. Kubernetes comes with built-in behavior that kills containers that run out of memory and other behavior which restarts containers when they fail. Common practice for databases includes having read-only replicas, hot-standby replicas, or automated failover. Load balancers include built-in health-checks for the servers they’re balancing and will adapt to send traffic only to the healthy servers. One of the most common reasons people want to move to the cloud is to enable autoscaling, where the systems can adapt to extra traffic by spinning up more containers and then spin down those extras when the surge in traffic subsides. More sophisticated examples of robustness include bulkheads, circuit breakers, and automated chaos experiments.

Graceful Extensibility

FH: Graceful extensibility is a sort of play on the idea of graceful degradation. Rather than asking the question how or why do people, systems, organizations bounce back, this line of approach asks: how do systems stretch to handle surprises? Resources are finite, environments changing, and their boundaries shift in ways that requires stretching and elasticity. A tenet here is that without the ability to stretch and adjust, your brittleness is far more severe than expected during normal operations, and generally exposed through extremely rapid collapses.

ER: Another aspect of concern is decompensation where exhaustion of a system under sustained disruption reduces the capacity of the system to adapt to new disruptions. Think on-call fatigue in an operations teams or deformation of a material under stress that changes its properties. (An effective way to “cut” paper without scissors is to fold it back and forth across a joint until the paper gives way with little force. Allowing you to tear the paper straight by hand.) As Woods writes:

When the time to recovery increases and/or the level recovered to decreases, this pattern indicates that a system is exhausting its ability to handle growing or repeated challenges, in other words, the system is nearing saturation of its range of adaptive behavior.

FH: The idea here is influenced by Safety-I (studying and preventing failures) vs. Safety-II (studying and enhancing successes), such that graceful extensibility can be seen as a positive attribute: how do we create a readiness-to-respond that is a strength and can be leveraged in all sorts of situations, rather than narrowing it to being the avoidance of negative effects?

Contrasted with rebounds, the approach to this is to look at past challenges, and see them as a way to gauge the potential to adapt to new surprises in the future. It also allows the idea of studying sequences and series of rebounds on a longer-term view of the system. How do they succeed and how do they fail?

ED: The best example we have of graceful extensibility in the software business is incident response. Once we detect that some part of our system is getting overwhelmed or otherwise misbehaving, some group of us drop what we’re doing to prevent the problem from getting worse and to remediate.

Sustained Adaptability

ER: Sustained adaptability asks three questions of a resilience engineer:

What characteristics explain the difference between systems that have sustained adaptability vs. those that don't?
What design principles and techniques allow you to engineer sustained adaptability?
How would you know if you succeeded in engineering sustained adaptability in a system?

FH: Expected challenges to sociotechnical systems over their life cycle include:

Surprises will keep challenging boundaries
Conditions and contexts will keep changing and shifting the boundaries
Adaptive shortfalls will happen and people will have to step in
The factors that provide adaptability and the needs for them will shift over time
Classes of changes will happen and the system as a whole will need to readjust itself and its relationships

ED: This is Woods’ most demanding concept of resilience. All systems reach their own previously known limits. This happens almost continuously. People in the systems continually stretch the systems to adapt to new circumstances. Successful components in the system induce demand that exceeds their original design. It is completely predictable that something will fail under the changing conditions. The specifics of what will fail and when is less predictable.

We have a few examples that address sustained adaptability. Many teams are adopting operational review meetings which is an excellent practice to help monitor how the ecosystem around them is changing and how their services are responding to the relentless change. Creating a Learning From Incidents (LFI) team to conduct cognitive interviews, facilitate learning reviews, and generally help other teams broaden and deepen what we learn when circumstances overwhelm our sub-systems.

ER: Architecting a system to have sustained adaptability relies on understanding that all adaptive systems are constrained by trade-offs, and that certain architectures allow for adjustment of those trade-offs.

FH: A whole lot of the discipline [of RE] is therefore interested in all the tradeoffs people make, that biological systems (or ecosystems) make, and particularly which are fundamental and how they apply to other systems as well. An agenda of this type of resilience is in managing capacities dedicated to resilience. In this perspective, it makes sense to say a system is resilient, or not, based on how well it balances all the tradeoffs, or not.

Woods states that the yield from the first two types of resilience has been low. The latter two approaches, the most positive ones, tend to provide better lines of inquiries, though the discipline is still young.

Erika Rowland

Eric Dobbs

Fred Hebert

You Can't Build More Nines

Wed, 19 Feb 2025 05:23:21 +0000

(Originally published on Medium)

Software teams are built to.. well.. build. We have design processes, RFC processes, change management processes.. lots of processes. All of them tend to be optimized for building.

But, inevitably after building enough complexity, we start to realize that our systems are not reliable enough. We start to measure uptime and, lo, there are not enough nines!

Of course, our first inclination is to build our way to more nines. Build CI/CD pipelines. Build canary deployments. Build a platform. Build synthetic testing.

It’s usually at this point that the dissolution sets in. Why aren’t we getting more nines? We built stuff for that!

But what are we measuring when we measure uptime anyway?

We are, in effect, measuring how often we see what we want to see when we look. Is it up now? Yes. How about now? Yes. Did our users succeed mostly?

Well if that’s what we’re measuring, and we’re trying to build more nines, we have to ask: what is a nine composed of?

One might say it is composed of time slices in which we’re up, or successful events. So, could we say then that if we build a system that is up, that we built the nines?

Unfortunately, math would like a word. Those nines are a percentage, so we’re always subject to everything in the denominator.

Or, to paraphrase a common military euphemism: “the entropy gets a vote.” No matter how bullet-proof you build the components of your system, the only way to make nines go up is to be ready to deal with the host of surprises that take them back down. By definition a percentage is a zero sum game. So, really, to add nines to your target, you have to subtract something else. You have to subtract the faults.

But, but, I’ve built systems to add nines!

You’ve probably built mostly two things: Fault avoidance, and redundancy.

Fault avoidance is the easy part. End-to-end tests that run pre-merge avoid some faults. Canary deploys avoid another class. Type checkers, linters, unit tests, all avoiding classes of faults. These will certainly increase the nines in the components of your system where they are deployed.

But again, this doesn’t do much to avoid the surprises of a complex system meeting the entropy of the real world. And since you’ve now optimized the velocity of changes entering your components by making a very powerful, confidence-building CI/CD pipeline, you also increase the velocity of change. No matter how good your pre-merge and post-deploy automated testing and rollback system is, it will always be supporting the change process. And changes are a source of faults. So while having great fault-avoidance automation will certainly subtract faults, it will also add some new ones back in.

So, you did the easy thing, but now you need to subtract more faults. Now you need to think about making the collections of components more redundant. After all, it’s predictable to have a broad class of problems like “ran out of computers” or “Network suddenly stopped networking.” or “Back-hoe cut fiber connection.”

For this, you build global database replication, leader elections, sharding, tombstones, write forwarding, queuing, eventual-consistency, etc. etc.

All of this fancy redundancy subtracts those big, obvious, predictable failures. So surely you’ll get the precious nines you’ve been longing for from this. Finally, you’ve done it. You’ve built the nines!

Except, you will also note some new faults. Global DB replication runs out of transaction log space. Leader elections take too long. Shards get flaky. Tombstones build up faster than they can be reaped. Metrics stop flowing. Logs are lost. Etc. etc. The denominator for good and bad events has added so many things, you might even make it worse before it gets better.

So, it’s goat herding for me then?

Don’t give up here. More nines are achievable, and sustainable, obviously. Many of us have done it. But, whether we consciously know this or not, we didn’t do it just by building software. Whether we hit three or six nines, and whether or not we realized it at the time, we built something a bit more, something a bit harder to measure than uptime, or redundancy or fault avoidance:

We built our organization’s resilience.

This didn’t happen by accident though. Somebody committed to driving the faults down. Somebody gave cover for those down at the sharp end feeling the pain of those faults. It went best when it was our leaders.

While all this redundancy was rolling out, somebody had the time and space to draw a map of all the faults, and to tell everyone else about it.

As we were talking to our customers we probably listened to them, and made sure that everyone understood what they expected the system to do, and vice-versa, being clear about what we do and don’t promise.

When alerts went off, I hope we took them seriously, and that we made sure they were representative of a real signal, with real plans for what to do with that signal.

If we were lucky, we made sure people were prepared by sending them to incident command training, giving them time and space to practice, run game days, devise role-playing exercises, and complete disaster recovery testing.

And finally, after all that, it’s very likely that when entropy reminded us that you really can’t predict what it’s going to do, we assembled an incident response team that professionally and efficiently worked to a resolution. A team that wrote down weird things they saw, from odd log messages to frustrating interrupts from outside the response team. And we made sure that they learned from those stories, and built up our collective wisdom.

Most importantly though, I would posit that nobody gets to any real, sustainable reliability, any honest version of more nines, without making space for everyone to feel safe to fail, listening to their experiences, and promoting the pockets of resilience and safety that inevitably exist in every organization.

We didn’t build those nines. We built our organization’s resilience.

Clint Byrum

SRE

You're Missing Your Near Misses

Thu, 13 Feb 2025 18:32:14 +0000

(Originally posted at Surfing Complexity)

FAA data shows 30 near-misses at Reagan Airport – NPR, Jan 30, 2025

The amount of attention an incident gets is proportional to the severity of the incident: the greater the impact to the organization, the more attention that post-incident activities will get. It’s a natural response, because the greater the impact, the more unsettling it is to people: they worry very specifically about that incident recurring, and want to prevent that from happening again.

Here’s the problem: most of your incidents aren’t going to repeat incidents. Nobody wants an incident to recur, and so there’s a natural built-in mechanism for engineering teams to put in the effort to do preventative work. The real challenge is preventing and quickly mitigating novel future incidents, which is the overwhelming majority of your incidents.

And that brings us to near misses, those operational surprises that have no actual impact, but could have been a major incident if conditions were slightly different. Think of them as precursors to incidents. Or, if you are more poetically inclined, omens.

Because most of our incidents are novel, and because near misses are a source of insight about novel future incidents, if we are serious about wanting to improve reliability, we should be treating our near misses as first-class entities, the way we do with incidents. Yet, I’d wager that there are no tech companies out there today that would put the same level of effort into a near miss as they would to a real incident. I’d love to hear about a tech company that holds near miss reviews, but I haven’t heard any yet.

There are real challenges to treating near misses as first-class. We can generally afford to spend a lot of post-incident effort on each high-severity incident, because there generally aren’t that many of them. I’m quite confident that your org encounters many more near misses than it does high-severity incidents, and nobody has the cycles to put in the same level of effort for every near-miss as they do for every high severity incident. This means that we need to use judgment. We can’t use severity of impact to guide us here, because these near misses are, by definition, zero severity. We need to identify which near misses are worth examining further, and which ones to let go. It’s going to be a judgment call about how much we think we could potentially learn from looking further.

The other challenge is just surfacing these near misses. Because they are zero impact, it’s likely that only a handful of people in the organization are aware when a near miss happens. Treating near misses as first class events requires a cultural shift in an organization, where the people who are aware of them highlight the near miss as a potential source of insight for improving reliability. People have to see the value in sharing when these happens, it has to be rewarded or it won’t happen.

These near misses are happening in your organization right now. Some of them will eventually blossom into full-blown high-severity incidents. If you’re not looking for them, you won’t see them.

Lorin Hochstein

Surfing Complexity

An Incident Review of an Incident Review

Tue, 11 Feb 2025 20:25:03 +0000

(Originally posted at An Incident Review of an Incident Review )

So I bombed an incident review this week. More specifically, the facilitating.

I’ve run post mortems/retrospectives/PIRs, whatever you want to call them, for over a decade. Just felt my arthritis kick in a bit as I typed that. It’s hard to quantify, to even qualify, subtle nuances and questions I’ve developed as handy go-to’s to get folks to speak up during interviews and meetings. My friend Lorin Hochstein said that facilitation is the hardest part of the work, which feels pretty on the money. You can always take another swing in an interview, prep what questions you’re likely to ask or come back around during the PIR (post incident review) with everyone. Walking through timelines and dashboards are toilsome, but they’re rarely more than an inconvenience of time and energy. I could see an argument made for summarization and write ups (“Tell me everyting we will need to know, all the tasks to make sure this ‘doesn’t happen again’ – but make it short enough so folks will want to read it”).

But running the meeting, yeah, that can be sneakily hard. You mostly have one shot at it and before an audience who you’ve convinced to spend their time in yet another meeting instead of “the real work” (aside – incidents are part of the real work). It’s very easy to lose folks, say the wrong thing, let emotions run high.

Funny thing is I typically think of myself as worse at the parts outside of the meeting. I’ve got golden retriever energy when it comes to helping folks out, and the PIR meeting is where I shine. It’s my job to care about folks, to make sure they’re heard? And you’re going to pay me to see folks do the “aha!” moment when the parts click? Sign me up, that’s entirely my jam. I’m fairly loquacious and have a knack for vulnerability-as-means-of-disarming folks, getting them to feel that yes it’s ok to say “I don’t know”. I consider that last bit a personal superpower.

So what went wrong? The humor of analyzing the analysis, finding the fault when we’re hunting through the pieces of an outage, isn’t lost on me. It’s also an easy slide into over analyzing everything we do, some college sophomore philosophy student who suddenly falls into a nihilistic hole trying to debate with everyone this sudden newfound enlightenment. To spoil the ending, I leaned too heavily on my tropes, enthusiasm and admittedly a bit of weariness from the week laying on top of a meeting. I’m also trying to get momentum for more PIR meetings, and while I know a surefire way to poison that is to set up a ton of very long and dry discussions, I condensed the review to a half hour to entice folks into joining. “That’ll surely be enough!” he lied to himself.

I tend to talk. I probably say in twenty words what can be said in five. That can be comforting to some folks, vamping while they gather ideas. It’s my crutch as I over explain to really make sure folks understand. That was heavily present in this latest. I got a nudge “Hey, let people talk more” in the meeting. Twice, actually, which is fairly impressive for only 30 minutes. That’s one of my focal points for PIR meetings too – don’t just repeat the narrative of events, let the participants state what happened. Folks will nod and say “Yup!” and agree with facilitators, that small modicum of power within the virtual walls of that meeting, because that’s what we’re inclined to do. Surefire way to get people not to share their expertise.

I was bummed for a few hours, because I felt it immediately after. No one had to mention it, I could see it clear as day. I try to leave five to ten minutes at the end of a meeting as a free space – action items, sure, but “what did we miss?” more so. There were at least two or three ideas of areas we failed to cover which feel pretty core to the learning. “Yeah, we don’t still understand the source of the problematic requests, and…”. etc.

But the world didn’t end. It (typically) doesn’t when we have a major outage and I’m fairly confident we’ll be ok here. It’s good to recognize, even with a ton of experience, facilitators do have tried-and-true methods that can hinder if overused. I’ll also say, in retrospect, I had a question I was drilling down on for at least 15 minutes that I wanted answering, likely in my head before the meeting started. Checking bias at the door, notably when it’s your team in the driver’s seat, is hard.

If nothing else, incidents are surprises. “Well that went wrong and caught me off guard” feels akin to that. I’ll grab this post another day in the future and appreciate it, a few more reviews under my belt that hopefully turn more my way.

Photo: 2012/366/364 Driving Off the Edge

Will Gallego

Secretary, Resilience in Software Foundation

Building a Safer Software Industry, Together

Wed, 18 Dec 2024 01:07:03 +0000

The software industry is on a precipice: In 2024, with rising interest rates driven by macroeconomic forces, many firms made decisions to cut back on staff and look for more efficiencies in their operations. While said macroeconomic (and sociological) forces putatively drove those cuts, the scientific realities around stretched, highly complex socio-technical systems remain the same: they will fail. It is in this world that we choose to shine a light on resilience—the adaptive (and fundamentally human) capacity that we have available to us in unprecedented, unanticipated situations. Any organization capable of this form of resilience will have a distinctive competitive advantage in the decades to come.

Today’s announcement that we’ve formed the Resilience in Software Foundation (RSF) has been months in the making, tactically speaking, but it has been forming in the corners and crevices of our industry for decades. Multiple people who are passionate about Resilience Engineering (RE) in the software industry have been meeting regularly to ideate what a democratically-led, inclusive, supportive community space could look and feel like for us. It would be easy to assume that simply some meetings and filing of paperwork were behind the creation of RSF (and there has indeed been a lot of paperwork!). Dig a bit deeper, though, and the work by various leaders in the field of Resilience Engineering in the software industry have brought us to this point now, when we are ready for an official Foundation. At its core, RSF is a place where we can continuously nourish and grow a vibrant community to discuss the practical world of applying resilience concepts in action, and advocate for these practices in our industry.

The Resilience in Software Foundation aims to transform our industry by supporting and growing Resilience Engineering throughout the industry, becoming a home to expertise, knowledge sharing, and research. We’re excited to launch with individual, student, and corporate memberships available. We have a strong Code of Conduct, borrowed from Rands Leadership Slack, and a solid initial set of bylaws heavily borrowed from our friends at Usenix. We have articles of incorporation filed in Delaware, and have submitted our paperwork so we’re officially a 501c3. We have a board of directors, and a group of moderators for our rapidly growing Slack community.

Not too shabby for a bunch of safety nerds (mostly) holding down day jobs.

We would love for you to join us! For access to our Resilience in Software Slack community, and discounts to future events, please join us as a member. If you have questions about our group and aren’t in our Slack yet, you can email us at info@resilienceinsoftware.org.

Colette Alexander

President, The Resilience in Software Foundation

Introducing the Resilience in Software Foundation

Wed, 18 Dec 2024 01:06:34 +0000

Software failures are inevitable. No matter how hard we try, we can’t make our systems flawless, nor can we predict every possible problem. The systems we’re building today are beyond the mental model of any one person or team—the complexity outpaces our ability to grasp it fully, especially when those systems are under stress or operating in unexpected conditions. And with the rapid acceleration of user demands and organizational requirements, the complexity of software continues to grow. We can patch, repair, and plan for more seemingly robust systems, but it often feels like we’re walking a tightrope, one misstep away from disaster.

Resilience Engineering (RE) offers a way forward. It’s a science-backed framework that helps organizations cope with the inherent complexity of high-pressure systems by building adaptive capacity. Rooted in research from industries like aviation, medicine, and energy, RE brings a crucial insight to the software industry: the people who design, build, and operate these systems are the key to resilience. They’re the ones who adapt to the unexpected, learning from incidents and failure, and developing strategies to return to stability when things go wrong.

As software becomes a cornerstone of modern life, ensuring that it can withstand and adapt to failure is no longer optional. It’s with this idea in mind that we created the Resilience in Software Foundation—a community where software practitioners can share, learn, and improve our industry together.

Our goal is simple: to make systems safer by making them more resilient. We are a collective of software practitioners and academics committed to embedding resilience engineering principles into the core of software design, development, and operations. Through research, education, and industry collaboration, we aim to set a new standard in software engineering—one that embraces complexity, prioritizes safety, and builds systems that not only survive but adapt under pressure.

We are a community of front-line practitioners who build and maintain complex software, alongside researchers who analyze trends and patterns across these systems. We continue to work on sharing ideas around Resilience Engineering because we know how demanding it is, how fundamentally scary it can be at times, and the relief that is the light at the end of the incident.

The tech industry has been ready to embrace these concepts for a long time. Many of us have been pushing for change over the past decade from the perspective of learning from incidents. We’ve moved beyond blameless postmortems and accounts of human error or root cause to a view of complex systems that posits humans as the central creators of safety and resilience. Members of this community have been at the forefront of these efforts for a while—with this Foundation we're looking to solidify these ideas, expand upon them, and reach out to those in tech looking for a better way. Notably, we're an inclusive group excited to introduce these concepts to folks who may be unsure how to start, or maybe are even skeptical of these approaches.

Join us!

Resilience in Software Foundation : News

Superficial Blamelessness

Why Try to Be Blameless?

What Makes Blame Superficial

Control-centric vs. Empowerment-centric Approaches

Saturation

Saturation

The Limits of Software Systems

System Behavior Changes at the Limit

Examples of Incidents Involving Saturation

Waymo

Cloudflare

Google Cloud Platform

OpenAI

Canva

Rogers

Slack

Resilient Systems Deal Effectively with Saturation

Four Responses to Overload

Four Responses to Overload

Expertise and Overload

Putting Overload & Expertise Insights Into Action

Learning in Addition to Fixing

Further Reading

Decompensation and Cascading Failures

Compensation

Decompensation

Decompensation and Resilience in Software

Adaptive Behavior, Adaptation, and Resilience Engineering

Further Reading

Negotiating the Paradox We Face in Resilience Engineering—Lessons From an Engineering Leader

The Paradox

My Recommendations

Give Them What They’re Asking For

Proceed with Caution When Educating and Influencing

Acknowledge and Negotiate the Paradox

MTTR Is (Still) Lying to You

MTTx Are Shallow Data

Moving Beyond MTTx

Three Takes on Four Concepts for Resilience Engineering

Introduction

Rebound

Robustness

Graceful Extensibility

Sustained Adaptability

You Can't Build More Nines

But, but, I’ve built systems to add nines!

So, it’s goat herding for me then?

You're Missing Your Near Misses

An Incident Review of an Incident Review

Building a Safer Software Industry, Together

Introducing the Resilience in Software Foundation