Three Takes on Four Concepts for Resilience Engineering

Published on February 19, 2025

Ed note: The first time I read Dr. David Woods' paper Four Concepts for Resilience Engineering, I felt so many things click in my brain. While the field of Resilience Engineering is not new, those of us in the software industry are relatively new to RE, and seeing a paper that is so clearly applicable to software systems left a strong impression on me. When I reached out to the RISF community for someone to contribute their thoughts on this paper, multiple hands went up in Slack—and they were all so good I couldn't pick just one. So what follows is three software engineers' takes on Woods' paper: Erika Rowland (ER), Eric Dobbs (ED), and Fred Hebert (FH). Each provides their own perspective on Woods' proposed concepts, and additionally Dobbs provides some excellent software-centric examples of how you might see these concepts manifest themselves in your systems.

Introduction

ED: Resilience ≠ Reliability. Resilience is human skills and human relationships. Reliability is what we build into our software.

ER: Resilience has been used a number of different ways in Resilience Engineering literature. This conflation of different definitions makes it difficult to parse and understand what that literature is arguing for and why.

FH: The paper opens by admitting that the popularity of the term has led to confusion regarding what it means in the first place. I recall seeing other papers which held the ill-defined term as one of the biggest weakness of a discipline named after it. All the different uses seen around the place have been categorized into 4 groups by Woods: rebound, robustness, graceful extensibility, and sustained adaptability.

Rebound

ER: Rebound is about returning to normal functioning after a disruption. This ability to rebound seems to depend directly on the conditions before the disrupting event. How was the system prepared before the chaos began?

While literature on rebound tends to focus on individual disruptions or traumas, the more interesting thing to study is the idea of surprises. A surprise is, to quote Woods:

the event is a surprise when it falls outside the scope of variations and disturbances that the system in question is capable of handling.

A surprise is an event that challenges an existing model and forces the system to learn or adapt the model.

ED: There are many common examples of rebound. Roll back a deploy. Restore lost data from a backup. Reboot a server. Restart a container. Truncate log files to free up disk space. Follow the instructions in a runbook. Basically, this is anything you do to put some sub-system more or less back the way it was.

Robustness

FH: This is generally perceived to be a conflation of resilience with another term—the ability to absorb disruptions—robustness. More robustness means your system can tolerate a broader range of disturbances while still responding effectively. Generally, robust control works, and only works, for cases where the disturbances are well-modelled.

ED: Monitoring and alerting are the most basic measures. We build one component to monitor another and call in the humans if some threshold is crossed. Kubernetes comes with built-in behavior that kills containers that run out of memory and other behavior which restarts containers when they fail. Common practice for databases includes having read-only replicas, hot-standby replicas, or automated failover. Load balancers include built-in health-checks for the servers they’re balancing and will adapt to send traffic only to the healthy servers. One of the most common reasons people want to move to the cloud is to enable autoscaling, where the systems can adapt to extra traffic by spinning up more containers and then spin down those extras when the surge in traffic subsides. More sophisticated examples of robustness include bulkheads, circuit breakers, and automated chaos experiments.

Graceful Extensibility

FH: Graceful extensibility is a sort of play on the idea of graceful degradation. Rather than asking the question how or why do people, systems, organizations bounce back, this line of approach asks: how do systems stretch to handle surprises? Resources are finite, environments changing, and their boundaries shift in ways that requires stretching and elasticity. A tenet here is that without the ability to stretch and adjust, your brittleness is far more severe than expected during normal operations, and generally exposed through extremely rapid collapses.

ER: Another aspect of concern is decompensation where exhaustion of a system under sustained disruption reduces the capacity of the system to adapt to new disruptions. Think on-call fatigue in an operations teams or deformation of a material under stress that changes its properties. (An effective way to “cut” paper without scissors is to fold it back and forth across a joint until the paper gives way with little force. Allowing you to tear the paper straight by hand.) As Woods writes:

When the time to recovery increases and/or the level recovered to decreases, this pattern indicates that a system is exhausting its ability to handle growing or repeated challenges, in other words, the system is nearing saturation of its range of adaptive behavior.

FH: The idea here is influenced by Safety-I (studying and preventing failures) vs. Safety-II (studying and enhancing successes), such that graceful extensibility can be seen as a positive attribute: how do we create a readiness-to-respond that is a strength and can be leveraged in all sorts of situations, rather than narrowing it to being the avoidance of negative effects?

Contrasted with rebounds, the approach to this is to look at past challenges, and see them as a way to gauge the potential to adapt to new surprises in the future. It also allows the idea of studying sequences and series of rebounds on a longer-term view of the system. How do they succeed and how do they fail?

ED: The best example we have of graceful extensibility in the software business is incident response. Once we detect that some part of our system is getting overwhelmed or otherwise misbehaving, some group of us drop what we’re doing to prevent the problem from getting worse and to remediate.

Sustained Adaptability

ER: Sustained adaptability asks three questions of a resilience engineer:

What characteristics explain the difference between systems that have sustained adaptability vs. those that don't?
What design principles and techniques allow you to engineer sustained adaptability?
How would you know if you succeeded in engineering sustained adaptability in a system?

FH: Expected challenges to sociotechnical systems over their life cycle include:

Surprises will keep challenging boundaries
Conditions and contexts will keep changing and shifting the boundaries
Adaptive shortfalls will happen and people will have to step in
The factors that provide adaptability and the needs for them will shift over time
Classes of changes will happen and the system as a whole will need to readjust itself and its relationships

ED: This is Woods’ most demanding concept of resilience. All systems reach their own previously known limits. This happens almost continuously. People in the systems continually stretch the systems to adapt to new circumstances. Successful components in the system induce demand that exceeds their original design. It is completely predictable that something will fail under the changing conditions. The specifics of what will fail and when is less predictable.

We have a few examples that address sustained adaptability. Many teams are adopting operational review meetings which is an excellent practice to help monitor how the ecosystem around them is changing and how their services are responding to the relentless change. Creating a Learning From Incidents (LFI) team to conduct cognitive interviews, facilitate learning reviews, and generally help other teams broaden and deepen what we learn when circumstances overwhelm our sub-systems.

ER: Architecting a system to have sustained adaptability relies on understanding that all adaptive systems are constrained by trade-offs, and that certain architectures allow for adjustment of those trade-offs.

FH: A whole lot of the discipline [of RE] is therefore interested in all the tradeoffs people make, that biological systems (or ecosystems) make, and particularly which are fundamental and how they apply to other systems as well. An agenda of this type of resilience is in managing capacities dedicated to resilience. In this perspective, it makes sense to say a system is resilient, or not, based on how well it balances all the tradeoffs, or not.

Woods states that the yield from the first two types of resilience has been low. The latter two approaches, the most positive ones, tend to provide better lines of inquiries, though the discipline is still young.

Erika Rowland

Profile photo for eric Eric Dobbs

Fred Hebert

Other news

You Can't Build More Nines

February 19, 2025

(Originally published on Medium)Software teams are built to.. well.. build. We have design processes, RFC processes, change management processes.. lots of processes. All of them tend to be optimized for building. But, inevitably after building enough complexity, we start to realize that our systems are not reliable enough. We start to measure uptime and, lo, there are not enough nines!Of course, o

MTTR Is (Still) Lying to You

February 27, 2025

Ed note: This is largely a re-post of what was in the 2022 VOID report. Since then, I continute to see ongoing conversations (and product marketing pages) touting the benefits of MTTR so I thought it was time to dust off these data that show how unreliable and unhelpful MTTR is to software organizations.Software organizations tend to value measurement, iteration, and improvement based on data. The