Skip to content
Toggle main navigation
Resilience in Software Foundation logo
Resilience in Software Foundation logo
  • Become a Member
  • Donate
  • Events
  • Blog
  • About Us
    • Bylaws
    • Policies
  • Store
MoreOpen dropdown to see remaining menu items
    Log in
    Become a Member
    Contact
    [email protected]
    Cookies
    Reselect cookie settings
    Community engagement platform by Hivebrite.

    Blog

    Filter by category
    All
    Announcements
    Incident Review
    Metrics
    Resilience Engineering

    Highlighted News

    Results (13)

    Newest
    Superficial Blamelessness

    Superficial Blamelessness

    In 2012, John Allspaw (then CTO of Etsy) wrote a seminal blog post on the need for what he called Blameless Postmortems. Built off the notion of a “Just Culture” from the research of Sidney Dekker, he summarized Etsy’s approach to balancing accountability in post-incident reviews with a need for avoiding blaming individuals who were involved in the incident: “Having a "Just Culture" means that you
    April 15, 2026
    Saturation

    Saturation

    SaturationEvery system has its limits. If you keep throwing more and more at a system, at some point it will no longer be able to function properly. The term saturation refers to the state of a system where the demands on the system are high enough that the system is at the limits of its normal performance. Electrical engineering provides a helpful visualization of saturation. Imagine you want to
    March 2, 2026
    Four Responses to Overload

    Four Responses to Overload

    “A pattern is a way to generalize and transfer findings from one situation to others.”  —Woods et al., Patterns in How People Think and Work: The Importance of Pattern Discovery for Understanding Complex Adaptive SystemsOverload is when the demands on a system, including its human operators, exceed their capacity to effectively manage those demands. It is so pervasive that people adapt to it witho
    February 21, 2026
    Expertise and Overload

    Expertise and Overload

    Resilience engineering views incidents through a different frame than the conventional approach in the software industry, which tends to treat incidents as an irritating interruption to our work. The unrelenting pressure of the roadmap hangs over retrospectives. The clock is always ticking. We focus on the quickest fixes that might prevent the failure in the future so we can “get back to work.” To
    February 21, 2026
    Decompensation and Cascading Failures

    Decompensation and Cascading Failures

    Consider the following scenario: A set of automated tasks has somehow failed to run to completion. Because a thorough fix will take some time and the tasks you need run are time-sensitive, you complete the work manually to keep things moving forward. If this sounds familiar, then you may have helped provide compensation for that system. Compensation is a very interesting mechanism in software syst
    January 22, 2026
    Negotiating the Paradox We Face in Resilience Engineering—Lessons From an Engineering Leader

    Negotiating the Paradox We Face in Resilience Engineering—Lessons From an Engineering Leader

    Author’s note: I need to credit my friend and former colleague Tim Nicholas for his review and contribution to this article and the insights within it. Tim has been pivotal in shaping how I understand failures in complex systems and Resilience Engineering, and how I incorporate this into my overall approach as an engineering leader. (Also, the spelling is New Zealand English.) So you’re an Enginee
    July 29, 2025
    MTTR Is (Still) Lying to You

    MTTR Is (Still) Lying to You

    Ed note: This is largely a re-post of what was in the 2022 VOID report. Since then, I continute to see ongoing conversations (and product marketing pages) touting the benefits of MTTR so I thought it was time to dust off these data that show how unreliable and unhelpful MTTR is to software organizations.Software organizations tend to value measurement, iteration, and improvement based on data. The
    February 27, 2025
    Three Takes on Four Concepts for Resilience Engineering

    Three Takes on Four Concepts for Resilience Engineering

    Ed note: The first time I read Dr. David Woods' paper Four Concepts for Resilience Engineering, I felt so many things click in my brain. While the field of Resilience Engineering is not new, those of us in the software industry are relatively new to RE, and seeing a paper that is so clearly applicable to software systems left a strong impression on me. When I reached out to the RISF community for
    February 19, 2025
    You Can't Build More Nines

    You Can't Build More Nines

    (Originally published on Medium)Software teams are built to.. well.. build. We have design processes, RFC processes, change management processes.. lots of processes. All of them tend to be optimized for building. But, inevitably after building enough complexity, we start to realize that our systems are not reliable enough. We start to measure uptime and, lo, there are not enough nines!Of course, o
    February 19, 2025
    You're Missing Your Near Misses

    You're Missing Your Near Misses

    (Originally posted at Surfing Complexity)FAA data shows 30 near-misses at Reagan Airport – NPR, Jan 30, 2025The amount of attention an incident gets is proportional to the severity of the incident: the greater the impact to the organization, the more attention that post-incident activities will get. It’s a natural response, because the greater the impact, the more unsettling it is to people: they
    February 13, 2025
    An Incident Review of an Incident Review

    An Incident Review of an Incident Review

    (Originally posted at An Incident Review of an Incident Review )So I bombed an incident review this week. More specifically, the facilitating.I’ve run post mortems/retrospectives/PIRs, whatever you want to call them, for over a decade. Just felt my arthritis kick in a bit as I typed that. It’s hard to quantify, to even qualify, subtle nuances and questions I’ve developed as handy go-to’s to get fo
    February 11, 2025
    Building a Safer Software Industry, Together

    Building a Safer Software Industry, Together

    The software industry is on a precipice: In 2024, with rising interest rates driven by macroeconomic forces, many firms made decisions to cut back on staff and look for more efficiencies in their operations. While said macroeconomic (and sociological) forces putatively drove those cuts, the scientific realities around stretched, highly complex socio-technical systems remain the same: they will fai
    December 18, 2024
    Introducing the Resilience in Software Foundation

    Introducing the Resilience in Software Foundation

    Software failures are inevitable. No matter how hard we try, we can’t make our systems flawless, nor can we predict every possible problem. The systems we’re building today are beyond the mental model of any one person or team—the complexity outpaces our ability to grasp it fully, especially when those systems are under stress or operating in unexpected conditions. And with the rapid acceleration
    December 18, 2024