“Human Error” Is an Illusion: A Response to Om Malik

Published on April 27, 2026

Dear Om,

I want to begin by stating that I have an immense amount of respect for you—I followed your work at GigaOm while I was a rising editor and conference chair at O'Reilly Media in the 2010s. I subscribe to your newsletter and appreciate your well-earned, often biting, takes on the current state of our industry. That said, I was sadly disappointed in your recent article "Human Error is OK! Machine Madness is a No-No! Why?" My goal here is to present enough information to convince you to update your mental model when it comes to "human error" because the views you've espoused continue to permeate our industry and have real, significant consequences for how we understand and investigate failure in software systems.

You begin with a perfectly reasonable-sounding question: why do we accept human error but not machine/AI error? You assert that "we forgive human error as if it were weather." But as someone who researches incidents and their aftermaths, what I've seen over 20+ years in this industry is that this is categorically untrue. In general, our industry treats human error as a scapegoat: a way to avoid looking any deeper into the more thorny aspects of managing complex systems (goal conflicts, financial pressures, organizational dysfunction, etc). And not far into your own post, you demonstrate exactly this: rather than accepting human error like wind or rain, you use it to pin large-scale outages on people. "Fat fingers, credential theft, misconfiguration, copy-paste mistakes…" This is important because scapegoating can lead to people getting punished (or even fired) for complex events in which their behavior was almost always an attempt to resolve the situation, not make it worse. This has a knock-on effect of creating cultures of fear and shame where latent issues don’t get reported out of concern for retribution, which is one of the many damning conclusions about the culture that led to the Challenger disaster.

It’s also worth noting that it is only possible to make these kinds of claims after everything has happened and you get what feels like a clean and easily-explained linear account that feels tidy thanks to a large serving of hindsight bias. Ask anyone who is regularly involved in these kinds of events and they will quickly disabuse you of the notion that they are this simply and easily explained.

We have to consider that there is something to the idea that we do accept human error more so than automation error: collectively, people have internalized people being fallible, and automation/machines/AI being reliable and correct. People can feel guilt, be blamed, be removed from the situation (whether this is a good idea or not); blaming individuals provides a therapeutic illusion of removing the one “bad apple” and moving on.

Given this instinct, humans are often framed as chaotic elements in an otherwise safe system (wrongly so; they’re an adaptive element), and machine processes as the embodiment of system priorities and goals. This means that using human error as an explanation shortcircuits our thinking and prevents questioning the system. Forgiving AI errors, perhaps ironically, therefore also shortcircuits our thinking and prevents questioning the system.

Most dangerously, though, is this idea: "...more than AI, it is human error that is a clear and present danger." This is a claim we've heard repeatedly to justify automation, AI, and self-driving cars, etc. If only we could cut those pesky humans and all their mistakes out of the loop entirely, all would be well. To this point, you invoke the research of James Reason and his Swiss Cheese model, however the broader field of Resilience Engineering (to which Reason was an early contributor) has long debated the merits of that model. Even if one fully agreed with Reason, your interpretation of his Swiss Cheese model is inaccurate—Reason himself considered human error only as a triggering element, and insisted that it was an inadequate stopping point for an investigation. He considered errors to be outcomes of the environment and not an explanation by themselves, and urged us to recognise that human variability is consistently a force to harness in averting errors.

Dr. Richard Cook, in his foundational summary How Complex Systems Fail, is unambiguous: post-incident attribution to any root cause (including "human error") is fundamentally wrong. Because overt failure in complex systems requires multiple faults co-occurring, there is no isolated cause of an incident. The search for a single cause, and "human error" is almost always the answer that search produces, does not reflect a technical understanding of failure. It reflects, in Cook's words, "the social, cultural need to blame specific, localized forces or events for outcomes."

This matters because complex systems don't operate in a safe, neutral state waiting to be disrupted by careless people (or mischievous machines). They are, as Cook puts it, always running in degraded mode: containing ever-changing mixtures of latent failures at any given moment, held together not despite their human operators but because of them. Practitioners are the adaptable element of these complex systems, constantly making real-time adjustments that regularly prevent those latent failures from compounding into catastrophe. The person who seemingly "caused" the incident is almost certainly the same person who prevented a dozen others that nobody ever wrote about, because all practitioner actions are gambles taken under uncertainty, and we only call them errors when the outcome is bad.

This is what system safety researcher Sidney Dekker calls the Old View: the assumption that systems work fine until a person breaks them, and that the fix is to find that person and correct or replace them. In his words: “There is an almost irresistible belief that we are custodians of already safe systems that need protection from unreliable, erratic human beings." The failures you catalogued (some of which didn’t even invoke human error in their own analyses)—the exposed password, the unreviewed command, the dormant VPN account—were not human intrusions into healthy systems. They were signals that those systems were already operating under conditions that made those outcomes possible, even predictable, but only in hindsight. Those are organizational and systemic failures. “An old VPN account with no multi-factor authentication, used by a human who had left the company” is not human error. Calling that human error doesn't explain or fix it. It obscures the deeper issues.

I’m also caught off-guard by your framing of things like post-incident reports as “failure theater.” This is disingenuous at best. Only if people involved in these incidents keep blaming surface elements and not looking deeper at their systems would I be willing to label these efforts as such. Many people involved in reviewing and analyzing incidents are deeply committed to looking beyond surface elements and temptingly simple “root causes” to better understand how their system as a whole led to what happened. And yes, while our industry has at times been ham-fisted in its approach to learning from failure and incidents, many of these practices come from other industries (aviation, healthcare, nuclear power) that have spent significantly more time and effort than we have grappling with the notion of failure in complex systems. I will concede that most public incident reports that we have access to as the public contain an element of theater, and are often written more to appease internal marketing/PR departments and shareholders, but let’s not throw the baby out with the bathwater please. The true danger in your argument is creating a false equivalency between humans and AI, which leads to philosophies (and, increasingly, real world products) based on the question “how could we tolerate so many human errors, and can’t we see that technology would let us get rid of their frustrating lack of reliability?”

There is a real and important conversation to be had about the novel risks of AI failure: its potential to replicate errors at scale, its opacity, and our current lack of AI-specific incident history. You're right that we're early in understanding how AI-generated failure behaves. But the solution is not to weigh our AI anxiety against a supposedly mature tolerance for human failure. That tolerance was never maturity. It was, and remains, a habit of naming things instead of understanding them. Safety, Cook reminds us, is not a property of any component of a system, human or machine. It is an emergent property of the system as a whole. Until we investigate both successes and failures at that level, we will keep producing comfortable narratives and calling them explanations.

Courtney Nash, Vice President, Resilience in Software Foundation (RISF)

With contribution and review by:

Fred Hebert, Board Member, RISF

John Allspaw, Principal, Adaptive Capacity Labs

Maria Jackson, RISF Member

Rachel Silber, Complex Systems Group

Sheeri Cabral, Director of Enterprise Architecture, myKaarma

Kurt Andersen

Robert Barron

Thomas Depierre

Other news

Superficial Blamelessness

April 15, 2026

In 2012, John Allspaw (then CTO of Etsy) wrote a seminal blog post on the need for what he called Blameless Postmortems. Built off the notion of a “Just Culture” from the research of Sidney Dekker, he summarized Etsy’s approach to balancing accountability in post-incident reviews with a need for avoiding blaming individuals who were involved in the incident: “Having a "Just Culture" means that you

Sources of Practice

May 7, 2026

Practice of Practice is a way to provide a social architecture that supports connective labor between teams and team members. Solidifying this layer of relationships is key to nourishing resilience in any organization. It does this through a tripod of supportive and interconnected parts:A Practice of Practice regimen forms a community of practice that reflects ambient values: the types of learning