
Expertise and Overload
Resilience engineering views incidents through a different frame than the conventional approach in the software industry, which tends to treat incidents as an irritating interruption to our work. The unrelenting pressure of the roadmap hangs over retrospectives. The clock is always ticking. We focus on the quickest fixes that might prevent the failure in the future so we can “get back to work.” To apply resilience engineering in the software business we must recognize that incidents are part of everyday work in complex systems, not merely interruptions. In fact, incidents are a recurring opportunity for learning and discovery.
There are two features of (almost) all incidents that usually go unnoticed or ignored: expertise and overload. Both of these are pervasive in incidents, yet also elusive.
Expertise by its very nature conceals the difficulty of incident response work. It is deep, well-practiced know-how, judgement, and experience for effective problem-solving and effective coordination under pressure. Similarly, overload is so pervasive that people adapt to it without even noticing. The cognitive capacity of incident responders can be exhausted by the demands on attention and communication, contributing to distraction, impaired decision-making, delayed actions, and increased errors or stress.
In order to draw these elusive and important things into the open, I’m going to dig into a short example from a particularly difficult moment in an incident drill in which two experienced incident commanders deftly managed the overload. The high pressure of this moment is a perfect laboratory to examine these subtle and elusive features of both expertise and overload.
Consider this section of the video transcript from the incident drill. Let's get a first impression without commentary. (I’m deliberately not expanding the acronyms yet, bear with me.) Sarah is leading the response and Alex is her deputy.
A few minutes later, Sarah is managing a few other threads of investigation.
These exchanges happen quickly and casually. Why bother pointing to this at all? There is a lot to learn about both expertise and overload hiding between the lines of this dialog. Let's dig in.
First off, Sarah's claim during that exchange turns out to be a misunderstanding of Hamed's role. In my interviews with her she reported feeling embarrassed to have gotten this wrong during the drill. It is frequently the case that the expert must make effective decisions even with incomplete and inaccurate understanding. It is also notable that despite her misunderstanding of Hamed’s specific role, she correctly assessed that Tanya has deeper technical understanding than Hamed. This is a signal to me that Sarah also has expertise in quickly recognizing and contrasting the expertise of others.
Next up, the acronyms BCP and CAN. They are useful glimpses of how expertise conceals the difficulty of work. Do you know these terms? I didn’t know either of them. Sarah and Alex are fluent in a shared vocabulary. For them, BCP is a business continuity plan and CAN is a conditions-actions-needs report. Does the expansion of the acronyms expand your understanding of the underlying complexity? This is the very point of acronyms. They encapsulate complex ideas. Experts use them to communicate efficiently, especially under pressure.
Many readers will have at least heard of business continuity planning and related ideas of disaster recovery. Fewer readers will know that conditions-actions-needs reports are a communication framework for reporting progress during an incident. Both of these are deep rabbit holes that don’t add enough to this discussion, so I will avoid them for brevity.
As an observer who is unfamiliar with the jargon, these terms are opaque. Sarah and Alex know what they’re talking about, but the underlying complexity of this situation is concealed for the uninformed observer. Their fluency with shared jargon conceals the complexity they are tackling. There’s another subtlety on this point. Imagine someone who fluently understands BCP but has never heard of CAN. They would feel the elevated sense of urgency knowing that whatever else is going on, Sarah thinks they may need to engage the business continuity plan. They might view the delay in Alex’s response as lacking an appropriate level of urgency.1
At this point in the incident, Sarah has previously asked four different times in Slack to learn more about the BCP. The only answer has been a link to the document. These are subtle clues of her expertise in recognizing and managing overload.
Take note of what she is doing and what she is not doing. She isn’t reading the document herself. This hints that she is adapting to overload. A less experienced person might adapt differently, trying to still do all the things in front of them but reducing thoroughness on each. Another option would be to just drop it—she could shed load by ignoring the BCP. I can tell that she hasn’t just dropped the work because she keeps asking for more details. And each request for information is itself an attempt to recruit resources to help cope with the overload. Let’s look at how she recruits Alex to do this work.
Sarah's expertise with overload goes beyond understanding her own limits. Even while she delegates the reading to Alex, she demonstrates awareness that he may also be overloaded.
Knowing the BCP will demand careful attention, she asks about his capacity (“do you have bandwidth…”) even before she makes the specific request for help (“...to read it?”).
What's more, we see a moment of investing in common ground 2 with Alex, and reducing his cognitive work for this task. She shares her screen to show exactly which document. This gesture is loaded with insight about how expertise tackles overload:
I now know that she’s opened the doc even if she hasn’t read it. Moreover, she’s keeping track of it to be able to quickly pull it into view. This is an example of time-shifting work. She’s got the document close to hand when delegating to Alex. Showing the doc reduces the scope of work for Alex by priming his perception to recognize the document. Alex won't have to spend any extra energy wondering if he is looking at the correct document. It also subtly signals the importance of the request.
Sarah's demonstration of her expertise at managing overload and maintaining common ground happens in the space of a single breath.
Expertise hides what makes work difficult.3
Because overload is pervasive in incidents, experienced responders must develop tacit expertise in managing overload. But because that expertise is tacit, they won't even notice the subtle ways they help everyone around them cope with overload.
If you have ever had the experience of relief when one of your best responders shows up, you have felt the tacit knowledge. You don't know why, but somehow incidents just go better when they are in the room. And maybe the main thing you have learned is to call for their help if things get bad.
By contrast, this example features two responders who know overload and are explicitly familiar with a pattern of four responses to overload (for more on this, see the companion article on the Four Responses to Overload). When I interviewed them individually about their experience in the drill, each brought up this pattern on their own without any prompting from me. Given that the pattern was that front-of-mind for them, when they participate in incident retros, they almost certainly bring these topics up for discussion and deliberately spread the explicit knowledge in their respective organizations.
What’s more, we can see Sarah repeatedly choose the two more strategic adaptations to her own overload: recruiting resources and time-shifting work. Unless the observer knows to look for it, and knows how to recognize it, this expertise in managing overload is invisible.
With similar expertise, Alex responds to Sarah’s request by time-shifting work: “Yeah, once I get this CAN out then I'll read it. I'm almost done.”
He responds, signalling he understands the importance of the request. He adapts to his own overload by momentarily deferring this request. He also explicitly states what he has prioritized instead—this is extra energy Alex spends to maintain common ground with Sarah. This provides the briefest update on her previous request for the CAN, and also offers her a chance to override that prioritization decision.
Once again, this display of expertise happens in the space of a single breath. Sarah’s short reply (“Awesome. Thank you.”) serves to close this short conversation—implicitly validating Alex’s decision to finish the CAN first.
There are more examples of overload and expertise in the next exchange.
...busily typing and voicing a complicated group of questions...
Sarah: Oh man it takes me so long to type.
Alex: There are some execution steps for you in the BCP for failing over.
Sarah: Hold just a second while I get these... uh...
...this train of thought trails off as she switches to voice what she's typing...
In the video recording we can see and hear many ways Sarah tries to phrase the questions. Alex could also see and hear her working. He nevertheless chooses to interrupt, increasing Sarah's cognitive load. But he also gives just enough indication to justify the interruption: this is about the BCP.
Sarah, explicitly time-shifts work again (“Hold just a second.”). She also tacitly sheds load (...this train of thought trails off...). This moment is the first we’ve seen of Sarah shedding load.
This example of shedding load is particularly instructive. We frequently adapt to overload without any conscious decision. Under these overloading demands, Sarah focuses attention on the most important or most urgent work—in this case, typing the questions, voicing them, and explicitly targeting them at specific responders.
Notice again how expertise hides the difficulty of the work. Some of the work disappears completely. Sarah normally does extra work to provide context of her thinking. That extra work just drops as her voice tracks exactly what she's typing. The absence of that work is only apparent by contrasting what "normal" sounds like for her under less intense circumstances.
Let’s look at one more observation of expertise with managing overload in the complicated group of questions Sarah typed into Slack:
- Can we move part of the traffic? Or is it all or nothing?
- Are there servers in the area of the DC we can shut down? Is our UAT there and able to be shut down for instance?
- Can the DC vendor give us an ETA? Is there anything else they can do to cool the room they are not already doing? Are we the only tenant there?
What we learn here is that the model of four responses to overload applies to machines just as it applies to people. Her first question (“Can we move part of the traffic?”) is recruiting resources in the form of an alternate data center.
Her second question (“Are there servers ... we can shut down?”) is working the same problem from the opposite direction. It would shed load on the network and also remove heat-generating computation from the data center.
Her third question (“Anything else they can do to cool the room?”) is also recruiting resources. She also explicitly directs those questions to specific responders. Each question is demanding on its own and Sarah is spreading that load across the team. Notice how she is managing overload at many levels simultaneously: her own, the incident response team, and the machines and resources related to the data center.
Finally, she is able to turn her attention back to Alex (“Alex, sorry, I put you in a buffer. What was that about the BCP?”). There is a subtle signalling that Alex had interrupted skillfully. Sarah shows her understanding that he has the details about the BCP and that there is work in that plan that she must execute.
It is again notable what we do not see here. Whatever else Alex has been doing, he’s managed his own attention and overload to be able to respond to Sarah at this point. Unlike the earlier exchange when he was working on the CAN and deferred her request, here he immediately engages by reading the six steps aloud.
Putting Overload & Expertise Insights Into Action
This part of the incident drill illuminates three things that are difficult to even perceive: expertise, overload, and explicit expertise about overload. Instead of observing overload and expertise directly (which is often difficult or impossible), we can infer their presence by looking at what is observable: the adaptations themselves.
Overload is pervasive. Knowing this model of four responses to overload can help us to recognize its presence, to anticipate ways overload can cascade from one part of the system to another, and to choose better adaptations when we know there’s overload.
If you suspect you have hidden overload, look for any of the four adaptations. If you see time-shifting work or recruiting resources, you can infer that responders are anticipating the risk of overload and may be adequately managing it. Shedding load and reducing thoroughness suggest acute overload that deserves more aggressive intervention. For example, resources that have been recruited to support local overload do not come for free. Ensure whatever tradeoffs that are paying for the additional resources are also under consideration. If there are many tasks that are getting time-shifted, there may be hidden costs of context-switching to mitigate those.
If you already know overload is present, then review the four adaptations to weigh your options for managing it. If the overload is an immediate threat, focus on ways to shed load or reduce thoroughness. If you feel some breathing room, look around for resources you could recruit or find ways to time-shift the work.
Expertise can be even more elusive than overload. It is certainly much more difficult to categorize. So much expertise develops tacitly that experts themselves have a difficult time explaining how they do what they do. Although I have shown many examples of how subtle and invisible expertise is, I have not offered any specific advice about how to recognize expert performance in the wild. There are learnable skills for discovering this expertise but we will tackle those in a different article.
Learning in Addition to Fixing
I’ve made another subtle move in this article by featuring an incident drill. Drills by their nature focus attention on how the incident responders do their work, instead of focusing on what needs to be fixed in the code or documentation, or monitoring. Humans develop skills and expertise through practice and experience. Conventional retrospectives focused only on the fixing miss out on all the places we could be sharing expertise across the organization or improving the work of running incidents.
Remember our resilience engineering frame: incidents are first-class work. We can gain a competitive advantage by rewarding ongoing professional development in incident response. Through this know-how we can continually earn the trust of our customers.
One cool trick you can now apply to all of your future retrospectives is to ask how people adapted to overload in this incident. These patterns of adaptations to overload happen at a wide range of tempos and scales. Your teams and your services have unique expressions of overload and associated adaptations. Incidents provide an opportunity to look closely at those specific adaptations to specific overload and where specific circumstances exceeded the capacity of the system to adapt. The general pattern is useful, but your people need the grungy local details about how overload shows up in your world. When reflecting on incidents after the fact, these observations can reveal improvements to the operational practices of your teams or the operational tooling of the services in question.
Sharing the model of four responses to overload across your teams, and regularly discovering how people apply that model in real incidents is a powerful driver to turn incident retros into a community of practice with ongoing transfer of expertise and know-how. In this way, you can create leverage from the pervasive combination of expertise, and overload in incidents and spread the more specific expertise about overload.
Further Reading
Interested readers can learn a lot more in Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems by Marisa Bigelow. There are four case studies of different software and network incidents with thoughtful attention to overload. You may be surprised to notice that the responses to overload are present equally in the software services and in the people running those services. What’s more, Bigelow traces connections between the overload of the software and the overload among the responders—an example of how overload in one part of a system can cascade to other parts.
See also: Patterns in How People Think and Work: The Importance of Pattern Discovery for Understanding Complex Adaptive Systems. Woods, David & Licu, Tony & Leonhardt, Jörg & Rayo, Michael & Balkin, E. & Cioponea, Radu. (2021).
Thanks to Fred Hebert, Greg Holecek, and James Boyd for reviewing early drafts. Special thanks to Courtney Nash, Sarah Butt, Alex Elman, Hamed Silatani and Uptime Labs. This article has been a long time coming. It started with a thematic analysis of many production incidents when I was a member of Alex’s team at a previous employer. That became a foundation for conference talks on the multi-party dilemma given by Sarah and Alex in 2023. The three of us shared the insights from the multi-party dilemma with Uptime Labs which informed the structure of the drill. Then Courtney had the brilliant idea that I should do an analysis of Sarah and Alex practicing that drill. Beth Adele Long’s workshops on regenerative productivity helped me strongly narrow the focus of this article to something manageable. John Allspaw clarified an important distinction between tacit expertise and the Law of Fluency—tacit expertise is why we need the many tools from Naturalistic Decision Making in order to discover expertise, whereas the Law of Fluency concerns the way experts actively adapting conceal the difficulty of their work to all the people around them. Those are tightly related, but also not the same thing. And I am particularly grateful to Brian Marick for probably the best demonstration of constructive feedback I’ve ever seen—especially the examples of putting specific examples first, then explaining the theory from the concrete.
1 Investigating the jargon in use between experts under pressure helps to reveal the complexity hidden in plain sight in day-to-day work. See Chapter 3 Being Bumpable: Consequences of Resource Saturation and Near-Saturation for Cognitive Demands on ICU Practitioners.
2 A colloquial understanding of common ground will be good enough to let us focus on the subtleties of expertise and overload. That said, there is a deeper understanding of common ground and common grounding that I have not covered in this article but nevertheless fits perfectly in this context. Klein, Gary & Feltovich, Paul J. & Bradshaw, Jeffrey & Woods, David. (2005). Common Ground and Coordination in Joint Activity. 10.1002/0471739448, ch6.
3 This section heading is how Dr. Richard Cook paraphrased The Law of Fluency from Dr. David Woods: “Well”-adapted work occurs with a facility that belies the difficulty of the demands resolved and the dilemmas balanced. Woods, David & Hollnagel, Erik. (2006). Joint Cognitive Systems: Patterns in Cognitive Systems Engineering. 10.1201/9781420005684. p.20.

Eric Dobbs
