Negotiating the Paradox We Face in Resilience Engineering—Lessons From an Engineering Leader

Published on July 29, 2025

Author’s note: I need to credit my friend and former colleague Tim Nicholas for his review and contribution to this article and the insights within it. Tim has been pivotal in shaping how I understand failures in complex systems and Resilience Engineering, and how I incorporate this into my overall approach as an engineering leader. (Also, the spelling is New Zealand English.) 

So you’re an Engineering Manager or a Director, or perhaps a Staff Engineer. You’ve discovered the world of Resilience Engineering and had some pretty significant "a ha" moments. Now you want to bring Resilience Engineering into your organisation, and set the world to rights. 

I’ve been exactly where you are right now, and I’m going to suggest you hold your horses, and let me share my insights and recommendations from my experience as an Engineering Leader working to make Resilience Engineering a reality. There’s pitfalls on this journey that I can help you avoid.

As an SRE (or adjacent) leader, you will find yourself in a challenging position when it comes to Resilience Engineering. You’re the layer in the middle. You're stuck between the practitioners, who are passionate but may not have good insight into the realities and challenges of leaders, and the executive layers, who have different goals and concerns and don’t have the motivation to deviate from things they think are working. 

Most people working in tech and IT, no matter the type of company or organisation, will be familiar with various practices around incidents. A lot of these are summed up in regular incident reporting. Often produced monthly or existing in a dashboard, this reporting includes data and metrics such as MTTR, incident counts by severity and root cause type, and compliance with completing post incident review (PIR) action items. 

These are often non-negotiable—the default standards and expectations that have permeated deeply and widely throughout our industry. In many cases, Boards and Executives expect to see these reports as indicators of risk and operational effectiveness, to ensure they fulfill their corporate governance obligations. Engineering Leaders and Managers simultaneously use them as a purported lens to measure team and engineering system performance. Producing a Root Cause Analysis (RCA) is a deeply embedded expectation of both leadership and customers following a high severity incident, as is tracking completion of PIR action items. You can’t explore a monitoring or incident response tool without  being sold the ability to drive down your MTTR. 

However, these are all things that those of us in the Resilience in Software community know and accept to be not just misleading and outdated, but also counterproductive and potentially harmful to the actual improvement of safety in an organisation.

That leaves those of us in Resilience Engineering with quite the dilemma. Do we speak out against these deeply embedded practices and expectations and seek to explicitly educate our colleagues and leaders, or is it all a necessary evil in making space for resilience work? 

The Paradox

The fact is, we are faced with a paradox. A contradiction that reveals deeper truths and a reality that is difficult to reconcile. 

For an engineering leader to build trust and credibility—to be perceived as competent—they need to meet or exceed the expectations of leaders and executives by carrying out these default standards and practices. You believe this to be contrary to your goals, however my experience tells me you should do these things anyway in pursuit of bigger picture resilience and adaptive capacity. 

Before you get upset and stop reading, hear me out. I’ve been the leadership layer in the middle, sandwiched between the practitioners who really understand Resilience Engineering and see the need for it every day, and the executives with different concerns, goals and motivations. 

My goal when it comes to Resilience Engineering is to legitimise the work, the practices and the concepts. I want to make space for engineers to conduct incident analysis to better understand our systems and the conditions that enable incidents to occur. I also want to build understanding and acknowledgement of complexity and Resilience Engineering in leadership to enable better awareness and decision making. 

Very specifically, my goals no longer include the death of root cause analysis, incident metrics, MTTR and PIR action tracking. 

Telling executive leadership that incident counts, MTTR, and root cause analysis are wrong/incorrect/misleading gets you nowhere. I can confidently say that when an executive leader wants to be talking about quality of service for your customers, the last thing they want to hear about is academic papers and Monte Carlo simulations. This narrative is harming our ability to do resilience work, and undermines the credibility of our ideas, our practices and our movement. It is very difficult to educate people against their will, especially if their view of competence on a subject is different to your own. We must make Resilience Engineering appealing and useful to executives if we are to get their support and buy in.

Board and Executive pressure around incidents and the need to present something that is viewed as tangible and sound should never be underestimated. Once you are in the pressure cooker of “too many incidents” or “poor reliability,” incident data and reports can give leaders the tools to feel in control but also demonstrate that they are in control. Many of the traditional incident practices and metrics help leaders communicate that they are controlling things, and also creates opportunities for them to exert control on what people do—even when they don't have control over the actual system or system outcomes. 

The closer you are to the sharp end—the point where the most direct and often challenging work of a particular activity takes place—the more apparent it is that uncertainty exists. Many of the above-mentioned default standards and expectations around incidents are designed to help manage and control uncertainty. However, Resilience Engineering is essentially about embracing uncertainty as an unavoidable reality of complex systems. As a result we tend to want everyone to accept uncertainty as unavoidable. In the Boardroom, far removed from the sharp end, the perspective is very different. Uncertainty is something that should be managed and reduced. It can become a risk with acceptable probability, but it can't be unmanaged. From this different perspective, the system and its outcomes must be controlled, or have the illusion of being controlled, even if we know this to be at odds with the system’s reality. 

My Recommendations

You need to play the game that others think they are playing. As an engineering leader, you only have so much social capital and so many cards to play in what you can change and influence. You want these cards to be used for your agenda, and you want to make meaningful progress. This might be influencing for additional headcount or prioritising a particular piece of system improvement work. I don’t think those limited cards should be used for debating concepts like MTTR and incident counts. 

Give Them What They’re Asking For

If leaders and executives are asking, then continue to produce that incident data and show those PIR actions. If you’re not doing it then someone else will, and it’s much better to be in control of the data and the narrative. Position yourself and your SRE leaders as credible, in control and taking action, even if it’s not by your definition. If your leaders aren’t currently asking for typical incident reports and metrics, I strongly suggest  you have the data on hand to be  in a position to provide it should you be asked. You need your perceived competence and resilience work to ride the wave of leadership change and organisational evolution. Even if you succeed with one set of leaders in building their understanding that these are not the metrics or approaches they need, leaders come and go. If you are perceived as too far from industry norms you will have a hard time establishing trust and credibility with a new set of leaders who haven't been on the same journey. 

Proceed with Caution When Educating and Influencing

Rather than seeking to educate directly in a way that can be perceived as contrary, model the Resilience Engineering concepts, language and practices you want to see others adopt. Once your stakeholders become familiar with the concepts through a more implicit approach, you’ll find them more receptive to being introduced to new ways of understanding reliability and resilience, and different types of recommendations from their teams as a result. 

Educating and influencing leadership on Resilience Engineering is not without risks, so I recommend that you proceed with caution. This approach assumes  leaders have the willingness to learn and the conceptual grounding to make informed decisions. In reality, this is rarely the case. When it comes to incidents, leaders tend to believe they know how things should be done and are resistant to alternative approaches and perspectives. Even if you do manage to succeed in educating leadership, the outcome is uncertain. At best they gain only a surface-level grasp of new concepts, at worst they misinterpret and misapply what they’ve learned, causing further harm. 

A more effective strategy is to empower people closer to the work—those who have done the analysis and understand the tradeoffs—to provide recommendations for executive action. These recommendations should narrow the range of options and enable leadership to make decisions confidently. These recommendations are more likely to be accepted if they are backed by the credibility and trust gained through giving leaders what they’re asking for. 

Anecdotally, attempting to influence leadership towards more of our Resilience Engineering methods and practices has a notable rate of burnout for SRE leaders and senior individual contributors. I suspect the phenomenon we are currently experiencing in this regard is part of some broader industry cycle due to factors outside of our control. Either way, we need our perceived competence and resilience work to ride out the wave of this industry cycle, just as we need to ride out the wave of leadership change and organisational evolution. And we need to be able to do that with our mental and emotional wellbeing intact. 

Acknowledge and Negotiate the Paradox

So what does this look like in practice? Think of it like putting spinach in a kid’s smoothie—give them what they think they want, and include the good, healthy stuff they also need in the process. If someone states human error as a cause for an incident, you can mention that this tends to be an indicator of underlying systemic challenges worth taking a closer look at. If you can, offer to provide SRE resources to conduct a more useful PIR. If leaders are getting caught up on incident counts and MTTR, you can use SLOs to provide a more meaningful representation of the quality of the customer experience, while providing context in the form of themes and patterns relevant to your audience. 

It’s here in these conversations, with the background of outputs and metrics you’re not enthusiastic about, that you’re best placed to use those valuable cards to make recommendations based on insights from incident learning and resilience work. Whether it be highlighting systematic challenges and areas of underinvestment or making recommendations to support better prioritisation and improving the engineered system.

In all of this, you are constantly negotiating the paradox we face in Resilience Engineering. These opposing perspectives and methods will always coexist. One might be more dominant than the other at certain points in time, but there will always be that push and pull. You need to be ready for it. By building the perception of competence and playing along with short term optics, you can pursue the bigger picture of Resilience Engineering. 

I still hope that one day we will build broad industry recognition for Resilience Engineering concepts and practices, while outdated methods and metrics fade off into the sunset. Until that day comes, we play the game while working to change the rules.

 

Michelle Casey

SRE/Engineering Leader