You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One change we are seeing in our industry is the wider adoption of the belief that being able to distill an incident down to a single root cause is a myth[1][2]. As the complexities of our systems grow the complexities of our incidents grow, and trying to isolate an incident to one item doesn't result in the types of learnings we need to come out of those incidents.
The truth is that each incident is unique because of the multiple factors that contributed to it, and if any one of those factors was different it would have been a completely different incident. Without giving each of those factors the same care, we miss the opportunity to solve for those different parts.
While pluralizing "root cause" to "root causes" can get you a good part of the way there, in my experience I've seen that the verbiage change from "root causes" to "contributing factors" is a much bigger change in how people think about it and drive the learnings in the way we want. While I initially was skeptical such a minimal language change would make a difference, I can happily admit I was wrong.
At Netflix we've started to change our internal language around it, and have found a much richer set of learnings from teams after an incident. Being that I was a responder at PagerDuty when we started to form these practices and the inception for this documentation, I feel like it'd be a miss if we didn't iterate on these documents to follow with learnings from our industry.
We, and others, have started to talk about Contributing Factors instead. We still identify what was traditionally called the "root cause", but we listed it as one of the factors (often called out as the trigger).
What are your thoughts on updating the verbiage of this documentation to align with our industry shifting its way of thinking?
One change we are seeing in our industry is the wider adoption of the belief that being able to distill an incident down to a single root cause is a myth[1][2]. As the complexities of our systems grow the complexities of our incidents grow, and trying to isolate an incident to one item doesn't result in the types of learnings we need to come out of those incidents.
The truth is that each incident is unique because of the multiple factors that contributed to it, and if any one of those factors was different it would have been a completely different incident. Without giving each of those factors the same care, we miss the opportunity to solve for those different parts.
While pluralizing "root cause" to "root causes" can get you a good part of the way there, in my experience I've seen that the verbiage change from "root causes" to "contributing factors" is a much bigger change in how people think about it and drive the learnings in the way we want. While I initially was skeptical such a minimal language change would make a difference, I can happily admit I was wrong.
At Netflix we've started to change our internal language around it, and have found a much richer set of learnings from teams after an incident. Being that I was a responder at PagerDuty when we started to form these practices and the inception for this documentation, I feel like it'd be a miss if we didn't iterate on these documents to follow with learnings from our industry.
We, and others, have started to talk about Contributing Factors instead. We still identify what was traditionally called the "root cause", but we listed it as one of the factors (often called out as the trigger).
What are your thoughts on updating the verbiage of this documentation to align with our industry shifting its way of thinking?
[1] https://medium.com/@jpaulreed/dev-ops-and-determinism-966a57e3a5cc
[2] https://en.wikipedia.org/wiki/Fallacy_of_the_single_cause
The text was updated successfully, but these errors were encountered: