Root Cause Analysis? You're Doing It Wrong
Key topics
The article 'Root cause analysis? You're doing it wrong' challenges traditional approaches to root cause analysis, sparking a discussion on the effectiveness of different methods and the complexities of analyzing accidents in complex systems.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3h
Peak period
61
54-60h
Avg / period
13.1
Based on 92 loaded comments
Key moments
- 01Story posted
Oct 11, 2025 at 9:39 AM EDT
3 months ago
Step 01 - 02First comment
Oct 11, 2025 at 12:53 PM EDT
3h after posting
Step 02 - 03Peak activity
61 comments in 54-60h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 15, 2025 at 12:40 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
https://how.complexsystems.fail/
I forget where I heard this but, “you manage risk, but risk cannot be managed.” Ie. there is no terminal state where it’s been “solved.” It’s much like “culture of safety.”
Though unsatisfying it feels like a lot boils down to "shit rolls downhill" or "fish rot from the head down"
Any RCA that doesn't provide useful feedback to management up and down the chain is missing pieces, but there's lots of discussion elsewhere in this thread about that by people better at elaborating than I am.
If you get a bunch of analyses that point to underinvestment in x in order to achieve y, and you can measure that this is losing money, then the top level recalibrates.
It's not about blame, it's about course-correcting including at the top level. They can't do that course-correcting without these analyses.
Before that I hated it when people confessed to me that they knew a problem was going to happen and they did nothing to stop it. But the problem with doing things right the first time is that nobody appreciates how hard it is. You spend all your political capital trying to stop the company from making a mistake it desperately wants to make. You get no credit when it doesn't happen because they already figured it wouldn't. And you don't get any buy-in for keeping it from repeating in the future. So now you have this plate you have to spin all by yourself, and nobody who can hand your more control gives a shit other than the fact that you seem like an asshole so we aren't going to promote you.
[1]: https://entropicthoughts.com/hidden-cost-of-heroics
I often hear people misinterpret "risk management" as if it means "risk minimisation", but this is the first time I hear of "risk elimination"!
Risk management is about finding an appropriate level of risk. Could be lower, could be higher.
That is sort of how Agile got out there in the first place. Open feedback loops letting management be stupid for so long before the consequences were apparent that they could duck their vast contributions to it.
[0] https://en.wikipedia.org/wiki/Normal_Accidents
If the result/accident is too bad though you need to find all the different faults and mitigate as manyias possible the first time.
This sounds like continuously firefighting to paper over symptoms rather than address the problems at a deeper level.
People rarely react well if you tell them "Hey this feature ticket you made is poorly conceived and will cause problems, can we just not do it?" It is easier just to implement whatever it is and deal with the fallout later.
My 'favorite' is when we implement stupid, self-limiting, corner-painting features for a customer who leaves us anyway. Or who we never manage to make money from.
I was asked to write up such a document for an incident where our team had written a new feature which, upon launch, did absolutely nothing. Our team had accidentally mistyped a flag name on the last day before we handed it to a test team, the test team examined the (nonfunctional) tool for a few weeks and blessed it, and then upon turning it on, it failed to do anything. My five whys document was most about "what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do."
I recall my manager handing the doc back to me and saying that I needed to completely redo it because it was unacceptable for us to blame another team for our team's bug, which is how I learned that you can make a five why process blame any team you find convenient by choosing the question. I quit not too long after that.
My first thought is why is rolling out a new system to prod that is not used yet an incident? I dont think "being in prod" is sufficient. There should be tiers of service and a brand new service should not be on a tier where it having teething issues is an incident.
> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
would be interested to see the doc, but imagine you'd branch off the causes, one branch of the tree is: UAT didnt pick up the bug. why didn't UAT pick up the bug? .... (you'd need that teams help).
I think that team would have something that is a contributing cause. You shouldn't rely on UAT to pick up a bug in a released product. However just because it is not a root cause doesnt mean it shouldn't be addressed. Today's contributing cause can be tomorrow's root cause!
So yeah yiu dont blame another team but you also dont shield another team from one of their systems needing attention! The wording matters alot though.
The way you worded the question seems a little loaded. But you may be paraphrasing? 5 whys are usually more like "Why did they papaya team not detect the bug before deployment?"
Whereas
> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
Is more emotive. Sounds like a cross examiners question which isn't the vibe you'd want to go for. 5 whys should be 5 drys. Nothing spicy!
It took a while for all the teams to embrace the rca process without fear and finger pointing, but now that it's trusted and accepted, problem management stream / rca process probably the healthiest / best viewed of our streams and processes :-)
Edit: You can see others commenting on precisely this. Examples:
https://news.ycombinator.com/item?id=45573027
https://news.ycombinator.com/item?id=45573101
https://news.ycombinator.com/item?id=45572561
https://news.ycombinator.com/item?id=45572561
If you do this, know that there be dragons. You have to be very careful here, because for any sufficiently large company, misaligned incentives are largely defined by the org chart and it’s boudaries. You will be adding fuel to politics that is likely above your pay grade, and the fallout can be career changing. I was lucky to have a neutral reputation, as someine who cared more about the product than personal gain. So I got a lot of leeway when I said tonedeaf things. Even still I ended up in crosshairs once or twice in the 10 years I was at the company for having opinions about systemic problems.
I'm not disagreeing. I'm saying they should phrase it this way (and some do), instead of masking it with an insincere request for root causing.
> Ideally there will be an actionable outcome, so someone can check that off a todo list at a later meeting
Occasionally this is the right thing to do. And often this results in a very long checklist that slows the whole development down because they don't want to do a cost-benefit analysis of whether not having an occasional escape is worth the decrease in productivity. And this is because the incentives for the manager is such that the occasional escape is not OK.
In reality, though, he will insist on an ever growing checklist without a compromise in velocity. And that's a great recipe for more escapes.
That's the problem with root cause analyses. Sometimes the occasional escape is totally OK if you actually analyze the costs. But they want us to "analyze" while turning a blind eye to certain things.
I've worked at places that understood this and didn't have this attitude. And places that didn't. Don't work for the latter.
BTW, I should add that I'm not trying to give a cynical take. When I first learned the five whys, I applied it to my own problems (work and otherwise). And I found it to be wholly unsatisfying. For one thing, there usually isn't a root cause. There are multiple causes at play, and you need a branching algorithm to explore the space.
More importantly, 5 (or 3) is an arbitrary number. If you keep at it, you'll almost always end up with "human nature" or "capitalism". Deciding when to stop the search is relatively arbitrary, and most people will pick a convenient endpoint.
Much simpler is:
1. What can we do to prevent this from happening again?
2. Should we solve this problem?
Expanding on the latter, I once worked at a place where my manager and his manager were militant about not polluting the codebase to solve problems caused by other tools. We'd sternly tell the customers that they need to go to the problematic tool's owner and fix them, and were ready to have that battle at senior levels.
This was in a factory environment, so there were real $$ associated with bugs. And our team had a reputation for quality, and this was one of the reasons we had that reputation. All too often people use software as a workaround, and over the years there accumulate too many workarounds. We were, in a sense, a weapon upper management wielded to ensure other tools maintained their quality.
An inflexible timeline is a constraint that is often valid -- e.g. if you have to launch a product meant for classrooms before the school year begins.
So then the real question becomes, why wasn't work scoped accordingly? Why weren't features simplified or removed once it became clear there wouldn't be time to deliver them all working correctly? There's a failure of process there that is the root cause.
Similarly, with "incentives are misaligned", that's fuzzy. A real root cause might be around managing time spent on bugfixes vs new feature, and the root cause is not dedicating enough time to bugfixes, and if that's because people aren't being promoted for those, it's about fixing the promotion process in a concrete way.
You can't usually just stop at fuzzy cultural/management things because you want to blame others.
That's not an inflexible timeline. That's just a timeline.
> So then the real question becomes, why wasn't work scoped accordingly? Why weren't features simplified or removed once it became clear there wouldn't be time to deliver them all working correctly? There's a failure of process there that is the root cause.
Because management didn't want to drop features. Hence "inflexible".
I'm not saying this is always the reason, or even most of the times. But if we can't invoke it when it is the problem, then the exercise is pointless.
> Similarly, with "incentives are misaligned", that's fuzzy.
Any generic statement is inherently fuzzy.
> You can't usually just stop at fuzzy cultural/management things because you want to blame others.
Did you really think I was advocating responding to a whole process with one liners?
The examples you gave are often ones they will not accept as root causes.
This was and is torture to me. I'm not going to fuck something up on purpose just to make the paperwork look good if I can tell ten minutes in that this is a stupid way to do it and I should be doing something else first.
Why did the testing team not catch that the feature was not functional?
This is covered by LINK
This is a classic "limit the scope of the feature." You want the document to be written and constrained to someone that is in a position to impact everything they talk about. If you think there was something more holistic, push for that, as well.
Note you can discuss what other teams are doing. But do that in a way that is strictly factual. And then ask why that led your team to the failure that your team owns.
If you're wondering what anyone "could have done", you've already missed the point of the article completely.
This is literally a "the right people aren't in the room" issue.
I should also have stated that based on the context I assumed this was talking about in incident report meant to be consumed internally, which I believe should be one per team. Incident reports published externally should be one single document, combining all the RCA from each individual report.
The idea that if it was obvious another team messed up, you'd just ignore the problem until it got audited by a cross-functional team. All of the time and effort and materials spent between the two being wasted because nobody spoke up.
Otherwise when an airplane crashes because of a defect in the aluminum, the design team's RCA will have to conclude that the root cause is "a lack of a redundant set of wings in the plane design", because they don't want to pin the blame on the materials quality inspection team mixing up some batch numbers.
If you were relying on something not to fail and it failed, your RCA should state as much. At best in the GP's case they could say "it's our fault for trusting the testing team to test the feature".
Is the process set up so that it's literally "throw it over the wall and you're done unless the test team contacts you"? Then arguably not. You did your job e2e and there was nothing you could've done. Doing more would've disrupted the process that's in place and taken time from other things you were assigned to. The test team should've contacted you.
BUT, well now the director has egg on his face and makes it your problem, so "should" is irrelevant; you will be thinking about it. And you ask yourself, was there something I could've done? And you know the answer is "probably".
Then, the more you think about it, you wonder, why on earth is the process set up to be so "throw it over the wall"? Isn't that stupid? All my hard work, and I don't even get to follow its progress through to production? Is this maybe also why my morale is so low? And the morale of everyone else on the team? And why testing always takes so long and misses so many bugs?
And then as you start putting things together, you realize that your director didn't assign this to you out of spite. He assigned it to you to make things better. That this isn't a form of punishment, but an opportunity to make a difference. It's something that is ultimately a director-level question. Why is the process set up like it is? The director could put together and solve with adequate time, but at that level time is on short supply, and he's putting his trust in you to analyze and flesh out, what really is the root cause for this incredibly asinine (and frightening) failure, and how can we improve as a result?
That said, in an org so broken that something like this could happen, I'm guessing the director is wanting you to do the RCA and the ten other firedrills that you're currently fighting as well, in which case, eh, fuck it. Blame the other team and move on.
You keep using that word. I do not think it means what you think it means.
Hell, people even legislated the value of PI that one time.
Reportedly Toyota has organizational mitigations for that problem or reportedly the working culture there isn't so great after all. The bottom line is, it's a double edged sword to say the very least.
Fwiw, if I were your manager performing a root cause analysis, I'd mostly expect my team to be identifying contributing factors within their domain, and then we'd collate and analyze the factors with other respective teams to drill down to the root causes. I'd also have kicked back a doc that was mostly about blaming the other team.
Agreed. If the test team messed up, then you need to answer the "why" your team didn't verify that the testing had actually been done. (And also why the team hadn't verified that the tool they'd sent to testing was even minimally functional to begin with.)
Five whys are necessarily scoped to a set of people responsible. For things that happen outside that scope, the whys become about selection and verification.
Validating that the build you produced works at all should be done by you, but there's also a whole team whose job it was to validate it; would you advocate for another team to test the testing teams tests?
And more to the point, how do you write a 5 why's that explains how you'd typo'd a flag to turn a feature on, and another team validated that the feature worked?
It doesn't take a whole team. There are lots of mechanisms to produce that evidence. This is just how it works. If two checks aren't sufficient, it becomes three. Or four. Until problems stop making it through.
Once you find out the heart surgeon shows up drunk to the operating room, you make sure there is an additional nurse there to hold his arm steady.
(That said, this is only in the case of persistent problems. Everyone can be inattentive some of the time, and a sensible quality system can be very helpful here. It's when the system tries to be a replacement for actually knowing what you're doing that things go off the rails)
Seriously? Even without knowing any context, there’s a handful of universal best practices that had to Swiss cheese fail for this to even get handed off to devtest…
- Why are you adding/changing feature flag changes the day before handoff? Is there process for development freeze before handoff, e.g. only showstopper changes are made after freeze? Yes but aales asked for it so they could demo at a conference. Why don’t we have special build/deployment pipeline for experimental features that our sales / marketing engineers are asking for?
- Was it tested be developer before pushing? Yes - why did succeed at that point and fail in prod? Environment was different. Why do we not have dev environment that matches prod? Money? Time? Politics?
- Was it code reviewed? Did it get an actual review, or rubber stamped? Reviewed, but skimmed important parts only — Why was it not reviewed more carefully? Not enough time — why is there not enough time to do code reviews? Oh, the feature flag name used underscore instead of hyphen — why did this not get flagged by style checker? Oh, so there’s no clear style conventions for feature flags and each team does their own thing…? Interesting…
Etc etc.
I've always pushed to get rid of this section of the postmortem template, or rename it, or something, because framing everything into five linear steps is never accurate, and having it be part of the template robs us of any deeper analysis or discussion because that would be redundant. But, it's hard to win against tradition and "best practices".
Just as you should never take responsibility over something you are given no power over, you should move responsibility to where the power is (and if they won't take responsibility, you start carving out the edges of their power and hand it over to adjacent groups who will).
I learned pretty early how to hack the 5 Why's in order to make sure something actionable but neither trivial nor overwhelming gets chosen as the response. And I often do it early enough in the meeting that I'm difficult to catch doing it.
If I don't get invited I will sometimes crash the party, especially if the last analysis resulted only in performative work and no real progress. You get one, maybe two, and then I'm up in your business because mistakes don't mean you're stupid, but learning nothing from them does.
...5 billion years ago, the Earth coalesced from the dust cloud around the Sun...
That being said, in this case, I agree with your manager. Both the QA team and your team had fundamental problems.
Your team (I assume) verified the functionality which included X set of changes, and then your team made Y more changes (the flag) which were not verified. Ideally, once the functionality is verified, no more changes would be permitted in order to prevent this type of problem.
The fundamental problems on the QA team were…extensive.
A few problems I faced:
- culturally a lack of deeper understanding or care around “safety” topics. Forces that be inherently are motivated by launching features and increasing sales, so more often than not you could write an awesome incident retro doc and just get people who are laser focused on the bare minimum of action items.
- security folks co-opting the safety things, because removing access to things can be misconstrued to mean making things safer. While somewhat true, it also makes doing jobs more difficult if not replaced with adequate tooling. What this meant was taking away access and replacing everything with “break glass” mechanisms. If your team is breaking glass every day and ticketing security, you’re probably failing to address both security and safety..
- related to the last point, but a lack of introspection as to the means of making changes which led to the incident. For example: user uses ssh to run a command and ran the wrong command -> we should eliminate ssh. Rather than asking why was ssh the best / only way the user could affect change to the system? Could we build an api for this with tooling and safeguards before cutting off ssh?
When you move thinking about reliability or safety outside of the teams that generate the problems, you replace self reflection with scolding, and you have to either cajole people to make changes for you or jump into code you're not spending enough time with to truly understand. And then if you make a mistake this is evidence that you shouldn't be touching it at all. See we told you this would end badly.
Someone said the quiet part loud! :
"""
Common circumstances missing from accident reports are:
Pressures to cut costs or work quicker,
Competing requests for colleagues,
Unnecessarily complicated systems,
Broken tools,
Biological needs (e.g. sleep or hunger),
Cumbersome enforced processes,
Fear of being consequences of doing something out of the ordinary, and
Shame of feeling in over one’s head.
"""
There were tools I wrote at my last job because basically an incident could IMO be tracked back to, "I was on step 5 and George interrupted me to ask a question of some other high priority effort, and when I got back I forgot to finish step 5". The more familiar you are with a runbook, the more your brain will confuse memories from an hour or a day ago with memories of the same routine at a different time.
It's literally "did I turn off the stove". You remember turning off the stove. Many, many times. But did you turn off both burners you turned on for lunch? You're certain about one. But what about the other? That's a blur.
But we're software developers. The more mundane a task, the more likely that we can replace it with a program to do the same thing. And as you get more familiar with the task you keep adding more to the automatic parts until one day it's just a couple buttons that can be reliably pushed during peak traffic on your services. Just make sure there isn't an active outage going on when you push the buttons.
1. Create a document from template (xyz), put it in this share location (abc) and fill it in as you perform each step.
I am blessed with a bad memory so I do this anyway! But not everyone has that "advantage".
I do this! I have a common checklist that I copy/paste into the project notes. It's all plain text and I just X items off as I go. Indeed, a missing X has caused problems before.
> If we analyse accidents more deeply, we can get by analysing fewer accidents and still learn more.
Yeah, that's not how it works. The failure modes of your system might be concentrated in one particularly brittle area, but you really need as much breadth as you can get: the bullets are always fired at the entire plane.
> An accident happens when a system in a hazardous state encounters unfavourable environmental conditions. We cannot control environmental conditions, so we need to prevent hazards.
I mean, I'm an R&D guy, so my experience is biased, but... sometimes the system is just broke and no amount of saying "the system is in a hazardous state" can paper over the fact that you shipped (or, best-case, stress-tested) trash. You absolutely have to run these cases through the failure analysis pipeline, there's no out there, but the analysis flow looks a bit different for things that should-have worked versus things that could-never-have worked. And, yes, it will roll up on management, but... still.
Sure, more is always better. Practically, though, we are trading depth for breadth. In my experience, many problems that look dissimilar after a shallow analysis turn out to be caused by the same thing when analysed in depth. In that case, it is more economical to analyse fewer incidents in greater depth and actually find their common factors, rather than make a shallow pass over many incidents and continue to paper over symptoms of the undiscovered deeper problem.
But my experience is also that you cannot ignore anything. Even the little stuff. The number of difficult system-level bugs I have resolved by remembering "you know, two weeks ago, it briefly did this weird thing that really shouldn't have been possible, but this might be related to that if only..." is crazy. It's been a superpower for me through the years.
However, I mostly work on hardware. Hardware's complexity envelope is straight-up different to software's. So that might explain some of our different perspective. Hardware absolutely never randomly misbehaves (which is to say that all its bad behavior has some kind of cause one might reasonably be able to ascertain, and that cause isn't from a level unreachable for a hardware engineer), but software carries enough state, and state from other levels of the stack, that I would not make the same statement. Thus the fault-chasing priorities aren't quite the same.
So I did what they wanted and the root cause was:
On December 11 1963 Mr and Mrs Stanley Smith had sexual intercourse.
I got asked what that had to do with anything and I said, "If you look up a few lines you'll see that the issue was a human error caused by Bob Smith, if he hadn't been born we wouldn't have had this problem and I just went back to the actual conception date."
I got asked how I was able to pin it to that date and said "I asked Bob what his father's birthday was and extrapolated that info"
I was never asked to do a RCA again.
They're both flawed but often replace something that works 3x worse than my caricature of both.
The findings should always result in a material change that is worth at least the effort of having done it. Not just a checkbox that proves we did something. The investment in the mitigation should honor the consequences of the failure, and the uniqueness of the failure. Or rather, the lack of uniqueness. As a failure repeats in kind (eg, a bunch of 737 Maxes crashing), trust that the system is put in jeopardy. By the time a problem has happened three times, the response should begin to resemble penance.
So how do we get the problem not to hit production again, or how do we at least keep it happening due to the exact same error?
And for some failure modes, we need to project the consequences going forward. Let's say you find your app is occasionally crashing over a weekend because of memory leaks, plus the lack of Continuous Deployment forcibly restarting the services. We can predict this problem will happen reliably on Memorial Day, and Labor Day. So we need to do something relatively serious now.
But it'll also get much worse on Thanksgiving weekend, and just stupid around Christmas, when we have code freezes. So we do something to get us through Memorial Day but we also need a second story near the top of the backlog that needs to be done by Labor Day, Thanksgiving at the latest. But we don't necessarily have to do that story next sprint.
Systems do not have to facilitate operators in building accurate mental models. In fact, safe systems disregard mental models, because a mental model is a human thing, and humans are fallible. Remove the human and you have a safer system.
Safety is not a dynamic control problem, it's a quality-state management problem. You need to maintain a state of quality assurance/quality control to ensure safety. When it falters, so does safety. Dynamism is sometimes not a factor (although when it is, it's typically a harmful factor).
Also fwiw, there's often not a root cause, but instead multiple causes (or coincidental states). For getting better at tracking down causes of failures (and preventing them), I recommend learning the Toyota Production System, then reading Out Of The Crisis. That'll kickstart your brain enough to be more effective than 99.99% of people.
It should be more clear in the article that this term is used more broadly. The "mental model" is the function that converts feedback into control actions. In that sense, even simple automated controllers have "mental models".
> Dynamism is sometimes not a factor (although when it is, it's typically a harmful factor).
Sometimes it isn't but often it is. Yes, that makes the system more complex, along with other properties such as organicism, non-linearity, parallelism, interactivity, long-term recurrences, etc. These properties are inconvenient but we cannot just wish them away. We have to design for the system we have.
> Toyota, Deming, SPC
I have read more books in this area than most people, and I don't really see how you came away without an appreciation for the importance of humans in the loop, and the dangers of overautomation. Could you illustrate more clearly what you mean?
Fantastic RCA: Remove requirement that caused the action that resulted in the problem occurring.
Bad RCA: Lets get 12 non technical people on a call to ask the on call engineer who is tired from 6 hours managing the fault, a bunch of technical questions they don't understand the answers to anyway.
(Worst possible fault practice is to bring in a bunch of stakeholders and force the engineer to be on a call with them while they try and investigate the fault)
Worst RCA: A half paragraph describing the problem in the most general terms to meet a contractual RCA requirement.
Causal Analysis based on Systems Theory - my notes - https://github.com/joelparkerhenderson/causal-analysis-based...
The full handbook by Nancy G. Leveson at MIT is free here: http://sunnyday.mit.edu/CAST-Handbook.pdf