On January 28th, 1986, seven astronauts boarded the Challenger for its tenth launch into space. Its previous missions had included the first space walk and, at various times, the first American woman, African-American, Canadian and Dutchman in space. 73 seconds after lift-off, the Challenger broke apart, killing the entire crew. Why? Because the fuel tank exploded. So, the solution, of course, is to send the next shuttle up with a fuel tank that doesn’t explode, right?
Only if you assume that the exploding fuel tank was a 100% isolated incident, completely unrelated to any other events. If that sounds fishy to you, it should. And this is where root cause analysis comes into play, a practice colloquially known as “the five whys”.
By Reading This Post, You Will Learn:
- What a “root cause analysis” is
- How root cause analysis can help you identify and resolve the systemic issues that caused an acute, incidental problem
- How to conduct a root cause analysis using “the five whys” format
What Is A Root Cause Analysis?
A root cause analysis assumes that any critical failure is not the result of a single event, but a chain of events starting with a root cause. You can think of each of these contributing factors as the branches of tree. A tree of EVIL. Naturally, you want to destroy this tree, it being evil and all. But, if you only address the acute issue (like a fuel tank exploding) you are only cutting off one branch. It’s preceding branches will all live on to spawn other catastrophes. And killing any plant means getting at its roots.
And you get to the roots by asking “Why?”. A lot. Why did this critical issue occur? Because of Event A. Well, let’s fix Event A, but why did that occur? Event B. Then let’s sort out Event B, but in the meantime, why did that event occur?
A root cause analysis isn’t scientific in the sense of being supported by experimentation and peer-reviewed studies. But I’m including it in “Game Planning With Science!” for two important reasons. First, it is a form of rigorous analysis, which is the foundation of all credible science. Second, it’s very much a lean way of thinking: you are trying to eliminate waste and failure not topically, but at the source. You are not just seeking to treat the symptom, but the disease itself.
Case In Point: The Challenger
Here’s an example of a real-world root cause analysis:
- Why did the Challenger’s external fuel tank explode?
- Answer: the booster rockets warped during liftoff and started leaking flames and propellant which, in turn burned a hole in the side of the fuel tank.
- Solution: redesign boosters to prevent or better tolerate this sort of warping (called extrusion)
- Why did extrusion cause the boosters to leak flames and propellent?
- Answer: the (now infamous) O-rings on the rockets did not shift to contain the flames and propellant after extrusion occurred
- Solution: redesign boosters to contend with O-ring failure
- Why did the O-rings fail to shift with the extrusion?
- Answer: the frigid temperatures in January exceeded what the O-rings’ seals could tolerate; specifically, the O-rings hardened in the cold and were too stiff to shift with the extrusion
- Solution: redesign O-ring seals for a wider tolerance of environmental stressors
- Why did Challenger launch with faulty O-rings?
- Answer: the launch was approved for temperatures below those for which Challenger had been certified to fly
- Solution: redefine mission criteria to prevent launches in weather conditions for which the space craft in question have not previously been tested/certified
- Why did NASA approve the launch during untested weather conditions?
- Answer: internal NASA politics bypassed safety protocols and forced a launch over the objections of the engineers
- Solution: increase transparency and accountability for observing established protocols
Implications Of The Five Whys
There are three crucial observations to make about a root cause analysis.
First, every “why?” results in a distinct issue (and implied fix) AND a path for further discovery. You are trimming the problematic branch (and all the other branches that might spring from that node), and then following the branch further along to the next deepest node.
Second, it’s crucial to observe how different the first “why?” is from the last. Why #1 is specific, while #5 is systemic. Why #1 is a technical problem – the barn door after the horse has left. #5 is organizational – why did we make the bad decisions that led to leaving the barn door open?
Third, consider the potential for further catastrophic failures if you had only fixed the boosters. Or even if you had stopped with using better O-rings in future missions. The root cause of inadequate launch safety criteria and reckless politicking at NASA would still linger to spawn future problems. By way of example, consider the (admittedly simplistic) flow chart below. Even assuming that any point of failure had, in the worst-case analysis, only two potential points of further failure (an optimistic assessment to say the least), you’re still looking at 15 possible negative outcome scenarios in addition to the actual Challenger disaster.
In other words, if the politics at NASA hadn’t forced a launch in unfavorable weather conditions, none of the other failures would have occurred, and the Challenger would not have exploded. This is not to say that there aren’t other root causes that could result in the Challenger’s demise, but this specific loss of life would not have happened.
Resources That Informed And Influenced This Post
If you have ad blockers turned on, you may not see the above links.
Web sites that helped shape this post:
The Five Whys In Game Development
Root cause analysis can be applied to any ex post facto investigation of an operational failure. For instance, let’s say you own a game studio and are running on online multiplayer game. The server goes down for an afternoon, costing your company an estimated $10k in lost micro-transactions. You dig through Git and find the offending submission, submitted by one Bobby McBork. Now you can begin and end your follow-up on the issue with reprimanding and/or firing Mr. McBork. Or, you can take a more holistic view and try to identify the sequence of events that lead to the server failure.
- Why did the server fail?
- Answer: Bobby McBork merged faulty code directly into the production code base
- Solution: revert the offending code change
- Why did Bobby merge his code directly to production?
- Answer: his manager, Jimmy Nosense instructed him to do so
- Solution: establish a new protocol that NO submissions should be merged into the production code base unless they are designed to fix an ongoing, red alert issue in live service; make sure Jimmy understands the new protocol
- Why did Jimmy direct Bobby to merge to production instead of the QA testing server?
- Answer: the QA server had been down all week
- Solution: bring the QA server back online
- Why was the QA testing server down all week?
- Answer: because no one took the time to fix it
- Solution: assign an engineer to bring the QA server back online
- Why didn’t anyone take the time to fix the server?
- Answer: no single person was responsible for fixing it, thus no one prioritized the work against his/her own to-dos*
- Solution: establish a protocol for dealing with QA server failures and assign first responder/s for such incidents
There’s Always A Catch
The above illustrates a catch in the Five Whys protocol. As a manager, if you ask why enough times, the path may very well lead up to your own door. What appears to be the failure of an individual (poor Bobby McBork) was in fact a failure on your part to establish clear protocol both for deployments and for handling server failures.
That’s not an argument against root cause analysis, mind you. It’s a recommendation that you need to check your ego at the door if you want to fix systemic, root cause issues rather than topical ones.
Does It Have To Be Five? No More, No Less?
As Emerson said, “A foolish consistency is the hobgoblin of little minds.” Don’t ask why five times just for the sake of asking it. And don’t stop at five if there is further discovery to be had. If you have reached the root cause, stop. If you haven’t, continue. The number five is simply a rule of thumb.
Further Reading If You Enjoyed This Post
Even the leanest, most efficient process will experience errors and failures. A mistake is a learning experience. Repetitions of that mistake are waste. So, if you want to avoid recurrences of the same issue, don’t just identify the cause. Find and correct the source. The savings, in terms of waste and lost time, will be orders of magnitude greater.
- Root cause analysis is the process of moving through the chain events of a specific, acute problem in order to identify the underlying systemic causes/s
- The typical format for root cause analysis is the five whys – literally asking why five times
- When using the five whys, solve each contributing issue you come across as you move toward the root cause
- Five whys is a rule of thumb, but root cause analysis doesn’t LITERALLY have to consist of exactly five whys
- The point is to try to fine the underlying systemic problem that facilitated the acute issue
- If you can find the problem in 3 or 4 whys, fine; and if it takes 7, so be it