Monday, November 12, 2012

Solving the Crime of Unreliability: Elements of a Process for RCA

I was recently watching a popular crime drama on TV and I noticed that they follow a very similar process when solving a crime that I do when solving a reliability problem in a facility.
The first thing the detectives do is identify questions they have that they would like answers to and then collect all the evidence they can to begin to answer those questions. Then they build a timeline to understand where things fit around the crime. Then they combine the evidence and the timeline together and identify the motives and the finally the suspects. I have over simplified all they do but the core process steps are still there.
Solving the Crime of Unreliability in a facility starts with identification of the questions and the evidence to be collected. Then just like the investigator the next step is the collection of said evidence. I suggest folks use collection kits to help categorize and capture the data in its entirety. There is a blog here about the kits I use and what they contain.
Early on I skipped the element of time and did not complete the timeline or sequence of events prior to the use of other tools. Over time I learned this was a mistake in many cases and cause me to miss details. In two recent RCA investigations that were completed by others and reviewed and refined later by me, we discovered whole new causal chains and missed causes related to rebuild and maintenance execution that was not identified in the initial investigation. This was due to the fact that the original RCA team focused on their preconceived notions and did not look at what happen just previous to the failure in the sequence of events. Completing the sequence of events opened their eyes. It will do two things for you: first it identifies other potential causes and second it clarifies the causes that you have already identified. Just as the crime scene investigators then take the time lines and evidence and begin to look at the relationships I do the same. I choose a tool like fault tree or logic tree, among others, to attack evidence in the sequence and to draw the connections and the causal chains.  
It you find the crime of unreliability has been committed in your facility then you may want to make sure you have included each of these steps in your RCA process.

