Failure Analysis Experts
Introduction – The following is a brief description of how ENGINEERING SERVICES, L.P. proceeds when performing a Root Cause Failure Analysis (RCFA).
ENGINEERING SERVICES, L.P. proposes that the Root Cause Failure Analysis (RCFA) be divided into three major work phases: COLLECTION, ANALYSIS, and SOLUTION. Each of these steps is described below. It is best to proceed through all of the phases (i.e., one should not consider solutions until the analysis is complete), but it is not a one way path. There are plenty of reasons to possibly back up and repeat or revisit any steps in the process before proceeding further. The failure is a complicated collection of events and likely will prove to be a challenge to identify the root cause.
The RCFA proposed by ENGINEERING SERVICES, L.P. is not a single, sharply defined methodology; there are many different tools, processes, and philosophies of RCFA. However, most of these can be classed into five, very-broadly defined “schools” that are named here by their basic fields of origin: SAFETY-BASED, PRODUCTION-BASED, PROCESS-BASED, FAILURE-BASED, and SYSTEMS-BASED. All of these must be considered by ENGINEERING SERVICES, L.P. in the analysis of this complex failure.
Phase 1 Collection
Building a Team
Collection – is used to describe all of the work necessary to prepare for the analysis phase. Naturally, the first step is to form a team that will participate in the RCFA. Team members should have ownership of the problem, and will therefore include engineers, technicians, and operators. These team members are considered the natural team, as they have a firsthand interest in the results of the RCFA., as is the case for all parties.
There are two main reasons why other team members will be added over the course of the investigation. First, it will be necessary to bring expertise into the team to help resolve key questions or assist in the development of viable solutions. These “expert” team members do not need to be permanent members, and can be released once their contribution is complete. Their role is to support the investigation such that it is not halted for technical reasons.
The second reason to add team members is to increase the circle of influence of the team. As the investigation matures, it may become apparent that the real root cause lies outside of the current team’s influence. For example, the investigation may point to an issue in manufacturing. In such a case, it is important to add a team member that has the desired influence (i.e., manufacturing background), thus the investigation is not prematurely halted due to organizational boundaries.
Defining the Problem
ENGINEERING SERVICES, L.P. will work closely with the client to define the problem as a team activity, usually requiring some amount of brainstorming to come up with just the right definition. The quality of the investigation depends heavily on the quality of the problem definition. A good problem definition is short, simple, and easy to understand. In fact, if a problem statement is complicated, it merely reflects a poor understanding of the real problem. It is important that everyone on the team understands and agrees with the problem statement.
The problem statement must also not be biased toward a specific solution. The consequence is the potential to either completely miss the real root cause, or at a minimum, miss some important contributing causes.
Data Collection
The final portion of the collection phase is the actual data collection. There are three common types of data:
- physical evidence,
- recorded evidence, and
- personal testimony
The most critical aspect of collecting the physical evidence is to resist the urge to clean. Although it may seem desirable to provide clean, easy to handle samples to the various technical experts for review, the odds are that valuable data will be lost in the cleaning process. Cleaning these parts before the completion of the investigation will add uncertainty to the metallurgical analysis, as well as eliminate the evidence of a potential corrosive source. There are times when it is necessary to further damage evidence just to remove it from the scene. In this case, care should be taken to not impact the actual damaged portions of the evidence.
In addition to the failed components, it is important to the investigation to provide good, undamaged components for study as well. Undamaged parts can also be important for extracting geometric information to be used in a computer simulation (finite element analysis [FEA], computational fluid dynamics [CFD], etc.). Depending on the type of failure, it may also be important to capture physical evidence such as hydraulic fluid, water samples, deposit samples, etc.
It is experience that will determine what other physical evidence needs to be retained. If there is doubt, it is certainly better to retain the samples. They can always be discarded once the investigation is over.
Recorded evidence is the next significant type of data to be collected by members of the team. Pictures are clearly necessary for the investigation. The tendency is to take too few pictures, because at the time, it seems impossible to forget what is being witnessed. However, experience at Engineering Services, L.P. has shown that there cannot be too many pictures. There are two good concepts to keep in mind when taking pictures. First, for each detail picture, include a series of pictures that start from a very large view, and then gradually (perhaps three steps) zooms into the desired level of detail. This technique is vital to maintaining perspective and orientation.
The other forms of recorded data (operator logs, supervisory control and data acquisition logs, pressure logs, daily well logs, etc.) can be critical to the complete understanding of operating conditions at the time of failure. These data are used to assist in a failure. Since data is typically dated, they are ideal for generating a timeline of events. Therefore, it is critical to capture this data as soon as possible.
The other important category of data to collect is the personal testimony. Theoretically, since everyone involved is discussing the same event, all of the various stories should converge. If the information from the various personnel does not agree, it may be a sign of multiple failures. Obviously, there is significant potential for finger pointing, or at least perceived finger pointing during this phase of data collection. To minimize this perception, important that the interviews be conducted by a rational, cool-headed person. Sending in an irritated and irrational person to collect personal testimony will definitely have adverse effects on the quality of the testimony. Second, it is important to stay focused only on data collection, building a consistent timeline, etc. Any premature discussion of the cause of failure will likely adversely impact the interview process. ENGINEERING SERVICES, L.P. assumes that client will participate in this activity.
It is up to the investigating team (ENGINEERING SERVICES, L.P.) to resolve all conflicts in the data, whether it is personal testimony, operator logs, etc. Unfortunately, due to the human influence, none of the data sources will be pristine. ENGINEERING SERVICES, L.P. will compare all of the data, filling in the gaps, and resolving the conflicts, so that a clear and consistent picture of the failure can be obtained.
Phase 2 Analysis
Building the Cause Chain
ENGINEERING SERVICES, L.P. will analyze the collected data to build the cause chain and determine the immediate, contributing, and root causes of the failure. The immediate cause is typically the first one in the cause chain, thus directly leading to the failure. The root cause is the last one in the cause chain, while the contributing causes are the ones in between the immediate and the root. Although the process is referred to as root cause failure analysis, it is important to identify all of the causes.
ENGINEERING SERVICES, L.P.‘s experience indicates that there are other important key points to remember during the analysis phase.
Follow the data—the most difficult aspect of the analysis phase is avoiding preconceived notions regarding the root cause. It is up to the team members to protect each other from this trap. The investigation team must stick to the data and exclude “gut feel” from the investigation.
Consider both technical and organizational causes—finding the technical answer is often difficult, but the investigation should not stop there. Organizational influences can be just as significant and must also be included in the investigation. What appears to be operator error is most likely a broken process, missing checks, or unclear expectations.
Concentrate on analysis—save the problem solving for the next phase. The key at this point is to identify the immediate, contributing, and root causes.
The analysis phase is complete once the immediate, contributing, and root causes are identified. The root cause is dependent on the reach of the team. If the last contributing cause exists at a boundary that cannot be crossed (by either adding technical or organizational influence), then it is effectively the root cause. This is where the solution phase should focus. It is only of academic value to identify a root cause over which the team has no influence.
Phase 3 Solution
Breaking the Cause Chain
ENGINEERING SERVICES, L.P.‘s fundamental objective of the solution phase is to break the cause chain. This means that the quality of the solution depends heavily on the quality of the cause chain developed in the analysis phase. Another important feature of the cause chain is that since most failures are the result of both root and contributing causes, there are usually multiple areas that can be addressed. This is important to recognize in the solution phase, as it helps to open up the number of possible solutions to the original problem. It is also possible that preventing some of the contributing causes can also lead to improved reliability in other areas not presently considered.
SUMMARY
The failures suggest both technical and organizational flaws. Although it is unreasonable to expect perfect performance with perfect reliability from these systems, it is just as unreasonable to allow the same failure to occur multiple times. Therefore, the objective of this project is to provide a practical approach to a RCFA and determine the appropriate corrective/preventive action recommendations necessary to avoid the same failure in the future.
RCFA starts with the COLLECTION PHASE, consisting of team forming, problem definition, and of course data collection. Next is the ANALYSIS PHASE, determining the immediate, contributing, and root causes of the defined problem. Finally, the SOLUTION PHASE consists of determining the appropriate corrective/preventive action plan that will effectively break the cause chain. In each phase of the process, there are critical steps and simple guidelines to consider that will keep the investigation focused and practical. These are the key characteristics of a successful RCFA that ENGINEERING SERVICES, L.P. is proposing.
From the very beginning of the investigation, ENGINEERING SERVICES, L.P. would like agreement on a Table of Contents for the final report. As data is collected and evaluated, it should immediately be incorporated into the final report, similar to a DATA BOOK. For example, in this industry, as we build equipment, we add the design and manufacturing information into a single volume refer to as a DATA BOOK. When the equipment is ready for shipment the DATA BOOK is reviewed for completion and then, and only then is the equipment released for shipped.

