A vicious ransomware attack hit the City of Atlanta in March 2018. It ended up being one of the costliest breaches in the last decade. Public services were disrupted. Departments were forced to do essential paperwork by hand. And it cost the city government $17 million.
When the dust settled, one thing was clear: The city hadn’t been prepared for this disaster. An audit done two months before the breach found 2,000 vulnerabilities in the city’s IT system. It was a good step, but it was too little, too late.
What happened in Atlanta is a cautionary tale with one big lesson—risk can never be eliminated, but it can be managed. This is especially true in maintenance. Equipment failure is inevitable. But knowing how to reduce the likelihood of failure, and how to react when it happens, is critical to success.
A failure mode and effects analysis (FMEA) is a tool for understanding and anticipating failure so you can limit its impact. In this article, you’ll learn:
- What an FMEA is
- The different types of FMEAs
- How to create an FMEA
- And how maintenance teams can use an FMEA
What is an FMEA?
A failure mode and effects analysis, or FMEA, identifies and documents all the ways a piece of equipment can fail and the potential impact of these failures. It outlines:
- Failure modes for individual components
- The consequences of failure on productivity and safety
- A plan for preventing or reacting to these issues
Building FMEAs is a key component of reliability-centered maintenance (RCM).
There are three main goals of an FMEA:
- Prevent future breakdowns by reducing the likelihood of common and critical failures using planned maintenance and standard operating procedures
- Reduce response times, decrease downtime, and improve health and safety when asset failure occurs
- Prioritize preventive and corrective maintenance in non-emergency situations
What are the different types of FMEA?
FMEAs can be categorized into subtypes based on the type of risk they’re assessing and the impact they have. Here’s a quick rundown of each type of FMEA:
- FFMEA (functional failure mode and effects analysis): An FFMEA analyzes the risks that affect the way a system functions. The goal of an FFMEA (sometimes called a system failure mode and effects analysis) is to prevent these failures before they happen.
- DFMEA (design failure mode and effects analysis): A DFMEA assesses the risks of an asset in the design stage. The purpose of this analysis is to find and correct potential issues with an asset before it’s deployed to increase its reliability, reduce the amount of maintenance needed, and extend the asset lifecycle.
- PFMEA (process failure mode and effects analysis): A PFMEA seeks out possible failures within a process. The difference between a PFMEA and other types of FMEAs is it focuses on what can go wrong during the operation and maintenance of a system.
- FMECA (failure mode, effects, and criticality analysis): An FMECA (or criticality analysis) analyzes both failure modes and the level of risk associated with those failure modes.
What’s the difference between a failure mode and a failure code?
A failure mode is an error or defect that causes a system to malfunction. An example of a failure mode on a variable speed transfer conveyor might be a bearing seizing. A broken bearing will cause the conveyor to slow down or stop functioning.
A failure code is a failure mode represented by an alphanumeric tag. Failure codes are often used in CMMS software as a way to convey information quickly and sort or report on failure. Failure codes are usually supported by three pieces of contextual information. An example of a failure code looks like this:
What’s the difference between FMEA and FRACAS?
A failure mode and effects analysis outlines possible failure, its causes, and its impact. It’s a process that lists possible future incidents and their likely root causes. It’s a proactive measure.
A failure reporting, analysis, and corrective action system (FRACAS) is a closed-loop reporting system that analyzes failures that have already occurred. It examines past failures to find out why they happened and what impact they had so they can be prevented in the future. It’s a reactive measure.
Creating FMEAs for maintenance is a key part of building a FRACAS. In fact, it’s the first step of the process. An FMEA is a baseline for failure as well as team and equipment performance. You can base decisions, like what reports to create or what failures to target, based on this information.
How you can use FMEAs for maintenance
There are three main ways maintenance teams can use a failure mode and effects analysis:
- To create a preventive maintenance schedule that reduces the likelihood of asset failure and optimizes resources
- To prepare for emergency maintenance so assets can be repaired quickly and downtime can be minimized
- To prioritize corrective maintenance and maintenance backlog
Using FMEAs to build a preventive maintenance schedule
There are three ways you can use FMEAs to run a world-class preventive maintenance program:
- Create new preventive maintenance tasks
- Prioritize preventive maintenance
- Optimize preventive maintenance
The first step in creating a preventive maintenance program is to understand what failures can occur and how often they occur. An FMEA outlines this information. For example, if a new asset is being designed, an FMEA allows you to figure out what PMs are necessary to prevent possible failure modes and how often they need to be done. That will allow you to map inputs that go into creating new PMs, including who’ll be assigned to the work, what’ll trigger the work, what it’ll cost, and how long it’ll take.
The success of your preventive maintenance program depends on both how many failures you find and stop and the impact of those failures. If you prevent 100 small breakdowns, but don’t catch the five or six failures that cost your company millions of dollars, your program is flawed. An FMEA has all the information you need to prioritize PMs and target the most likely and disruptive breakdowns.
While FMEAs give you a baseline for creating a preventive maintenance schedule, your plans won’t last forever. Your operation is changing all the time. Your PM schedule needs to change with it. Using work order and repair histories to update FMEAs helps you optimize your schedule and keep pace with other changes. For example, a failure mode may not happen as frequently as you predicted. This data may lead you to reduce the frequency of the PM meant to prevent this failure mode. You can then put those resources toward another maintenance task.
Using FMEAs to prepare for emergency maintenance
No amount of maintenance will totally eliminate failure. The best you can do is plan for high-risk, high-impact breakdowns so your team can fix them in one hour instead of two. An FMEA is a valuable tool for putting these emergency measures in place.
Start by looking at failures that have the biggest impact and happen most often. From this list, pick out failure modes that are hard to detect. You’ll end up with failure modes that are hard to spot and cause the biggest mess. Build an emergency response plan for these breakdowns.
Your emergency response plan should include any information that reduces response and repair times. It should also take health and safety into account. This might include the following:
- Kitting parts to cut down on time spent retrieving critical spares and personal protective equipment (Download a parts kitting template)
- Creating a detailed task list or troubleshooting tips
- Attaching diagrams, manuals, photos, and other visual aids to work orders
- Outlining a list of technicians or contractors that can complete the repair
- Establishing a way to communicate with technicians quickly, like CMMS software
Using FMEAs to prioritize corrective maintenance and maintenance backlog
Detecting failure early is helpful, but it doesn’t mean anything if you don’t have a process for correcting that failure quickly and effectively. An FMEA helps you build this process.
The first step is to identify failure modes with a high severity score (ie. assets that’ll cost your company the most if they go down). Corrective action should be done on this equipment as soon as possible. This list will allow you to build training materials and response plans so everyone is aware of how to react to failure.
You can employ a similar approach when prioritizing maintenance backlog, except for one extra step. After ranking deferred work by severity, look at leftover work by failure frequency. Compare this to how late the work is. If a failure mode has a frequency rate exceeding the number of missed inspections, this work should take priority as the likelihood of failure is higher.
How to create an FMEA
The FMEA template below will help you spot risk in your operation and take action to prevent it.
But first, here’s how to get data for your FMEA
Good FMEAs depend on good data. Without data, you’ll build your maintenance program on guesswork and assumptions. But how do you find the required information for an FMEA? The three sources below give you a great foundation:
- OEM guidelines: This is your starting point. These guidelines give you a baseline for filling in an FMEA if you have no other data.
- Interviews with operators and technicians: Tap into the experience of those who work with equipment every day. They’ll give you insight you can’t find elsewhere, like if a component needs twice as much lubrication as suggested or if the frequency of a failure has increased because machine specs have changed.
- Work order data: Your work orders reveal how equipment is performing and are a great source of information for tweaking and improving your FMEAs. Look for common failures, what actions were taken to find and fix the root cause, what delayed a repair, and how easy it was to detect a failure.
None of these sources work alone. Combine them to get a full picture of how your equipment operates, how it can fail, the impact of failure, and what should be done about it.
An FMEA template
1. Identify asset components
Document each asset component that can break or degrade. For example, the components of a bottling line may include gearboxes, motors, sprockets, bearings, and nozzles.
Start with your most critical equipment and work down from there. This is a great time to create clear naming conventions and an asset hierarchy if you don’t have these already.
2. Identify potential failure modes
It’s time to identify how those components can fail. If a single component has multiple failure modes, list each failure mode accordingly. For example, a bearing’s failure modes can include misalignment, corrosion, or contamination.
3. List potential effects of failure
Describe the result of a failure, and how it affects production and the safety of staff. For example, a misaligned bearing will shut down a line until it can be replaced (about three hours), with a potential loss of 1,800 units.
4. Severity score
This is a measurement of a failure’s impact on production and safety. It’s scored on a scale of 1 to 10 with 1 being a low-impact event and 10 being a high-impact event. Account for the state of the asset when scoring. For example, a car that’s tire blows out at low speed will experience minor steering issues, but a blowout at high speed is far more dangerous.
5. List potential causes
List all possible reasons a failure may have occurred. Go beyond a direct cause. For example, a corroded bearing may happen because supplies were mislabeled or instructions were unclear, leading to improper lubrication.
6. Expected frequency score
This is a measure of how common a failure mode is. It’s scored on a scale of 1 to 10 where 1 represents an event that rarely occurs and 10 represents an event that occurs very frequently.
7. List current process controls
Document all the measures in place to prevent or detect a failure. Process controls may include weekly preventive maintenance inspections, monthly parts replacements, and the use of sensors to detect dangerously high levels of vibration.
8. Detection score
This number determines how easy it is to detect an issue before it causes total failure. It’s scored on a scale of 1 to 10. A score of 1 is given to an event that can never be detected. A score of 10 is given to an event that can be detected almost every time. For example, a flat tire can sometimes be detected in its early stages, so it would score a 5. A chipped windshield is often caused by unpredictable events, which means it’s difficult to detect and would score a 9.
9. Risk priority number
The risk priority number calculates failure modes that have the highest impact and are the most preventable. To find the RPN, multiply the severity, frequency, and detection scores. For example, if a failure mode has a severity score of 8, a frequency score of 5, and a detection score of 10, the RPN would be 400. The higher the number, the more resources should put into preventing that failure.
10. Determine recommended action
Establish a plan for reducing the likelihood of failure or increasing the chances of early detection. This can include increasing the frequency of PMs on a component or investing in condition-monitoring equipment.
|Potential failure mode
|Potential failure effects
|Current process controls
|Risk priority number
|What component is affected?
|In what ways can this component fail?
|What is the impact on production and/or safety from this failure?
|How severe is the impact of the failure?
|What could have caused the component to fail?
|How often is the cause expected to occur?
|What are the existing controls that either prevent or detect the failure mode?
|How likely is it that the failure mode and its cause will be detected and identified?
|What is the total risk of this failure mode (Severity x Frequency x Detection)?
|What are the actions for reducing the occurrence of the cause or improving its detection?
How to manage FMEAs
FMEAs are living documents that should be regularly reviewed and updated. Here are some events that could trigger a review of an FMEA:
- A new asset is designed or installed at your facility
- A new technician or operator joins the team
- A change is made to a machine’s operating mode (ie. it’s run more often or specs change)
- A failure mode is occurring more frequently
- New technology is implemented that helps you detect or prevent failure easier
- You find a new failure mode or reason for an existing failure
- The impact of failure changes (ie. a new product using more expensive material is being produced)
Both maintenance and operations staff should be involved in modifying and adding to a failure mode and effects analysis. The diversity of perspectives and experiences with equipment helps to avoid gaps in your FMEAs.
FMEAs are a long-term investment in success
A failure mode and effects analysis is not a band-aid fix or a troubleshooting tool. It’s a continuous activity with the goal of preventing failure where possible and mitigating its effects when it’s not. It’s a planning resource and a safeguard against financial loss and safety risks.
While creating FMEAs involves a considerable time investment, it will pay you back in the long term by helping you plan ahead, prevent reactive maintenance, and track team success.