System availability (also known as equipment availability or asset availability) is a metric that measures the probability that a system is not failed or undergoing a repair action when it needs to be used.
There are three qualifications that need to be met for a system to be available:
Functioning equipment Not out of service for repairs or inspections
Functioning under normal conditions Operates in an ideal setting at an expected rate
Functioning when needed Operational at any time production is scheduled
System availability is used to gauge if an asset’s production potential is being maximized, which has a direct impact on the financial health of a business.
System availability is calculated by dividing uptime by the total sum of uptime and downtime.
Availability = Uptime ÷ (Uptime + downtime)
For example, let’s say you’re trying to calculate the availability of a critical production asset. That asset ran for 200 hours in a single month. That asset also had two hours of unplanned downtime because of a breakdown, and eight hours of downtime for weekly PMs. That equals 10 hours of total downtime.
Here is how to calculate the availability of that asset:
Availability = 200 ÷ (200 + 10)
Availability = 200 ÷ 210
Availability = 0.952
Availability = 95.2%
World-class availability is considered to be 90% or higher.
System availability has a direct impact on the bottom line. When equipment is running as much as possible, it means more products are made and more money is made. In other words, when system availability is high, revenue is also likely to grow.
Because availability is so tied to the financial health of a company, it is commonly used as a key business metric in production-heavy organizations. However, it’s also heavily connected to what several other departments do, including maintenance. Availability is impacted by reliability and maintainability, which are influenced by the processes and tools of the maintenance team. Therefore, availability is used to measure and investigate the effectiveness of these processes and tools, and how they can be improved.
Downtime has the biggest impact on availability and is something maintenance has a lot of control over. Downtime can be broken down into planned vs. unplanned and frequency vs. length. Each component can be further broken down until an anomaly is identified. Once issues are pinpointed, they can be addressed and can improve availability.
It’s easy to see which type of downtime (unplanned or planned) is causing an issue with availability.
If unplanned downtime makes up the lion’s share of total downtime, you can start to analyze what is causing this unplanned downtime. It may be due to a lack of preventive maintenance, the age of the machine, or even a severe case of pencil whipping.
If planned downtime seems to be dragging availability downward, you can investigate how your PMs can get more efficient. Are you constantly waiting on parts? Are regular inspections taking longer because there are no checklists or SOPs available? How about the frequency of your PMs — can the asset function properly with fewer routine checkups?
The same logic applies to the frequency and length of downtime. If an asset breaks down a lot, but is fixed quickly, you can focus your efforts on finding why failure is occurring so often, such as too few PMs, age, or a broken PM process. It’s also possible that you may be doing too much preventive maintenance on an asset.
If an asset isn’t down as often, but takes a long time to fix or inspect, it’s time to take a closer look at your maintenance processes. There are dozens of different ways preventive and reactive maintenance can get more efficient. For example, if technicians have to keep walking back and forth from an office to an asset to retrieve paper files, it can cost precious minutes or even hours. If there’s a lack of failure codes, or if they aren’t clear, this can prolong downtime and shrink availability.
System availability and asset reliability are often used interchangeably but they actually refer to different things. System availability is affected by planned and unplanned downtimes. However, asset reliability refers to the probability of an asset performing without failure under normal operating conditions over a given period of time. This does not include unplanned downtimes.
For example, an asset that never experiences unplanned downtime is 100 percent reliable but if it is shut down every 10 hours for routine maintenance, it would only be 90 percent available. System availability and asset reliability go hand-in-hand because if an asset is more reliable, it’s also going to be more available.
Another factor that impacts system availability is maintainability, which refers to how quickly technicians detect, locate, and restore asset functionality after downtime. Just like with asset reliability, the higher the maintainability, the higher the availability. This characteristic is commonly measured using a KPI called mean-time-to-repair (MTTR). MTTR is a maintenance metric that measures the average time required to troubleshoot and repair failed equipment. It reflects how quickly an organization can respond to unplanned breakdowns and repair them.
System availability problems can happen when you least expect them or at the most inconvenient time. What’s worse is that some of the most serious system availability problems can originate from preventable or originally benign sources. No amount of testing will find all preventable issues, but there are several ways to improve system availability to avoid unexpected downtime and costly repairs. We’ve highlighted five ways to build a system and identify problems for optimized system availability.
Everything fails at some point so the best way to optimize system availability is to plan on when and how your assets will fail. When building your system, consider availability concerns during all aspects of your system design and construction.
For example, what design constructs and patterns have you considered or are you using that will help improve the availability of your equipment? What do you do when a component you depend on fails? How do you retry? What do you do if the problem is an unrecoverable (hard) failure, rather than a recoverable (soft) failure?
Keeping a system highly available requires removing the risk of the system failing. In many situations, the reason for the failure could have been identified beforehand as a risk and addressed accordingly. Identifying risk is one of the best ways to ensure availability.
Keeping a system available requires removing risk. But as systems become larger and more complicated, it becomes more challenging and time-consuming to proactively identify and address risks. Keeping a large system available should focus more on risk management and mitigation. For example, managing what your risk is, how much risk is acceptable, what you can do to mitigate that risk, and knowing what to do when a problem occurs.
It’s difficult to know if there is a problem in your system unless you can see the consequences of the problem. Make sure your assets are properly tested and monitored so that you can see how they perform from internal and external perspectives throughout the production process.
Monitoring systems aren’t much use if action isn’t taken to fix the issues identified. To be most effective in maintaining system availability, establish processes and procedures that your team can follow to help diagnose issues and easily fix common failure scenarios. For example, if an asset becomes unresponsive, you might have a set of steps for workers to go through that might include tasks such as running a test to help diagnose where the problem is or rebooting the equipment.
Having standard processes in place for handling common failure scenarios will decrease the amount of time your system is unavailable. Additionally, they can provide useful follow-up diagnosis information to your engineering teams to help them deduce the root cause of common ailments.
Preventive maintenance is regular and routine maintenance performed on physical assets to reduce the chances of equipment failure and unplanned machine downtime. Effective preventive maintenance is planned and scheduled based on real-time data insights, often using software like a CMMS.
Having a solid preventive maintenance program in place helps reduce asset failure or needing to take equipment out of production. You can optimize preventive maintenance processes by identifying and prioritizing tasks, and figuring out how often they should be performed to help to maximize asset and system availability.
A big part of your business’s bottom line revolves around system availability. Although asset availability is bigger than maintenance, knowing how your team can influence this maintenance metric is incredibly important to keeping equipment working and production on schedule. Doing a system availability analysis allows you to explore new ways to decrease downtime and make your operation more efficient.
Leverage the cloud to work together, better in the new connected age of maintenance and asset management.