How to interpret reliability charts

Reliability Engineering

Reliability Engineering has a statistical foundation and although reliability engineers need to master the basic tenets of statistics, they tend to analyse and interpret reliability data visually in charts. This is compared to a statistician, who may use more formal formulas and proofs. We also need to consider the history of reliability engineering which was developed before cheap computing was widely available.

The historical method reliability engineers used for fitting data to a Weibull distribution was “Ranked regression”, manually plotting failure events on graph paper and using visual methods to determine goodness of fit. Statistically the MLE (Maximum Likelihood Estimator) method usually provides superior results, especially for larger data samples. In general, the statistician would prefer MLE with a formulaic Goodness of Fit. However, the MLE calculation is too long and tedious to be usefully conducted by hand, and so was not extensively used until cheaper personal computers became available. A third general Weibull fitting method exists called Method of Moments (MM), but this has a disadvantage that it does not include censored data.

Censored data is where a component may have been changed for new and has accumulated operating age without failing. This may occur when components are changed as part of a planned replacement task in a Preventative Maintenance regime. Adding censored data to failure data during Weibull fitting contributes evidence of survivor age achieved, which will influence a more representative fit. Each change event must be marked up as either failure or censored to enable the correct treatment by the fitting algorithm used.

There are also robust statistical algorithms to test goodness-of-fit, such as Anderson-Darling or Hollander-Proschan tests, but using an expert’s visual check using a Weibull probability chart is often more intuitive to non-statisticians. We are not saying these goodness-of-fit algorithms should not be used, merely that visual inspection of the charts is powerful.

This blog provides some insights about how to read and interpret reliability charts derived from fitting a Weibull distribution to failure data. We will also cover how to interpret Crow AMSAA charts and how these may be used in lieu of Weibull analysis when the number of failure events are too small to confidently fit a Weibull. Although Weibull is most used, other set distributions exist such as for discrete distributions there is binomial and Poisson; for continuous distributions, Exponential, Weibull, log-normal, chi-squared and gamma may also be used.

The discrete distributions are rarely used but may be applied where the probability of picking healthy components from a store, where there are known to be defective components (especially in electronics), may be used. Detailed use of discrete data distributions will not be discussed here.

Weibull distributions were discovered in the 1950’s and are particularly good for fitting age-failure data. Weibull is part of the family of exponential distributions that have “thick tails”, which is where failures may occur at longer distances from the middle of the distribution compared with a Normal or Gaussian distribution. Where the Weibull ‘shape’ parameter is equal to 1, the distribution equals the exponential distribution. Weibull also comes close to approximating the normal distribution when the Weibull shape parameter is 3.6.

The types of charts we apply to Weibull analysis, and the insights we gain from them are broadly analogous and can be applied for other continuous distributions. We will consider some typical failure data sets and use the Weibull probability plot, probability density and the cumulative density.

Another historical barrier to applying statistical methods to failure analysis, is that data acquisition can be difficult and expensive. However, many complex machines now have digital controls, sensing systems, and records of either planned or corrective changes of defective parts recorded in the Computerised Maintenance Management Systems (CMMS). The cost of data acquisition is now lower than before. IronMan® solves this issue through automating data collection and graph output.

If we take a set of failure data for a component where we know the age of the component at the time of failure, we could plot a histogram to get an idea of the distribution. We can either use a “frequency histogram” or “probability density histogram” plot, where the probability density is converted so the sum of all the events equates to 1.

In this blog we will use probability density histograms along with a continuous probability density function. See the left-hand examples in figure 2. The number of failures and the choice of the number of ‘bins’ are important factors as too many or few bins will mask the general shape of the distribution. A general formula you may apply is the Sturges rule:

Number of Bins = 1 + 3.322 * Log_N(Where N is the number of failures).

‍

Basic Weibull Charts

There are three basic cases for using Weibull estimates on failure data for helping decide which type of maintenance tasks are applicable. We will use failure data with the same Weibull Scale parameter, but vary the Weibull Shape Parameter to see the differences.

‍

Premature Failure

20 data points randomly sampled from a 2 parameter Weibull with Shape = 0.8, Scale = 100 is shown in figure 1.

This first example shows a component that is suffering from Premature Failure (sometimes called Infant Mortality). This is where the Weibull Shape parameter is < 0.9. The failures are occurring much earlier than we or the manufacturer expected. The standard maintenance response for components suffering premature failure is to:

Temporarily stop any planned removal PM maintenance that may be in force as this is likely to be ineffective, given that the components are likely to fail long before any maintenance is initiated.
Initiate Root-Cause Analysis (RCA). The causes of premature failure are likely to be a quality issue in either overloading, or bad maintenance practices. Less frequently premature failure may be indicative of a manufacturing or materials fault, and warranty claims may be appropriate.

If the consequences of failure are high, it may be desirable to see if frequent “on condition” maintenance is applicable to contain the problem (sometimes called Palliative actions).

A diagram of a graphDescription automatically generated

Figure 1: Premature failure pattern

If we look at the Weibull probability plot, for the 20 failure events, by eye, we see a reasonable fit and the alpha (scale) and Beta (shape) are reasonably close to the original distribution we sampled from. We notice that the confidence bands are broad, but with an increasing number of data points we would find the bands would get thinner.

Looking at the probability density function we can see that the data is squeezed into the bottom left-hand corner. Note that the range of the x axis extends out beyond 1500-time units on the x scale. This is because the Weibull right side tail, when shape is < 1, is relatively thick and even though the age of all of the failure events is less than 500 (see the top right event on the probability chart), the nature of the distribution allows a small minority of events of relatively extreme age.

This thick tailed property of the distribution is important to remember. Sometimes we will notice that a small number of our failure events (or age at planned time of removal) seem to have unreasonably long lives, especially if they are multiples of the underlying scale. This may skew the Weibull fit towards a premature fit pattern. If we observe this, as a reliability engineer, we should be suspicious and assume that perhaps we have missed a failure or change event for those components with extremely large ages. It is likely to be an input data quality issue and we should investigate. In the short term, we could delete unreasonably long age events and refit the data.

We do need to keep in mind that if we are overzealous in excluding what we think are erroneous outlier data, we may be guilty of trying to make the data fit our picture of what we think it should be. Investigating the possible cause of data errors is a must to justify this action. It emphasises the importance of data quality, especially for components that have significant impacts of failure.

An example of a premature failure mode is misaligned rotating equipment when we change a pump or a motor, or possibly due to errors made in the alignment procedure between an engine and shaft, that can lead to uneven loading or in the worst-case violent oscillations and excessive forces. Small misalignments between rotating machines can lead to significant reductions in machine life.

‍

Random failure (with a constant hazard rate)

20 data points randomly sampled from a 2 parameter Weibull with Shape = 1, Scale = 100 is shown in Figure 2.

This example is similar in look to a premature failure, but has a shorter and thinner right-hand tail (premature extends to 1500 and random to 600) in the probability density distribution compared to the premature failure example.

Figure 2: Random Failure pattern (Hazard rate is constant)

The fit of the data and width of the confidence indicators seem to be reasonable with the number of events. The estimation of shape being 1.019 sits inside our boundary of random failure patterns (0.9 < shape < 1.3). Notice that the scale of the x axis for the probability density is limited to 600, a significant reduction from the premature chart.

Weibull distributions with reasonable fit and a shape parameter between 0.9 and 1.3 have a constant hazard or failure rate so the likelihood of failure is constant throughout life. This could be caused by environmental stressors or operational factors that occurs randomly. The only candidate maintenance task that should be applied is “on condition” or predictive maintenance. Scheduled replacement or repair only makes sense if there is a useful life followed by an increase in the likelihood of failure which is not present with a constant failure rate.

An example of random failure is a rolling element bearing that is part of a motor providing power to a pump. If the motor bearings have been properly designed for the expected load, and appropriately greased upon installation, they should continue in operation over a very long life. One of the ways a failure can initiate is the bearing suffering shock loading events that may cause hardening and cracking in the bearing races. Shock conditions are environmentally random events. Condition monitoring helps to detect potential failures early, that could occur to the bearing at any point in time after installation. In this case, condition monitoring may show a significant increase in vibration, and an oil sample profile with metal particulates being shed from the bearing. At a later point in time, an increase in bearing temperature may also be detected.

‍

Wear out failure

20 data points randomly sampled from a 2 parameter Weibull with Shape = 3, Scale = 100 is shown in figure 3.

This is where there is a period of useful life before the probability of failure accelerates. This characteristic is a pre-requisite for scheduled replacement or repair for the underlying components. On condition maintenance or a combination may also be applicable. On-Condition tasks are also applicable, as is a mixture of scheduled replacement and on-condition if the consequences of failure justify it.

A graph of a functionDescription automatically generated

Figure 3: Wear-out failure pattern

The fit seems reasonable and although the histogram shows two peaks, it is still reasonably consistent with a single failure mode, given the number of events. The estimated values for scale and shape are within the 5% which is satisfactory. Notice now that the x scale of the probability chart only extends to 200, compared with 600 and 1500 for the random and premature examples above. When the shape increases to a wear out characteristic the spread of the data is much tighter than smaller scales. This observation should also remind us that a small number of outliers with unusually large ages can skew the fit.

Notice how the cumulative distribution function (CDF) now has an “S” shape, showing that the left tail provides us with a period of “useful life” before the probability of failure increases as the slope gets steeper.

Another thing we should take note of is the “B20” value. The “B” values correspond to the CDF ‘Fraction remaining’ (0 to 1). In many CDF charts the y-axis can be expressed in percentages (O.2 equates to 20% equates to “B20”). The B20 age is the intersection by (mentally, by eye) drawing a horizontal line through 0.2 until it coincides with the CDF line, and then draw a line downwards to read off the age on the x-axis. Consider these shapes:

‍

Shape = 0.8, B20~ 10
Shape = 1.0, B20~ 15
Shape = 3.0, B20~ 55

‍

You can see how much the disparity of the B-20 age is with different Weibull shape parameters and similar scales. This graphically shows how undesirable premature failure is and how important causes need rapid RCA action in relation to wear-out failure. For this reason, the B-20 is a very useful measure of reliability. If we consider Defect Elimination, any components displaying premature failure with severe consequences of failure are prime candidates for Root Cause Analysis. This shows Weibull analysis can be a powerful enabler for Defect Elimination.

As a rule of thumb, B20 may be a default scheduled replacement age, for wear out if the consequence of failure is economic, but where component failure has more severe consequences, then lower values of B10 or B5 may be more appropriate. For safety associated failure the tolerance for failure will be zero; many equipment items may be subject to legislation or standards that dictate when and what maintenance must be done. The B20 and scale may be reliability measures that are far more satisfactory than using MTBF. Bearing manufacturers very frequently quote B-10 values for bearing life.

Note, the scale parameter is the B-63.2 line on the cumulative distribution function for any Weibull regardless of the shape parameter.

An example of wear out failure is a corrosion slowly acting on a steel tank used to hold slurry. Over time, the tank walls corrode further to the point that it leaks or becomes structurally unstable. Periodic visual inspections, measuring wall thickness at repeatable points, or changing the tank for new after a set time (for a predictable rate of corrosion) can all be performed in response.

‍

Proportions of failures fitting the RCM failure patterns

In the late 1960’s the Reliability Centred Maintenance (RCM) framework for analysing failure and defining a maintenance regime was developed. Extensive peer reviewed empirical studies in the civil aerospace industry were carried out, comparing maintenance done and failure patterns on complex machinery. This research was later repeated in other independent empirical studies conducted in the US Navy and the Nuclear industry. These other studies broadly confirmed the original aerospace findings. Many people are surprised or even shocked when they first learn of the results of these studies as most people think the probability of failure increases with age, or that all failures conform to the bathtub curve.

The RCM studies identified 6 different failure patterns seen in the Figure 1, but these can be broken down into the Premature, random and wear-out patterns we use in Weibull. We account for this below. If we break down the RCM failure patterns, the approximate apportionment of Weibull failure patterns form RCM patterns are:

Premature is involved with ~ 70% of all failures.

Random is involved with ~ 90% of all failures.

Wear-out is involved with ~ 11% of all failures.

A diagram of a baby's failureDescription automatically generated

Figure 4: Breakdown of RCM Failure Patterns for complex machinery

‍

Multiple competing failure modes

The received reliability engineering knowledge is that Weibull must be conducted on single failure modes. There are occasions where this is not true, we can mix failure modes, where the failure modes are not competing. However, this is an exception to the general rule.

Here is an example of where two failure modes are competing that are held in the same data set. The line fit is bad for the combined data, and there is a characteristic S curve in the probability plot that is a strong indication of two competing failure modes. The middle of the S-curve may be due to the data points that fall in the region of the left-hand probability density chart where the tails of the distributions overlap (between ~100 to 400 on the x scale).

Quite frequently we can see a mixed failure mode occurring when there is a mixture of premature failure driven by a quality issue, combined with data from the next dominant failure mode. If there is a low-quality issue it may be associated with a subset of operators or maintainers. Some of the components they are associated with may fail prematurely and others may last longer and fail as broadly expected.

We can plot probability distributions that show the individual distributions that contribute to the probability chart. We can see the characteristic S curve in the probability chart that shows there are competing failure modes. We can also observe that the curve fit parameters for the whole data set is very different to the scale and shape of the two contributing distributions.

We can also observe that many of the data points fall outside the confidence (95% confidence) bands. It is important to know how to read confidence values, because they are calculated with respect to the x axis (the time or age) or the y axis (reliability/proportion likely to fail at a certain time). Depending on the calculation used the bands will differ. In the charts below, the confidence bands are calculated with respect to the x-axis.

A picture containing plot, text, line, diagramDescription automatically generated

Figure 5. Mixed competing failure modes

How could we deal with this? We could split the data about Time = 330 and replot two probability charts. We can see on the left hand chart there is a cross over between Time ~ 100 to just over 400 where both distributions are contributing. We would be unsure of which events are derived from which distribution, but at 330 we can see from the left hand chart that the instantaneous probability of failure is equal (where the edges of the distribution cross) and this seems a reasonable point to make the split. We can also see that 330 in the right-hand chart forms a ‘cusp’ (where the ‘trend’ changes) in the middle of the S curve, and this supports the reasonable split point of the data.

There is a theory that as equipment becomes more complex and suffers from many more failure modes the tendency moves towards more random failure. This belief should be treated with caution as in the field, it is likely that two or three dominant failure modes mask the effects of other failure modes, especially when components are swapped for new.

The reliability engineer needs to study transactional data in notifications about the failure events, speak with the front-line maintainers about what they think are causes of failure and intercept and inspect removed defective components to understand the failure mechanisms involved to try and tease out competing failure modes.

On occasion the probability chart may show a bent line with a distinct kink or cusp that indicates mixing 2 or more competing failure modes. This needs to be distinguished between a ‘bow’ in the probability chart, seen below in the figure 6.

‍

Indications of missed age during utilisation

If the bow is concave looking up from the x axis on the top right hand probability plot, it suggests a positive location. This is where the Weibull distribution is shifted right along the x-axis and means we have over-estimated age. This may occur if we are using calendar time and start the age clock ticking with an asset, but it remains idle for a lot of the time. This is a strong reason to use asset running times as the age by default. One reason for instead using elapsed calendar time as the age is when we are dealing with corrosion where the main failure mechanism is influenced by 24/7 exposure to the environment. In the top left-hand chart, we can see the histogram that is shifted ~ 50 time-units to the right, and we can see the probability distribution (blue line) using a two parameter Weibull estimation, is not a good fit after looking at the left-hand tail and peak of the PDF (Blue line)

In the lower left chart, we now fit using a three parameter Weibull, there the fit of the PDF to the histogram are much closer. The right-hand lower PDF plotting the fit to the 3 parameter Weibull straightens the bow, with a much better fit.

A group of graphs and diagramsDescription automatically generated

Figure 6: When to use a 3 parameter Weibull fit for missing age

The points to take away are that visual inspection of the probability density function drawn on the probability density histogram should show a reasonable correspondence, and looking at the Weibull probability chart, gives an excellent visual representation of goodness of fit against the confidence boundaries.

‍

How can we deal with a small number of component failure events?

Many Weibull texts say that the method works well with 15 or more failure events. It is worth fitting data with less failure events especially if censored data is included (age of survivors: age at planned removals, and the ages of the currently fitted parts). There is another technique based on cumulative counts, Duane and Crow AMSAA. These techniques are usually used within product development to monitor how product development is proving reliability growth. They can easily be applied to monitoring in-service reliability where they are extremely useful in two areas:

Where there are a low number of failure events and Weibull confidence bands are too large

As an early warning trigger, or alarm, to warn of a change in the failure rate, that may be a first indication that a material problem exists or that maintenance or operations have an emerging low-quality issue.

The simplest method is to order the series of failure event ages as they have occurred. Then cumulatively add the ages from oldest to latest. Plot the log of the order of events (1, 2 etc.) and the log of the accumulated ages. The slope of the data points may then be inspected. If the slope is constant then the failure rate is not changing, if there is a cusp in the data and the slope increases, then the failure rate is getting worse, and becomes a candidate to be investigated. If the slope becomes smaller, the failure rate is improving.

We need some basic rules to identify a true trend after a change in the failure rate. By experience the minimum number of aligned consecutive data points is 3 but increased numbers provide increased confidence that the trend is true.

It is worth remembering that if hundreds of failure events are plotted, the data points become bunched up at the end of the trend and it becomes very difficult to spot changes in reliability trends. The advice would be to zoom into the last 10 to 20 data points and look for changes of trend in this data.

A simple example of worsening reliability rates is shown below:

A graph with a lineDescription automatically generated

Some reliability metrics show a moving window of MTBF. Simple moving averages have a lag before the trend changes depending on the size of the window selected. The simple Cum-Sum has no ‘lag’, it shows the instantaneous changes in a trend and for earliest warning of a change this is an improvement over MTBF moving averages.

‍

Why Weibull is the Gold standard

The reason that Weibull has become the gold-standard tool for understanding reliability is that it shows the probability of failure over a very large range of time, which is critical to determining the type of failure mode and hence the correct maintenance regime to directly address it. Without this information it is more difficult to make an accurate assessment. For this reason, other forms of reliability measures, like mean time between failure (MTBF), can be helpful for providing information at a glance but don’t provide the amount of detail that a Weibull does.

Some people argue that MTBF only has one parameter, and is easier and more convenient to remember. Weibull has two parameters scale (which broadly equates to ‘mean’ – being a measure of the middle tendency of a distribution) and shape. It is not hard to remember two parameters and as this blog has explained so much more meaningful information can be gleaned from the two Weibull parameters compared with single parameter MTBF.

‍

In conclusion

This blog has shown that an experienced reliability engineer can glean a lot of latent information from visually inspecting the various forms of chart that can be plotted using the Weibull distribution. Goodness of fit and mixed competing failure modes can be observed visually using probability charts. The quality of data may be questioned and the shape parameter is a determinant of what types of maintenance task are applicable. The insights the reliability engineer brings to verifying the maintenance regime and taking an active part in Defect Elimination adds considerable value to their organisations.

‍