Post Service Failure

In the last two blogs we covered the concept of Defect Elimination, highlighting the two broad methodologies that can be applied through quality and reliability centric approaches. In this blog we are going to focus on ‘Post service failure’, which is an important subset of defect elimination, using the reliability centric approach.

To set the context, defect elimination recognises that a significant proportion of in-service failures can be caused by:

Maintenance induced failure
Mal-operations induced failure (such as overload)
Environmental stressing factors (such as abrasive dust)
Low quality input and output factors (such as fuel quality)

Defect elimination attempts to find common causes of critical failures, where the root cause(s) can be avoided or eliminated by the operating companies, independent of manufacturers. This effort is targeted at improving achievable reliability in service to the upper limit of intrinsic reliability set by machinery design and manufacturing.

‘Post Service failure’ is a direct method of addressing maintenance induced failures. The method relies on conducting planned or corrective maintenance and recognising that any failures that occur shortly after the maintenance are highly likely to have been caused by poor quality issues in conducting the maintenance. A case of low-quality maintenance may not result in a subsequent failure event, but it may require avoidable rework. For example, failing to tighten up fasteners to a required torque may require an asset to be taken out of service for a short period to rectify the issue.

The measurement and identification of Post Service Failures requires a method for us to track the age of components, resetting the age clock to zero after each maintenance intervention in our Computerised Maintenance Management System (CMMS). Tracking the component age, in order to identify and differentiate early-age failures, puts this method into the reliability approach to defect elimination.

For more mature reliability systems, Weibull analysis can be utilised to identify post-service failures, for which a distribution of failure events of a component may be fitted to a Weibull, which results in a shape parameter of 0 to 0.9. Such distributions fall into a ‘failure pattern’ of ‘premature failure’, which is exactly what we are looking to target in post service defect elimination. Additionally, we can also use Crow AMSAA analysis to determine if a component’s reliability trend has recently changed for the worse. Crow AMSAA is generally used by manufacturers when they are developing new products, when they want to grow reliability before the product is available to be sold. There is no reason why we cannot re-use this technique, in service to spot when reliability trends change.

‍

Example Weibull in IronMan® for a component suffering post service failure with a vast number of failures at an early age (few data points)

‍

Example Merged Crow AMSAA in IronMan® for a component suffering post service failure with an increasingly poor failure rate

‍

If we track individual component operating ages, we should be able to define what an acceptable B-5 age should be. The B-5 figure, which can be derived from a Weibull cumulative distribution chart, is the component age at which 5% are LIKELY to have failed.

5% means that if we have 20 failures, only one of them should have an age at failure under the B-5 age. This gives a rule of thumb (or Key Performance Indicator threshold) for events that may be considered as post service failures and opens these up for root cause analysis.

If Weibull data is unavailable then a set of ages may be selected using expert knowledge from the senior maintenance practitioners, those who have years of experience in maintaining the equipment. We may select age or elapsed times since last maintenance such as 100 hours operating time, or a week elapsed time after maintenance. Any events that occur under these age or time limits can be regarded as post service failures.

When taking action, a useful angle is to consider is the vulnerabilities (or hazards) inherent in doing maintenance in terms of causing failures. For example:

If you need to break any seals in equipment, you are more vulnerable to suffer ingress of dirt
If you have accessibility issues in changing equipment, the sheer awkwardness of conducting the maintenance risks it may not be done appropriately.

You may need to remove other equipment before gaining access to the equipment you need to conduct the intended maintenance (in the Royal Navy, we termed this kind of task “work in wake”)

Slight misalignment of rotating machinery can cause a considerable deterioration to shaft and coupling

We can also take inspiration from James Reason’s models of organisational accidents where his principles can be applied to post service failures. The ideas are focused on human and organisational causes of failure. James Reason introduces the concept of latent failures, one of which we touched on above with accessibility, but we can expand this latent failure into an understanding of fitness of purpose of procedures, training or experience. Other human factors such as fatigue that may be associated with working practices may also be considered.

Other general quality focused processes such as ‘Five S’, keep workspaces clean and clear for work, enable the use of shadow boards to ensure all tools are accounted for and that none are left in machinery, are all best practices that can improve maintenance performance.

‍