Defect Elimination Part II

There can be multiple approaches to Defect Elimination.

In a previous blog we explored the concept, benefits and the main causes of defects. In many industries the majority of defects and unexpected failures are caused by quality issues in running maintenance and equipment operations. The good news is that this lack of quality is in the horizon of organisations to fix and make substantial improvements. This is the bedrock of what Defect Elimination is built on.

‍

Defect Elimination’s power derives from identifying and eliminating common preventable causes of failure over many different assets and machines. We already know the biggest causes are based around low quality maintenance and operations. This simple insight means solving the most common causes of failure improves all machinery and maximises the benefits.

There are two approaches to implementing Defect Elimination. The first approach is based on an organisational review of quality processes and staff, with a simple view of what machinery is failing, causing most pain. This top-down approach is not limited by a lack of reliability data and could be run by small organisations. This quality driven approach was discussed in the previous blog here.

One beneficial consequence of the quality approach is its influence in incentivising an organisation in gathering more and higher quality data and structuring it so it can be more efficiently exploited. This emergent consequence naturally leads us to describe the second approach. This is more bottom-up and utilises reliability and other related data to enhance the effectiveness of Defect Elimination. Both approaches are compatible, and they are also complementary with such frameworks as RCM (especially RCM age exploration) or Total Quality Maintenance (TQM) that seeks to empower the front line staff to improve quality and processes.

We may have existing data in our CMMS in the form of work-orders or notifications that we can use to glean clues for failure modes, condition of components and failure symptoms. Some CMMS systems also enable failure cause effects and damage labels (or codes) associated with the work order data. However, there could be issues with simple systems that allow codes to be picked from lists, this will be a subject for a future blog where data quality can be covered.

We may have access to FMEA data, but we need to be cautious about using design FMEAs as they are rightly focused on safety, but can miss some of the operating contexts and may have limited operational and economic consequences of failure. A good quality Maintenance FMEA used for the RCM process would be the most appropriate. We will cover what a good quality FMEA is in a future blog. We may also have access to existing Root Cause Analysis (RCA) data.

If we think about how data may be combined, it is easy to imagine that RCA data can be linked to and extend the FMEA ‘failure mode’ data, to capture the cause-and-effect chains that RCA captures.

An important key success factor is quality for running a data driven Defect Analysis process is the data quality that may be used to populate it. This can be significantly enhanced by the quality of meaningful communication with front line maintainers and operators. It is the front line staff that can provide the majority of the data to make DE a ssuccess. The organisation needs to value their input and also explain to them how the data is used for the DE process and how it ultimately benefits them. Otherwise, front line maintainers may not regard data entry as being important, as they may perceive little value in providing it.

So what type of data are we talking about? The following explains.

Information may be extracted from the frontline data and broken down into entities such as Defects, failures, failure modes, failure mechanisms symptoms and failure effects. This is what we mean by those terms for entities

A defect is where a component is damaged or has degraded performance, but is still delivering required functions.

A failure is where components functions are no longer providing required functions. The condition of the failed component may be repairable or scrapped.

The failure mode is the observable state of the failure, – for example, a leak (or more formally ‘loss of containment’).

The failure mechanism is a mechanical, electrical or chemical the process that leads to failure (for example erosion, corrosion and cyclic fatigue).

Symptoms are observable changes of state off or a deteriorating component. For example: a leak may progress from ‘weeping’ to ‘slow drip’ to ‘leaking’ at an unacceptable rate over time.

Failure effects can range from local effects on the affected component to higher level effects to the components system or asset, and even include impacts on operations (such as loss of production) or economic.

The ‘position’ and ‘operating age’ of the component at the time it is changed should be captured. Position is very important where there is more than one of the same type of component fitted. For example: a diesel engine may have a HP fuel pump for each bank of cylinders in a V configured engine, and a fuel injector per cylinder. We need to identify which of the pumps and injectors are changed every time. Assets, their systems and components have criticality scores, based on their safety, operational importance, intrinsic costs (material and recovery costs), and availability through logistics.

Components could be labelled with vulnerabilities where maintenance induced failures may be more prevalent. We wouldn’t expect front line staff to label this data, but someone in the reliability department could. For example: electrical equipment may be subject to insulation resistance breakdown, and short circuit or short to earth/ground failures. Any component that has to be disassembled especially where component containment boundaries need to be breached (any maintenance involving seals) may be subject to contamination ingress and leaks due to low-quality reassembly, that may be a cause of failure. The component vulnerability would be labelled as ‘invasive maintenance’. The ‘vulnerabilities’ are where risk of lower quality maintenance are more probable.

The symptoms provide clues for the nature of the failure and how it presents. For example: if weeping and slow dripping is observable as symptoms, the onset of failure is gradual and this opens up the possibility of on-condition maintenance being a viable choice of maintenance task. The timing of observable symptoms to functional failure also allows us to identify the P-F interval and inspection periodicity. The criticality of the component is increased if the failure is sudden with no observable pre-warnings.

Another emergent effect is that discussion with the frontline staff reporting failures or defects, sharing information about how the maintenance and reliability group relies on and actively uses this data incentivises people to provide more detailed data that is meaningful for these other downstream processes.

All of this data about the component failure events can be structured and linked to components, and any existing FMEA. Indeed, if no FMEA exists, then this data can be used to dynamically construct an FMEA as more work order data is generated.

If we capture and record failure modes, we can build up a pareto chart for these types of components for the most predominant failure modes. We can also use Weibull analysis to identify which component failure modes are failing prematurely (where the Weibull shape parameter is < 1). We discussed Weibull analysis in a previous blog.

In another previous blog we discussed the use of graph-based databases, which are optimal for containing relationship rich data where we want to ask sophisticated queries of the data. A possible suggested schema for a combined FMEA and RCA graph schema is shown below.

This is a baseline for querying the database for similar types of equipment that share common failure modes. The reliability engineer could query the database asking questions such as:

“What are the number of electrical short circuit faults for all rotating electrical machinery over the last five, in separate six monthly, time spans compared with all failures in the same time?”

The reliability engineer can mine the graph data to find areas with common failure modes that are significant, and which form candidates for prioritised Root Cause Analysis (RCA).

By conducting the detailed analysis, it may be that a set of common preventable causes may be arrived at, that would improve many components at a single stroke.

The traditional way of initiating RCA is where a single significant component triggers the need for analysis. This is valid and remains powerful, but compared with the DE philosophy of fixing common causes over many components it may not be so productive and benficial.

All the data from the RCA, including rich media, audio, video, images as well as text, along with the cause and effect chain should be captured in the graph representation of the FMEA-RCA tool. This makes the data richer and ever more valuable to the owning organisation.

Downstream of the RCA, a number of palliative and preventative actions (which may be big enough to be separate projects) may be initiated. The palliative actions may be temporary changes to help contain the problem until the time the preventative action can be rolled out in the organisation.

The effects of the palliative and preventative actions should be measured to verify the required improvements are being made. This data should also be recorded in the FMEA-RCA data store.

This blog has discussed the data driven Defect Elimination process and enabling data systems. We have shown why with the breakdown of influences of failure and the aim to fix many common causes of failure makes this approach so valuable. DE is not a replacement for RCM but actually improves its concept of ‘age exploration’. It also fits extremely well with TQM where front line staff are more empowered to incrementally improve quality.

We would love to hear about your experiences with Defect Elimination, what are some of the barriers you had to overcome? What are some of the pitfalls? Does the need to become more sophisticated in managing and collecting good quality data concern you. Does the cost of doing this concern you? Please share your experience.

The next blog will take a look at some of the issues of managing data quality in the maintenance domain. How can we avoid building a data empire that may not deliver our expectations? How can the maintenance domain become smarter at exploiting data and information systems?