Availability & Delay Attribution

Reliability, Availability and Maintainability analysis is often rolled up in the acronym RAM in the US or AR&M in the UK. There is also a European standard that defines what these terms mean. EN 13306(2017).

‍

Availability is the ability of an asset to operate when required, in a defined context with resources being available.

Reliability is the ability of an asset to deliver required functions in a defined context for a given period.

Maintainability is the ability of an asset to be retained in or restored to a state where it can deliver required functions in a context when maintenance can be achieved given the right conditions and resources.

‍

Acceptable availability implies acceptable reliability, coupled with ease and low turn-around times in maintenance.

We also need to take into consideration, availability in context with our operating context and environment. We may not be able to profitably utilise a proportion of availability and in this case, it may not be profitable to improve availability.?

We must mix the concepts of asset utilisation and whether it is profitable to utilise with availability in our operating context to understand our required availability.

In an operation that runs 24/7 where demand and selling price for the product is high, utilisation maybe profitable for the full 24/7. Running complex machinery 24/7 forever is not practical as machinery will wear and will require maintenance with assets shut down. We need to plan periods where the machine can be released for maintenance in discussion with operations that causes minimum disruption.

The amount of time needed for maintenance depends on:

1. Amount of work to be done.

2. Packaging the maintenance (when work falls due).

3. How the detailed work may be sequenced given there are constraints to the spatial limits.

4. The number of maintenance staff and the facilities available.

It is worth thinking about reliability and availability. We can look at two extreme cases:

1. Both the asset systems are sold out. One system has a dominant failure mode that stops production with low reliability. They suffer frequent failures. The maintainability is good in this case and it takes 10 minutes (using a median measure) to recover the failure.

2. The other asset system has higher reliability for its dominant failure mode that also stops production. The maintainability for the second system is bad taking 2 hours (median measure) to recover.

Loss of availability is worse in the second case and costs more compared with the lower reliability of the first case. There may be some other costs associated with disruption due to unexpected failure, but this needs to be quantified and added to the equation.

Through life cost, we need to account for all three constituents of AR&M.

‍

Other considerations around availability we need to take account of are if assets operate with, or on infrastructure, this limits the number or quantity of assets operating, then break down may also limit other assets using the infrastructure. An easy example to visualise this is a train breaking down on a single rail track. No other trains can run until that train is moved to a siding or fixed. This imposes delays on the system.

We should consider dependencies between assets, a single breakdown can have cascading effects and essentially makes other dependent assets unavailable.

It is possible to mitigate some of these with redundancy. Redundancy requires investment, and the balance of the amount invested versus the probability that the standby capability is needed to avoid lost production needs careful consideration.

Another way of avoiding production stoppage is to stockpile manufactured material downstream of critical assets. Stockpiling also ties up cash, and there is a cost of finance invested in non-earning stockpiles or redundant assets.

This poses a question about how availability is measured:

* do we include the planned outage time in the overall calculation or not?

* how do we factor in delay and delay attribution?

Many systems do not include planned outage time for availability (or lost production).

Part of the solution may be to define a classification model for how we can attribute time for a working asset. The following diagram is an example that maybe used:

‍

Some of the classifications may be contextual. For example, a steam plant may take an appreciable amount of time to start up and shut down to allow controlled temperature transients that do not stress the plant. Other assets have no appreciable time whilst they are starting up or shutting down. The breakdown of the classifications needs to work for the organisation, it should highlight improvement opportunities.

The divisions and classifications of times may be set targets which may then be compared against actual performance.

The selection of classification breakdown also needs to be pragmatic and data gathering needs to be practical and achievable.

The measurement of these time slices may also be dependent on whether the recording of time is manual or automatic:

* Manual collection implies a much simpler set of classifications. Automatic collection could be facilitated by fixed sensors utilising data surveillance (such as SCADA/ DCS associated networks) or data historians.

* Simple calculations to record time stamps for machinery state changes can be scripted into these systems.

* From this data elapsed and cumulative times can be automatically calculated so the actual performance can be compared to the target metrics.

We can also look at capturing metrics within the maintenance periods. This would rely more on manual data collection and sometimes overlooking and ensuring proper accounting can be complicated. For example, a set of PM tasks may be done over a period of allocated PM time. Many of the tasks can be done concurrently. One or two of the tasks may be delayed because of lack of resources but this might not affect the allocated PM time because those tasks may not be on the PM plan critical path (Critical path in project management terms). The delay times are accommodated.

To get a handle on how productive maintenance is, we need to measure each task. We also then need to use the Gannt chart for the maintenance to understand the impact of delays on the whole timeline. On many occasions the work conducted by the maintenance staff is likely to differ from the Gannt plan. But simple collection of times when tasks were started and finished can be captured. The sequencing may also be used to capture better sequencing knowledge for the planners. Perhaps a front line system that allows staff to tick off when tasks are started and finished, with notes available for discussing delays may help capture data in a semi-automated way.

The use of collecting detailed work data should be open with staff where they also have some delegated responsibility for improvement. This could be part of a Kaizen system used in Total Quality Maintenance.

Measuring each task is also necessary because the accuracy of planning estimation depends on it. If we see consistent divergence of actual times or resources from our estimated times in our task lists or definitions, we can correct the estimates. This allows us to adjust the target times in our asset timeline.

Below is an example of a time classification breakdown for maintenance:

We have discussed Availability and Delay Attribution showing it is not as simple as many may imagine. Suggesting classification systems as a baseline for measuring availability that may be tailored depending on the context of any organisation adopting them. We have covered the needs of data and how we should be looking to exploit automated means by which this can be collected.

‍