6.0. Fault Tolerance and Plant Ageing

There are a few relevant important points, which affect the hazard analysis and SIS. In this clause, short discussions on how fault tolerance and plant ageing affects safety systems in a plant are presented.

6.1. Fault Tolerance

In this part, basics of fault and fault tolerance vis-a-vis system requirements have discussed. Since fault tolerance is very much connected with PEs, so it is treated separately in Chapter XI.

6.1.1. Fault and Fault Tolerance

What is fault? From the system point of view, fault can be viewed some deviation from the expected behavior of the system. It is malfunction of the system. This fault may be due to human error, hardware failure, software failure, and/or design problem. Depending on timing, fault can be mainly three types:
• Intermittent fault: The fault which occurs for some time then vanishes and again reappears. These faults are mainly due to hardware. Many times the system does not function well when a part of it is hot, but if it is allowed to cool down it again starts functioning well, or system malfunctions due to one loose connection—all these are examples of intermittent fault.
• Permanent fault: It is a persistent fault say due to failure of a component/subsystem, for example, input short circuit.
• Transient fault: This type of fault occurs once then disappears.
Again depending on the nature of the fault it can be two types; silent and Byzantine.
• Silent fault: Here the output stops, that is, no output due to failure.
• Byzantine fault: Here system produces output but it gives incorrect results. Obviously it is difficult to tackle Byzantine faults.
Unless a system is fault tolerant, then a small failure of component can cause total breakdown of the system. Tolerance stands for capacity of endurance. Fault tolerance can be conceived as capacity to endure fault in the system. Therefore, fault tolerance is the property of the system that enables the system to continue operating properly in the event of failure of (or one or more fault within) some subsystem or component. The quality of operation of the system may continue without degradation or with decrease in performance of the system. Also if quality of performance decreases, then that decrease is proportional to the severity of fault. For fault tolerant design, the designer needs to anticipate the fault or deviation from expected behavior, and can develop necessary measures to cope with the situation. However, this is not true in all situations; because many such measures could cost heavily, and also may not be worth considering, for example, not possible to consider duplication of engine for a car or duplication of generator in a power station. A fault tolerant design enables a system to continue its intended operation, maybe at reduced level, rather than complete breakdown of the system. System fault tolerance can be resilient type (e.g., mask the process failure by replication), where the system can work uninterrupted if one or two sensor fails, for example, for 2 of 3 sensor configuration, or redundant network communication with a one link failure. Some are such that there may not be any degradation; only the operator's attention will be drawn. Such systems are transparent to the fault. This is somewhat possible in triple modular redundant (TMR) system logic or for systems with complete redundancy in sensor, logic element, and final control element (FCE). Fault tolerance is directly connected with dependability which consists of “availability,” (ready for use immediately), “reliability” (how much the system can run without failure), “Safety” (how safe the failure? What are the results, i.e., nothing serious happens for one failure), and “maintainability” (easiness to repair).

Table I/6.1.1-1

Fault Tolerance Characteristics

NameExplanationExample
No single point of failure/repairSystem continues its operation uninterrupted for a single failure and/or during repair for that.With uninterrupted power supply (UPS) of suitable capacity connected to a computer, computer continues its operation with main supply failure, or input supply fuse failure and repair for the same.
Fault isolation for the failing componentAbility of the system to isolate itself from the failed component and continues its operation. So, necessary fault detection and isolating devices must be provided.It is quite common to isolate a part of the grid incase massive grid failure. PLC I/Os with galvanic/photovoltaic isolation does the similar function.
Fault containment to prevent propagation of failureIt is possible in some cases, that on account of failure of one subsystem, the fault may propagate. So, suitable measures shall be built in to prevent such propagation.Firewall in network in the classical example of the same. Explosion proof enclosure also does the function of containment.
Availability of recoveryA process by which failure shall be recovered.It is possible in two ways one forward recovery when the system will be taken to a new correct state but not the last correct state. In back recovery it is brought back to the last correct state. These are often found in network communications. When computer OS fails it is possible to bring back the system by recovery at a back date when it was functioning, a simplest example from day to day experience!
Basic characteristics of fault tolerance have been given in Table I/6.1.1-1 with suitable examples for understanding.
At this point it is worth noting that the fault tolerance of a system may be implemented in both hardware as well as software pertinent to the system. Fault tolerant software is normally part of the real time operating system (OS) interface and has some special characteristics like:
• Reaction to power failure, even for a graceful shut down
• Immediate backing up for system failure
• For multiple processors working in tandem, to compare data/output for error generation and correction
• In case of backup storage immediate switchover to backup in case of failure of former one.

6.1.2. Redundancy and Replica

In most of the cases, fault tolerances are achieved through redundancy, now, it is better to look the same bit closely. It is worth noting that replica and redundancy are often used synonymously, but there is a little difference between the two.
• Replica: Normally multiple identical components/subsystems operate in parallel, and correct output is chosen via voting/quorum, for example, use of 1 of 2 or 2 of 3 transmitters (though this is loosely termed in C&I engineering, as 2 of 3 transmitter redundancy). In some cases, based on implementation methods of replica, these can be termed as diversity of application also.
• Redundancy: Normally multiple identical components/subsystems are kept ready for operation. In this case, one operates and others comes into operation when the former fails (failure changeover/hot standby), for example, in a programmable logic controller (PLC) with two processors, one is working and the other on standby (but tracking I/Os), so that in case of failure of one, the other will take over. There are three types of redundancies applied; these are:
Information: IT/network applications, viz. error correction code/parity, etc.
Time: Work persistently until there is success. Network transactions.
Physical: Hardware/software, for example, process replica discussed previously.
There will be obvious question in mind; how many replica shall be necessary to get fault tolerant system? The answer is:
• For silent fault, the redundancy shall be K + 1 so that it can withstand K faults with one survival.
• For Byzantine fault, it shall be 2k + 1so that k + 1 components will be there for voting.
In instrumentation and control, triple modular redundancy is very important for fail safe operation. Fig. I/6.1.2-1 shows the same. Here, each of the three elements are voted thrice in each stage to get the output. In network communications, especially for remote communication, there are a few other problems known as Two Army problem, Byzantine general problem, etc. The issues discussed so far basically belong to fault masking to get away with hardware fault. There is another term called dynamic recovery, in which case there shall be a special mechanism to detect hardware fault and isolate the faulty hardware and replace the same with a good one. This will be clear from an example. Say in a process control, there are two processors; one working and the other standby. If there is another processor whose main function is to act a diagnostic processor to check health of other processors, when it finds fault with the working processor it switches the control to the standby processor (which also was tracking I/Os), and removes the faulty processor. Sometimes this is found to be more cost-effective than voting type. In software fault tolerance are achieved by:
image
Figure I/6.1.2-1 Triple modular redundancy.
• Static redundancy: In this system n versions of the programs are written and run for performing the same function. There will be a few checkpoints where each of the output from different versions will be voted to be selected as final output.
• In an other approach, the programs are divided into blocks which are run and tested. If any one fails the test, it will be replaced by redundant one.
• Another option is design diversity each using different set of hardware and software. All these are detailed out in Chapter XI.

6.2. Plant Ageing

Fault tolerance discussed in the previous clause are mainly related to instrumentation and control systems, etc. How to combat faulty conditions in the system? In most of these cases the situations could be defined. Similarly in mechanical systems, especially in vessels containing hazardous or corrosive materials, can withstand or tolerate up to a certain extent then they may tend to degrade. For example, on account of corrosion there could be weakening in wall thickness, or there could be structural deformity! The majority of such things happen due to plant ageing, and these could result a big accident and/or catastrophic situation. In this part, discussions will be on plant ageing. In our day-to-day experience, it is observed that many old persons are more fit compared to a younger person, both physically and mentally. Similarly, plant ageing is not solely dependent on chronological age, but also on many other factors. From petroleum safety authority, it is found that “ageing is not about how old your equipment is; it's about its condition and how it's changing over time.” So, it is more about its degradation with time and use of the same.

6.2.1. Contributing Factors

Corrosion, erosion, mechanical fatigue, operation at design limit and/or beyond it, calibration failure, design fault (such as wrong material selection), obsolescence, etc. are major contributing factors for degradation of equipment/system, and/or electrical control and instrumentation (EC&I) items.

6.2.2. Aging Management

Plant ageing management needs to be linked with integrated asset management of the enterprise, and highly depends on how effectively site assets are inspected, tested, calibrated, and effectively maintained (both in terms of frequency as well as detailing).

6.2.3. Various Plant Ageing Componts

Major equipment, structure, and other things to be managed under plant ageing system have been shown in Table I/6.2.3-1. This is an indicative list only, there can be others also, depending on plant type.

Table 1/6.2.3-1

Major Items in Plant Ageing (Category and Example)

CategoryExamplesRemarks
Structural and CivilBuilding structure, secondary/tertiary supporting foundation & structure for containment, structure for external impact—flooding etc. Loading/unloading point & structure.
Process containment equipment—mechanical items (non-rotary)Pressure/process vessels, reactors, boiler & steam system, heat exchanger, pipe, piping network, flexible hoses, utilities, column distillation system to name a few.In majority of the cases and countries regular inspection is mandatory.
Mechanical rotating itemsPump, compressor, electric generator, fans, turbine, etc.Some cases regular inspection is mandatory.
Safeguard items (mainly non-EC&I for BPCS)Pressure relief valve, safety valve, associated circuit, alarm & communication system, safety instrumentation, overfill protection, flare stack, chimney, etc.Many of these calls for regular inspection as a rule.
Electrical/C&I itemsLevel gauges, transmitters, and switches for BPCS & general instrumentation, power distribution, fixed hazard detection system.In most of the cases these are subject to regular inspection.

6.2.4. Plant Ageing Indicator

Table I/6.2.4-1 can be used early warning for plant ageing. This is an indicative list only for looking at plant ageing in general terms.

6.2.5. Factors

Plant ageing factors shall include, but are not limited to the following:
• Equipment age: Time
• Obsolescence of equipment/C&I

Table I/6.2.4-1

Plant Ageing Indicator

CategoryItemDiscussions
Physical conditionDamaged surface condition of equipment, poor surface painting, corrosion status, trending in leakage, trending from inspection resultsIn case of corrosion, erosion there will be possibility for damaged surface area. Vessel inspection, high leakage could be a result of poor maintenance, or cracking gasket damage. Result of repeated inspection can also gives indication for deteriorating condition (e.g., damage bearing—hence vibration etc.)
System availability and reliabilityFrequent breakdown, loss of availability, trending in mean time between failure (MTBF), need for frequent repair, unstable BPCSFrequent breakdown, repair and MTBF trends clearly indicate problem with ageing so to find cause for it. Instability in instrumentation could suggest either the BPCS instrumentation is obsolete or in poor condition and/or problem due to equipment ageing. This could be due to poor maintenance (discussed below) also. Needs further inspection to arrive at a conclusion.
MaintenanceHigher budget towards maintenance & repair, trends in mean time to repair (MTTR)If more and more attention is needed either to EC&I or to equipment. Higher MTTR all indicate plant ageing.
Operational performanceLower grade/poor product quality, high rejection, deteriorating plant operational performancePoor efficiency, high pumping cost, unable to cope up with the requirement for product quality. Unstable operation.
Hazard potentialAction taken report from PHA, design operation reports. Incident reportWhen for some conditions all actions suggested could not be implemented or when implemented not improvement result noted. Also design reports, operation and incident report could be eye opener.
Energy and environmental impactHigher energy consumed per unit production, more pollutionThe efficiency of machines may be degraded hence higher energy or less pollution control efficiency, for example, ESP.
• Operation beyond designed temperature range
• Operation beyond design limits
• Bad material selection/fatigue test not carried
• Sustained operation beyond corrosion limit
• Change of services
• Extremely corrosive atmosphere
• Departure from environmental condition at design stage
• Poor maintenance of equipment/control
• Poor painting and protective materials used
• Wrong operational practice, high cycling rate of load, temperature.
• Predictable ageing

6.2.6. Operator Action

An experienced operator can understand the problem better so they can initiate actions such as:
• Recognition of ageing and try to identify the points where it is occurring or expected to occur.
• Advise/suggest frequent vigorous inspection with due prioritization in the inspection plan.
• Equipment de-rating and or suggest replacement.

6.2.7. Inspection Teamwork

A good team of inspectors from various disciplines shall be formed to take up the issue in an efficient way which shall include but are not limited to the following points for better results:
• Good coordination, leadership, and high support from management
• Maintenance of asset register and regular updating of the same
• Well-planned maintenance schedule routine testing of equipment for condition monitoring (e.g., vibration for rotating equipment)
• Regular inspection for corrosion data, pressure vessel, etc.
• Regular electrical equipment testing and inspection
• Regular calibration and tuning (if necessary) for C&I items, replacement of obsolete C&I equipment with compatible latest device

6.2.8. Plant Ageing Control Discussions

There are a few options by which the plant ageing control can be met:
• One option could be to carry on the business with periodic check and reactivate or rejuvenate the system to minimize ageing effect to a great extent and turn around. This is common practice in many chemical and process industries. This approach is not only costly, but has no foresight about the remaining plant life. In some other industries, a similar approach with more emphasis on inspection are provided to find accurate reason for ageing.
• The industries like offshore operations and nuclear plants, the issue is more critical so, they under take risk management review or ageing management review on regular interval during the design plant life, so that they always run the plant at a level much within ALARP. For risk control and management they take the help of risk matrix discussed earlier. In this method first degradation mechanisms, as discussed above, are identified. After identification, these are listed in a table. Against each of degradation mechanisms, probable consequences and safeguard/mitigation and controls, etc. are listed. Here likelihood is determined from remaining life time (see CRR 363 Risk Based Inspection [RBI], @hse.uk), for example, when remaining lifetime is say, more than 10 years, it is unlikely happening may be ignored! Severity is assessed based on guided-word later arranged in number scale. The rating very much depends on factors like impact on project, location, population, etc., hence number scale is specific for project. Corresponding to each degradation mechanism there will be one severity value and likelihood value, which will be placed in risk matrix. Several steps for this are as follows:
Data collection for component with degradation mechanism identification
Criticality selection
Risk assessment and ranking (risk matrix) based on remaining life
Risk mitigation and control asset life management

Table I/6.2.9-1

Progressive Ageing (Brief) [9]

Stage 1Post-commissioning

• Design, manufacturing, installation & commissioning issue

• Early life issues (Training/trial)

• Identification of potential ageing site

Stage 2Risk-based

• Operation with design limit

• Routine maintenance

• Extended operating period

• Upgraded risk analysis

Stage 3Ageing

• Design limit approaching

• Evidence of active deterioration

• Degradation rate increasing

• Repair refit modification

Stage 4Terminal

• Accelerating and accumulating damage

• Beyond design limit operating in known experience

• Monitoring

• Major repair, refit replacement required

image

6.2.9. Progressive Ageing

Brief discussions on plant ageing could be concluded with some ideas about progressive plant ageing from [9] in Table I/6.2.9-1.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.138.123