7
Identifying and Managing Engineering Risks

7.1 Introduction

Chapters 2 and 6 drew attention to two fundamental, and sometimes opposing, aspects of product and technology development work:

  • The creative element of the process, developing new ideas to solve problems and to improve people's well‐being.
  • The element of risk which those new engineering solutions might introduce in terms of reliability, robustness, or creation of other forms of harm or danger.

This chapter is concerned with the second of these points – the identification of risks that require attention. Chapter 8 then deals with how those risks might be reduced or eliminated through engineering development work.

Risk is a widely used term, and its simplest definition is perhaps given by the Cambridge English Dictionary: ‘the possibility of something bad happening’. Within the engineering and technology community, it is essentially ‘the possibility of something going wrong’. A more technical description can be found in the Risk Management Guide for DoD Acquisition: ‘Risk is a measure of the potential inability to achieve overall program objectives within defined cost, schedule, and technical constraints’. Risk in this context is normally considered to have two elements: the frequency of a potentially hazardous event and the severity of its consequences.

It should be noted that the concern in this book is with engineering or technical risks that can be addressed through the engineering development process. Other forms of risk, such as business or financial risks, are not covered here, although the same principles can be used for their management – see Refs. 13.

7.2 Identification of Risks

It might be argued that the separate identification of risks is a relatively new concept in the engineering world. Engineers have, over a 200‐ to 250‐year period, always performed experiments and calculations to ensure that their products will work. Initially, knowledge was limited and failure was commonplace despite engineers' best efforts. For example, nineteenth‐century railways were plagued with problems and several royal commissions were set up in the United Kingdom looking, for example, at iron bridges on railways (1847). Unlike today, that was a period when failure in service was commonplace and was one of the primary learning mechanisms – an approach that would be unacceptable nowadays.

Explicitly listing out risks and following them through to close out is a more recent phenomenon. The ‘failure modes and effects’ (FMEA) methodology, for example, which is one of several ways of analysing risks, was developed in the 1950s. Along with other approaches, it is used as a structured method of analysing potential failures and their effect on overall system reliability. This approach started in the military and aerospace industries, spread to automotive, and is now used widely in a range of engineering sectors. See Refs. 4,5 for some further material.

This attention to potential failure has been effective and has produced results. On a lighter note, consider the famous 1896 London to Brighton car run to celebrate the lifting in the United Kingdom of the 4 mph speed limit for ‘light locomotives’ (cars, in today's parlance). There were 58 entries, 32 or 33 made it to the start line and somewhere between 13 and 20 finished the approximately 60‐mile journey. The doubt about the numbers revolves around what constitutes finishing – does arriving a couple of days later after a major rebuild count as having finished? Compare this with today's cars, 120 years later. Reliability is measured in ‘failures per hundred vehicles’ and the best achieve about 0.7 fphv over a full year. Failures here are defined as any problem requiring attention, of which breakdowns are a small proportion. Rough arithmetic suggests that reliability has improved by a factor of about 105 over this period.

On a more serious note, the other major driver of improvement has been the investigation of major accidents. Public investigations into disasters such as Flixborough (1974), Seveso (1976), Three Mile Island (1979), Challenger space shuttle (1986), Kings Cross fire (1987), and Piper Alpha oil rig (1988) have fundamentally changed the approach to safety engineering, particularly in high‐hazard industries where there is the possibility of major societal losses as well as the losses to the operator.

Subsequent sections of this chapter explain some of the approaches that have made these improvements possible. The overriding dictum is that successful design should fully anticipate all the possible and relevant ways in which failure can occur.

7.3 Risk‐Based Approach

The risk‐based approach is a response to the need for improved reliability and safety driven by the following:

  • In the consumer sector, reliability as a competitive advantage, automotive products being a specific example where Japanese manufacturers set new standards that others then had to meet
  • In other, high‐risk industrial sectors, the need for absolute safety – nuclear power, space exploration, high‐speed rail, chemical and process industries, oil and gas exploration, and passenger aircraft being examples
  • In the military field, the need for reliability of weapon systems (and obviously safety, in the case of nuclear weapons)
  • Allied to these points, the increasing complexity of products and systems, plus their increasing level of automation and reliance on software‐based control systems
  • An increasing public unwillingness to tolerate failures and their consequences

Engineering arguments might also be added to the list above. For example, development of designs is increasingly reliant on different forms of computer simulation and modelling, in turn leading to reduced physical testing. There is less opportunity, then, for the application of practical engineering instincts that have, for decades or centuries, been the backstop for the application of common sense. Risk analysis is, in some ways, also a response to this situation.

A similar, risk‐based approach is also widely used in a range of fields of human activity: financial, business, health and safety, medicine, and project management being just some examples. An international standard, ISO31000:2009, and the associated document ISO Guide 73:2009 (Refs. 1 ,2), give very general guidance, which is intended to be applicable to ‘any public, private or community enterprise’.

At the core of the risk‐based approach, irrespective of the application, are three straightforward activities:

  1. Identification of potential hazards or risks
  2. Estimation of the likelihood of each of these
  3. Estimation of the severity of the consequences of potential failures, should they occur

The first of these is often conducted as a group, in which brainstorming is used to identify as many potential risks as possible, using the collective experience of the group involved – bringing in as far as possible the lessons of the past. The second and third are more analytical and, in some instances, may be quantified in terms of probability of occurrence over a defined period of time and in terms of the monetary value or the injuries/loss of life as a consequence of the failure.

Likelihoods and consequences are often described qualitatively on a 1–5 scale, such as the example shown in Figure 7.1, which is taken from MIL‐STD‐1629 Rev. A. Other standards and guidance documents, which are numerous, have something similar. Some extend the scales to 10 steps or further.

Level Likelihood, Probability
of Occurrence
Severity of
Consequences
1 Extremely unlikely None
2 Remote Minor
3 Occasional Marginal
4 Reasonably probable Critical
5 Frequent Catastrophic

Figure 7.1 Likelihood and consequence categories.

At the simplest level, a straightforward multiplication of likelihood and consequences (Figure 7.2) is often used to give an initial prioritisation to identified potential problems. This specific model has just three categories: usually designated as red, amber, and green. In engineering practice, a somewhat more complex approach is often used; in fact, matrices up to 14 × 14 have been seen and with individual prioritisation instructions in each of the 196 cells, but this is perhaps a little too complicated and prescriptive.

Risk assessment matrix with rows for likelihood and columns for consequences. The matrix contains numbers in light-dark shades. At the top right corner of the matrix are numbers 20, 16, 25, and 20 (dark shaded).

Figure 7.2 Risk assessment matrix.

It should be noted that all risks, once identified, must be actively managed and not just dismissed. However, it is acceptable to judge a risk to be too unlikely to warrant attention, or for the cost of dealing with a risk to be out of balance with the likelihood and consequences, but these must be conscious decisions. The latter can be controversial, when the safety of workers or the general public is involved, but there are guidelines for decisions of this type – see below.

A further development of this approach is to introduce a third variable: the likelihood that a failure mode will be detected at the design stage by the designer or those overseeing the design, or at the manufacturing stage, or during the operational phase. These three parameters (likelihood, consequence severity, and detectability) are then ranked on a 1–10 scale and a ‘risk priority number’ (RPN) calculated by multiplying the three parameters, resulting in an RPN between 1 and 1000.

7.4 Sources of Engineering Risk

One of the early standards in this field, MIL‐STD‐1629, defines the mode of a failure, which is effectively a risk that has materialised, as:

the physical or chemical processes, design defects, quality defects, part misapplication, or other processes which are the basic reason for failure or which initiate the physical process by which deterioration proceeds to failure.

For any engineering product, simple or complex, there are many potential forms of failure, although their root causes may derive from a relatively small number of basic causes such as:

  • Design‐related:
    • Failure of the design to achieve the specified performance
    • Progressive loss of performance over the life of a product
    • Failure of the design to comply with legislative requirements
  • Manufacturing‐related:
    • Failure due to out‐of‐specification manufacture
  • Mechanical failures:
    • Breakage due to overload during operation
    • Breakage due to fatigue damage accumulated over a period of time
    • Failure due to creep
    • Wear out
    • Corrosion, erosion, UV, or chemical attack
    • Delamination of composite material
  • Electronic component failure
  • Software design malfunctions

Potential failures, or risks, may be described in terms which relate more closely to the product, e.g. brakes ‘stick on’. However, the underlying cause might be one of the above, such as corrosion. It is the latter information that will point to a solution which, in this instance, might be some form of protective coating, lubrication, or maintenance action.

7.5 Qualitative Risk Assessment Methodologies

Several methodologies exist for identifying, prioritising, and managing risks. The emphasis for all of them is to provide structure to the risk management process so that nothing is missed and so that there is an audit trail through to close‐out of potential problems. This information is also a valuable record of learning and experience for use on future projects and can be used as the basis of troubleshooting during service operation.

In product‐based industries (as opposed to process industries), ‘failure mode effects analysis’ (FMEA), is the most widely used methodology. The name is sometimes extended to ‘failure mode, effects and criticality analysis’ (FMECA). Figure 7.3 shows a typical template for an FMEA analysis.

FMEA worksheet with columns for item/function, potential failure mode(s), potential effect(s) of failure, severity, potential cause(s)/mechanism(s) of failure, probability, current design controls, etc.

Figure 7.3 Typical FMEA worksheet.

The worksheet can be used to illustrate the step‐by‐step process of an FMEA study:

  • Description of the design function that is required
  • Identification of the potential mode of failure
  • Description of the likely effect of the failure
  • Ranking of the severity of the effect of the failure on a scale of 1–5 or 1–10
  • Description of the mechanism of the potential failure
  • Ranking of the likelihood or probability of the failure, also on a 1–5 or 1–10 scale
  • Description of the means by which a fault would be detected in service
  • Ranking of the detectability on a 1–5 or 1–10 scale
  • Calculation of a RPN, obtained by multiplying severity × likelihood × detectability, resulting in an RPN between 1 and 125 or 1 and 1000
  • Agreement on the course of action to be taken
  • Agreement on who will accept responsibility and the timescale for completion
  • Summary of the action taken
  • Revised severity/likelihood/detectability/RPN based on the action completed

A simplified version would omit the detectability ranking and just rely on severity and likelihood.

The FMEA process should be initiated as soon as schematic drawings or descriptions are available and revisited or updated periodically as a living document in much the same way as drawings or technical specifications. The later the start is made to an FMEA, the less will be its influence. Any modifications to a product should be evaluated through the FMEA process.

Early‐stage FMEA analyses will point towards analysis, modelling, or test work that should be used to examine potential failures and reduce their likelihood. They may also be used as a method of evaluating alternative solutions.

The FMEA method, as a ground‐up approach, works well at the level of components or subassemblies and with single modes of failure. It is sometimes described as an ‘inductive’ approach and depends on experience of similar products and systems. For those designing and supplying these elements of a product or system, FMEA is often mandated by the system integrator, who may use a combination of individual FMEAs to identify critical characteristics and assess system reliability and failure consequences.

The same approach may also be used in a manufacturing environment to analyse potential problems with manufacturing processes – so‐called process FMEAs, or PFMEAs. The structure and approach follow the same principles, including the listing of critical characteristics at each step of the process and putting in place an appropriate control plan to make sure that the process is generating good product.

It should be noted, however, that FMEA is less effective with complex systems and with multiple, interacting modes of failure. For these more complex situations, other methodologies such as fault tree analysis (FTA) can be used in combination with individual FMEAs.

7.6 Fault Tree Analysis

Whereas FMEA is a bottom‐up approach, FTA looks at complex systems from the top of a system downwards.

The method was developed initially within the US military in the 1960s as a way of improving the reliability of missile systems. Its use since then has extended into the nuclear power, oil and gas, chemicals, aerospace, rail, and automotive sectors – all applications involving complex and critical systems.

The starting point is the definition of a significant system failure, such as failure of an engine to start, often referred to as an ‘undesired event’. From there, potential causes of failure, located more deeply in the system, are analysed and documented – the potential sources of failure, the causes of the causes, and so on, in the form of a hierarchy or tree of events. Thus, in the example suggested above, the ‘failure of engine to start event’ might be broken down thus, very simplistically:

  • Lack of fuel
    • Pipe blockage
    • Pump failure
    • Pipe disconnected
    • Run out of fuel
  • Failure of starter motor to operate
    • Battery failure
    • Motor jammed
    • Motor open circuit
    • Cable disconnected
    • Electrical supply failure
  • Failure of ignition system
    • Damaged/dirty plugs
    • Leads disconnected
    • Coil failure
    • Electrical supply failure

Figure 7.4 shows diagrammatically the layout of a typical fault tree.

Fault tree with event linked to an OR gate. OR gate is linked to basic events 1 and 2 and to an AND gate. AND gate is linked to basic events 3 and 4 and to an XOR gate. XOR gate is linked to basic events 5 and 6.

Figure 7.4 Example fault tree.

As can be seen, a complete tree covers the top event down to basic causes – which might already have been identified in an FMEA‐type of study. Numerical probability failures may be estimated for the basic events from which overall system reliability can be calculated. Software is available to support the analysis of complex systems.

The fault tree is essentially built around Boolean logic, and there is a standard notation for such trees with agreed symbols. Examples of the latter include: basic (bottom‐level) events, ‘AND gates’, where two failures have to combine to produce the higher event, and ‘OR gates’, where either of two events can cause the higher event.

The approach is good at exploring complex systems and multiple sources of potential failure. It can look at the effectiveness of redundancy in a system, the interaction of failures, and identify common causes (one fault, such as loss of electrical power, leading to multiple failures). Its basic purpose is to draw attention to critical risks, which will then be the subject of redesign or development effort either to eliminate them, or to reduce their impact, or to include them in system monitoring. The approach is also relevant to maintenance work and fault diagnosis

7.7 Hazard and Operability Reviews – HAZOP

The examples described above relate to both products, such as aircraft or rail vehicles, and to fixed, safety‐critical plants. A related approach, now widely used, comes originally from the heavy chemicals sector but is applied to other forms of fixed plant as well as, occasionally, products such as rail vehicles. Known universally as HAZOP, the full title is ‘hazard and operability’ studies and the emphasis, as the name implies, is on issues that might arise during operation – see Refs. 69.

The method breaks the complex plant, process, or asset into a series of ‘nodes’, each of which is examined systematically in detail, again using a team‐based, multidisciplinary approach in a series of workshops spread over a period of time. Each node is examined for potential deviations from normal operation, which could cause hazards and operability problems. There is a lexicon of standard keywords which is used to prompt the team to identify issues. These are generally very simple words such as: more, less, and late. An experienced scribe is usually appointed to the team and a typical worksheet is shown on Figure 7.5. As this worksheet shows, the basic elements of the methodology is very similar in principle to FMEA, with its concentration on identifying potential problems and then ensuring that they are managed.

HAZOP worksheet with cells for date, node, design intent of the system, HAZOP team chairman, and HAZOP team members, and columns for item, process parameters, guide word, deviation, possible cause, etc.

Figure 7.5 Typical HAZOP worksheet.

Whilst this method can be applied retrospectively to existing plants, its real value comes from application during detailed design; a HAZOP study can be run once there is process definition and piping and instrumentation diagrams (P&IDs). The method is effectively mandated in many situations where regulatory approval will only be given if HAZOP studies, amongst other things, have been completed. It is, however, a time‐consuming and expensive process and there may be a temptation to restrict the scope of a HAZOP analysis to safety and environmental concerns thus excluding reliability and efficient plant operation. Its application is most relevant at the detailed design stage. Similar but somewhat simpler methods, such as hazard identification studies (HAZID), can and should be applied at earlier stages – concept design and ‘front‐end engineering design’ (FEED) stages.

7.8 Quantitative Risk Assessment

The methods described above are essentially qualitative although some, such as FTA, can be used either qualitatively or quantitatively. Quantitative risk assessments (QRAs), or probabilistic risk assessments (PRAs), are normal practice in certain complex, high‐hazard industries, such as oil and gas, chemicals, and nuclear power generation. Their origins are again in these fixed, one‐off plants or installations, rather than volume products. However, similar quantitative methods are growing in use in the latter field and arguably have been used for many years in aerospace and rail, albeit in different forms from the process industries and usually described as ‘reliability studies’.

The start point for quantitative methods is the same risk identification approach described above, listing potential risks and assigning likelihoods and consequences in a qualitative manner. The more serious risks are then analysed quantitatively, the objective being to establish numerically the likelihood of major incidents and the consequences, again numerically, of such incidents. This then permits an evaluation of the figures against those that are considered acceptable to society (which is a subject in its own right).

Likelihoods are usually expressed as the probability of occurrence over a defined period, such as a year, or the probability per mission – a typical figure might be 1 in 105 or 0.001%. Consequences could be measured in terms of injuries, deaths, financial cost, environmental damage, or loss of aircraft. Various modelling techniques are used to evaluate consequences, e.g. blast damage, atmospheric dispersion, radiation, toxicity, and environmental pollution.

The methods for analysis of this type are essentially empirical and have been built up by experience over a number of years, on an industry‐by‐industry basis. Unlike other forms of engineering modelling, simulation methods cannot be verified by running experiments and correlating results. Accidents and incidents are studied avidly, but there is an inherent level of uncertainty in most quantitative safety analyses that then leads, rightly, to a cautious approach being taken by the application of a ‘disproportion factor’, a form of safety factor, to the analysis.

The sectors where quantitative safety analysis is performed are generally those where regulatory approval is based on a safety case (see below). The quantitative analysis is therefore one of several means of justifying that approval and not an activity conducted in isolation.

7.9 Functional Safety

An increasing number, if not the majority, of products, machines, plant, and equipment rely heavily on electronic control systems and software for their normal functioning. At one extreme, some systems are wholly reliant on active, automated control for a normal operation; for example, some military aircraft are aerodynamically unstable (to increase their manoeuvrability) and are hence completely reliant on automated control systems for safe flight. Similarly, many complex plants are beyond the stage where they can readily be controlled manually. Looking ahead, there is much interest (e.g. in cars and ships) in developments towards autonomous operation. In all these situations, safety is dependent on the components, equipment, and control systems working correctly.

'Functional safety is the term used internationally to describe the situation where safety has this dependence on correct operation. A series of international standards cover the topic, with IEC61508 as the master standard. There are then separate and specific standards for automotive, rail, process industries, and nuclear – see References.

Within these standards, a ‘safety function’ is a process in which an active control system detects the development of potentially dangerous conditions and triggers protective or corrective actions, either as a separate override system or as part of the wider functionality of the core system.

As indicated above, the topic is relevant to a wide range of industries: process plants, oil and gas, nuclear, medical, rail signalling, automotive (especially as autonomous technologies develop), aerospace, and machining. Even simple devices, such as automatic door opening systems in shopping malls, have a degree of safety criticality.

As with all potential hazards, the starting point for determining what should be done to address the safety of such systems is an initial risk analysis. This may or may not show the need for active functional safety (other forms of safety, such as mechanically interlocked guarding, may be sufficient). The IEC61508 standard (Refs. 10,11) accepts that zero risk cannot be achieved, but it does promote the consideration of safety from the outset of system design and make the point that nontolerable risks must be reduced to as low a level as reasonably practicable (see Section 7.10). The standard is ‘end‐to‐end’ in the sense that it applies to all points in the life cycle from concept to disposal.

The International Electrotechnical Commission (IEC) standard uses hazard and risk analyses based on six categories of frequency and four categories of consequence severity, giving a 24‐cell risk assessment matrix. From this matrix, four categories of risk are then identified:

  • Class I. Unacceptable in any circumstance
  • Class II. Undesirable: tolerable only if risk reduction is impracticable or if the costs are grossly disproportionate to the improvement gained
  • Class III. Tolerable if the cost of risk reduction would exceed the improvement
  • Class IV. Acceptable as it stands, though it may need to be monitored

These categories can then be used to decide whether a safety function is needed to achieve a sufficiently low overall risk. If a safety function is needed, it must then be decided what form it should take in order to achieve and maintain a ‘safe state’ within the system being controlled (which could include shutdown) in the event of problems arising. The performance criteria for this safety function are in turn defined in terms of safety integrity levels (SILs) (Ref. 12), which specify probabilistically the likelihood of not achieving the defined function under defined conditions within the required timescale – see Figure 7.6. These levels in effect specify the amount of risk reduction that the safety function must provide relative to an unprotected situation.

SIL LEVEL Probability of failure
on demand (i)
Probability of failure
per hour (ii)
1 10−1 – 10−2 10−5 – 10−6
2 10−2 – 10−3 10−6 – 10−7
3 10−3 – 10−4 10−7 – 10−8
4 10−4 – 10−5 10−8 – 10−9

Figure 7.6 Safety integrity levels.

Two types of system are considered: those (‘i’ above) that switch in when a problem is detected (e.g. an emergency shut‐down system) and those (‘ii’ above) that operate continuously.

The material above is intended as a very broad summary of the concept of functional safety and how it should be assured. There are significant differences how different industries address the topic, reflecting their different needs and the different consequences of failure. For example, machine tools have much less capacity to cause large‐scale damage and injury than a major process plant. There is also no uniform definition of safety integrity levels (SILs) across all industries. However, the risk‐based approach and the concepts of functional safety are widely accepted and will apply to a wider range of engineering products and systems as they adopt an increasing level of built‐in ‘intelligence’.

ALARP boundaries depicted by an inverted triangle with 3 segments labeled acceptable zone, ALARP zone, and unacceptable zone (bottom-top).

Figure 7.7 ALARP boundaries.

7.10 As Low as Reasonably Practicable

The question was raised earlier as to what level of risk is considered acceptable when the safety of users, operators, employees, or the general public is concerned. The UK health and safety regime operates under the legal umbrella of ‘as low as reasonably practicable’, abbreviated to the ugly but widely used acronym of ‘ALARP’. This approach developed from workplace safety considerations but the principle can be applied more widely to hazardous plants and to other safety or risk‐related situations – see Ref. 13.

The approach recognises that absolute safety cannot be achieved, and it is acceptable (and legally defensible) to choose not to adopt certain safety measures. Thus, if certain measures are not technically feasible (i.e. not practicable) or are not proportionate to the benefits (i.e. not reasonably practicable), then they can be rejected.

However, the test of what is reasonable can be strict. It must be shown that existing good practice, as it exists in other similar or related situations, has been adopted. To support this argument, it would be normal to carry out a full qualitative risk assessment and probably a quantitative assessment. The latter, in particular, would show which safety improvement measures should be adopted and which do not bring a proportionate benefit. It should also be noted that the state‐of‐the‐art is always moving forward: new technologies are being developed and learning will be acquired from accidents or incidents. Hence, what might be considered acceptable in one decade may not be acceptable in the next.

The ALARP concept is applicable within certain boundaries – see Figure 7.7, where the width of the inverted triangle represents the frequency of the risk. If, for example, after adopting all reasonably practicable measures, the level of quantified risk for a product, facility, or activity is still above certain boundaries, then the proposal will have to be abandoned as unacceptable or an alternative solution developed. Conversely, once risk reduction measures have reduced risks to negligible levels, then the ALARP principle no longer applies. In this context, the ALARP zone is considered to be in the approximate range of 1 in 103 to 1 in 106 risk of a fatality in one year.

7.11 Safety Cases

The points above relate mainly to single risks or failures, each of which is considered in isolation. With every product, system, or plant, multiple risks will accumulate to an overall level of risk, which may or may not be acceptable. For complex systems or plants, it is normal to compile a ‘safety case’ that examines all aspects of risk and puts forward arguments as to why the system overall is adequately safe. Such a safety case is required for regulatory approval in a growing range of industries: nuclear, oil and gas, chemicals, rail, medicine, aerospace, and automotive as examples.

In this situation, a safety case is a compilation of the arguments that the system in question is adequately safe when operating for a specific purpose under specified operating conditions. This is sometimes described as an evidence‐based approach where the evidence supports the arguments that the system is safe. The alternative is a more prescriptive approach where, by complying with certain standards (national or international) or test methods, something is considered safe. The latter approach is perfectly acceptable for small‐scale products, such as small domestic appliances, where international standards or test codes are the normal means by which safety is assured. The safety case approach is applicable to more complex systems where there are multiple potential failure modes and the possibility of failures interacting with each other. The case considers not just the design of the system but how it is operated and maintained, how staff are trained, what emergency procedures exist, and how these factors are all managed.

Because of the arguments‐based nature of safety cases, and hence the potential for different interpretations, regulatory approval may require an independent peer review before approval is given.

7.12 Stretching the Boundaries

Chapter 2 introduced the concept of risk in engineering and identified four categories of risk, which are repeated below:

  1. Obvious. Those that both the developer of the technology and outside parties agree upon and where therefore consensus can be easily achieved
  2. Experience. Those that the developer may have missed but that peers and grey‐haired engineers might identify from their experience
  3. Hidden. Those for which the signs are present but the developer might be ignoring or dismissing, e.g. because test material is ‘nonrepresentative’
  4. Danger zone. Those that lie beyond the experience or expectation of all concerned

The approaches described so far should bring out risks in the first three of these categories, provided they are done thoroughly and involve experienced people. But this then raises the question: how can risks that are beyond normal experience be identified? Clearly, the more thorough and rigorous the risk analysis, the smaller is the danger zone. But is there any guidance that could be suggested to identify the ‘unknown unknowns’? Logically, if they are unknowable, then they are destined to be hidden until they emerge, taking everyone by surprise. However, there is merit in any risk review in stepping back at some point and asking: ‘what have we missed?’ Areas to consider might include:

  • Taking an established technology but stretching its application just too far
  • Subjecting an established product to a duty cycle with which it cannot cope
  • Not foreseeing potential software and control system problems
  • Malevolent hacking of control and communication systems
  • Use and abuse of products outside the operating envelope originally envisaged

The book Design Paradigms by Henry Petroski (Ref. 14) contains some instructive examples, over a long period of time, from the field of bridge‐building. It is interesting to see how particular forms of bridge design move from conservatism, in the early stages, to overconfidence and failure, then back to conservatism. For example, the Tacoma Narrows suspension bridge, which broke up in high winds in 1940 as a result of aerodynamics forces, is often regarded as the revelatory event for this phenomenon. However, history shows that at least 10 suspension bridges were destroyed or severely damaged by wind in the nineteenth century alone. Another instructive book is Major Hazards and Their Management (Ref. 15).

In the twenty‐first century, it could be argued that software, control, and communication systems present the greatest challenges in terms of stepping into uncharted territory. The technologies of the Fourth Industrial Revolution undoubtedly present tremendous opportunities, but their downsides and risks need also to be given careful thought.

Hence, good practice in risk management should always include a brain‐wracking activity to try to identify such issues. At the same time, care must be taken to ensure that the culture of the organisation concerned allows people to raise concerns without fear of ridicule or retribution.

7.13 Concluding Points

Identifying, reducing, and managing technical risks is one of the most fundamental aspects of the engineering development process. It has been an integral aspect of this process from time immemorial, but it has been an explicitly managed topic since around the 1960s. Partly driven by the increasing complexity of products and systems and partly by greater market pressure for reliability and safety, it is a topic in its own right with an extensive literature which this chapter can summarise only briefly.

Risk identification and management is one of the primary mechanisms for embodying the lessons of the past. As George Santayana said, ‘Those who cannot remember the past are condemned to repeat it’ and this is particularly true of engineering work. Learning from the failures of the past is central to the engineering process. It might also be considered legally negligent not to do so.

Much of the basic thinking and methods of analysis in this field has come from two sources:

  • Aerospace and defence. Driven by the needs of reliability and, in the case of passenger aircraft, basic safety
  • High‐hazard process industries. Driven by the potential consequences of major accidents

The basic methodologies of risk management all revolve around the simple concepts of identifying potential hazards, estimating their likely frequency and consequences, and then adopting courses of action appropriate to the level of risk. It is accepted that zero risk is impossible to achieve and problems or accidents do still happen. However, all long‐term measures of reliability and safety show a strongly improving trend and what was considered acceptable some decades ago would be intolerable today. This trend will continue into the future.

At the same time, the future will present challenges. In particular, most products, and not just complicated items such as aircraft or major plants, are incorporating an increasing level of intelligence. Predicting how they will operate and what could go wrong will stretch the minds of both engineers and regulators.

For engineers wishing to educate themselves more widely on this important topic, formal investigations into major accidents or problems (which go back to the 1840s) are an interesting source of learning for the enquiring mind. The older case studies have the advantage of not being clouded by political or social factors, litigation, or vested interests. Although the technologies involved have been superseded, the general points of learning are still valid, especially as most of them revolve around human factors and human error. These case studies are concentrated in industries that have can have a wide societal impact if something goes wrong, but the lessons of the past apply to all fields of engineering.

References

International standards and public procurement documents provide overall frameworks for risk management:

  1. 1 ISO31000:2009 – Risk Management, International Standards Organisation, 2009
  2. 2 ISO Guide 73:2009 – Risk Management Vocabulary
  3. 3 Risk Management Guide for DOD Acquisition, 6, (Version 1.0)

FMEA methods are covered by a number of documents including:

  1. 4 MIL‐STD‐ 1629 Rev. A, US Department of Defense, November 1980
  2. 5 SAE J1739_200901 (Revision 4) ‐ Potential Failure Mode and Effects Analysis in Design (Design FMEA), Potential Failure Mode and Effects Analysis in Manufacturing and Assembly Processes (Process FMEA)

HAZOP methods are covered in these four references:

  1. 6 Hazard and Operability Studies (HAZOP Studies) Application Guide BS: IEC61882: 2002
  2. 7 Kletz, T. (2006). HAZOP and Hazan, 4e. Taylor & Francis.
  3. 8 (2008). HAZOP: Guide to Best Practice, 2e. European Process Safety Centre, IChemE.
  4. 9 IEC 61882:2016 (2016). Hazard and Operability Studies (HAZOP Studies) ‐ Application Guide. International Electrotechnical Commission.

Functional safety and safety integrity levels are covered in IEC61508 and related standards:

  1. 10 IEC 61508 (2016). Functional Safety of Electrical/Electronic/Programmable Electronic Safety‐Related Systems (E/E/PE, or E/E/PES). International Electrotechnical Commission.
  2. 11 Related standards: IEC61511 (process industries), IEC 61513 (nuclear power plants), IEC 62061 (machinery systems), IEC 62425 (railway signalling systems), and ISO 26262 (road vehicles).
  3. 12 Health and Safety Executive (2004). A Methodology for the Assignment of Safety Integrity Levels (SILs) to Safety‐Related Control Functions Implemented by Safety‐Related Electrical, Electronic and Programmable Electronic Control Systems of Machines. UK.

This paper gives a good overview of the application of ALARP principles:

  1. 13 Back to Basics: Risk Matrices and ALARP, R. David & G. Wilkinson, 2009

Finally, these books are well worth reading:

  1. 14 Petroski, H. (1994). Design Paradigms, Case Studies of Error and Judgement in Engineering. Cambridge University Press.
  2. 15 Wells, G. (1997). Major Hazards and Their Management. IChemE.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.198.174