5
RESILIENCE ENGINEERING

Resilience Engineering represents a way of thinking about safety that departs from conventional risk management approaches (e.g., error tabulation, violations, calculation of failure probabilities). Furthermore, it looks for ways to enhance the ability of organizations to monitor and revise risk models, to create processes that are robust yet flexible, and to use resources proactively in the face of disruptions or ongoing production and economic pressures. Accidents, according to Resilience Engineering, do not represent a breakdown or malfunctioning of normal system functions, but rather represent the breakdowns in the adaptations necessary to cope with the real world complexity. As control theory suggested with its emphasis on dynamic stability, individuals and organizations must always adjust their performance to current conditions; and because resources and time are finite it is inevitable that such adjustments are approximate. Success has been ascribed to the ability of groups, individuals, and organizations to anticipate the changing shape of risk before damage occurs; failure is the temporary or permanent absence of that ability.

CASE 5.1 NASA ORGANIZATIONAL DRIFT INTO THE COLUMBIA ACCIDENT

While the final breakup sequence of the space shuttle Columbia could be captured by a sequence-of-events model, the organizational background behind it takes a whole different form of analysis, and has formed a rich trove of inspiration for thinking about how to engineer resilience into organizations.

A critical precursor to the mission was the re-classification of foam events from in-flight anomalies to maintenance and turn-around issues, something that significantly degraded the safety status of foam strikes. Foam loss was increasingly seen as an accepted risk or even, as one pre-launch briefing put it, “not a safety of flight issue” (CAIB, 2003, p. 126). This shift in the status of foam events is an important part of explaining the limited and fragmented evaluation of the Columbia foam strike and how analysis of that foam event never reached the problem-solving groups that were practiced at investigating anomalies, their significance and consequences, that is, Mission Control.

What was behind this reclassification, how could it make sense for the organization at the time? Pressure on schedule issues produced a mindset centered on production goals. There are several ways in which this could have played a role: schedule pressure magnifies the importance of activities that affect turnaround; when events are classified as in-flight anomalies a variety of formal work steps and checks are invoked; the work to assess anomalies diverts resources from the tasks to be accomplished to meet turnaround pressures. In fact the rationale for the reclassification was quite weak, and flawed. The CAIB’s examination reveals that no cross-checks were in place to detect, question, or challenge the specific flaws in the rationale. Managers used what on the surface looked like technical analyses to justify previously reached conclusions, rather than the robust cognitive process of using technical analyses to test tentative hypotheses.

It would be very important to know more about the mindset and stance of different groups toward this shift in classification. For example, one would want to consider: Was the shift due to the salience of the need to improve maintenance and turnaround? Was this an organizational structure issue (which organization focuses on what aspects of problems)? What was Mission Control’s reaction to the reclassification? Was it heard about by other groups? Did reactions to this shift remain underground relative to formal channels of communication?

Interestingly, the organization had three categories of risk: in-flight anomalies, accepted risks, and non-safety issues. As the organization began to view foam events as an accepted risk, there was no formal means for follow-up with a re-evaluation of an “accepted” risk to assess if it was in fact acceptable as new evidence built up or as situations changed. For all practical purposes, there was no difference between how the organization was handling non-safety issues and how it was handling accepted risks (i.e., accepted risks were being thought of and acted on no differently than non-safety issues). Yet the organization acted as if items placed in the accepted risk category were being evaluated and handled appropriately (i.e., as if the assessment of the hazard was accurate and up to date and as if the countermeasures deployed were still shown to be effective).

Foam events were only one source of debris strikes that threaten different aspects of the orbiter structure. Debris strikes carry very different risks depending on where and what they strike. The hinge in considering the response to the foam strike on STS-107 is that the debris struck the leading-edge structure (RCC panels and seals) and not the tiles. Did concern and progress on improving tiles block the ability to see risks to other structures? Did NASA regard the leading edge as much less vulnerable to damage than tiles? This is important because the damage in a previous mission (STS-45) provided an opportunity to focus on the leading-edge structure and reconsider the margins to failure of that structure given strikes by various kinds of debris. Did this mission create a sense that the leading-edge structure was less vulnerable than tiles? Did this mission fail to revise a widely held belief that the RCC leading-edge panels were more robust to debris strikes than they really were? Who followed up the damage to the RCC panel and what did they conclude? Who received the results? How were risks to non-tile structures evaluated and considered – including landing gear door structures? More information about the follow-up to leading-edge damage in STS-45 would shed light on how this opportunity was missed.

A management stance emerged early in the Columbia mission which downplayed significance of the strike. The initial and very preliminary assessments of the foam strike created a stance toward further analysis that this was not a critical or important issue for the mission. The stance developed and took hold before there were results from any technical analyses. This indicates that preliminary judgments were biasing data evaluation, instead of following a proper engineering evaluation process where data evaluation points teams and management to conclusions.

Indications that the event was outside of boundary conditions for NASA’s understanding of the risks of debris strikes seemed to go unrecognized. When events fall outside of boundaries of past data and analysis tools and when the data available includes large uncertainties, the event is by definition anomalous and of high risk. While personnel noted the specific indications in themselves, no one was able to use these indicators to trigger any deeper or wider recognition of the nature of the anomaly in this situation. This pattern of seeing the details but being unable to recognize the big picture is commonplace in accidents.

As the Debris Assessment Team (DAT) was formed after the strike was detected and began to work, the question arose: “Is the size of the debris strike ‘out-of-family’ or ‘in-family’ given past experience?” While the team looked at past experience, it was unable to get a consistent or informative read on how past events indicated risk for this event. It appears no other groups or representatives of other technical areas were brought into the picture. This absence of any cross-checks is quite notable and inconsistent with how Mission Control groups evaluate in-flight anomalies. Past studies indicate that a review or interaction with another group would have provided broadening checks which help uncover inconsistencies and gaps as people need to focus their analysis, conclusions, and justifications for consideration and discussion with others.

Evidence that the strike posed a risk of serious damage kept being encountered – RCC panel impacts at angles greater than 15 degrees predicted coating penetration (CAIB, 2003, p. 145), foam piece 600 times larger than ice debris previously analyzed (CAIB, 2003, p. 143), models predicting tile damage deeper than tile thickness (CAIB, 2003, p. 143). Yet a process of discounting evidence discrepant with the current assessment went on several times (though eventually the DAT concerns seem to focus on the landing gear doors rather than the leading-edge structure).

Given the concerns about potential damage that arose in the DAT and given its desire to determine the location more definitively, the question arises: did the team conduct contingency analyses of damage and consequences across the different candidates sites – leading edge, landing gear door seals, tiles? Based on the evidence compiled in the CAIB report, there was no contingency analysis or follow through on the consequences if the leading-edge structure (RCC) was the site damaged. This is quite puzzling as this was the team’s first assessment of location and in hindsight their initial estimate proved to be reasonably accurate.

This lack of follow-through, coupled with the DAT’s growing concerns about the landing gear door seals, seems to indicate that the team may have viewed the leading-edge structures as more robust to strikes than other orbiter structures. The CAIB report fails to provide critical information about how different groups viewed the robustness or vulnerability of the leading-edge structure to damage from debris strikes (of course, post-accident these beliefs can be quite hard to determine, but various memos/analyses may indicate more about the perception risks to this part of the orbiter). Insufficient data is available to understand why RCC damage was under-pursued by the Debris Assessment Team.

There was a fragmented view of what was known about the strike and its potential implications over time, people, and groups. There was no place, artifact, or person who had a complete and coherent view of the analysis of the foam strike event (note a coherent view includes understanding the gaps and uncertainties in the data or analysis to that point). This contrasts dramatically with how Mission Control works to investigate and handle anomalies where there are clear lines of responsibility to have a complete, coherent view of the evolving analysis vested in the relevant flight controllers and in the flight director. Mission Control has mechanisms to keep different people in the loop (via monitoring voice loops, for example) so that all are up to date on the current picture of situation. Mission Control also has mechanisms for correcting assessments as analysis proceeds, whereas in this case the fragmentation and partial views seemed to block reassessment and freeze the organization in an erroneous assessment. As the DAT worked at the margins of knowledge and data, its partial assessments did not benefit from cross-checks through interactions with other technical groups with different backgrounds and assumptions. There is no report of a technical review process that accompanied its work. Interactions with people or groups with different knowledge and assumptions is one of the best ways to improve assessments and to aid revision of assessments. Mission Control anomaly-response includes many opportunities for cross-checks to occur. In general, it is quite remarkable that the groups practiced at anomaly response – Mission Control – never became involved in the process.

The process of analyzing the foam strike by the DAT broke down in many ways. The fact that this group also advocated steps that we now know would have been valuable (the request for imagery to locate the site of the foam strike) leads us to miss the generally fragmented distributed problem-solving process. The fragmentation also occurred across organizational levels (DAT to Mission Management Team (MMT)). Effective collaborative problem-solving requires more direct participation by members of the analysis team in the overall decision-making process. This is not sufficient of course; for example, the MMT’s stance already defined the situation as, “Show me that the foam strike is an issue” rather than “Convince me the anomaly requires no response or contingencies.” Overall, the evidence points to a broken distributed problem-solving process – playing out in between organizational boundaries. The fragmentation in this case indicates the need for a senior technical focal point to integrate and guide the anomaly analysis process (e.g., the flight director role). And this role requires real authority. The MMT and the MMT chair were in principle in a position to supply this role, but: Was the MMT practiced at providing the integrative problem-solving role? Were there other cases where significant analysis for in flight anomalies was guided by the MMT or were they all handled by the Mission Control team? The problem-solving process in this case has the odd quality of being stuck in limbo: not dismissed or discounted completely, yet unable to get traction as an in-flight anomaly to be thoroughly investigated with contingency analyses and re-planning activities. The dynamic appears to be a management stance that puts the event outside of safety of flight (e.g., conclusions drove, or eliminated, the need for analysis and investigation, rather than investigations building the evidence from which one would draw conclusions). Plus, the DAT exhibited a fragmented problem-solving process that failed to integrate partial and uncertain data to generate a big picture – that is, the situation was outside the understood risk boundaries and carried significant uncertainties.

The Columbia case reveals a number of classic patterns that have helped shape the ideas behind resilience engineering – some of these patterns have part of their basis in the earlier models described in this chapter:

image Drift toward failure as defenses erode in the face of production pressure.

image An organization that takes past success as a reason for confidence instead of investing in anticipating the changing potential for failure.

image Fragmented distributed problem-solving process that clouds the big picture.

image Failure to revise assessments as new evidence accumulates.

image Breakdowns at the boundaries of organizational units that impede communication and coordination.

The Columbia case provides an example of a tight squeeze on production goals, which created strong incentives to downplay schedule disruptions. With shrinking time/resources available, safety margins were likewise shrinking in ways which the organization couldn’t see. Goal tradeoffs often proceed gradually as pressure leads to a narrowing of focus on some goals while obscuring the tradeoff with other goals. This process usually happens when acute goals like production/efficiency take precedence over chronic goals like safety. The dilemma of production/safety conflicts is this: if organizations never sacrifice production pressure to follow up warning signs, they are acting much too riskily. On the other hand, if uncertain “warning” signs always lead to sacrifices on acute goals, can the organization operate within reasonable parameters or stakeholder demands? It is precisely at points of intensifying production pressure that extra safety investments need to be made in the form of proactive searching for side-effects of the production pressure and in the form or reassessing the risk space – safety investments are most important when least affordable. This raises the following questions:

image How does a safety organization monitor for drift and its associated signs, in particular, a means to recognize when the side-effects of production pressure may be increasing safety risks?

image What indicators should be used to monitor the organization’s model of itself, how it is vulnerable to failure, and the potential effectiveness of the countermeasures it has adopted?

image How does production pressure create or exacerbate tradeoffs between some goals and chronic concerns like safety?

image How can an organization add investment in safety issues at the very time when the organization is most squeezed? For example, how does an organization note a reduction in margins and follow through by rebuilding margin to boundary conditions in new ways?

Another general pattern identified in Columbia is that an organization takes past success as a reason for confidence instead of digging deeper to see underlying risks. During the drift toward failure leading to the Columbia accident a misassessment took hold that resisted revision (that is, the misassessment that foam strikes pose only a maintenance problem and not a risk to orbiter safety). It is not simply that the assessment was wrong; what is troubling is the inability to re-evaluate the assessment and re-examine evidence about the vulnerability.

The absence of failure was taken as positive indication that hazards are not present or that countermeasures are effective. In this context, it is very difficult to gather or see if evidence is building up that should trigger a re-evaluation and revision of the organization’s model of vulnerabilities. If an organization is not able to change its model of itself unless and until completely clear-cut evidence accumulates, that organization will tend to learn late, that is, it will revise its model of vulnerabilities only after serious events occur. On the other hand, high-reliability organizations assume their model of risks and countermeasures is fragile and even seek out evidence about the need to revise and update this model (Rochlin, 1999). They do not assume their model is correct and then wait for evidence of risk to come to their attention, for to do so will guarantee an organization that acts more riskily than it desires.

The missed opportunities to revise and update the organization’s model of the riskiness of foam events seem to be consistent with what has been found in other cases of failure of foresight. We can describe this discounting of evidence as “distancing through differencing,” whereby those reviewing new evidence or incidents focus on differences, real and imagined, between the place, people, organization, and circumstances where an incident happens and their own context. By focusing on the differences, people see no lessons for their own operation and practices (or only extremely narrow, well-bounded responses). This contrasts with what has been noted about more effective safety organizations which proactively seek out evidence to revise and update this model, despite the fact that this risks exposing the organization’s blemishes.

The distancing through differencing that occurred throughout the build-up to the final Columbia mission can be repeated in the future as organizations and groups look at the analysis and lessons from this accident and the CAIB report. Others in the future can easily look at the CAIB conclusions and deny their relevance to their situation by emphasizing differences (e.g., my technical topic is different, my managers are different, we are more dedicated and careful about safety, we have already addressed that specific deficiency). This is one reason avoiding hindsight bias is so important – when one starts with the question, “How could they have missed what is now obvious?” – one is enabling future distancing through differencing rationalizations. The distancing through differencing process that contributes to this breakdown also indicates ways to change the organization to promote learning. One general principle which could be put into action is – do not discard other events because they appear on the surface to be dissimilar. At some level of analysis all events are unique, while at other levels of analysis they reveal common patterns. Every event, no matter how dissimilar to others on the surface, contains information about underlying general patterns that help create foresight about potential risks before failure or harm occurs. To focus on common patterns rather than surface differences requires shifting the analysis of cases from surface characteristics to deeper patterns and more abstract dimensions. Each kind of contributor to an event can then guide the search for similarities.

This suggests that organizations need a mechanism to generate new evaluations that question the organization’s own model of the risks it faces and the countermeasures deployed. Such review and reassessment can help the organization find places where it has underestimated the potential for trouble and revise its approach to create safety. A quasi-independent group is needed to do this – independent enough to question the normal organizational decision-making but involved enough to have a finger on the pulse of the organization (keeping statistics from afar is not enough to accomplish this).

Another general pattern identified in Columbia is a fragmented problem-solving process that clouds the big picture. During Columbia there was a fragmented view of what was known about the strike and its potential implications. There was no place or person who had a complete and coherent view of the analysis of the foam-strike event including the gaps and uncertainties in the data or analysis to that point. It is striking that people used what looked like technical analyses to justify previously reached conclusions, instead of using technical analyses to test tentative hypotheses.

Discontinuities and internal handovers of tasks increase risk of fragmented problem-solving (Patterson, Roth, Woods, Chow, and Gomez, 2004). With information incomplete, disjointed and patchy, nobody may be able to recognize the gradual erosion of safety constraints on the design and operation of the original system. High reliability organization researchers have found that the importance of free-flowing information cannot be overestimated. A spontaneous and continuous exchange of information relevant to normal functioning of the system offers a background from which signs of trouble can be spotted by those with the experience to do so (Weick, 1993; Rochlin, 1999). Research done on handovers, which is one coordinative device to avert the fragmentation of problem-solving (Patterson, Roth, Woods, Chow, and Gomez, 2004) has identified some of the potential costs of failing to be told, forgetting or misunderstanding information communicated. These costs, for the incoming crew, include:

image having an incomplete model of the system’s state;

image being unaware of significant data or events;

image being unprepared to deal with impacts from previous events;

image failing to anticipate future events;

image lacking knowledge that is necessary to perform tasks safely;

image dropping or reworking activities that are in progress or that the team has agreed to do;

image creating an unwarranted shift in goals, decisions, priorities or plans.

Such problems could also have played a role in the Helios accident, described above. In Columbia, the breakdown or absence of cross-checks between disjointed departments and functions is also striking. Cross-checks on the rationale for decisions is a critical part of good organizational decision-making. Yet no cross-checks were in place to detect, question, or challenge the specific flaws in the rationale, and no one noted that cross-checks were missing. The breakdown in basic engineering judgment stands out as well. In Columbia the initial evidence available already placed the situation outside the boundary conditions of engineering data and analysis. The only available analysis tool was not designed to predict under these conditions, the strike event was hundreds of times the scale of what the model is designed to handle, and the uncertainty bounds were very large with limited ability to reduce the uncertainty (CAIB, 2003). Being outside the analyzed boundaries should not be confused with not being confident enough to provide definitive answers. In this situation basic engineering judgment calls for large efforts to extend analyses, find new sources of expertise, and cross-check results as Mission Control both practices and does. Seasoned pilots and ship commanders well understand the need for this ability to capture the big picture and not to get lost in a series of details. The issue is how to train for this judgment. For example, the flight director and his or her team practice identifying and handling anomalies through simulated situations. Note that shrinking budgets led to pressure to reduce training investment (the amount of practice, the quality of the simulated situations, and the number or variety of people who go through the simulations sessions can all decline).

What about making technical judgments? Relevant decision-makers did not seem able to notice when they needed more expertise, data, and analysis in order to have a proper evaluation of an issue. NASA’s evaluation prior to STS-107 that foam debris strikes do not pose risks of damage to the orbiter demands a technical base. Instead their “resolution” was based on very shaky or absent technical grounds, often with shallow, offhand assessments posing as and substituting for careful analysis.

The fragmentation of problem-solving also illustrates Weick’s points about how effective organizations exhibit a “deference to expertise,” “reluctance to simplify interpretations,” and “preoccupation with potential for failure,” none of which was in operation in NASA’s organizational decision-making leading up to and during Columbia (Weick et al., 1999). A safety organization must ensure that adequate technical grounds are established and used in organizational decision-making. To accomplish this, in part, the safety organization will need to define the kinds of anomalies to be practiced as well as who should participate in simulation training sessions. The value of such training depends critically on designing a diverse set of anomalous scenarios with detailed attention to how they unfold. By monitoring performance in these simulated training cases, safety personnel will be better able to assess the quality of decision-making across levels in the organization.

The fourth pattern in Columbia is a failure to revise assessments as new evidence accumulates. The accident shows how difficult it is to revise a misassessment or to revise a once plausible assessment as new evidence comes in. This finding has been reinforced in other studies in different settings (Feltovich et al., 1997; Johnson et al., 1991). Research consistently shows that revising assessments successfully requires a new way of looking at previous facts. Organizations can provide this “fresh” view:

image by bringing in people new to the situation;

image through interactions across diverse groups with diverse knowledge and tools;

image through new visualizations which capture the big picture and reorganize data into different perspectives.

One constructive action is to develop the collaborative interchanges that generate fresh points of view or that produce challenges to basic assumptions. This cross-checking process is an important part of how NASA Mission Control and other organizations successfully respond to anomalies (for a case where these processes break down see Patterson et al., 2004). One can also capture and display indicators of safety margin to help people see when circumstances or organizational decisions are pushing the system closer to the edge of the safety envelope. This idea is something that Jens Rasmussen, one of the pioneers of the new results on error and organizations, has been promoting for two decades (Rasmussen, 1997).

The crux is to notice the information that changes past models of risk and calls into question the effectiveness of previous risk reduction actions, without having to wait for completely clear-cut evidence. If revision only occurs when evidence is overwhelming, there is a grave risk of an organization acting too riskily and finding out only from near-misses, serious incidents, or even actual harm. Instead, the practice of revising assessments of risk needs to be an ongoing process. In this process of continuing re-evaluation, the working assumption is that risks are changing or evidence of risks has been missed.

What is particularly interesting about NASA’s organizational decision-making is that the correct diagnosis of production/safety tradeoffs and useful recommendations for organizational change were noted in 2000. The Mars Climate Orbiter report of March 13, 2000, depicts how the pressure for production and to be “better” on several dimensions led to management accepting riskier and riskier decisions. This report recommended many organizational changes similar to those in the CAIB report. A slow and weak response to the previous independent board report was a missed opportunity to improve organizational decision-making in NASA. The lessons of Columbia should lead organizations of the future to develop a safety organization that provides “fresh” views on risks to help discover the parent organization’s own blind spots and question its conventional assumptions about safety risks.

Finally, the Columbia accident brings to the fore another pattern – breakdowns at the boundaries of organizational units. The CAIB analysis notes how a kind of Catch-22 was operating in which the people charged to analyze the anomaly were unable to generate any definitive traction and in which the management was trapped in a stance shaped by production pressure that views such events as turnaround issues. This effect of an “anomaly in limbo” seems to emerge at the boundaries of different organizations that do not have mechanisms for constructive interplay. It is here that we see the operation of the generalization that in risky judgments we have to defer to those with technical expertise and the necessity to set up a problem-solving process that engages those practiced at recognizing anomalies in the event.

This pattern points to the need for mechanisms that create effective overlap across different organizational units and the need to avoid simply staying inside the chain-of-command mentality (though such overlap can be seen as inefficient when the organization is under severe cost pressure). This issue is of particular concern to many organizations as communication technology has linked together disparate groups as a distributed team. This capability for connectivity is leading many to work on how to support effective coordination across these distributed groups, for example in military command and control. A safety organization must have the technical expertise and authority to enhance coordination across the normal chain of command.

ENGINEERING RESILIENCE IN ORGANIZATIONS

The insights derived from the above five patterns and other research results on safety in complex systems point to the need to monitor and manage risk continuously throughout the life-cycle of a system, and in particular to find ways of maintaining a balance between safety and the often considerable pressures to meet production and efficiency goals (Reason, 1997; Weick et al., 1999). These results indicate that safety management in complex systems should focus on resilience – in the face of potential disturbances, changes and surprises, the system’s ability to anticipate (knowing what to expect), ability to address the critical (knowing what to look for), ability to respond (knowing what to do), and ability to learn (knowing what can happen). A system’s resilience captures the result that failures are breakdowns in the normal adaptive processes necessary to cope with the complexity of the real world (Rasmussen, 1990; Sutcliffe and Vogus, 2003; Hollnagel, Woods, and Leveson, 2006).

A system’s resilience includes properties such as:

image buffering capacity: the size or kinds of disruptions the system can absorb or adapt to without a fundamental breakdown in performance or in the system’s structure;

image flexibility: the system’s ability to restructure itself in response to external changes or pressures;

image margin: how closely the system is currently operating relative to one or another kind of performance boundary;

image tolerance: whether the system gracefully degrades as stress/pressure increase, or collapses quickly when pressure exceeds adaptive capacity.

Cross-scale interactions are another important factor, as the resilience of a system defined at one scale depends on influences from scales above and below: downward in terms of how organizational context creates pressures/goal conflicts/dilemmas and upward in terms of how adaptations by local actors in the form of workarounds or innovative tactics reverberate and influence more strategic issues. Managing resilience, or resilience engineering, then, focuses on what sustains or erodes the adaptive capacities of human-technical systems in a changing environment (Hollnagel et al., 2006). The focus is on monitoring organizational decision-making to assess the risk that the organization is operating nearer to safety boundaries than it realizes (or, more generally, that the organization’s adaptive capacity is degrading or lower than the adaptive demands of its environment).

Resilience engineering seeks to develop engineering and management practices to measure sources of resilience, provide decision support for balancing production/safety tradeoffs, and create feedback loops that enhance the organization’s ability to monitor/revise risk models and to target safety investments. For example, resilience engineering would monitor evidence that effective cross-checks are well integrated when risky decisions are made, or would serve as a check on how well the organization prepares to handle anomalies by checking on how it practices handling of simulated anomalies (what kind of anomalies, who is involved in making decisions). The focus on system resilience emphasizes the need for proactive measures in safety management: tools to support agile, targeted, and timely investments to defuse emerging vulnerabilities and sources of risk before harm occurs.

To achieve resilience, organizations need support for decisions about production/safety tradeoffs. Resilience engineering should help organizations decide when to relax production pressure to reduce risk, or, in other words, develop tools to support sacrifice decisions across production/safety tradeoffs. When operating under production and efficiency pressures, evidence of increased risk on safety may be missed or discounted. As a result, organizations act in ways that are riskier than they realize or want, until an accident or failure occurs. This is one of the factors that creates the drift toward failure signature in complex system breakdowns.

To make risk a proactive part of management decision-making means knowing when to relax the pressure on throughput and efficiency goals, that is, make a sacrifice decision; how to help organizations decide when to relax production pressure to reduce risk. These tradeoff decisions can be referred to as sacrifice judgments because acute production- or efficiency-related goals are temporarily sacrificed, or the pressure to achieve these goals is relaxed, in order to reduce risks of approaching too near to safety boundary conditions. Sacrifice judgments occur in many settings: when to convert from laparoscopic surgery to an open procedure (e.g., Cook et al., 1998; Woods, 2006), when to break off an approach to an airport during weather that increases the risk of wind shear, or when to have a local slowdown in production operations to avoid risks as complications build up. Ironically, it is at the very times of higher organizational tempo and focus on acute goals that we require extra investment in sources of resilience to keep production/safety tradeoffs in balance – valuing thoroughness despite the potential for sacrifices on efficiency required to meet stakeholder demands.

CONCLUSION

The various models that try to understand safety and “human error” always are works in progress, and their language evolves constantly to accommodate new empirical results, new methods, and new concepts. It is now become obvious, though, that traditional, reductive engineering notions of reliability (that safety can be maintained by keeping system component performance inside acceptable and pre-specified bandwidths) have very little to do with what makes complex systems highly resilient. “Human error” as a label that would indicate a lack of such traditional reliability on part of human components in a complex system, has no analytical leverage whatsoever. Through the various generations of models, “human error” has evolved from cause, to effect, to a mere attribution, that has more to do with those who struggle with a failure in hindsight than with the people caught up in a failing system at the time.

Over the past two decades, research has begun to show how organizations can manage acute pressures of performance and production in a constantly dynamic balance with chronic concern for safety. Safety is not something that these organizations have, it is something that organizations do. Practitioners and organizations, as adaptive systems, continually assess and revise their work so as to remain sensitive to the possibility of failure. Efforts to create safety are ongoing, but not always successful. An organization usually is unable to change its model of itself unless and until overwhelming evidence accumulates that demands revising the model. This is a guarantee that the organization will tend to learn late, that is, revise its model of risk only after serious events occur. The crux is to notice the information that changes past models of risk and calls into question the effectiveness of previous risk reduction actions, without having to wait for complete clear cut evidence. If revision only occurs when evidence is overwhelming, organization will act too riskily and experience shocks from near misses, serious incidents, or even actual harm. The practice of revising assessments of risk needs to be continuous.

Resilience Engineering, the latest addition to thinking about safety and human performance in complex organization, is built on insights derived in part from HRO work, control theory, Perrowian complexity and even man-made disaster theory. It is concerned with assessing organizational risk, that is the risk that organizational decision making will produce unrecognized drift toward failure boundaries. While assessing technical hazards is one kind of input into Resilience Engineering, the goal is to monitor organizational decision making. For example, Resilience Engineering would monitor evidence that effective cross checks are well-integrated when risky decisions are made or would serve as a check on how well the organization is practicing the handling of simulated anomalies (what kind of anomalies, who is involved in making decisions).

Other dimensions of organizational risk include the commitment of the management to balance the acute pressures of production with the chronic pressures of protection. Their willingness to invest in safety and to allocate resources to safety improvement in a timely, proactive manner, despite pressures on production and efficiency, are key factors in ensuring a resilient organization. The degree to which the reporting of safety concerns and problems is truly open and encouraged provides another significant source of resilience within the organization. Assessing the organization’s response to incidents indicates if there is a learning culture or a culture of denial. Other dimensions include:

image Preparedness/Anticipation: is the organization proactive in picking up on evidence of developing problems versus only reacting after problems become significant?

image Opacity/Observability – does the organization monitor safety boundaries and recognize how close it is to ‘the edge’ in terms of degraded defenses and barriers? To what extent is information about safety concerns widely distributed throughout the organization at all levels versus closely held by a few individuals?

image Flexibility/Stiffness – how does the organization adapt to change, disruptions, and opportunities?

Successful organizations in the future will have become skilled at the three basics of Resilience Engineering:

1. detecting signs of increasing organizational risk, especially when production pressures are intense or increasing;

2. having the resources and authority to make extra investments in safety at precisely these times when it appears least affordable;

3. having a means to recognize when and where to make targeted investments to control rising signs of organizational risk and re-balance the safety and production tradeoff.

These mechanisms may help produce an organization that creates foresight about changing risks before failures occur.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.196.175