4
COMPLEXITY, CONTROL AND SOCIOLOGICAL MODELS

NORMAL ACCIDENT THEORY

Highly technological systems such as aviation, air traffic control, telecommunications, nuclear power, space missions, and medicine include potentially disastrous failure modes. These systems, consistent with the barrier idea in the previous chapter, usually have multiple redundant mechanisms, safety systems, and elaborate policies and procedures to keep them from failing in ways that produce bad outcomes. The results of combined operational and engineering measures make these systems relatively safe from single point failures; that is, they are protected against the failure of a single component or procedure directly leading to a bad outcome. But the paradox, says Perrow (1984), is that such barriers and redundancy can actually add complexity and increase opacity so that, when even small things start going wrong, it becomes exceptionally difficult to get off an accelerating pathway to system breakdown. The need to make these systems reliable, in other words, also makes them very complex. They are large systems, semantically complex (it generally takes a great deal of time to master the relevant domain knowledge), with tight couplings between various parts, and operations are often carried out under time pressure or other resource constraints.

Perrow (1984) promoted the idea of system accidents. Rather than being the result of a few or a number of component failures, accidents involve the unanticipated interaction of a multitude of events in a complex system – events and interactions whose combinatorial explosion can quickly outwit people’s best efforts at predicting and mitigating disaster. The scale and coupling of these systems creates a different pattern for disaster where incidents develop or evolve through a conjunction of several small failures. Yet to Normal Accidents Theory, analytically speaking, such accidents need not be surprising at all (not even in a fundamental sense). The central thesis of what has become known as normal accident theory (Perrow, 1984) is that accidents are the structural and virtually inevitable product of systems that are both interactively complex and tightly coupled. Interactive complexity and coupling are two presumably different dimensions along which Perrow plotted a number of systems (from manufacturing to military operations to nuclear power plants). This separation into two dimensions has spawned a lot of thinking and discussion (including whether they are separable at all), and has offered new ways of looking at how to manage and control complex, dynamic technologies, as well as suggesting what may lie behind the label “human error” if things go wrong in a tightly coupled, interactively complex system. Normal accident theory predicts that the more tightly coupled and complex a system is, the more prone it is to suffering a “normal” accident.

Interactive complexity refers to component interactions that are non-linear, unfamiliar, unexpected or unplanned, and either not visible or not immediately comprehensible for people running the system. Linear interactions are those in expected and familiar production or maintenance sequences, and those that are quite visible and understandable even if unplanned. Complex interactions are those of unfamiliar sequences, or unplanned and unexpected sequences, and either not visible or not immediately comprehensible (Perrow, 1984). An electrical power grid is an example of an interactively complex system. Failures, when they do occur, can cascade through these systems in ways that may confound the people managing them, making it difficult to stop the progression of failure (this would also go for the phone company AT&T’s Thomas Street outage, even if stakeholders implicated “human error”).

In addition to being either linearly or complexly interactive, systems can be loosely or tightly coupled. They are tightly coupled if they have more time-dependent processes (meaning they can’t wait or stand by until attended to), sequences that are invariant (the order of the process cannot be changed) and little slack (e.g., things cannot be done twice to get it right). Dams, for instance, are rather linear systems, but very tightly coupled. Rail transport is too. In contrast, an example of a system that is interactively complex but not very tightly coupled is a university education. It is interactively complex because of specialization, limited understanding, number of control parameters and so forth. But the coupling is not very tight. Delays or temporary halts in education are possible, different courses can often be substituted for one another (as can a choice of instructors), and there are many ways to achieving the goal of getting a degree.

CASE 4.1 A COFFEE MAKER ONBOARD A DC-8 AIRLINER

During a severe winter in the US (1981-1982), a DC-8 airliner was delayed at Kennedy airport in New York (where the temperature was a freezing 2°F or minus 17°C) because mechanics needed to exchange a fuel pump (they received frost bite, which caused further delay). (Perrow, 1984, p. 135.)

After the aircraft finally got airborne after midnight, headed for San Francisco, passengers were told that there would be no coffee because the drinking water was frozen. Then the flight engineer discovered that he could not control the cabin pressure (which is held at a higher pressure than the thin air the aircraft is flying in so as to make the air breathable). Later investigation showed that the frozen drinking water had cracked the airplane’s water tank. Heat from ducts to the tail section of the aircraft then melted the ice in the tank, and because of the crack in the tank, and the pressure in it, the newly melted water near the heat source sprayed out. It landed on the outflow valve that controls the cabin pressurization system (by allowing pressurized cabin air to vent outside). Once on the valve, the water turned to ice again because of the temperature of the outside air (minus 50°F or minus 45°C), which caused the valve to leak. The compressors for the cabin air could not keep up, leading to depressurization of the aircraft.

The close proximity of parts that have no functional relationship, packed inside a compact airliner fuselage, can create the kind of interactive complexity and tight coupling that makes it hard to understand and control a propagating failure. Substituting broken parts was not possible (meaning tight coupling): the outflow valve is not reachable when airborne and a water tank cannot be easily replaced either (nor can a leak in it be easily fixed when airborne). The crew response to the pressurization problem, however, was rapid and effective – independent of their lack of understanding of the source of their pressurization problem. As trained, they got the airplane down to a breathable level in just three minutes and diverted to Denver for an uneventful landing there.

To Perrow, the two dimensions (interactive complexity and coupling) presented a serious dilemma. A system with high interactive complexity can only be effectively controlled by a decentralized organization. The reason is that highly interactive systems generate the sorts of non-routine situations that resist standardization (e.g., through procedures, which is a form of centralized control fed forward into the operation). Instead, the organization has to allow lower-level personnel considerable discretion and leeway to act as they see fit based on the situation, as well as encouraging direct interaction among lower-level personnel, so as to bring together the different kinds of expertise and perspective necessary to understand the problem.

A system with tight couplings, on the other hand, can in principle only be effectively controlled by a highly centralized organization, because tight coupling demands quick and coordinated responses. Disturbances that cascade through a system cannot be stopped quickly if a team with the right mix of expertise and backgrounds needs to be assembled first. Centralization, for example through procedures, emergency drills, or even automatic shut-downs or other machine interventions, is necessary to arrest such cascades quickly. Also, a conflict between different well-meaning interventions can make the situation worse, which means that activities oriented at arresting the failure propagation need to be extremely tightly coordinated.

To Perrow, an organization cannot be centralized and decentralized at the same time. So a dilemma arises if a system is both interactively complex and tightly coupled (e.g., nuclear power generation). A necessary conclusion for normal accidents theory is that systems that are both tightly coupled and interactively complex can therefore not be controlled effectively. This, however, is not the whole story. In the tightly coupled and interactively complex pressurization case above, the crew may not have been able to diagnose the source of the failure (which would indeed have involved decentralized multiple different perspectives, as well as access to various systems and components). Yet through centralization (procedures for dealing with pressurization problems are often trained, well-documented, brief and to the point) and extremely tight coordination (who does and says what in an emergency depressurization descent is very firmly controlled and goes unquestioned during execution of the task), the crew was able to stop the failure from propagating into a real disaster. Similarly, even if nuclear power plants are both interactively complex and tightly coupled, a mix of centralization and decentralization is applied so as to make propagating problems more manageable (e.g., thousands of pages of procedures and standard protocols exist, but so does the co-location of different kinds of expertise in one control room, to allow spontaneous interaction; and automatic shutdown sequences that get triggered in some situations can rule out the need for human intervention for up to 30 minutes).

NORMAL ACCIDENT THEORY AND “HUMAN ERROR”

At the sharp end of complex systems, normal accidents theory sees human error as a label for some of the effects of interactive complexity and tight coupling. Operators are the inheritors of a system that structurally conspires against their ability to make sense of what is going on and to recover from a developing failure. Investigations, infused with the wisdom of hindsight, says Perrow (1984) often turn up places where human operators should have zigged instead of zagged, as if that alone would have prevented the accident. Perrow invokes the idea of the fundamental surprise error when he comments on official inability to deal with the real structural nature of failure (e.g., through the investigations that are commissioned). The cause they find may sometimes be no more than the “cause” people are willing or able to afford. Indeed, to Perrow, the reliance on labels like “human error” has little to do with explanation and more with politics and power, something even formal or independent investigations are not always immune to:

Formal accident investigations usually start with an assumption that the operator must have failed, and if this attribution can be made, that is the end of serious inquiry. Finding that faulty designs were responsible would entail enormous shutdown and retrofitting costs; finding that management was responsible would threaten those in charge, but finding that operators were responsible preserves the system, with some soporific injunctions about better training. (1984, p. 146)

Human error, in other words, can be a convenient and cheap label to use so as to control sunk costs and avoid having to upset elite interests. Behind the label, however, lie the real culprits: structural interactive complexity and tight coupling – features of risky technological systems such as nuclear power generation that society as a whole should be thinking critically about (Perrow, 1984).

That said, humans can hardly be the recipient victims of complexity and coupling alone. The very definition of Perrowian complexity actually involves both human and system, to the point where it becomes hard to see where one ends and the other begins. For example, interactions cannot be unfamiliar, unexpected, unplanned, or not immediately comprehensible in some system independent of the people who need to deal with them (and to whom they are either comprehensible or not). One hallmark of expertise, after all, is a reduction of the degrees of freedom that a decision presents to the problem-solver (Jagacinski and Flach, 2002), and an increasingly refined ability to recognize patterns of interactions and knowing what to do primed by such situational appreciation (Klein, Orasanu, and Calderwood, 1993). Perrowian complexity can thus not be a feature of a system by itself, but always has to be understood in relation to the people (and their expertise) who have to manage that system (e.g., Pew et al., 1981; Wagenaar and Groeneweg, 1987). This also means that the categories of complexity and coupling are not as independent as normal accident theory suggests.

Another problem arises when complexity and coupling are treated as stable properties of a system, because it misses the dynamic nature of much safety-critical work and the ebb and flow of cognitive and coordinative activity to manage it. During periods of crisis, or high demand, a system can become more difficult to control as couplings tighten and interactive complexity momentarily deepens. It renders otherwise visible interactions less transparent, less linear, creating interdependencies that are harder to understand and more difficult to correct. This can become especially problematic when important routines get interrupted, coordinated action breaks down and misunderstandings occur (Weick, 1990). The opposite goes too. Contractions in complexity and coupling can be met in centralized and de-centralized ways by people responsible for the safe operation of the system, creating new kinds of coordinated action and newly invented routines.

CASE 4.2 THE MAR KNOCKOUT CASE (COOK AND CONNOR, 2004)

During the Friday night shift in a large, tertiary care hospital, a nurse called the pharmacy technician on duty to report a problem with the medications just delivered for a ward patient in the unit dose cart. The call itself was not usual; occasionally there would be a problem with the medication delivered to the floor, especially if a new order was made after the unit dose fill list had been printed. In this case, however, the pharmacy had delivered medicines to the floor that had never been ordered for that patient. More importantly, the medicines that were delivered to the floor matched with the newly printed medication administration record (MAR). This was discovered during routine reconciliation of the previous day’s MAR with the new one. The MAR that had just been delivered was substantially different from the one from the previous day but there was no indication in the patient’s chart that these changes had been ordered. The pharmacy technician called up a computer screen that showed the patient’s medication list. This list corresponded precisely to the new MAR and the medications that had been delivered to the ward.

While trying to understand what had happened to this patient’s medication, the telephone rang again. It was a call from another ward where the nurses had discovered something wrong. For some patients, the unit dose cart contained drugs their patients were not taking, in others the cart did not contain drugs the patients were supposed to get. Other calls came in from other areas in the hospital, all describing the same situation. The problem seemed to be limited to the unit dose cart system; the intravenous medications were correct. In each case, the drugs that were delivered matched the newly printed MAR, but the MAR itself was wrong. The pharmacy technician notified the on-call pharmacist who realized that, whatever its source, the problem was hospital-wide. The MAR as a common mode created the kind of Perrowian complexity that made management of the problem extremely difficult: its consequences were showing up throughout the entire hospital, often in different guises and with different implications.

Consistent with normal accident theory, a technology that was introduced to improve safety, such as the dose checking software in this case, actually made it harder to achieve safety, for example, by making it difficult to upgrade to new software. Information technology makes it possible to perform work efficiently by speeding up much of the process. But the technology also makes it difficult to detect failures and recover from them. It introduces new forms of failure that are hard to appreciate before they occur. These failures are foreseeable but not foreseen. This was an event with system-wide consequences required decisive and immediate action to limit damage and potential damage. This action was expensive and potentially damaging to the prestige and authority of those who were in charge. The effective response required simultaneous, coordinated activity by experienced, skilled people.

Like many accidents, it was not immediately clear what had happened, only that something was wrong. It was now early Saturday morning and the pharmacy was confronting a crisis. First, the pharmacy computer system was somehow generating an inaccurate fill list. Neither MARs nor the unit dose carts already delivered to the wards could be trusted. There was no pharmacy computer-generated fill list that could be relied upon. Second, the wards were now without the right medications for the hospitalized patients and the morning medication administration process was about to begin. No one yet knew what was wrong with the pharmacy computer. Until it could be fixed, some sort of manual system was needed to provide the correct medications to the wards. Across the hospital, the unit dose carts were sent back to the pharmacy.

A senior pharmacist realized that the previous day’s hard copy MARs as they were maintained on the wards were the most reliable available information about what medicines patients were supposed to receive. By copying the most recent MARs, the pharmacy could produce a manual fill list for each patient. For security reasons, there were no copying machines near the wards. There was a fax machine for each ward, however, and the pharmacy staff organized a ward-by-ward fax process to get hand-updated copies of each patient’s MAR. Technicians used these faxes as manual fill lists to stock unit dose carts with correct medications. A decentralized response, in other words, that coordinated different kinds of expertise and background, making fortuitous use of substitutions (fax machines instead of copiers) helped people in the hospital manage the problem. A sudden contraction in interactive complexity through a common mode failure (MAR in this case) with a lack of centralized response capabilities (no central back-up) did not lead to total system breakdown because of the spontaneously organized response of practitioners throughout the system.

Ordinarily, MARs provided a way to track and reconcile the physician orders and medication administration process on the wards. In this instance they became the source of information about what medications were needed. Because the hospital did not yet have computer-based physician-order entry, copies of handwritten physician-orders were available. These allowed the satellite pharmacies to interact directly with the ward nurses to fill the gaps. Among the interesting features of the event was the absence of typewriters in the pharmacy. Typewriters, discarded years before in favor of computer-label printers, would have been useful for labeling medications. New technology displaces old technology, making it harder to recover from computer failures by reverting to manual operations.

The source of the failure remained unclear, as it often does, but that does not need to hamper the effectiveness of the coordinated response to it. There had been some problem with the pharmacy computer system during the previous evening. The pharmacy software detected a fault in the database integrity. The computer specialist had contacted the pharmacy software vendor and they had worked together through a fix to the problem. This fix proved unsuccessful so they reloaded a portion of the database from the most recent backup tape. After this reload, the system had appeared to work perfectly. The computer software had been purchased from a major vendor. After a devastating cancer chemotherapy accident in the institution, the software had been modified to include special dose-checking programs for chemotherapy. These modifications worked well but the pharmacy management had been slow to upgrade the main software package because it would require rewriting the dose-checking add-ons. Elaborate backup procedures were in place, including both frequent “change” backups and daily “full” backups onto magnetic tapes.

Working with the software company throughout the morning, the computer technicians were able to discover the reason that the computer system had failed. The backup tape was incomplete. Reloading had internally corrupted the database, and so the backup was corrupted because of a complex interlocking process related to the database management software that was used by the pharmacy application. Under particular circumstances, tape backups could be incomplete in ways that remained hidden from the operator. The problem was not related to the fault for which the backup reloading was necessary. The immediate solution to the problem facing the pharmacy was to reload the last “full” backup (now over a day and a half old) and to re-enter all the orders made since that time. The many pharmacy technicians now collected all the handwritten order slips from the past 48 hours and began to enter these (the process was actually considerably more complex. For example, to bring the computer’s view of the world up to date, its internal clock had to be set back, the prior day’s fill list regenerated, the day’s orders entered, the clock time set forward and the current day’s morning fill list re-run). The manual system was used all Saturday. The computer system was restored by the end of the day. The managers and technicians examined the fill lists produced for the nightly fill closely and found no errors. The system was back “on-line”.

As far as pharmacy and nursing management could determine, no medication misadministration occurred during this event. Some doses were delayed, although no serious consequences were identified. Several factors contributed to the hospital’s ability to recover from the event. First, the accident occurred on a Friday night so that the staff had all day Saturday to recover and all day Sunday to observe the restored system for new failures. Few new patients are admitted on Saturday and the relatively slow tempo of operations allowed the staff to concentrate on recovering the system. Tight coupling, in other words, was averted fortuitously by the time of the week of the incident. Second, the hospital had a large staff of technicians and pharmacists who came in to restore operations. In addition, the close relationship between the software vendor and hospital information technical staff made it possible for the staff to diagnose the problem and devise a fix with little delay. The ability to quickly bring a large number of experts with operational experience together was critical to success, as normal accidents theory predicts is necessary in highly interactively complex situations. Third, the availability of the manual, paper records allowed these experts to “patch-up” the system and make it work in an unconventional but effective way. The paper MARs served as the basis for new fill lists and the paper copies of physician orders provided a “paper trail” that made it possible to replay the previous day’s data entry, essentially fast forwarding the computer until it’s “view” of the world was correct. Substitution of parts, in other words, was possible, thereby reducing coupling and arresting a cascade of failures. Fourth, the computer system and technical processes contributed. The backup process, while flawed in some ways, was essential to recovery: it provided the “full” backup needed. In other words, a redundancy existed that had not been deliberately designed-in (as is normally the case in tightly coupled systems according to normal accident theory).

The ability of organizations to protect themselves against system accidents (such as the MAR knockout close call) can, in worse cases than the one described above, fall victim to the very interactive complexity and tight coupling it must contain. Plans for emergencies, for example, are intended to help the organization deal with unexpected problems and developments for which are designed to be maximally persuasive to regulators, board members, surrounding communities, lawmakers and opponents of the technology, and as a result can become wildly unrealistic. Clarke and Perrow (1996) call them “fantasy documents,” that fail to cover most possible accidents, lack any historical record that may function as a reality check, and are quickly based on obsolete contact details, organizational designs, function descriptions and divisions of responsibility. The problem with such fantasy documents is that they can function as an apparently legitimate placeholder that suggests that everything is under control. It inhibits the organization’s commitment to continually reviewing and re-assessing its ability to deal with hazard. In other words, fantasy documents can impede organizational learning as well as organizational preparedness.

CONTROL THEORY

In response to the limitations of event chain models and their derivatives, such as the latent failure model, models based on control theory have been proposed for accident analysis instead. Accident models based on control theory explicitly look at accidents as emerging from interactions among system components. They usually do not identify single causal factors, but rather look at what may have gone wrong with the system’s operation or organization of the hazardous technology that allowed an accident to take place. Safety, or risk management, is viewed as a control problem (Rasmussen, 1997), and accidents happen when component failures, external disruptions or interactions between layers and components are not adequately handled; when safety constraints that should have applied to the design and operation of the technology have loosened, or become badly monitored, managed, controlled. Control theory tries to capture these imperfect processes, which involve people, societal and organizational structures, engineering activities, and physical parts. It sees the complex interactions between those – as did man-made disaster theory – as eventually resulting in an accident (Leveson, 2002).

Control theory sees the operation of hazardous technology as a matter of keeping many interrelated components in a state of dynamic equilibrium (which means that control inputs, even if small, are continually necessary for the system to stay safe: it cannot be left on its own as could a statically stable system). Keeping a dynamically stable system in equilibrium happens through the use of feedback loops of information and control. Accidents are not the result of an initiating (root cause) event that triggers a series of events, which eventually leads to a loss. Instead, accidents result from interactions among components that violate the safety constraints on system design and operation, by which feedback and control inputs can grow increasingly at odds with the real problem or processes to be controlled. Unsurprisingly, concern with those control processes (how they evolve, adapt and erode) forms the heart of control theory as applied to organizational safety (Rasmussen, 1997; Leveson, 2002).

Degradation of the safety-control structure over time can be due to asynchronous evolution, where one part of a system changes without the related necessary changes in other parts. Changes to subsystems may have been carefully planned and executed in isolation, but consideration of their effects on other parts of the system, including the role they play in overall safety control, may remain neglected or inadequate. Asynchronous evolution can occur too when one part of a properly designed system deteriorates independent of other parts. In both cases, erroneous expectations of users or system components about the behavior of the changed or degraded subsystem may lead to accidents (Leveson, 2002). The more complex a system (and, by extension, the more complex its control structure), the more difficult it can become to map out the reverberations of changes (even carefully considered ones) throughout the rest of the system. Control theory embraces a much more complex idea of causation, taken from complexity theory. Small changes somewhere in the system, or small variations in the initial state of a process, can lead to huge consequences elsewhere. The Newtonian symmetry between cause and effect (still assumed in other models discussed in this chapter) no longer applies.

CASE 4.3 THE LEXINGTON COMAIR 5191 ACCIDENT (SEE NELSON, 2008)

Flight 5191 was a scheduled passenger flight from Lexington, Kentucky to Atlanta, Georgia, operated by Comair. On the morning of August 27, 2006, the Regional Jet that was being used for the flight crashed while attempting to take off. The aircraft was assigned runway 22 for the takeoff, but used runway 26 instead. Runway 26 was too short for a safe takeoff. The aircraft crashed just past the end of the runway, killing all 47 passengers and two of the three crew. The flight’s first officer was the only survivor. At the time of the 5191 accident the LEX airport was in the final construction phases of a five year project. The First Officer had given the takeoff briefing and mentioned that “lights were out all over the place” (NTSB, 2007, p. 140) when he had flown in two nights before. He also gave the taxi briefing, indicating they would take taxiway Alpha to runway 22 and that it would be a short taxi. Unbeknownst to the crew, the airport signage was inconsistent with their airport diagram charts as a result of the construction. Various taxiway and runway lighting systems were out of operation at the time.

After a short taxi from the gate, the captain brought the aircraft to a stop short of runway 22, except, unbeknownst to him, they were actually short of runway 26. The control tower controller scanned runway 22 to assure there was no conflicting traffic, then cleared Comair 191 to take off. The view down runway 26 provided the illusion of some runway lights. By the time they approached the intersection of the two runways, the illusion was gone and the only light illuminating the runway was from the aircraft lights. This prompted the First Officer to comment “weird with no lights” and the captain responded “yeah” (NTSB, 2007, p. 157). During the next 14 seconds, they traveled the last 2,500 ft of remaining runway. In the last 100 feet of runway, the captain called “V1, Rotate, Whoa.” The jet became momentarily airborne but then impacted a line of oak trees approximately 900 feet beyond the end of runway 26. From there, the aircraft erupted into flames and came to rest approximately 1,900 feet off the west end of runway 26.

Runway 26 was only 3,500 feet long and not intended for aircraft heavier than 12,000 pounds. Yet each runway had a crossing runway located approximately 1,500 feet from threshold. They both had an increase in elevation at the crossing runway. The opposite end of neither runway was visible during the commencement of the takeoff roll. Each runway had a dark-hole appearance at the end, and both had 150 foot wide pavement (runway 26 was edge striped to 75 feet). Neither runway had lighting down the center line, as that of runway 22 had been switched off as part of the construction (which the crew knew). Comair had no specified procedures to confirm compass heading with the runway. Modern Directional Gyros (DG) automatically compensate for precession, so it is no longer necessary to cross-check the DG with runway heading and compass indication. Many crews have abandoned the habit of checking this, as airlines have abandoned procedures for it. The 5191 crew was also fatigued, having accumulated sleep loss over the preceding duty period.

Comair had operated accident-free for almost 10 years when the 5191 accident occurred. During those 10 years, Comair approximately doubled its size, was purchased by Delta Air Lines Inc., became an all jet operator and, at the time of the 5191 accident, was in the midst of its first bankruptcy reorganization. As is typical with all bankruptcies, anything management believed was unnecessary was eliminated, and everything else was pushed to maximum utilization. In the weeks immediately preceding the 5191 accident, Comair had demanded large wage concessions from the pilots. Management had also indicated the possibility of furloughs and threatened to reduce the number of aircraft, thereby reducing the available flight hours and implying reduction of work force.

Data provided by Jeppesen, a major flight navigation and chart company, for NOTAM’s (Notices to Airmen), did not contain accurate local information about the closure of taxiway Alpha North of runway 26. Comair, nor the crew, had any other way to get this information other than a radio broadcast at the airport itself, but there was no system in place for checking the completeness and accuracy of these either. According to the airport, the last phase of construction did not require a change in the route used to access runway 22; Taxiway A5 was simply renamed Taxiway A, but this change was not reflected on the crew’s chart (indeed, asynchronous evolution). It would eventually become Taxiway A7.

Several crews had acknowledged difficulty dealing with the confusing aspects of the north end taxi operations to runway 22, following the changes which affected a seven-day period prior to the 5191 accident. One captain, who flew in and out of LEX numerous times a month, stated that after the changes “there was not any clarification about the split between old alpha taxiway and the new alpha taxiway and it was confusing.” A First Officer, who also regularly flew in and out of LEX, expressed that on their first taxi after the above changes, he and his captain “were totally surprised that taxiway Alpha was closed between runway 26 and runway 22.” The week before, he used taxiway Alpha (old Alpha) to taxi all the way to runway 22. It “was an extremely tight area around runway 26 and runway 22 and the chart did not do it justice.” Even though these and, undoubtedly, other instances of crew confusion occurred during the seven-day period of August 20-27, 2006, there were no effective communication channels to provide this information to LEX, or anyone else in the system. After the 5191 accident, a small group of aircraft maintenance workers expressed concern that they, too, had experienced confusion when taxiing to conduct engine run-up’s. They were worried that an accident could happen, but did not know how to effectively notify people who could make a difference.

The regulator had not approved the publishing of interim airport charts that would have revealed the true nature of the situation. It had concluded that changing the chart over multiple revision cycles would create a high propensity for inaccuracies to occur, and that, because of the multiple chart changes, the possibilities for pilot confusion would be magnified.

Control theory has part of its background in control engineering, which helps the design of control and safety systems in hazardous industrial or other processes, particularly with software applications (e.g., Leveson and Turner, 1993). The models, as applied to organizational safety, are concerned with how a lack of control allows a migration of organizational activities towards the boundary of acceptable performance, and there are several ways to represent the mechanisms by which this occurs. Systems dynamics modeling does not see an organization as a static design of components or layers. It readily accepts that a system is more than the sum of its constituent elements. Instead, they see an organization as a set of constantly changing and adaptive processes focused on achieving the organization’s multiple goals and adapting around its multiple constraints. The relevant units of analysis in control theory are therefore not components or their breakage (e.g., holes in layers of defense), but system constraints and objectives (Rasmussen, 1997; Leveson, 2002):

Human behavior in any work system is shaped by objectives and constraints which must be respected by the actors for work performance to be successful. Aiming at such productive targets, however, many degrees of freedom are left open which will have to be closed by the individual actor by an adaptive search guided by process criteria such as workload, cost effectiveness, risk of failure, joy of exploration, and so on. The work space within which the human actors can navigate freely during this search is bounded by administrative, functional and safety-related constraints. The normal changes found in local work conditions lead to frequent modifications of strategies and activity will show great variability … During the adaptive search the actors have ample opportunity to identify ‘an effort gradient’ and management will normally supply an effective ‘cost gradient’. The result will very likely be a systematic migration toward the boundary of functionally acceptable performance and, if crossing the boundary is irreversible, an error or an accident may occur. (Rasmussen, 1997, p. 189)

image

Figure 4.1 The difference between the crew’s chart on the morning of the accident, the actual situation (center) and the eventual result of the reconstruction (NFDC or National Flight Data Center chart to the right). From Nelson, 2008

image

Figure 4.2 The structure responsible for safety-control during airport construction at Lexington, and how control deteriorated. Lines going into the left of a box represent control actions, lines from the top or bottom represent feedback

The dynamic interplay between these different constraints and objectives is illustrated in Figure 4.3.

image

Figure 4.3 A space of possible organizational action is bounded by three constraints: safety, workload and economics. Multiple pressures act to move the operating point of the organization in different directions. (Modified from Cook and Rasmussen, 2005)

CONTROL THEORY AND “HUMAN ERROR”

Control theory sees accidents as the result of normal system behavior, as organizations try to adapt to the multiple, normal pressures that operate on it every day. Reserving a place for “inadequate” control actions, as some models do, of course does re-introduce human error under a new label (accidents are not the result of human error, but the result of inadequate control – what exactly is the difference then?). Systems dynamics modeling must deal with that problem by recursively modeling the constraints and objectives that govern the control actions at various hierarchical levels, thereby explaining the “inadequacy” as a normal result of normal pressures and constraints operating on that level from above and below, and in turn influencing the objectives and constraints for other levels. Rasmussen (1997) does this by depicting control of a hazardous technology as a nested series of reciprocally constraining hierarchical levels, down from the political and governmental level, through regulators, companies, management, staff, all the way to sharp-end workers. This nested control structure is also acknowledged by Leveson (2002).

In general, systems dynamics modeling is not concerned with individual unsafe acts or errors, or even individual events that may have helped trigger an accident sequence. Such a focus does not help, after all, in identifying broader ways to protect the system against similar migrations towards risk in the future. Systems dynamics modeling also rejects the depiction of accidents in the traditionally physical way as the latent failure model does, for example. Accidents are not about particles, paths of traveling or events of collision between hazard and process-to-be-protected (Rasmussen, 1997). The reason for rejecting such language (even visually) is that removing individual unsafe acts, errors or singular events from a presumed or actual accident sequence only creates more space for new ones to appear if the same kinds of systemic constraints and objectives are left similarly ill-controlled in the future. The focus of control theory is therefore not on erroneous actions or violations, but on the mechanisms that help generate such behaviors at a higher level of functional abstraction – mechanisms that turn these behaviors into normal, acceptable and even indispensable aspects of an actual, dynamic, daily work context.

Fighting violations or other deviations from presumed ways of operating safely – as implicitly encouraged by other models discussed above – is not very useful according to control theory. A much more effective strategy for controlling behavior is by making the boundaries of system performance explicit and known, and to help people develop skills at coping with the edges of those boundaries. Ways proposed by Rasmussen (1997) include increasing the margin from normal operation to the loss-of-control boundary. This, however, is only partially effective because of risk homeostasis and the law of stretched systems – the tendency for a system under goal pressures to gravitate back to a certain level of risk acceptance, even after interventions to make it safer. In other words, if the boundary of safe operations is moved further away, then normal operations will likely follow not long after – under pressure, as they always are, from the objectives of efficiency and less effort.

CASE 4.4 RISK HOMEOSTASIS

One example of risk homeostasis is the introduction of anti-lock brakes and center-mounted brake lights on cars. Both these interventions serve to push the boundary of safe operations further out, enlarging the space in which driving can be done safely (by notifying drivers better when a preceding vehicle brakes, and by improving the vehicle’s own braking performance independent of road conditions). However, this gain is eaten up by the other pressures that push on the operating point: drivers will compensate by closing the distance between them and the car in front (after all, they can see better when it brakes now, and they may feel their own braking performance has improved). The distance between the operating point and the boundary of safe operations closes up’.

Another way is to increase people’s awareness that the system may be drifting towards the boundary, and then launching safety campaign to push back in the opposite direction (Rasmussen, 1997).

CASE 4.5 TAKE-OFF CHECKLISTS AND THE PRESSURE TO DEPART ON-TIME

Airlines frequently struggle with on-time performance, particularly in heavily congested parts of the world, where so-called slot times govern when aircraft may become airborne. Making a slot time is critical, as it can be hours for a new slot to open up if the first one is missed. This push for speed can lead to problems with for example pre-take off checklists, and airlines regularly have problems with attempted take-offs in airplanes that are not correctly configured (particularly the wing flaps which help the aircraft fly at slower speeds such as in take-off and landing).

One airline published a flight safety news letter that was distributed to all its pilots. The letter counted seven such configuration events in half a year, where aircraft did not have wing flaps selected before taking off, even when the item “flaps” on the before take-off checklist was read and responded to by the pilots. Citing no change in procedures (so that could not be the explanation), the safety letter went on to speculate whether stress or complacency could be a factor, particularly as it related to the on-time performance goals (which are explicitly stated by the airline elsewhere). Slot times played a role in almost half the events. While acknowledging that slot times and on-time performance were indeed important goals for the airline, the letter went on to say that flight safety should not be sacrificed for those goals. In an attempt to help crews develop their skills at coping with the boundaries, the letter also suggested that crew members should act on ‘gut’ feelings and speak out loudly as soon as something was detected that was amiss, particularly in high workload situations.

Leaving both pressures in place (a push for greater efficiency and a safety campaign pressing in the opposite direction) does little to help operational people (pilots in the case above) cope with the actual dilemma at the boundary. Also, a reminder to try harder and watch out better, particularly during times of high workload, is a poor substitute for actually developing skills to cope at the boundary. Raising awareness, however, can be meaningful in the absence of other possibilities for safety intervention, even if the effects of such campaigns tend to wear off quickly. Greater safety returns can be expected only if something more fundamental changes in the behavior-shaping conditions or the particular process environment (e.g., less traffic due to industry slow-down, leading to less congestion and fewer slot times). In this sense, it is important to raise awareness about the migration toward boundaries throughout the organization, at various managerial levels, so that a fuller range of countermeasures is available beyond telling front-line operators to be more careful. Organizations that are able to do this effectively have sometimes been dubbed high-reliability organizations.

HIGH-RELIABILITY THEORY

High reliability theory describes the extent and nature of the effort that people, at all levels in an organization, have to engage in to ensure consistently safe operations despite its inherent complexity and risks. Through a series of empirical studies, high-reliability organizational (HRO) researchers found that through leadership safety objectives, the maintenance of relatively closed systems, functional decentralization, the creation of a safety culture, redundancy of equipment and personnel, and systematic learning, organizations could achieve the consistency and stability required to effect failure-free operations (LaPorte and Consolini, 1991). Some of these categories were very much inspired by the worlds studied – naval aircraft carriers, for example (Rochlin, LaPorte and Roberts, 1987). There, in a relatively self-contained and disconnected closed system, systematic learning was an automatic by-product of the swift rotations of naval personnel, turning everybody into instructor and trainee, often at the same time. Functional decentralization meant that complex activities (like landing an aircraft and arresting it with the wire at the correct tension) were decomposed into simpler and relatively homogenous tasks, delegated down into small workgroups with substantial autonomy to intervene and stop the entire process independent of rank. HRO researchers found many forms of redundancy – in technical systems, supplies, even decision-making and management hierarchies, the latter through shadow units and multi-skilling.

When HRO researchers first set out to examine how safety is created and maintained in such complex systems, they focused on errors and other negative indicators, such as incidents, assuming that these were the basic units that people in these organizations used to map the physical and dynamic safety properties of their production technologies, ultimately to control risk (Rochlin, 1999). The assumption was wrong: they were not. Operational people, those who work at the sharp end of an organization, hardly defined safety in terms of risk management or error avoidance. Ensuing empirical work by HRO, stretching across decades and a multitude of high-hazard, complex domains (aviation, nuclear power, utility grid management, navy) would paint a more complex picture. Operational safety – how it is created, maintained, discussed, mythologized – is much more than the control of negatives. As Rochlin (1999, p. 1549) put it:

The culture of safety that was observed is a dynamic, intersubjectively constructed belief in the possibility of continued operational safety, instantiated by experience with anticipation of events that could have led to serious errors, and complemented by the continuing expectation of future surprise.

The creation of safety, in other words, involves a belief about the possibility to continue operating safely. This belief is built up and shared among those who do the work every day. It is moderated or even held up in part by the constant preparation for future surprise – preparation for situations that may challenge people’s current assumptions about what makes their operation risky or safe. It is a belief punctuated by encounters with risk, but it can become sluggish by overconfidence in past results, blunted by organizational smothering of minority viewpoints, and squelched by acute performance demands or production concerns. But that also makes it a belief that is, in principle, open to organizational or even regulatory intervention so as to keep it curious, open-minded, complexly sensitized, inviting of doubt, and ambivalent toward the past (e.g., Weick, 1993).

HIGH RELIABILITY AND “HUMAN ERROR”

An important point for the role of “human error” in high reliability theory is that safety is not the same as reliability. A part can be reliable, but in and of itself it can’t be safe. It can perform its stated function to the expected level or amount, but it is context, the context of other parts, of the dynamics and the interactions and cross-adaptations between parts, that make things safe or unsafe. Reliability as an engineering property is expressed as a component’s failure rate over a period of time. In other words, it addresses the question of whether a component lives up to its pre-specified performance criteria. Organizationally, reliability is often associated with a reduction in variability, and an increase in replicability: the same process, narrowly guarded, produces the same predictable outcomes. Becoming highly reliable may be a desirable goal for unsafe or moderately safe operations (Amalberti, 2001). The guaranteed production of standard outcomes through consistent component performance is a way to reduce failure probability in those operations, and it is often expressed as a drive to eliminate “human errors” and technical breakdowns.

In moderately safe systems, such as chemical industries or driving or chartered flights, approaches based on reliability can still generate significant safety returns (Amalberti, 2001). Regulations and safety procedures have a way of converging practice onto a common basis of proven performance. Collecting stories about negative near-miss events (errors, incidents) has the benefit in that the same encounters with risk show up in real accidents that happen to that system. There is, in other words, an overlap between the ingredients of incidents and the ingredients of accidents: recombining incident narratives has predictive (and potentially preventive) value. Finally, developing error-resistant and error-tolerant designs helps cut down on the number of errors and incidents.

The monitoring of performance through operational safety audits, error counting, process data collection, and incident tabulations has become institutionalized and in many cases required by legislation or regulation. As long as an industry can assure that components (parts, people, companies, countries) can comply with pre-specified and auditable criteria, it affords the belief that it has a safe system. Quality assurance and safety management within an industry are often mentioned in the same sentence or used under one department heading. The relationship is taken as non-problematic or even coincident. Quality assurance is seen as a fundamental activity in risk management. Good quality management will help ensure safety.

Such beliefs may well have been sustained by models such as the latent failure model discussed above, which posited that accidents are the result of a concatenation of factors, a combination of active failures at the sharp end with latent failures from the blunt end (the organizational, regulatory, societal part) of an organization. Accidents represent opportunistic trajectories through imperfectly sealed or guarded barriers that had been erected at various levels (procedural, managerial, regulatory) against them. This structuralist notion plays into the hand of reliability: the layers of defense (components) should be checked for their gaps and holes (failures) so as to guarantee reliable performance under a wide variety of conditions (the various line-ups of the layers with holes and gaps). People should not violate rules, process parameters should not exceed particular limits, acme nuts should not wear beyond this or that thread, a safety management system should be adequately documented, and so forth.

This model also sustains decomposition assumptions that are not really applicable to complex systems (see Leveson, 2002). For example, it suggests that each component or sub-system (layer of defense) operates reasonably independently, so that the results of a safety analysis (e.g., inspection or certification of people or components or sub-systems) are not distorted when we start putting the pieces back together again. It also assumes that the principles that govern the assembly of the entire system from its constituent sub-systems or components is straightforward. And that the interactions, if any, between the sub-systems will be linear: not subject to unanticipated feedback loops or non-linear interactions.

The assumptions baked into that reliability approach mean that aviation should continue to strive for systems with high theoretical performance and a high safety potential – that the systems it designs and certifies are essentially safe, but that they are undermined by technical breakdowns and human errors. The elimination of this residual reliability “noise” is still a widely-pursued goal, as if industries are the custodian of an already safe system that merely needs protection from unpredictable, erratic components that are the remaining sources of unreliability. This common sense approach, says Amalberti (2001), which indeed may have helped some systems progress to their safety levels of today, is beginning to lose its traction. This is echoed by Vaughan (1996, p. 416):

We should be extremely sensitive to the limitations of known remedies. While good management and organizational design may reduce accidents in certain systems, they can never prevent them … technical system failures may be more difficult to avoid than even the most pessimistic among us would have believed. The effect of unacknowledged and invisible social forces on information, interpretation, knowledge, and – ultimately – action, are very difficult to identify and to control.

Many systems, even after progressing beyond being moderately safe, are still embracing this notion of reliability with vigor – not just to maintain their current safety level (which would logically be non-problematic, in fact, it would even be necessary) but also as a basis for increasing safety even further. But as progress on safety in more mature systems (e.g., commercial aviation) has become asymptotic, further optimization of this approach is not likely to generate significant safety returns. In fact, there could be indications that continued linear extensions of a traditional-componential reliability approach could paradoxically help produce a new kind of system accident at the border of almost totally safe practice (Amalberti, 2001, p. 110):

The safety of these systems becomes asymptotic around a mythical frontier, placed somewhere around 5x10−7 risks of disastrous accident per safety unit in the system. As of today, no man-machine system has ever crossed this frontier, in fact, solutions now designed tend to have devious effects when systems border total safety.

The accident described below illustrates how the reductionist reliability model applied to understanding safety and risk (taking systems apart and checking whether individual components meet prespecified criteria) may no longer work well, and may in fact have contributed to the accident. Through a concurrence of functions and events, of which a language barrier was a product as well as constitutive, the flight of a Boeing 737 out of Cyprus in 2005 may have been pushed past the edge of chaos, into that area in nonlinear dynamics where new system behaviors emerge that cannot be anticipated using reductive logic, and negate the Newtonian assumption of symmetry between cause and consequence.

CASE 4.6 HELIOS AIRWAYS B737, AUGUST 2005

On 13 August 2005, on the flight prior to the accident, a Helios Airways Boeing 737-300 flew from London to Larnaca, Cyprus. The cabin crew noted a problem with one of the doors, and convinced the flight crew to write that the “Aft service door requires full inspection” in the aircraft logbook. Once in Larnaca, a ground engineer performed an inspection of the door and carried out a cabin pressurization-leak check during the night. He found no defects. The aircraft was released from maintenance at 03:15 and scheduled for flight 522 at 06:00 via Athens, Greece to Prague, Czech Republic (AAISASB, 2006).

A few minutes after taking off from Larnaca, the captain called the company in Cyprus on the radio to report a problem with his equipment cooling and the take-off configuration horn (which warns pilots that the aircraft is not configured properly for take-off, even though it evidently had taken off successfully already). A ground engineer was called to talk with the captain, the same ground engineer who had worked on the aircraft in the night hours before. The ground engineer may have suspected that the pressurization switches could be in play (given that he had just worked on the aircraft’s pressurization system), but his suggestion to that effect to the captain was not acted on. Instead, the captain wanted to know where the circuit breakers for his equipment cooling were so that he could pull and reset them.

During this conversation, the oxygen masks deployed in the passenger cabin as they are designed to do when cabin altitude exceeds 14,000 feet. The conversation with the ground engineer ended, and would be the last that would have been heard from flight 522. Hours later, the aircraft finally ran out of fuel and crashed in hilly terrain north of Athens. Everybody on board had been dead for hours, except for one cabin attendant who held a commercial pilots license. Probably using medical oxygen bottles to survive, he finally had made it into the cockpit, but his efforts to save the aircraft were too late. The pressurization system had been set to manual so that the engineer could carry out the leak check. It had never been set back to automatic (which is done in the cockpit), which meant the aircraft did not pressurize during its ascent, unless a pilot had manually controlled the pressurization outflow valve during the entire climb. Passenger oxygen had been available for no more than 15 minutes, the captain had left his seat, and the co-pilot had not put on an oxygen mask.

Helios 522 is unsettling and illustrative, because nothing was “wrong” with the components. They all met their applicable criteria. “The captain and First Officer were licensed and qualified in accordance with applicable regulations and Operator requirements. Their duty time, flight time, rest time, and duty activity patterns were according to regulations. The cabin attendants were trained and qualified to perform their duties in accordance with existing requirements” (AAISASB, 2006, p. 112). Moreover, both pilots had been declared medically fit, even though postmortems revealed significant arterial clogging that may have accelerated the effects of hypoxia. And while there are variations in what JAR-compliant means as one travels across Europe, the Cypriot regulator (Cyprus DCA, or Department of Civil Aviation) complied with the standards in JAR OPS 1 and Part 145. This was seen to with help from the UK CAA, who provided inspectors for flight operations and airworthiness audits by means of contracts with the DCA. Helios and the maintenance organization were both certified by the DCA.

The German captain and the Cypriot co-pilot met the criteria set for their jobs. Even when it came to English, they passed. They were within the bandwidth of quality control within which we think system safety is guaranteed, or at least highly likely. That layer of defense – if you choose speak that language – had no holes as far as our system for checking and regulation could determine in advance. And we thought we could line these sub-systems up linearly, without complicated interactions. A German captain, backed up by a Cypriot co-pilot. In a long-since certified airframe, maintained by an approved organization. The assembly of the total system could not be simpler. And it must have, should have, been safe.

Yet the brittleness of having individual components meet prespecified criteria became apparent when compounding problems pushed demands for crew coordination beyond the routine. As the AAISASB observed, “Sufficient ease of use of English for the performance of duties in the course of a normal, routine flight does not necessarily imply that communication in the stress and time pressure of an abnormal situation is equally effective. The abnormal situation can potentially require words that are not part of the ‘normal’ vocabulary (words and technical terms one used in a foreign tongue under normal circumstances), thus potentially leaving two pilots unable to express themselves clearly. Also, human performance, and particularly memory, is known to suffer from the effects of stress, thus implying that in a stressful situation the search and choice of words to express one’s concern in a non-native language can be severely compromise … In particular, there were difficulties due to the fact that the captain spoke with a German accent and could not be understood by the British engineer. The British engineer did not confirm this, but did claim that he was also unable to understand the nature of the problem that the captain was encountering.” (pp. 122-123).

The irony is that the regulatory system designed to standardize aviation safety across Europe, has, through its harmonization of crew licensing, also legalized the blending of a large number of crew cultures and languages inside of a single airliner, from Greek to Norwegian, from Slovenian to Dutch. On the 14th of August 2005, this certified and certifiable system was not able to recognize, adapt to, and absorb a disruption that fell outside the set of disturbances it was designed to handle. The “stochastic fit” (see Snook, 2000) that put together this crew, this engineer, from this airline, in this airframe, with these system anomalies, on this day, outsmarted how we all have learned to create and maintain safety in an already very safe industry. Helios 522 testifies that the quality of individual components or subsystems predicts little about how they can stochastically and non-linearly recombine to outwit our best efforts at anticipating pathways to failure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.230.222