Chapter 4. Causes and Effects

The causes and the effects
are not always
distinguishable.

IN SECTION 4.1, WE CONSIDER the nature of causative factors associated with risks, including the roles of weak links and multiple simultaneously acting causes. In Section 4.2, we examine the distinctions and similarities between accidental and intentional causes.

4.1 Weak Links and Multiple Causes

A weak link in a system is a point at which a failure or an attack can render the system incapable of continuing to satisfy its requirements (for security, integrity, reliability, survivability, or whatever). As in a chain, in which loss of a single link destroys the entire chain, a weak link can represent the potential for a disaster triggered at a single point. We seek to design reliable systems and secure systems with no (single-point) weak links. However, it is also desirable to protect against multiple causes, because we can never guarantee that only single failures or attacks will occur at any one time.

The cable cuts in Chicago, the New York area, and White Plains noted in Section 2.1 involved single-point weak links whose severance resulted in major consequences because of correlated effects. The RISKS archives contain many other cases in which a single event appeared to be sufficient to trigger a disaster. In many of those cases, however, it was actually the confluence of several events that led to the triggered problem, as in the case of a Trojan horse written by one person, planted by another, and accidentally triggered by a third. In some of those cases involving multiple causes, the causative events were independent, whereas in other cases they were interrelated. This section discusses some of the underlying causes and their effects, and ways in which the causes can be interrelated.1

4.1.1 Single-Event Causes

We begin with a few cases of single weak links.

Cable cuts with widespread effects

The severance of the cable near White Plains, New York, that disconnected New England from the ARPAnet in 1986 (see Section 2.1) can be viewed as a single-point failure with respect to the cable, but as a highly correlated set of seven events with respect to the seven supposedly independent links. Whereas the intent of the ARPAnet configuration was clearly to provide multipath redundancy, the implementation was not faithful to the intent of the design and transformed a redundant design into one that was vulnerable to a single-point weakness. The Chicago and New York cable cuts (also noted in Section 2.1) further illustrate multiple correlated effects with respect to the abstraction of individual telephone calls or logical links, but with a single weak-link event at the level of the physical implementation (that is, the cable). In general, events at a particular layer of abstraction may seem to be unrelated, whereas they are interrelated at lower layers.

DC-10 crash blamed on failed indicator

An American Airlines DC-10 crashed because its stall-warning indicator failed. However, the weak-link cause was that the power supposedly driving the warning device was expected to come from the engine that had failed (SEN 11, 5).

Other cases

Referring to cases introduced in Section 2.2.2, the Atlas-Agena (missing hyphen), Mariner I (missing overbar in a handwritten equation), Aries (wrong resistor), Gemini V (incorrect programming short-cut), and Phobos 1 (missing character) all resulted from human error. It is clear that a minor human error can cause inordinate effects.

Digital versus analog technologies

In general, the technology related to digital computers is substantively different from that of continuous mechanical systems. In mechanical systems, a small fault is usually transformed into a small error. In computer systems, the alteration of a single bit may be enough to cause devastation. The purpose of fault-tolerant computing, discussed in Section 7.7, is to prevent such far-reaching effects resulting from single causes.

4.1.2 Independent Multiple Causes

Next, we consider several examples that were caused (or at least reported to be caused) by independent events. In the first case, three analyses were all independently wrong!

Independent tests err harmoniously

In the Handley-Page Victor aircraft, three independent test methods were used to analyze the tailplane stability. Each of the three tests had an error, but coincidentally they all came up with roughly the same numbers—each wrong. This case is considered in greater detail in Section 7.5, during a discussion of the role of modeling and simulation (SEN 11, 2, 12, April 1986, plus erratum in SEN 11, 3, 25, July 1986).

Davis-Besse nuclear-power plant

The Davis-Besse nuclear-power plant emergency shutdown in June 1985 (noted in Section 2.10) involved the loss of cooling water, the breakdown of 14 separate components, and multiple human errors.2

Three Mile Island 2

The Three Mile Island near-meltdown (noted in Section 2.10) involved at least four equipment failures and a design flaw that prevented the correct temperature from being identified, plus serious misjudgment.

Simultaneous multiple disk failures down Toronto Stock Exchange

Even though the Tandem primary and backup computers both operated in the desired nonstop mode, three different and apparently independent disk-drive failures prevented the required data access and caused the Toronto Stock Exchange to shut down for 3 hours on August 16, 1989. (See SEN 14, 6, 8-9, Oct 1989.)

In each of these cases, a combination of supposedly independent circumstances was required. Several other coincidental cases are noted elsewhere, such as the multiple problems with the San Francisco BART and Muni Metro systems all occurring on December 9, 1986 (Section 2.5), and lightning striking twice in the same place (see Section 5.5.2).

4.1.3 Correlated Multiple Causes

In this subsection, we recall several cases from Chapter 2 in which multiple causes were involved, but in which there were correlations among the different causes.

The 1980 ARPAnet collapse, revisited

The four-hour collapse of the ARPAnet [139], which was noted in Section 2.1, required the confluence of four events: the absence of parity checking in memory (an implementation deficiency—parity checking existed in network communications); a flawed garbage-collection algorithm (a design oversimplification); and the concurrent existence of two bogus versions of a status message (resulting from dropped bits) and the legitimate message. These problems seem independent in origin, except that the two bit droppings might have been due to the same transient hardware memory problem. However, they were clearly interrelated in triggering the consequences.

The 1990 AT&T system runaway, revisited

The AT&T problem noted in Section 2.1 involved the confluence of several events: despite extensive testing before installation, the inability to detect an obscure load- and time-dependent flaw in the switch-recovery software; the presence of the same flaw in the recovery software of every switching system of that type; an untolerated hardware fault mode that caused a switch to crash and recover; because the recovery occurred during a period of heavy telephone traffic, neighboring switches that were unable to adjust to the resumption of activity from the first switch (due to the presence of two signals from the first previously dormant switch within about 10 milliseconds, which triggered the obscure flaw); and the triggering of the same effect in other switches, and later again in the first switch, propagating repeatedly throughout the network. (The effect finally died out when the load dwindled.)

In each of these two cases, there were triggering events (two different status words with dropped bits and a hardware crash, respectively) initiating global effects. However, the startling consequences depended on other factors as well.

Shuttle Columbia return delayed

As noted in Section 2.2.1, the shuttle mission STS-9 lost multiple computers in succession at the beginning of reentry, delaying the reentry. However, one loose piece of solder may have been responsible for all of the different processor failures.

Discovery landing-gear design ignored the possibility of simultaneous failures

The shuttle Discovery had landing-gear problems when returning to earth on April 19, 1985. One of the main wheel sets locked, and then another—blowing out one tire and shredding another. The Aerospace Safety Advisory Panel warned in January 1982 that the “design is such that should a tire fail, its mate (almost certainly) would also fail—a potential hazard.” Indeed, both right tires had problems at the same time. Further, the ASAP again warned in January 1983 that “the landing gear tires and brakes have proven to be marginal and constitute a possible hazard to the shuttle.” (All three of the earlier ASAP recommendations had been ignored.)3

Simultaneous engine failures

The case of the Transamerica Airlines DC-8/73 that had three of its four engines fail at the same time is noted in Section 2.4. In this case, there was a highly correlated common cause, although not one that would normally be considered as a weak link.

Topaz in the rough

An extraordinary case occurred at Digital Equipment Corporation’s System Research Center in Palo Alto, California, on February 16, 1990, when the internal research environment Topaz became unusable. The in-house distributed name service at first seemed to be the culprit, but prolonged system difficulties were all attributable to a remarkable confluence of other interacting events. The description of this case by John DeTreville [36] is a classic, and is recommended as essential reading for system developers.

Patriot missiles

As noted in Section 2.3, the Patriot missile problems can be attributed to a variety of causes, including the clock problem and the nonadherence to the 14-hour maximum duty cycle. But those two causes are correlated, in that the latter would not necessarily have been a problem if the clock problem had not existed.

Common flaws in n-version programming

The concept known as n-version programming is the notion of having different programs written independently from the same specification, and then executing them in parallel, with conflicts resolved by majority voting. Brilliant and colleagues [16] have shown that, even when the different versions appear to be independent, these versions may exhibit common fault modes or have logically related flaws. In one startling case, two separately programmed versions of a more carefully developed and verified program were both wrong; because the other two programs coincidentally happened to be consistent with each other, they were able to outvote the correct version, as described by Brunelle and Eckhardt [18].

Two wrongs make a right (sometimes)

Multiple faults can be self-canceling under certain conditions. Two somewhat obscure wiring errors (perhaps attributable to a single confusion) remained undetected for many years in the pioneering Harvard Mark I electromechanical computer in the 1940s. Each decimal memory register consisted of 23 10-position stepping switches, plus a sign switch. Registers were used dually as memory locations and as adders. The wires into (and out of) the least significant two digits of the last register were crossed, so that the least significant position was actually the second-to-least position, and vice versa, with respect to memory. No errors were detected for many years during which that register was presumably used for memory only, such as in the computation of tables of Bessel functions of the nth kind; the error on read-in corrected itself on read-out. The problem finally manifested itself on the (n+1)st set of tables when that register was used as an adder and the carry went into the wrong position. The problem was detected only because it was standard practice in those days to take repeated differences of successive computed values of the resulting tables; for example, kth differences of kth-order polynomials should be constant. The differencing was done by hand, using very old adding machines.

There are other examples as well, such as loss of nonredundant power sources in otherwise redundant systems, and replicated program variables (parameters) derived from a common source, as in the case of pilots entering erroneous data once and then being able to copy the input instead of having to reenter it. Further cases are considered in Section 5.5; they relate specifically to accidental denials of service.

4.1.4 Reflections on Weak Links

Networked systems provide an opportunity for identical flaws to exist throughout. In the case of the ARPAnet collapse, each node had identical software, and used the same overly permissive garbage-collection algorithm. In the case of the AT&T blockage, each switch had the identical flawed recovery software. In the case of the Internet Worm [35, 57, 138, 150, 159] discussed in Section 3.1 and 5.1.1, the worm threaded its way through the same flaws wherever they existed on similar systems throughout the ARPAnet and Milnet. The commonality of mechanism throughout a network or distributed system provides an opportunity for fault modes and security vulnerabilities that are pervasive.

One of the lessons of this section is that, when a persistent failure occurs, even though it may not yet be critical, the nature of the reliability (and security) problem has changed and the risks must be reevaluated. Indeed, if seemingly noncritical problems are ignored, the risks may tend to escalate dramatically. When further faults occur subsequently, the results can be disastrous. Consequently, individual faults should be repaired as rapidly as possible commensurate with the increased risks.

4.2 Accidental versus Intentional Causes

This section considers the potential distinctions between events that are caused accidentally and those that are caused intentionally. We see that weak links may cause serious problems in both cases.

A system may fail to do what was expected of it because of intentional misuse or unintended factors. Intentional system misbehavior may be caused by individuals or teams, acting in a variety of roles such as requirements specifiers, designers, implementers, maintainers, operators, users, and external agents, taking advantage of vulnerabilities such as those enumerated in Section 3.1. Accidental system misbehavior may be caused by the same variety of people plus innocent bystanders, as well as by external events not caused by people (for example, “acts of God”). Types of problems and their accompanying risks are suggested in Sections 1.2 and 1.3.

We next consider the similarities and differences between intentionally caused problems and unintentionally caused ones.

4.2.1 Accidents That Could Have Been Triggered Intentionally

If an event can happen accidentally, it often could be caused intentionally.

Where existing systems have suffered accidental calamities, similar or identical disasters might have been caused intentionally, either by authorized users or by penetrators exploiting security vulnerabilities. Examples of past accidental events follow:

The 1980 ARPAnet collapse. The ARPAnet collapse discussed in Sections 2.1 and 4.1 required a confluence of a hardware-implementation weakness (the lack of parity checking in memory) and a weak software garbage-collection algorithm, and was triggered by hardware faults in memory. The global shutdown was clearly an unintentional problem. However, given the hardware weakness and the software weakness, this fault mode could have been triggered intentionally by the malicious insertion into the network of two or more bogus status messages, which would have had the identical effect. (See [139].)

The 1986 ARPAnet Northeast disconnect. The ARPAnet severance of New England from the rest of the network, discussed in Section 2.1, was caused by a backhoe cutting seven supposedly independent links, all in the same conduit. Knowledge of such a commonality in a more critical network could lead to an attractive target for a malicious perpetrator. Similar conclusions can be drawn about the many other cable cuts noted in Section 2.1.

The 1990 AT&T collapse. The collapse of the AT&T long-distance network depended on the existence of flawed switch-recovery software in each of the switches. The global slowdown was triggered accidentally by a crash of a single switch. However, a maliciously caused crash of a single switch would have produced exactly the same results, perhaps exploiting external development and maintenance routes into the switch controllers. This example illustrates the interrelations between reliability and security threats.

The 1988 Internet Worm. Inherent weaknesses in system software (finger), passwords (for example, vulnerability to dictionary-word preencryptive password attacks) and networking software (the sendmail debug option and .rhosts tables) resulted in severe degradation of Berkeley Unix systems attached to the ARPAnet and Milnet. Although the penetrations were intentional (but not malicious), the rampant proliferation within each attacked system was accidental. However, it is now clear that far worse consequences could have been caused intentionally by a malicious attack exploiting those vulnerabilities. This case is discussed in greater detail in Section 3.1 and [138, 150, 159].)

Chernobyl, 1986. Intentional Chernobyl-like disasters could be triggered by utility operators or disgruntled employees, particularly in control systems in which shutting off the safety controls is an administrative option. In some control systems for other applications, there are remote access paths for diagnosis and maintenance whose misuse could result in similarly serious disasters.

Interference. Various cases of electronic, electromagnetic, and other forms of accidental interference have been observed, and are noted in Section 5.5. Many of those could have been done intentionally. Indeed, several cases of malicious interference are noted in Section 5.4. Some of these have involved intentional piggybacking onto and spoofing of existing facilities, such as has occurred in several up-link cable-TV satellite takeovers.

Accidental security problems. Numerous accidental security violations such as the unintended broadcasting of a password file can also lead to serious reliability and integrity problems. Accidental security problems are considered in Section 5.2. Many of these could have been caused intentionally.

4.2.2 Potential Malicious Acts That Could Also Be Triggered Accidentally

If an event can be caused intentionally, it often could happen accidentally (even if less feasibly).

Some intentional perpetrations may require combinatoric or sequential events, or possibly multiperson collusions. In cases requiring complex circumstances, the feasibility of accidental occurrence generally is less than for the same circumstances to occur intentionally. However, for simple events, the accidental feasibility is more comparable to the intentional feasibility. Examples of potential malicious acts follow.

Intentional deletion

On a logged-in Unix keyboard, a malicious passer-by could type rm * (rm is an abbreviation for remove, and the asterisk denotes any name in the current directory); as the result (when completed by a carriage return), every file in the directory would be deleted. A legitimate user of the same directory might accidentally type the same sequence of keystrokes. Alternatively, that user might attempt to delete all files whose names end with the number 8, by intending to type rm *8 but instead typing rm *9 by mistake and then attempting to delete the 9 but instead hitting the terminating carriage return in an effort to press the delete key. (The delete key and the carriage-return key are adjacent on many keyboards.)

Covert activities

Intentional planting of Trojan horses and exploitation of covert channels are examples of covert threats. The likelihood of an equivalent coherent event happening accidentally seems far-fetched in general. However, the triggering of an already planted Trojan horse (as opposed to its implanting) is usually an accidental event, done by an unsuspecting person or by a normal computer-system activity. Such accidental triggering of an intentionally planted pest program bridges the gap between the two types of activity. Furthermore, personal-computer viruses have accidentally been included in vendor software deliveries on several occasions.

Maliciously caused outages

Milnet and other similar networks can be intentionally crashed by subverting the network control. This type of vulnerability existed for many years in the old ARPAnet, in which the nodes (Interface Message Processors, or IMPs) were maintained by Bolt, Beranek and Newman (BBN) from Cambridge, Massachusetts, using the network itself for remote debug access and for downloading new software. Subverting the network control would have been relatively easy, although—to the best of my knowledge—this vulnerability was never attacked. Malicious remote downloading of bogus node software that subsequently prevented remote downloading would be a particularly nasty form of this attack. Such a collapse would seem to be relatively unlikely to occur accidentally. However, a trapdoor that was intentionally installed could be triggered accidentally. For example, a trapdoor is sometimes installed for constructive emergency use in debugging or maintenance (although that is a dangerous practice). Section 5.1.2 gives examples of pest programs that were installed for malicious purposes, such as subsequent attempts at blackmail or employment to repair the problem; those programs also could have been triggered accidentally. In addition, penetration of telephone network-control systems has also occurred, and may have resulted in outages.

4.3 Summary of the Chapter

Considerable commonality exists between reliability and security. Both are weak-link phenomena. For example, certain security measures may be desirable to hinder malicious penetrators, but do relatively little to reduce hardware and software faults. Certain reliability measures may be desirable to provide hardware fault tolerance, but do not increase security. On the other hand, properly chosen system architectural approaches and good software-engineering practice can enhance both security and reliability. Thus, it is highly advantageous to consider both reliability and security within a common framework, along with other properties such as application survivability and application safety.

From the anecdotal evidence, there are clearly similarities among certain of the causes of accidental and intentional events. However, some accidents would have been difficult to trigger intentionally, and some malicious acts would have been difficult to reproduce accidentally. There are no essential functional differences between accidental and intentional threats, with respect to their potential effects; the consequences may be similar—or indeed the same. However, there are some differences in the techniques for addressing the different threats. The obvious conclusion is that we must anticipate both accidental and intentional types of user and system behavior; neither can be ignored.

In many cases of system disasters, whether accidentally or intentionally caused, a confluence of multiple events was involved that resulted in significant deviations from expected system behavior. This phenomenon is discussed in Section 4.1, although that discussion centers on unintentional events—that is, cases of unreliability. The discussion is immediately extensible to intentional misuse—that is, to violations of the intended security or integrity.

• Some accidental system problems that seemed attributable to single weak links were actually due to multiple events (independent or correlated). The 1980 ARPAnet collapse is an example.

• Some system problems that seemed attributable to multiple events were actually due to single weak links. The 1986 ARPAnet Northeast disconnect noted in Section 2.1 is an example. Many of these cases also could have been triggered by a single malicious act.

• Some system problems that seemed attributable to independent events were in reality caused by correlated events. The 1980 ARPAnet collapse noted in Section 2.1 illustrates this correlation, as does the case in which two separately programmed versions of a thoroughly developed module exhibited a common flaw that permitted them to outvote the correct version [18].

Thus, it becomes imperative for system designers to anticipate the occurrence of multiple deleterious events, whether accidentally or intentionally caused, and to anticipate problems of both reliability and security. Problems that are seemingly completely unrelated to security may nevertheless have security implications. Examples — such as the Sun integer divide that could be used to gain root privileges, noted in Section 3.7—are scattered throughout this book.

A system cannot be secure unless it is adequately reliable, and cannot be either reliable or available if it is not adequately secure. There are many further references to system reliability (see Chapter 2) and to system security (see Chapters 3 and 5). However, there is only sparse literature on the subject of systems that must be both reliable and secure.4

Challenges

C4.1 Identify how security and reliability interact, in hardware, in system software, in applications, and in environments transcending computer and communication systems. Include confidentiality, integrity, and availability among your security requirements.

C4.2 Pick a system with which you are familiar. Find and describe its main weak links with respect to any of the sources of risks in Section 1.2.

C4.3 Reconsider your design for Challenge C2.4. Having read this chapter, can you find additional sources of failure, including multiple weak links?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.185.180