Chapter 9. Implications and Conclusions

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Implications and Conclusions

Data without generalization
is just gossip.

PHAEDRUS, IN LILA,
AN INQUIRY INTO MORALS,
ROBERT M. PIRSIG [133]

THUS FAR, WE HAVE RECOUNTED a large number of strange cases relating to computer-communication systems, including many that involve people who have either misused those systems or been victimized by them. Interspersed throughout are various analyses of the underlying causes and the nature of the effects, along with some philosophical and technological views on what might be done to lessen the risks. In this chapter, we draw some general conclusions about the underlying problems, and assess our chances for the future.

9.1 Where to Place the Blame

To err is human.
To blame it on a computer is even more.¹

Increasingly, we hear of blame being allocated rather simplistically and conveniently to one particular cause, such as the operator, the computer, or some external factor. In most cases, however, there are multiple factors that contribute. We have discussed attempts to place blame, such as in various air accidents (Section 2.4) and in the Exxon Valdez oil spill (Section 8.2). We have also considered many cases with identifiable multiple causes (Section 4.1).

Often, people are blamed — perhaps to perpetuate the myth of infallibility of the technology. For example, the blame for Chernobyl was attributed in Soviet reports to a few errant people; the long-term health effects were suppressed completely, reflecting an attempt to downplay the real risks. In the Vincennes’ shootdown of Iran Air 655, blame was directed at a single crew member (and on the Iranians for flying over the ship), irrespective of whether the Aegis computer system was appropriate for the job at hand. In the Patriot case, the clock-drift problem was seemingly critical; had the program fix not taken so long to arrive (by plane), the Dhahran disaster might have been averted. On the other hand, the analysis of Ted Postol noted in Section 2.3 suggests that the system was fundamentally flawed and would have been deficient even if it had been rebooted every 14 hours.

We noted in Section 8.1 that most system problems are ultimately and legitimately attributable to people. However, human failings are often blamed on “the computer” — perhaps to protect the individuals. This attribution of blame seems to be common in computers affecting consumers, where human shortcomings are frequently attributed to “a computer glitch.” Computer system malfunctions are often due to underlying causes attributable to people; if the technology is faulty, the faults frequently lie with people who create it and use it.

Systems must be designed to withstand reasonable combinations of multiple hardware faults, system failures, and human mistakes. An obvious problem with this conclusion is that, no matter how defensive a design is, there will be combinations of events that have not been anticipated. In general, it is easy to ignore the possibility of a devastating event. Furthermore, Murphy’s Law suggests that even seemingly ridiculous combinations of improbable events will ultimately occur. Not surprisingly, many of the cases cited here illustrate that suggestion.

9.2 Expect the Unexpected!

What we anticipate seldom occurs; what we least expected generally happens.

BENJAMIN DISRAELI²

One of the most difficult problems in designing, developing, and using complex hardware and software systems involves the detection of and reaction to unusual events, particularly in situations that were not understood completely by the system developers. There have been many surprises. Furthermore, many of the problems have arisen as a combination of different events whose confluence was unanticipated.

Prototypical programming problems seem to be changing. Once upon a time, a missing bounds check enabled reading off the end of an array into the password file that followed it in memory. Virtual memory, better compilers, and strict stack disciplines have more or less resolved that problem; however, similar problems continue to reappear in different guises. Better operating systems, specification languages, programming languages, compilers, object orientation, formal methods, and analysis tools have helped to reduce certain problems considerably. However, some of the difficulties are getting more subtle — especially in large systems with critical requirements. Furthermore, program developers seem to stumble onto new ways of failing to foresee all of the lurking difficulties.

Here are a few cases of particular interest from this perspective, most of which are by now familiar.

Air New Zealand

The Air New Zealand crash of November 28, 1979 (Section 2.4) might be classified as a partially anticipated event—in that a serious database error in course data had been detected and corrected locally, but had not yet been reported to the pilots.

British Midland

The crash of the British Midland 737-400 on January 8, 1989 (Section 2.4) illustrates the possibility of an event that was commonly thought to be impossible—namely, both engines being shut down at the same time during flight. The chances of simultaneous failure of both engines are commonly estimated at somewhere between one in 10 million and one in 100 million—which to many people is equivalent to “it can’t happen.” In this case, the pilot mistakenly shut down the only working engine.

F-18

The F-18 crash due to an unrecoverable spin (Section 2.3) was attributed to a wild program transfer.

Rail ghosts

As noted in Section 2.5, the San Francisco Muni Metro light rail system had a ghost train that appeared in the computer system (but not on the tracks), apparently blocking a switch at the Embarcadero station for 2 hours on May 23, 1983. The problem was never diagnosed. The same ghost reappeared on December 9, 1986.

BoNY

The Bank of New York (BoNY) accidental $32-billion overdraft due to an unchecked program counter overflow, with BoNY having to fork over $5 million for a day’s interest, was certainly a surprise. (See Section 5.7 and SEN 11, 1, 3-7.)

Reinsurance loop

A three-step reinsurance cycle was reported where one firm reinsured with a second, which reinsured with a third, which unknowingly reinsured with the first, which was thus reinsuring itself and paying commissions for accepting its own risk. The computer program checked only for shorter cycles (SEN 10, 5, 8-9).

Age = 101

In Section 2.11, we note the Massachusetts man whose insurance rate tripled when he turned 101. He had been converted into a youthful driver (age 1 year!) by a program that mishandled ages over 100.

Monkey business

Chimpanzee Number 85 (who, for the launch, was named Enos, which is Greek for wine) had some hair-raising adventures aboard the Mercury-Atlas MA-5 mission on November 29, 1962. His experimental station malfunctioned, rewarding proper behavior with electric shocks instead of banana pellets. In response, he tore up his restraint suit and its instrumentation.³

Another monkey made the news in mid-October 1987, aboard Cosmos 1887, a Soviet research satellite. Yerosha (which is Russian for troublemaker) created havoc by freeing its left arm from a restraining harness, tinkering with the controls, and playing with its cap fitted with sensing electrodes.

A monkey slipped out of its cage aboard a China Airlines 747 cargo plane before landing at Kennedy Airport, and was at large in the cabin. After landing, the pilot and copilot remained in the cockpit until the animal-control officer thought that he had the monkey cornered at the rear of the plane. After the pilot and copilot left, the monkey then entered the cockpit (macaquepit?) and was finally captured while sitting on the instrument panel between the pilot and copilot seats.⁴

Stock-price swing

Mark Brader reported a case of a sanity check that backfired. Following speculation on Canadian Pacific Railway stock over disputed land ownership, the stock dropped $14,100 (Canadian) per share in 1 day when the lawsuit failed. The stock was reported as “not traded” for the day because a program sanity check rejected the large drop as an error (SEN 12, 4, 4-5).

The ozone hole

As noted briefly in Section 8.2, the ozone hole over the South Pole would have been recognized from the computer-analyzed data years before it was actually accepted as fact, but it was explicitly ignored by the software for 8 years because the deviations from the expected normal were so great that the program rejected them (SEN 11, 5). Years later, when the Discovery shuttle was monitoring the ozone layer and other atmospheric gases, a high-rate data-channel problem was blocking transmission. Because the shuttle computer system does not have enough storage for the full mission data, much of the mission data could not be recorded.⁵

ARPAnet collapse

The ARPAnet network collapse (Section 2.1) was triggered by dropped bits in a status message. That problem had been partially anticipated in that checksums had been created to detect single-bit errors in transmission but not in memory. In this case, a single parity-check bit would have detected the first bit dropping in memory, in which case the second would not have been able to occur undetected. Indeed, newer node hardware now checks for an altered bit.

Other unexpected examples

We can also note the AT&T outage, the Therac-25, and the clock overflows, discussed in Sections 2.1, 2.9, and 2.11, respectively. Before the Internet Worm, the sendmail debug option feature was fairly widely known, although the fingerd trapdoor was not (Section 5.1). However, the massive exploitation of those system flaws was clearly unexpected.

9.2.1 Good Practice

Systematic use of good software-engineering practice, careful interface design, strong typing, and realistic sanity checks are examples of techniques that can help us to overcome many such problems. We must strive much harder to develop better systems that avoid serious consequences from unexpected events. (Section 7.8 gives further examples of systems that did not adhere to good software engineering practice.)

There are wide ranges in the degree of anticipation. In some cases, events occur that were completely unanticipated. In some cases, events were considered but ignored because they were thought to be impossible or highly unlikely. In other cases, an event might have been anticipated, and an attempt made to defend against it, but nevertheless that event was handled improperly (for example, incompletely or erroneously). In other cases, great care may have gone into anticipating every eventuality and planning appropriate actions. Even then, there are cases of unexpected events. In some of these cases, the designers and implementers were like ostriches with their heads in the sand. In others, their eyes were open, but they mistakenly assumed the presence of another ostrich instead of a charging lion.

The distinction among different degrees of anticipation is often not clear, especially a priori, which suggests the need for extensive analytical techniques above and beyond the routine testing that is done today. As Section 7.10 makes clear, however, there are many pitfalls in risk analysis.

Although some of the occurrences of inadequately anticipated events may be noncritical, others may be devastating. There are many examples of serious unanticipated events and side effects given here—for example, the loss of the Sheffield in the Falklands War due to an unanticipated interference problem, cancer radiation-therapy deaths due to a programming flaw, two deaths due to microwaves interfering with pacemakers, and so on.

The paradoxical maxim, “Expect the Unexpected!” relates to the anticipation of unusual events with potentially critical side effects, and their avoidance or accommodation. The paradox, of course, arises in trying to know that all of the previously unanticipated events have indeed now been anticipated. Approaching fulfillment of this maxim is decidedly nontrivial. In the context of a single program, it may be attainable; involving the entire environment in which the computing systems operate, and including all of the people in the loop, it is essentially impossible.

9.2.2 Uncertainty

Perhaps the most frustrating of all are those cases in which uncertainty and doubts remain long after the event. Missing evidence is a contributing problem—for example, when an aircraft flight recorder goes down with the aircraft or when the appropriate audit trail is not even recorded. As with people, whose behavior may be altered as a result of being observed (as noted in Section 8.1), computer systems are also subject to the Heisenberg phenomenon. It may be difficult to conduct experiments that permit adequate diagnostics (barring system redesign), or to simulate such events before they happen in real systems. Asynchronous systems tend to compound the diagnostic problems, as we observe from Sections 2.1 and 7.3.

Much greater care and imagination are warranted in defending against computers that go bump in the night due to unexpected events. Ensuring adequate diagnostic ability before, during, and after such events is particularly desirable.

Informed speculation is a necessary part of the detective and preventive processes, particularly in the absence of discernible causes. It permits the true ghostbusters to emerge. Press reports tend toward gross oversimplifications, which are especially annoying when the facts are not adequately ascertainable. Risks Forum discussions remind us that blame is too often (mis)placed on “computer error” and “acts of God” (which serve as euphemisms for “human error”—particularly in system development and in use). On the other hand, “human error” by users often has as a cause “bad interface design” by developers. The blame must not be placed solely on people when the interface itself is unwieldy, confusing, or erroneous. Too much effort is spent on arguing the chicken-and-egg problem rather than on building systems better able to cope with uncertainty; see Section 9.1.

A somewhat extreme example of how we might learn to expect the unexpected happened to me in the early 1960s. I had just boarded a local two-car train in Peapack, New Jersey, to return to Bell Telephone Laboratories in Murray Hill after attending a funeral, and was standing in the center aisle of the train holding my wallet while waiting for the conductor to make change. It was a hot and sultry day. Suddenly a gust of wind swept across the aisle and blew my wallet out of my hand through the nearest window. The conductor kindly asked the engineer to back up the train until he found the wallet at track-side. I learned to realize that something seemingly ridiculous (the wallet disappearing) and something extremely unusual (an engineer backing up a train to pick up an object) could actually happen. That event significantly colored my perspective of anticipating what might happen.

Herb Caen noted another unusual occurrence of a man pitching horseshoes who missed. The spark from the horseshoe hitting a rock caused a 15-acre grass fire.⁶ High-tech or low-tech, the risks abound.

9.3 Avoidance of Weak Links

[A]ll mechanical or electrical or quantum-mechanical or hydraulic or even wind, steam or piston-driven devices, are now required to have a certain legend emblazoned on them somewhere. It doesn’t matter how small the object is, the designers of the object have got to find a way of squeezing the legend in somewhere, because it is their attention which is being drawn to it rather than necessarily that of the user’s. The legend is this:

The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.

DOUGLAS ADAMS⁷

Attaining security, safety, high availability, and other critical requirements may be viewed as a weak-link problem, as discussed in Section 4.1; potentially, weak links may be broken or combinations of events may exceed the defensive capabilities of the system, no matter how defensively the system is designed and implemented. Systems can be designed to minimize the presence of such vulnerabilities, but there are inevitably circumstances in which the defenses can be overcome, accidentally or intentionally, and, in the latter case, with or without malice.

In general, it is desirable to cover all likely modes of faults, malicious misuses, and external circumstances, but also to ensure that the system will do something reasonable in the case of less likely events. Too often, however, unanticipated events may actually be much more likely than they seemed, with unpredictable consequences. Furthermore, the inability to anticipate unforeseen breakdown modes often results in those occurrences being difficult to diagnose and repair.

9.3.1 Lessons

Allegorically speaking, there are various morals to this story, with a meta(eu)phoria of (o)variety.

• Do not put all your eggs in one basket. Centralized solutions are inherently risky if a weak-link failure can knock out the entire system. Even if the centralized mechanism is designed to be highly robust, it can still fail. If everything depends on it, the design is a poor one.

• Do not have too many baskets. A risk of distributed systems (see Section 7.3) is that physical dispersion, intercommunications, distributed control, configuration control, and redundancy management may all become so complex that the new global fault modes cannot be accommodated adequately, as exemplified by the 1980 ARPAnet and 1990 AT&T collapses discussed in Section 2.1. There is also a risk of overdependence on individual baskets, as suggested by the Lamport quote at the beginning of Section 7.3.

• Do not have too many eggs. There are saturation effects of trying to manage too many objects all at the same time, irrespective of whether the eggs are all in one basket or are dispersed widely.

• Do not have too few eggs. There are problems relating to multiplexing among inadequate resources, recovering from malfunctions, and providing timely alternatives.

• Avoid baskets and eggs of poor quality. Poor quality may result from improper care in the production process and shabby quality control. One rotten egg can spoil the whole crate, just as entire networks have had great falls because of uncovered fault modes. Furthermore, a rotten egg in one basket rather surprisingly may be able to spoil the other baskets as well.

• Know in advance what you are trying to do. Look at the big picture. Are you trying to make a cake, an omelet, an operating system, or a worldwide network?

• Be careful how you choose and organize your roosters. Chief programmers and chief designers can be very effective, but only if they are genuinely competent. Otherwise, they can lead to more problems than they can solve.

• Be careful how you choose and organize your hens. Structuring your system into layers can help, or am I egging you on too much?

9.3.2 Defenses

Mechanisms for increasing system dependability (for example, security controls such as access controls and authentication; reliability measures such as fault avoidance, fault masking, and recovery, where faults may be introduced accidentally or maliciously; real-time checking; and proofs of algorithmic, design, or code correctness) may require vastly more complexity than the mechanisms they attempt to make dependable. Consequently, increased complexity may actually introduce greater risks.

The attainment of human safety typically may depend on security, availability, real-time performance, and reliability (among other requirements), and on the hardware, software, and operating environment behaving appropriately. Interactions among reliability, availability, and security are particularly intricate. Security flaws often may be exploited in combination with one another, and in combination with correlated fault modes. Exploitation of security flaws may also accidentally trigger reliability problems. In general, it is desirable to plan defensively to protect a system against both accidental and malicious correlated events.

It is risky to make oversimplified design assumptions, and even more dangerous to build systems that critically depend on the perpetual validity of those assumptions. Defensive design must be layered to accommodate failures of underlying mechanisms through higher-layer recovery, alternative strategies, or backups.

It is often self-defeating to pin the blame on a donkey—for example, on a designer, programmer, system operator, or penetrator, or worse yet, on the computer. Blame is often simplistically misplaced. Several factors may be involved. Especially in distributed systems, blame is typically distributable. (See Section 9.1.)

The faithful reader may by now be numbed by the continual reappearance of several of the examples. In fact, cases such as the ARPAnet and AT&T outages are multipurpose parables for our time. They serve to illustrate many different points, most notable of which is that seemingly isolated events are often related and—in combination with other events—may induce serious problems that were not adequately anticipated during system development and operation. They suggest that much greater care is needed in developing critical systems.

9.4 Assessment of the Risks

If I don’t know I don’t know, I think I know.
If I don’t know I know, I think I don’t know.

R.D. LAING, KNOTS, VINTAGE BOOKS, RANDOM HOUSE, 1972.⁸

In the presence of computer-related risks as well as the difficulties inherent in risk analysis (see Section 7.10), risk assessment and its subsequent implications may necessitate tradeoffs among technical, social, economic, and political factors (among others). These tradeoffs are generally difficult to quantify, because each person’s basis for judgment is likely to be different. Some risks are intrinsic in the use of technology; others can be surmounted with suitable effort and cost. The basic question to be addressed is this: Under what circumstances are the risks acceptable? An honest assessment of that question involves an analysis of all of the relevant factors, as well as an analysis of the analysis. But the decisions must be based on more than just narrowly conceived reasons such as bottom-line short-term profit.

It is exceedingly difficult to predict the future based only on our (incomplete) knowledge of the past and the present. Worse yet, in an apparently sound system, there always exists some risk of a malfunction or misuse that never previously occurred. Even afterward it may be impossible to find out with certainty what caused the problem.

In making supposedly rational arguments about why hardware and software will operate correctly or might continue to operate correctly, we make all sorts of implicit underlying assumptions that may be true most of the time, but which may indeed break down—particularly in times of stressed operation. Some of these are noted here. At a lower level of detail, some nasty problems seem to involve unanticipated conditions in timing and sequencing, in both synchronous and asynchronous systems, and particularly in distributed systems that rely on interdependencies among diverse components. We have various examples of such problems in the past—once again, for example, the 1980 ARPAnet collapse (due to what might be called an accidentally propagated data retrovirus after years of successful operation, noted in Section 2.1) and the delay in the first shuttle launch (Section 2.2.1).

In our lives as in our computer systems, we tend to make unjustified or oversimplified assumptions, often that adverse things will not happen. In life, such assumptions make it possible for us to go on living without inordinate worry (paranoia). In computer systems, however, greater concern is often warranted — especially if a system must meet critical requirements under all possible operating conditions to avoid serious repercussions. Thus, it seems essential that we try to make our assumptions about computer systems and their use both more explicit and more realistic. Basing a system design on assumptions that are almost always but not quite always true may seem like a close approximation, but may imply the presence of serious unanticipated risks. In systems such as those involved in the Strategic Defense Initiative, for example, many potentially critical assumptions have to be sufficiently correct. Certain erroneous assumptions could be catastrophic. Besides, we understand relatively little about long-term effects (whether they eventually manifest themselves as obvious effects or remain undetected as invisible side effects).

Many of the computer-related problems noted here involve unanticipated interactions between components that appeared to work correctly in isolation, or short-sighted assumptions that certain events could never happen. Furthermore, scapegoats are often found for someone else’s problem.

These cases illustrate the dangers of relying on a set of assumptions that supposedly underlie proper system behavior. There are also lessons that we can learn from environmental risks such as toxic substances in our food, drink, and environments; some of those risks were known in advance but were ignored — for example, for commercial reasons; others came as “surprises” (thalidomide, for example), but probably represented a lack of due care and long-term testing. In some cases, the risks were thought of, but were considered minimal (Bhopal, Chernobyl, Three Mile Island). In other cases, the risks were simply never considered.

Greater humility is required. Science (especially computer science) does not have all the answers. Furthermore, the absence of any one answer (and indeed suppression of a question that should have been asked) can be damaging. But, as we see from the nature of the problems observed here, some people keep their heads in the sand — even after being once (or multiply) burned. Eternal vigilance is required of all of us. Some bureaucrats and technocrats may be fairly naïve about the technology, making statements such as “Don’t worry, we can do it, and nothing can go wrong.” Further, after a disaster has happened, the common response is that “We’ve fixed that problem, and it will never happen again.”

The cases presented here may help to cast doubts. But technocrats who say “we can’t do it at all” also need to be careful in their statements; in the small, anything is possible. (Both the Cassandra and the Pollyanna attitudes are unwise—although the original Cassandra ultimately turned out to be correct!)

We must remember that systems often tend to break down when operating under conditions of stress—particularly when global system integration is involved. Furthermore, it is under similar conditions that people tend to break down.

9.5 Assessment of the Feasibility of Avoiding Risks

What is the use of running when we are not on the right road?

OLD PROVERB

By now you may have detected a somewhat schizoid alternation between pessimism and optimism in this book—for example, concerning the issue of trust discussed in Section 8.2 or the issue of attaining reliability, safety, and security. There is optimism that we can do much better in developing and operating computer-related systems with less risk. There is pessimism as to how much we can justifiably trust computer systems and people involved in the use of those systems. People active in the forefronts of software engineering, testing, and formal methods tend to have greater optimism regarding their own individual abilities to avoid or at least identify flaws in systems and to provide some real assurances that a system will do exactly what is expected (no more and no less, assuming the expectations are correct), at least most of the time and under most realistic anticipated circumstances. However, serious flaws abound in systems intended to be secure or reliable or, worse yet, correct! Further, we have seen repeatedly that the operating environments are not always what had been expected, and that the expectations of human and system behavior are not always accurate either.

9.5.1 Progress

Is our ability to develop dependable computer-related systems generally improving? A retrospective perusal of this book suggests that there is a continual increase in the frequency of horror stories being reported and relatively few successes; new problems continue to arise, as well as new instances of old problems. One Risks Forum reader noted that the cases being reported seem to include many repetitions of earlier cases, and that the situation indeed does not seem to be improving.

Several comments are in order:

• Unqualified successes are rare.

• Failures are easier to observe than are successes, and are more eye-catching. Besides, the Risks Forum is explicitly designed for reporting and analyzing the difficulties.

• Henry Petroski’s oft-quoted observation that we tend to learn much more from our failures than from our successes suggests that the reporting of failures may be a more effective educational process.

• The people who desperately need to learn how to avoid repetitive recurrences are not doing so. Robert Philhower suggested that the longevity of the Risks Forum itself attests to the fact that not enough people are properly assimilating its lessons.

• Experience is the best teacher, but once you have had the experience, you may not want to do it again (and again)—or even to teach others.

• We still do not appreciate sufficiently the real difficulties that lurk in the development, administration, and use of systems with critical requirements.

• Developers badly underestimate the discipline and effort required to get systems to work properly—even when there are no critical requirements.

• Many computer systems are continuing to be built without their developers paying consistent attention to good software-engineering and system-engineering practice. However, reducing one risk may increase others, or at best leave them unaffected.

• Technological solutions are often sought for problems whose solutions require reasonable human actions and reactions more than sound technology.

For such reasons, the same kinds of mistakes tend to be made repeatedly, in system development, in operation, and in use. We desperately need more true success stories, carefully executed, carefully documented, and carefully evaluated, so that they can be more readily emulated by others with similar situations.

It is interesting to note that we have improved in certain areas, whereas there has been no discernible improvement in others. If you consider the sagas relating to shuttles, rockets, and satellites discussed in Section 2.2, and defense-related systems in Section 2.3, the incidence and severity of cases are both seemingly dwindling. In other areas — particularly those that are highly dependent on people rather than technology—the incidence and severity seem to be increasing.

9.5.2 Feasibility

The foregoing discussion brings us to familiar conclusions. Almost all critical computer-related systems have failure modes or security flaws or human interface problems that can lead to serious difficulties. Even systems that appear most successful from a software viewpoint have experienced serious problems. For example, the shuttle program used such extraordinarily stringent quality control that the programmer productivity was something on the order of a few lines of debugged and integrated code per day overall. Nevertheless, there have been various difficulties (noted in Section 2.2.1) — the synchronization problem with the backup computer before the first Columbia launch, multiple computer outages on a subsequent Columbia mission (STS-9), the output misreading that caused liquid oxygen to be drained off just before a scheduled Columbia launch (STS-24), Discovery’s positioning problem in a laser-beam experiment over Hawaii (STS-18), Discovery’s having the shutdown procedure for two computers reversed (STS-19), Endeavour’s difficulties in its attempted rendezvous with Intelsat (due to not-quite-identical values being equated), and others. The variety seems endless.

Systems with critical requirements demand inordinate care throughout their development cycle and their operation. But there are no guaranteed assurances that a given system will behave properly all of the time, or even at some particularly critical time. We must be much more skeptical of claimed successes. Nevertheless, we need to try much harder to engineer systems that have a much greater likelihood of success.

9.6 Risks in the Information Infrastructure

The future seems to be leading us to an emerging national information infrastructure (NII) (for example, see [174]) and its global counterpart, which we refer to here as the worldwide information infrastructure, or WII. The WII represents many exciting constructive opportunities. (This concept is sometimes referred to as the information superhighway or the Infobahn—although those metaphors introduce problems of their own.) The WII also encompasses a wide range of risks. In a sense, the emerging global interconnection of computers brings with it the prototypical generic RISKS problems; it encompasses many now familiar risks and increasingly involves human safety and health, mental well-being, peace of mind, personal privacy, information confidentiality and integrity, proprietary rights, and financial stability, particularly as internetworking continues to grow.

On September 19, 1988, the National Research Council’s Computer Science and Technology Board (now the Computer Science and Telecommunications Board) held a session on computer and communication security. At that meeting, Bob Morris of the National Computer Security Center noted that “To a first approximation, every computer in the world is connected with every other computer.” Bob, K. Speierman, and I each gave stern warnings—particularly that the state of the art in computer and network security was generally abysmal and was not noticeably improving, and that the risks were considerable. Ironically, about six weeks later, a single incident clearly demonstrated the vulnerability of many computer systems on the network—the Internet Worm of November 1988 (see Sections 3.3, 3.6, and 5.1.1). Since that time, numerous innovative system penetrations (including wholesale password capturing that began anew in the fall of 1993) have continued. The 1980 ARPAnet collapse and the 1990 AT&T long-distance collapse (Section 2.1) illustrate reliability problems. The security and reliability problems of the existing Internet can be expected to continue to manifest themselves in the WII even with improved systems and networking, and some of those problems could actually be more severe.⁹

In addition, many social risks can be expected. National cultural identities may become sublimated. Governments will attempt to regulate their portions of the WII, which attempts could have chilling effects. Commercial interests will attempt to gain control. Aggregation of geographically and logically diverse information enables new invasions of privacy. Huge volumes of information make it more difficult to ensure accuracy. The propagation of intentional disinformation and unintentionally false information can have many harmful effects. Rapid global communications may encourage overzealous or misguided reactions, such as instantaneous voting on poorly understood issues. Hiding behind anonymous or masqueraded identities can result in bogus, slanderous, or threatening E-mail and Trojan horseplay. Although modern system design and authentication can reduce some of these risks, there will still be cases in which a culprit is completely untraceable or where irrevocable damage is caused before the situation can be corrected.

Enormous communication bandwidths are certainly appealing, particularly with the voracious appetite of modern graphics capabilities and commercial entrepeneurs for consuming resources. However, past experience suggests that Parkinsonian saturation (of bandwidth, computational capacities, or human abilities) will be commonplace, with massive junk E-mail advertising tailored to patterns of each individual’s observed behavior, hundreds of video channels of home shopping, interactive soap operas and community games, movies and old television programs on demand, live access to every sporting event in the world, and virtual aerobic exercise for armchair participants. Virtual reality may subliminally encourage some people to withdraw from reality further, providing additional substitutes for human contacts and real experiences. Personal disassociation and technological addictions are likely, along with diminished social skills. Cafés and other meeting places may qualify for the endangered species list, becoming the last refuge of antitechnologists. Risks of the technology being subverted by short-sighted commercializations are considerable. Recent studies linking on-line living with depression and obesity may also be relevant. Many of Jerry Mander’s 1978 concerns [87] about the potential evils that might result from television, computers, and related technologies are still valid today (if not already demonstrated), and can be readily applied to the WII. (See also Section 9.8.1.)

The interconnective technologies may further divide people into haves and have-nots according to whether they are able to participate. There is great potential for improving the lives of those who are physically handicapped. However, the hopes for other minorities—such as the homeless, impoverished, elderly, or chronically unemployed—may be bleak in the absence of significant social change. Furthermore, even if everyone could participate (which is most unlikely), people who choose not to participate should not be penalized.

Educational uses could easily degenerate to lowest common denominators, unless educators and learners are keenly aware of the pitfalls. The WII can be much more than a public library, although it may not be free to all. It will clearly present unparallelled opportunities for the sharing of knowledge and for creative education. However, it will require educators and facilitators who truly understand the medium and its limitations, and who can resist the cookie-cutter phenomenon that tends to stamp out people with identical abilities.

The information-superhighway metaphor suggests analogies of traffic jams (congestion), crashes (system wipeouts), roadkill (bystanders), drunken drivers (with accidents waiting to happen), car-jackers (malicious hackers), drag racers and joyriders, switched vehicle identifiers (bogus E-mail), speed limits (on the Infobahn as well as the Autobahn), onramps and gateways (controlling access, authentication, and authorization), toll bridges (with usage fees), designated drivers (authorized agents), drivers’ licenses, system registration, and inspections. All of these analogies are evidently applicable. Although use of the superhighway metaphor may deemphasize certain risks that other metaphors might illuminate, it serves to illustrate some of the technological, social, economic, and political dangers that we must acknowledge and strive to overcome. The risks noted here are genuine. I urge you all to increase your awareness and to take appropriate actions wherever possible.

9.7 Questions Concerning the NII

Section 9.7 originally appeared as an Inside Risks column, CACM 37, 7, 170, July 1994, written by Barbara Simons, who chairs the U.S. Public Policy Committee of the ACM (USACM).

The previous section considers risks in the U.S. national information infrastructure and its international counterpart. We now pose some questions that need to be asked about both the NII and the WII. These questions are generally open-ended, and intended primarily to stimulate further discussion.

1. What should the NII be? Will it be primarily passive (500 movie and home-shopping channels) or active (two-way communications)? Will it support both commercial applications (consumers) and civic applications (citizens)? Should it have both government and private involvement? Who should oversee it?

2. Guiding principles. How can socially desirable services be provided if they are not directly profitable? Who determines standards and compliance for interconnectivity, quality, and so on?

3. Accessibility. Does universal access imply a connection, a terminal, and knowledge of how to access the NII? What is the NII goal? How can the government ensure that goal, and fund the NII adequately? Should providers be required to subsidize those who cannot afford it? Who determines what should be the minimum service? How can the NII be affordable and easily usable, especially for the underprivileged and disadvantaged? How can social risks be dealt with, including sexual harassment and character defamation? Should equal-access opportunity be facilitated and actively encouraged, independent of gender, race, religion, wealth, and other factors? Are fundamental changes in the workplace required? The French Minitel system is ubiquitous and cost-effective; it gives a free terminal to anyone who requests one. Is this approach effective, and could it work elsewhere? Should open access be guaranteed? If so, how? Will the existing notion of common carriers for communications services apply in the emerging union of telephones, computers, cable, and broadcast television?

4. Censorship. Should there be censorship of content? If so, who establishes the rules? Will there be protected forms of free speech? How do libel laws apply? Should offensive communications be screened? What about controls over advertising and bulk mailing?

5. Access to public information. Should the NII support the government mandate to improve the availability of information? If so, should NII users be charged for accessing this information?

6. Privacy. Should the NII protect an individual’s privacy? Can it? Is additional legislation required for medical, credit, and other databases? Who owns the use rights for data? What permissions will be required? What should be the role of encryption and escrowed keys? Should/could the government determine that the use of nonauthorized encryption algorithms is illegal? Who should be able to monitor which activities, and under what conditions and authority? What recourse should individuals have after violations?

7. International cooperation. Are we studying and evaluating the activities of other countries, such as the French Minitel system, to emulate their successes and avoid their mistakes? What are we or should we be doing to ensure cooperation with other countries, while allowing them to maintain local control?

8. Electronic democracy. The NII could be a form of electronic democracy—an electronic town hall. Given that the current Internet users are far from representative of the entire population, to what extent should the NII be used to allow more people to have input into policy-making decisions? Do we have to be concerned about the development of an electronic oligarchy?

9. Usability. Legislation and research support have focused on advanced users and network technology, with little discussion about the design of the user interface. What is being done to study usability issues and to develop basic standards about the complexity of interfaces, the ease in authoring new materials, capacity for access in multiple languages, design for the elderly population, support for cooperation among users, and design methods to reduce learning times and error rates? What should be done, and who should provide the funding? Should we have testbeds to study what people would like and what constitutes a user-friendly interface?

10. Libraries. Should there be a knowledge base of electronic libraries? If so, who should be responsible? How should it be funded? How can the financial interests of writers and publishers be protected? With declining support for conventional libraries, how can electronic libraries become realistic?

11. Public schools. How can equitable NII access be provided for public schools when funding for public education varies greatly from wealthy areas to impoverished ones? How can these institutions be brought on-line when many classrooms don’t even have telephone jacks? How will access to pornography be dealt with?

12. Emergencies. What should be the NII’s role in dealing with natural and political emergencies?

9.8 Avoidance of Risks

I hate quotations. Tell me what you know.

RALPH WALDO EMERSON ¹⁰

There is a widespread and growing disillusionment with technology. Fewer young people are aspiring to technology-oriented careers. Doubts are increasing as to technology’s ability to provide lasting solutions for human problems. For example, heavy-industry technology has become a major polluter throughout the world. The use of chlorofluorocarbons in refrigeration, propulsion systems, and aerosols is threatening the ozone layer. Networked information that affects the activities of people is routinely sold and distributed without the knowledge or consent of those to whom it pertains, irrespective of its accuracy.¹¹

9.8.1 Jerry Mander’s Aphorisms

Jerry Mander has recently published a remarkable book, In the Absence of the Sacred: The Failure of Technology & the Survival of the Indian Nations [88]. He offers a list of aphorisms about technology, quoted below. From within our present ways of thinking, these aphorisms might appear as Luddite anti-technology. But they point the way to a new common sense that needs cultivation. Please read them in that spirit.¹²

1. Since most of what we are told about new technology comes from its proponents, be deeply skeptical of all claims.

2. Assume all technology “guilty until proven innocent.”

3. Eschew the idea that technology is neutral or “value free.” Every technology has inherent and identifiable social, political, and environmental consequences.

4. The fact that technology has a natural flash and appeal is meaningless. Negative attributes are slow to emerge.

5. Never judge a technology by the way it benefits you personally. Seek a holistic view of its impacts. The operative question is not whether it benefits you but who benefits most? And to what end?

6. Keep in mind that an individual technology is only one piece of a larger web of technologies, “metatechnology.” The operative question here is how the individual technology fits the larger one.

7. Make distinctions between technologies that primarily serve the individual or the small community (for example, solar energy) and those that operate on a scale of community control (for example, nuclear energy).

8. When it is argued that the benefits of the technological lifeway are worthwhile despite harmful outcomes, recall that Lewis Mumford referred to these alleged benefits as “bribery.” Cite the figures about crime, suicide, alienation, drug abuse, as well as environmental and cultural degradation.

9. Do not accept the homily that “once the genie is out of the bottle, you cannot put it back,” or that rejecting a technology is impossible. Such attitudes induce passivity and confirm victimization.

10. In thinking about technology within the present climate of technological worship, emphasize the negative. This brings balance. Negativity is positive.

9.8.2 A New Common Sense

Peter Denning [34] proposes that it is time to start cultivating a new common sense about technology. By common sense Denning means the general, instinctive way of understanding the world that we all share. The disillusionment suggests that our current common sense is not working. Questions such as “If we can get a man to the moon, why can’t we solve the urban crisis?” are overly simplistic; technology is not fundamentally relevant to the answer.

An example of an understanding in the old common sense is that we must obtain a precise description of a problem and then apply systematic methods to design a computing system that meets the specifications. In the new common sense, we must identify the concerns and the network of commitments people make in their organizations and lives, and then design computing systems that assist them in carrying out their commitments. The new common sense does not throw out precise specifications and systematic design; it simply regards these techniques as tools and does not make them the center of attention.

Many of the cases that are noted in the RISKS archives corroborate both Mander’s aphorisms and the need for Denning’s new common sense. In the new common sense, we would see organizations as networks of commitments, not hierarchical organization charts. Daily human interactions would be mini-instances of the customer-provider loop, and attention would be focused on whether the customer of every loop is satisfied with the provider’s work. Technology would help people to fulfill their promises and commitments. Communication would serve for successful coordination of action rather than merely for exchanges of messages.

To begin the needed rethinking about technology, we can ask ourselves questions such as Mander asks. This introspection requires a rethinking of not merely military versus nonmilitary budgets but also the proper role of technology as a whole. Technology by itself is not the answer to any vital social questions. Ultimately, more fundamental human issues must be considered. Nevertheless, technology can have a constructive role to play—if it is kept in perspective.

9.9 Assessment of the Future

The future isn’t what it used to be.

ARTHUR CLARKE, 1969, LAMENTING THAT IT WAS BECOMING MUCH MORE
DIFFICULT TO WRITE GOOD SCIENCE FICTION

The past isn’t what it used to be, either.

PETER NEUMANN, 1991, SPECULATING ON WHY HISTORY IS SO QUICKLY
FORGOTTEN AND WHY PROGRESS BECOMES SO DIFFICULT

As stated at the outset, this book presents a tremendously varied collection of events, with diverse causes and diverse effects. Problems are evidenced throughout all aspects of system development and use. The diversity of causes and effects suggests that we must be much more vigilant whenever stringent requirements are involved. It also suggests that it is a mistake to look for easy answers; there are no easy answers overall, and generally no easy answers even with respect to any of the causative factors taken in isolation (doing which, as we note, is dangerous in itself).

There are many techniques noted in this book that could contribute to systems better able to meet their requirements for confidentiality, integrity, availability, reliability, fault tolerance, and human safety—among other requirements. However, the case must be made on each development effort and each operational system as to which of these techniques is worthwhile, commensurate with the risks. A global perspective is essential in determining which are the real risks and which techniques can have the most significant payoffs. Short-term benefits are often canceled out by long-term problems, and taking a long-term view from the outset can often avoid serious problems later on.

There are many lessons that must be learned by all of us—including system developers and supposedly innocent bystanders alike. And we must remember these lessons perpetually. However, history suggests that the lessons of the past are forgotten quickly. I sincerely hope that this book will provide a basis for learning, and that, in the future, we will be able to do much better—not only in avoiding past mistakes but also in avoiding potential new mistakes that are waiting to emerge. Nevertheless, we can expect the types of problems described in this book to continue. Some will result in serious disasters.

System-oriented (Chapter 7) and human-oriented (Chapter 8) viewpoints are both fundamental. In attempting to develop, operate, and use computer-related systems in critical environments, we do indeed essay a difficult task, as the Ovid quote suggests at the beginning of Chapter 1; the challenge confronting us is imposing. As the extent of our dependence on computer-based systems increases, the risks must not be permitted to increase accordingly.

9.10 Summary of the Chapter and the Book

Computers make excellent and efficient servants, but I have no wish to serve under them. Captain, a starship also runs on loyalty to one man. And nothing can replace it or him.

SPOCK, THE ULTIMATE COMPUTER, STARDATE 4729.4 (STAR TREK)

This chapter provides several perspectives from which to consider the problems addressed in the foregoing chapters.

Chapter 1 gives an introductory summary of potential sources of problems arising in system development (Section 1.2.1) and use (Section 1.2.2), as well as a summary of the types of adverse effects that can result (Section 1.3). Essentially all of the causes and effects outlined there are illustrated by the cases presented in this book.

Overall, this book depicts a risk-filled world in which we live, with risks arising sometimes from the technology and the surrounding environment, and frequently from the people who create and use the technology. We see repeatedly that the causes and the effects of problems with computers, communications, and related technologies are extraordinarily diverse. Simple solutions generally fall short, whereas complex ones are inherently risky. Extraordinary care is necessary throughout computer-system development, operation, and use, especially in critical applications. However, even in seemingly benign environments, serious problems can occur and — as seen here—have occurred.

Table 9.1 gives a rough indication of the types of risks according to their applications. The entries represent the number of incidents included in the RISKS archives through August 1993, classified according to the Software Engineering Notes index [113]. Many of those cases are included in this book. The numbers reflect individual incidents, and each incident is counted only once in the table. That is, a number in the column headed “Cases with deaths” is the number of cases, not the number of deaths. Where an incident resulted in death or risk to life and also caused material damage, it is counted under death or risk to life only. Other risks include all other cases except for those few beneficial ones in which the technology enabled risks to be avoided. The table was prepared by Peter Mellor¹³ for the use of one of his classes. It is included here primarily for illustrative purposes.

Table 9.1 Number of cases of various types (as of August 1993)

I hope that this book will inspire many people to delve more deeply into this collection of material, and to address subsequent disasters with a scrutiny that has not generally been present up to now. The cited articles by Eric Rosen [139] and Jack Garman [47] are notable exceptions. Nancy Leveson’s book [81] pursues various safety-related cases in considerable detail, and also provides techniques for increasing system safety. Ivars Peterson [127] provides some greater details on several cases discussed here. Appendix A gives further relevant references.

The potential for constructive and humanitarian uses of computer- and communication-related technologies is enormous. However, to be realistic, we must accept the negative implications regarding the abilities and inabilities of our civilization to wisely develop and apply these technologies. We must continually endeavor to find a balance that will encourage socially beneficial applications in which the risks are realistically low. Above all, we must remember that technology is ideally of the people, for the people, and by the people. For the protection of the people and the global environment, constraints need to be placed on the development and use of systems whose misuse or malfunction could have serious consequences. However, in all technological efforts we must always remember that people are simultaneously the greatest strength and the greatest weakness.

Challenges

C9.1 Jim Horning notes that there is a risk in using the information superhighway metaphor (Section 9.6), because doing so tends to limit people’s thinking to only highway-related risks. Describe additional risks that would be exposed by an alternative metaphor, such as the global village, the global library, the information monopoly, the net of a thousand lies (Verner Vinge), the global role-playing game (multiuser dungeon), the megachannel boob tube (these suggestions are Horning’s), or another metaphor of your own choosing.

C9.2 Choose one of the questions on the information infrastructure given in Section 9.7, and discuss in detail the issues it raises.

C9.3 Based on your intuition and on what you have read here or learned elsewhere, examine the technologies related to the development of computer-based systems. Identify different aspects of the development process that correspond to artistic processes, scientific method, and engineering discipline. What conclusions can you draw about improving the development process?

C9.4 Identify and describe the most essential differences between ordinary systems and critical systems, or else explain why there are no such differences. Include in your analysis issues relevant to system development and operation.

C9.5 Choose an application with widespread implications, such as the Worldwide Information Infrastructure, domestic or worldwide air-traffic control, the Internal Revenue Service Tax Systems Modernization, or the National Crime Information Center system NCIC-2000. Without studying the application in detail, what potential risks can you identify? What would you recommend should be done to avoid or minimize those risks, individually and globally, as relevant to your application?

C9.6 Contrast the various types of application areas (for example, as in the various summary tables) with respect to shifts in causes and effects over time.¹⁴ Is the nature of the causes changing? Is the nature of the consequences changing? Is the nature of the risks changing? Do you think that the risks are diminishing, increasing, or remaining more or less the same? In your analysis of any changes, consider such factors as (among others) the effects of the rapidly advancing technologies; the tendencies for increased interconnectivity; vulnerabilities, threats, and risks; education, training, and experiential learning; governmental and other bureaucracy; technical skill levels in your country and throughout the world, particularly with respect to the risks and how to combat them; the roles of technology in contrast with geosocioeconomic policy; and the levels of awareness necessary for ordinary mortals having to cope with technology. If you have chosen to view the problem areas that you understand best, estimate the potential significance of interactions with other aspects that you have not considered.

C9.7 Assess the relative roles of various factors noted in Challenge C9.6. Are people and organizations (including developers, operators, end-users, governments, corporations, educational organizations, and innocent bystanders) adequately aware of the real risks? How could their awareness be increased? Can social responsibility be legislated or taught? Why? Can the teaching and practice of ethical behavior have a significant effect? If so, who must do what?

C9.8 In what ways is the technology causing an increase in the separation between the haves and the have-nots? Is this gap intrinsic? If it is not, what can be done to shrink it?

C9.9 What are your expectations and hopes for our civilization in the future with respect to our uses of computer-related technologies?

C9.10 Describe changes in policies, laws, and system-development practices that would significantly improve the situation regarding risks to society at large and to individuals. Which changes would be most effective? What, in your opinion, are the limiting factors?

C9.11 Do you still agree with your responses to the challenges in earlier chapters? What, if anything, has changed about your earlier perceptions? Do you agree with my conclusions in this chapter? Explain your answers. Detail the issues on which you disagree (if any), and provide a rationale for your viewpoint.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Implications and Conclusions

Create new playlist

Sign In

Sign Up

Chapter 9. Implications and Conclusions

9.1 Where to Place the Blame

9.2 Expect the Unexpected!

Air New Zealand

British Midland

F-18

Rail ghosts

BoNY

Reinsurance loop

Age = 101

Monkey business

Stock-price swing

The ozone hole

ARPAnet collapse

Other unexpected examples

9.2.1 Good Practice

9.2.2 Uncertainty

9.3 Avoidance of Weak Links

9.3.1 Lessons

9.3.2 Defenses

9.4 Assessment of the Risks

9.5 Assessment of the Feasibility of Avoiding Risks

9.5.1 Progress

9.5.2 Feasibility

9.6 Risks in the Information Infrastructure

9.7 Questions Concerning the NII

9.8 Avoidance of Risks

9.8.1 Jerry Mander’s Aphorisms

9.8.2 A New Common Sense

9.9 Assessment of the Future

9.10 Summary of the Chapter and the Book

Challenges

Table of Contents for
Chapter 9. Implications and Conclusions