18.2 Software

We have all encountered horror stories about software that contained errors; they make for very interesting reading. Are software errors in running programs really common occurrences? Can’t we do something to make software more error free? To answer the first question, a web search for “software bugs” retrieved 261,000,000 hits. To answer the second, software developers are trying. In the next few sections, we examine why error-free software is difficult—if not impossible—to produce. We also discuss current approaches to software quality, and we end with a collection of interesting bugs.

Complexity of Software

If we accept the premise that commercial software contains errors, the logical question is “Why?” Don’t software developers test their products? The problem is actually not a lack of diligence but rather our old nemesis complexity. As our machines have grown increasingly more powerful, the problems that can be tackled have become increasingly more complex. A single programmer with a problem moved to a programming team with a problem and finally graduated to a team of teams with a problem.

Software testing can demonstrate the presence of bugs but cannot prove their absence. We can test software, find errors and fix them, and then test the software some more. As we find problems and fix them, we raise our confidence that the software performs as it should. But we can never guarantee that we have removed all of the bugs. There is always the possibility of another bug lurking in the software that we haven’t found yet.

Given that we can never know for sure if we have found all the problems, when do we stop testing? It becomes a question of risk. How much are you willing to risk that your software may hold yet another bug? If you’re writing a game, you might take that risk a lot sooner than you would if you’re writing airplane control software in which lives are on the line.

As Nancy Leveson points out in Communications of the ACM, a branch of computing known as software engineering emerged in the 1960s with the goal of introducing engineering discipline into the development of software.7 Great strides toward this goal have been made over the past half-century, including a greater understanding of the role of abstraction, the introduction of modularity, and the notions of the software life cycle, which we discuss in detail later.

Most of these ideas come from engineering, but had to be adapted to the unique problems that arose when working with more abstract materials. Hardware designs are guided and limited by the nature of materials used to implement these designs. Software appears to have limits that are related more closely to human abilities than to physical limitations. Leveson continues, “Thus, the first 50 years may be characterized as our learning about the limits of our field, which are intimately bound up with the limits of complexity with which humans can cope.”

Building software has changed. The early days were filled with the drive to write new software, but more and more the problems of maintaining and evolving existing software have taken center stage. As our systems have grown bigger and required large teams of designers, we have started to examine the ways humans collaborate and to devise ways to assist them to work together more effectively.

Current Approaches to Software Quality

Although the complexity of large software systems makes error-free products almost an impossibility, it doesn’t mean that we should just give up. There are strategies that we can adopt that, if used, improve the quality of software.

Software Engineering

In Chapter 7, we outlined four stages of computer problem solving: write the specification, develop the algorithm, implement the algorithm, and maintain the program. When we move from small, well-defined tasks to large software projects, we need to add two extra layers on top of these: software requirements and specifications. Software requirements are broad, but precise, statements outlining what is to be provided by the software product. Software specifications are a detailed description of the function, inputs, processing, outputs, and special features of a software product. The specifications tell what the program does, but not how it does it.

Leveson mentions the software life cycle as part of the contributions of software engineering. The software life cycle is the concept that software is developed, not just coded, and evolves over time. Thus the life cycle includes the following phases:

  • Requirements

  • Specifications

  • Design (high-level and lower-level)

  • Implementation

  • Maintenance

Verification activities must be carried out during all of these phases. Do the requirements accurately reflect what is needed? Do the specifications accurately reflect the functionality needed to meet the requirements? Does the high-level design accurately reflect the functionality of the specifications? Does each succeeding level of design accurately implement the level above? Does the implementation accurately code the designs? Do changes implemented during the maintenance phase accurately reflect the desired changes? Are the implementations of these changes correct?

In Chapters 6 and 7, we discussed the testing of the designs and code for the relatively small problems presented in this book. Clearly, as the problems get larger, verification activities become more important and more complex. (Yes, that word again.) Testing the design and finished code is only a small—albeit important—part of the process. Half the errors in a typical project occur in the design phase; only half occur in the implementation phase. This data is somewhat misleading. In terms of the cost to fix an error, the earlier in the design process an error is caught, the cheaper it is to correct the error.8

Teams of programmers produce large software products. Two verification techniques effectively used by programming teams are design or code walk-throughs and inspections. These are formal team activities, the intention of which is to move the responsibility for uncovering errors from the individual programmer to the group. Because testing is time-consuming and errors cost more the later they are discovered, the goal is to identify errors before testing begins.

In a walk-through, the team performs a manual simulation of the design or program with sample test inputs, keeping track of the program’s data by hand on paper or a blackboard. Unlike thorough program testing, the walk-through is not intended to simulate all possible test cases. Instead, its purpose is to stimulate discussion about the way the programmer chose to design or implement the program’s requirements.

At an inspection, a reader (never the program’s author) goes through the requirements, design, or code line by line. The inspection participants are given the material in advance and are expected to have reviewed it carefully. During the inspection, the participants point out errors, which are recorded on an inspection report. Team members, during their pre-inspection preparation, will have noted many of the errors. Just the process of reading aloud will uncover other errors. As with the walk-through, the chief benefit of the team meeting is the discussion that takes place among team members. This interaction among programmers, testers, and other team members can uncover many program errors long before the testing stage begins.

At the high-level design stage, the design should be compared to the program requirements to make sure that all required functions have been included and that this program or module works correctly in conjunction with other software in the system. At the low-level design stage, when the design has been filled out with more details, it should be inspected yet again before it is implemented. When the coding has been completed, the compiled listings should be inspected again. This inspection (or walk-through) ensures that the implementation is consistent with both the requirements and the design. Successful completion of this inspection means that testing of the program can begin.

Walk-throughs and inspections should be carried out in as nonthreatening a way as possible. The focus of these group activities is on removing defects in the product, not on criticizing the technical approach of the author of the design or the code. Because these activities are led by a moderator who is not the author, the focus is on the errors, not the people involved.

Over the past 10 to 20 years, the Software Engineering Institute at Carnegie Mellon University has played a major role in supporting research into formalizing the inspection process in large software projects, including sponsoring workshops and conferences. A paper presented at the SEI Software Engineering Process Group (SEPG) Conference reported on a project that was able to reduce product defects by 86.6% using a two-tiered inspection process of group walk-throughs and formal inspections. The process was applied to packets of requirements, design, or code at every stage of the life cycle. TABLE 18.1 shows the defects per 1000 source lines of code (KSLOC) that were found in the different phases of the software life cycle in a maintenance project.9 During the maintenance phase, 40,000 lines of source code were added to a program with more than half a million lines of code. The formal inspection process was used in all of the phases except testing activities.

TABLE
18.1 Errors found during a maintenance project
Stage Defects per KSLOC
System design 2
Software requirements 8
Design 12
Code inspection 34
Testing activities 3

We have talked about large software projects. Before we leave this section, let’s quantify what we mean by “large.” The Space Shuttle Ground Processing System has more than 500,000 lines of code; Vista has 50 million lines of code. Most large projects fall somewhere in between.

We have pointed out that the complexity of large projects makes the goal of error-free code almost impossible to attain. The following is a guideline for the number of errors per lines of code that can be expected:10

Standard software: 25 bugs per 1000 lines of program

Good software: 2 errors per 1000 lines

Space Shuttle software: < 1 error per 10,000 lines

Formal Verification

It would be nice if we could use a tool to automatically locate the errors in a design or code without our even having to run the program. That sounds unlikely, but consider an analogy from geometry. We wouldn’t try to prove the Pythagorean theorem by proving that it worked on every triangle—that would simply demonstrate that the theorem works for every triangle we tried. We prove theorems in geometry mathematically. Why can’t we do the same for computer programs?

The verification of program correctness, independent of data testing, is an important area of theoretical computer science research. The goal of this research is to establish a method for proving programs that is analogous to the method for proving theorems in geometry. The necessary techniques exist for proving that code meets its specifications, but the proofs are often more complicated than the programs themselves. Therefore, a major focus of verification research is the attempt to build automated program provers—verifiable programs that verify other programs.

Formal methods have been used successfully in verifying the correctness of computer chips. One notable example is the verification of a chip to perform real-number arithmetic, which won the Queen’s Award for Technological Achievement. Formal verification to prove that the chip met its specifications was carried out by C. A. R. Hoare, head of the Programming Research Group of Oxford University, together with MOS Ltd. In parallel, a more traditional testing approach was taking place. As reported in Computing Research News:

The race [between the two groups] was won by the formal development method—it was completed an estimated 12 months ahead of what otherwise would have been achievable. Moreover, the formal design pointed to a number of errors in the informal one that had not shown up in months of testing. The final design was of higher quality, cheaper, and was completed quicker.11

It is hoped that success with formal verification techniques at the hardware level will eventually lead to success at the software level. However, software is far more complex than hardware, so we do not anticipate any major breakthroughs within the near future.

Open-Source Movement12

In the early days of computing, software came bundled with the computer, including the source code for the software. Programmers adjusted and adapted the programs and happily shared the improvements they made. In the 1970s, firms began withholding the source code, and software became big business.

With the advent of the Internet, programmers from all over the world can collaborate at almost no cost. A simple version of a software product can be made available via the Internet. Programmers interested in ex tending or improving the program can do so. A “benevolent dictator” who keeps track of what is going on governs most open-source projects. If a change or improvement passes the peer review of fellow developers and gets incorporated into the next version, it is a great coup.

Linux is the best-known open-source project. Linus Torvolds wrote the first simple version of the operating system using UNIX as a blueprint and continued to oversee its development. IBM spent $1 billion on Linux in 2001 with the objective of making it a computing standard. As The Economist says,

Some people like to dismiss Linux as nothing more than a happy accident, but the program looks more like a textbook example of an emerging pattern. . . . Open source is certainly a mass phenomenon, with tens of thousands of volunteer programmers across the world already taking part, and more joining in all the time, particularly in countries such as China and India. SourceForge, a website for developers, now hosts more than 18,000 open-source projects that keep 145,000 programmers busy.14

Now, 14 years later, open source is still going strong. Some companies consider it one of several design choices; others consider it critical to their operations. As of May 2013, SourceForge had more than 300,000 software projects registered and more than 3 million registered users. By 2014, OpenSSL, an open-source implementation of a set of encryption tools founded in 1998, was used in two-thirds of all web servers. A 2013 report by Coverity, a company specializing in software quality and security testing solutions, showed that open source software had fewer errors per thousand lines of code (KLOC) than proprietary software.15,16

Unfortunately, in April 2014 a bug, popularly known as Heartbleed, was found in OpenSSL. It was quickly fixed, but it brought attention to the volunteer programmers who drive the open-source movement. Theoretically, each checks another’s work, leading to better software. Clearly, this did not happen in the case of Heartbleed.17

Notorious Software Errors

Everyone involved in computing has a favorite software horror story. We include only a small sample here.

AT&T Down for Nine Hours

In January 1990, AT&T’s long-distance telephone network came to a screeching halt for nine hours because of a software error in the electronic switching systems. Of the 148 million long-distance and toll-free calls placed with AT&T that day, only 50% got through. This failure caused untold collateral damage:

  • Hotels lost bookings.

  • Rental car agencies lost rentals.

  • American Airlines’ reservation system traffic fell by two-thirds.

  • A telemarketing company lost $75,000 in estimated sales.

  • MasterCard couldn’t process its typical 200,000 credit approvals.

  • AT&T lost some $60 million to $75 million.

As AT&T Chairman Robert Allen said, “It was the worst nightmare I’ve had in 32 years in the business.”18

How did this happen? Earlier versions of the switching software worked correctly. The software error was in the code that upgraded the system to make it respond more quickly to a malfunctioning switch. The error involved a break statement in the C code.19 As Henry Walker points out in The Limits of Computing, this breakdown illustrates several points common to many software failures. The software had been tested extensively before its release, and it worked correctly for about a month. In addition to testing, code reviews had been conducted during development. One programmer made the error, but many others reviewed the code without noticing the error. The failure was triggered by a relatively uncommon sequence of events, difficult to anticipate in advance. And the error occurred in code designed to improve a correctly working system—that is, it arose during the maintenance phase. E. N. Adams in the IBM Journal of Research and Development estimates that 15 to 50% of attempts to remove an error from a large program result in the introduction of additional errors.

Therac-25

One of the most widely cited software-related accidents involved a computerized radiation therapy machine called the Therac-25. Between June 1985 and January 1987, six known accidents involved massive overdoses by the Therac-25, leading to deaths and serious injuries. These accidents have been described as the worst series of radiation accidents in the 35-year history of medical accelerators.

We look more closely at these incidents in the “Ethical Issues” section at the end of this chapter.

Bugs in Government Projects

On February 25, 1991, during the first Gulf War, a Scud missile struck a U.S. Army barracks, killing 28 soldiers and injuring roughly 100 other people. A U.S. Patriot Missile battery in Dhahran, Saudi Arabia, failed to track and intercept the incoming Iraqi Scud missile because of a software error. This error, however, was not a coding error but a design error. A calculation involved a multiplication by 1/10, which is a nonterminating number in binary. The resulting arithmetic error accumulated over the 100 hours of the batteries’ operation amounted to 0.34 second, enough for the missile to miss its target.20

The General Accounting Office concluded:

The Patriot had never before been used to defend against Scud missiles nor was it expected to operate continuously for long periods of time. Two weeks before the incident, Army officials received Israeli data indicating some loss in accuracy after the system had been running for 8 consecutive hours. Consequently, Army officials modified the software to improve the system’s accuracy. However, the modified software did not reach Dhahran until February 26, 1991—the day after the Scud incident.”21

The Gemini V missed its expected landing point by about 100 miles. The reason? The design of the guidance system did not take into account the need to compensate for the motion of the Earth around the Sun.22

In October 1999, the Mars Climate Orbiter entered the Martian atmosphere about 100 kilometers lower than expected, causing the craft to burn up. Arthur Stephenson, chairman of the Mars Climate Orbiter Mission Failure Investigation Board, concluded:

The “root cause” of the loss of the spacecraft was the failed translation of English units into metric units in a segment of ground-based, navigation-related mission software, as NASA has previously announced . . . The failure review board has identified other significant factors that allowed this error to be born, and then let it linger and propagate to the point where it resulted in a major error in our understanding of the spacecraft’s path as it approached Mars.23

Launched in July 1962, the Mariner 1 Venus probe veered off course almost immediately and had to be destroyed. The problem was traced to the following line of Fortran code:

DO 5 K = 1. 3

The period should have been a comma. An $18.5 million space exploration vehicle was lost because of this typographical error.

Errors in software are not only the province of the U.S. government. An unmanned Ariane 5 rocket launched by the European Space Agency exploded on June 4, 1996, just 40 seconds after lift-off. The development cost for the rocket was $7 billion, spanning over a decade. The rocket and its cargo were valued at $500 million. What happened? A 64-bit floating-point number, relating to the horizontal velocity with respect to the platform, was converted to a 17-bit signed integer; the number was larger than 32,767. The resulting error caused the launcher to veer off its flight path, break up, and explode.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.10.137