The various quality-assurance practices don't all have the same effectiveness. Many techniques have been studied, and their effectiveness at detecting and removing defects is known. This and several other aspects of the "effectiveness" of the quality-assurance practices are discussed in this section.
If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.
Some practices are better at detecting defects than others, and different methods find different kinds of defects. One way to evaluate defect-detection methods is to determine the percentage of defects they detect out of the total defects that exist at that point in the project. Table 20-2 shows the percentages of defects detected by several common defect-detection techniques.
Table 20-2. Defect-Detection Rates
Removal Step |
Lowest Rate |
Modal Rate |
Highest Rate |
---|---|---|---|
Source: Adapted from Programming Productivity (Jones 1986a), "Software Defect-Removal Efficiency" (Jones 1996), and "What We Have Learned About Fighting Defects" (Shull et al. 2002). | |||
Informal design reviews |
25% |
35% |
40% |
Formal design inspections |
45% |
55% |
65% |
Informal code reviews |
20% |
25% |
35% |
Formal code inspections |
45% |
60% |
70% |
Modeling or prototyping |
35% |
65% |
80% |
Personal desk-checking of code |
20% |
40% |
60% |
Unit test |
15% |
30% |
50% |
New function (component) test |
20% |
30% |
35% |
Integration test |
25% |
35% |
40% |
Regression test |
15% |
25% |
30% |
System test |
25% |
40% |
55% |
Low-volume beta test (<10 sites) |
25% |
35% |
40% |
High-volume beta test (>1,000 sites) |
60% |
75% |
85% |
The most interesting facts that this data reveals is that the modal rates don't rise above 75 percent for any single technique and that the techniques average about 40 percent. Moreover, for the most common kinds of defect detection—unit testing and integration testing—the modal rates are only 30–35 percent. The typical organization uses a test-heavy defect-removal approach and achieves only about 85 percent defectremoval efficiency. Leading organizations use a wider variety of techniques and achieve defect-removal efficiencies of 95 percent or higher (Jones 2000).
The strong implication is that if project developers are striving for a higher defectdetection rate, they need to use a combination of techniques. A classic study by Glenford Myers confirmed this implication (1978b). Myers studied a group of programmers with a minimum of 7 and an average of 11 years of professional experience. Using a program with 15 known errors, he had each programmer look for errors by using one of these techniques:
Execution testing against the specification
Execution testing against the specification with the source code
Walk-through/inspection using the specification and the source code
Myers found a huge variation in the number of defects detected in the program, ranging from 1.0 to 9.0 defects found. The average number found was 5.1, or about a third of those known.
When used individually, no method had a statistically significant advantage over any of the others. The variety of errors people found was so great, however, that any combination of two methods—including having two independent groups using the same method—increased the total number of defects found by a factor of almost 2. Studies at NASA's Software Engineering Laboratory, Boeing, and other companies have reported that different people tend to find different defects. Only about 20 percent of the errors found by inspections were found by more than one inspector (Kouchakdjian, Green, and Basili 1989; Tripp, Struck, and Pflug 1991; Schneider, Martin, and Tsai 1992).
Glenford Myers points out that human processes (inspections and walk-throughs, for instance) tend to be better than computer-based testing at finding certain kinds of errors and that the opposite is true for other kinds of errors (1979). This result was confirmed in a later study, which found that code reading detected more interface defects and functional testing detected more control defects (Basili, Selby, and Hutchens 1986). Test guru Boris Beizer reports that informal test approaches typically achieve only 50–60 percent test coverage unless you're using a coverage analyzer (Johnson 1994).
The upshot is that defect-detection methods work better in combination than they do singly. Jones made the same point when he observed that cumulative defect-detection efficiency is significantly higher than that of any individual technique. The outlook for the effectiveness of testing used by itself is bleak. Jones points out that a combination of unit testing, functional testing, and system testing often results in a cumulative defect detection of less than 60 percent, which is usually inadequate for production software.
This data can also be used to understand why programmers who begin working with a disciplined defect-removal technique such as Extreme Programming experience higher defect-removal levels than they have experienced previously. As Table 20-3 illustrates, the set of defect-removal practices used in Extreme Programming would be expected to achieve about 90 percent defect-removal efficiency in the average case and 97 percent in the best case, which is far better than the industry average of 85 percent defect removal. Although some people have linked this effectiveness to synergy among Extreme Programming's practices, it is really just a predictable outcome of using these specific defect-removal practices. Other combinations of practices can work equally well or better, and the determination of which specific defect-removal practices to use to achieve a desired quality level is one part of effective project planning.
Table 20-3. Extreme Programming's Estimated Defect-Detection Rate
Removal Step |
Lowest Rate |
Modal Rate |
Highest Rate |
---|---|---|---|
Informal design reviews (pair programming) |
25% |
35% |
40% |
Informal code reviews (pair programming) |
20% |
25% |
35% |
Personal desk-checking of code |
20% |
40% |
60% |
Unit test |
15% |
30% |
50% |
Integration test |
25% |
35% |
40% |
Regression test |
15% |
25% |
30% |
Expected cumulative defect-removal efficiency |
∼74% |
∼90% |
∼97% |
Some defect-detection practices cost more than others. The most economical practices result in the least cost per defect found, all other things being equal. The qualification that all other things must be equal is important because per-defect cost is influenced by the total number of defects found, the stage at which each defect is found, and other factors besides the economics of a specific defect-detection technique.
Most studies have found that inspections are cheaper than testing. A study at the Software Engineering Laboratory found that code reading detected about 80 percent more faults per hour than testing (Basili and Selby 1987). Another organization found that it cost six times as much to detect design defects by using testing as by using inspections (Ackerman, Buchwald, and Lewski 1989). A later study at IBM found that only 3.5 staff hours were needed to find each error when using code inspections, whereas 15–25 hours were needed to find each error through testing (Kaplan 1995).
The cost of finding defects is only one part of the cost equation. The other is the cost of fixing defects. It might seem at first glance that how the defect is found wouldn't matter—it would always cost the same amount to fix.
That isn't true because the longer a defect remains in the system, the more expensive it becomes to remove. A detection technique that finds the error earlier therefore results in a lower cost of fixing it. Even more important, some techniques, such as inspections, detect the symptoms and causes of defects in one step; others, such as testing, find symptoms but require additional work to diagnose and fix the root cause. The result is that one-step techniques are substantially cheaper overall than two-step ones.
For details on the fact that defects become more expensive the longer they stay in a system, see "Appeal to Data" in Importance of Prerequisites. For an up-close look at errors themselves, see Typical Errors.
Microsoft's applications division has found that it takes three hours to find and fix a defect by using code inspection, a one-step technique, and 12 hours to find and fix a defect by using testing, a two-step technique (Moore 1992). Collofello and Woodfield reported on a 700,000-line program built by over 400 developers (1989). They found that code reviews were several times as cost-effective as testing—a 1.38 return on investment vs. 0.17.
The bottom line is that an effective software-quality program must include a combination of techniques that apply to all stages of development. Here's a recommended combination for achieving higher-than-average quality:
Formal inspections of all requirements, all architecture, and designs for critical parts of a system
Modeling or prototyping
Code reading or inspections
Execution testing
13.58.5.57