Overview of the Prediction Model

Fault prediction for a given release N of a system is based on a model built by analyzing the code properties, process properties, and fault counts from earlier releases. The fundamental assumption is that properties associated with faulty files in earlier releases will also be associated with faulty files in the next release. A model consists of an equation whose variables are the code and process properties for a file of release N, and whose value is a predicted number of faults for the file. Creation of the model requires data from at least two prior releases, but can make use of data from as many prior releases as are available. The data used to construct a model are referred to as training data.

Because we have fault data available for the entire history of these systems, we can generate predictions for release N using information from releases 1 through N – 1, and then check the prediction quality by looking at the actual numbers of faults that occur in files of release N. Our fundamental way of evaluating the quality of the predictions for release N is to measure how many of the faults actually detected in N occurred in the files that appear in the first 20% of the prediction list. Table 9-2 shows that the top 20% always includes at least 75% of the faults, and usually more than 80%. The table provides compelling evidence that the prediction method will be applicable to other large systems.

You might be wondering why we have chosen to use the admittedly arbitrary cutoff value of 20% to assess our predictions. There are two basic reasons. First, the evidence that we and others have collected tells us that the top 20% of the files usually contains the majority of the defects, which encourages us to feel that if our predictions really do help us identify the files containing around 80% of the faults, we are doing an excellent job of pinpointing the right files.

Second, 20% of the files often represents a small enough number of files to make it practical to target them for special attention. We are certainly not advocating that other files should not be tested or carefully scrutinized, only that the first 20% are likely to warrant increased attention.

The prediction models we use are based on negative binomial regression (NBR) [McCullagh and Nelder 1989], which we have found to be the most effective statistical approach for our purpose. We have developed a single form of the NBR model and have applied it successfully to many different systems. Negative binomial regression is a well-known statistical modeling method that is an extension of linear regression. It is especially suitable to model situations where the outcomes of trials are nonnegative integers, such as the number of faults expected in each file of a system. With NBR, the logarithm of the outcome (expected number of faults) is modeled as a linear combination of the independent variables. In other words, the expected number of faults is assumed to be a multiplicative function of the independent variables.

The following variables make up what we call the standard model for our predictions. They are input to the model’s equation for each file, and produce a predicted fault count for the file in release N:

  • The size of the file in number of lines of code (LOC)

  • Whether the file was part of the system in the previous release or was new to this release (a binary variable)

  • The number of changes made to the file in release N – 1

  • The number of changes made to the file in release N – 2

  • The number of faults detected in the file in release N – 1, either during testing or during operation in the field

  • The programming language in which the file was written

A detailed technical description of our standard NBR model can be found in [Weyuker et al. 2009].

In our empirical studies, we evaluate a model’s success by adding up the faults that actually were detected in the files predicted to be among the top 20%, and then dividing that number by the total number of faults that were identified in all of the files in that release. Thus, if the files predicted to be in the top 20% contain 80 faults and collectively there is a total of 100 faults in all the files of release N, then we would say that the model correctly identifies 80% of the faults in the top 20% of the files for release N. Of course, when the models are used in practice in an ongoing development environment, the developers don’t know exactly which files in release N will turn out to be faulty or not faulty.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.225