258 Simple Statistical Methods for Software Engineering
Illes-Seifert and Paech set up four Pareto hypotheses and prove each is right.
1. Pareto distribution of defects in files: a small number of les accounts for the
majority of defects
2. Pareto distribution of defects in files across releases
3. Pareto distribution of defects in code: a small part of the system’s code size
accounts for the majority of defects
4. Pareto distribution of defects in code across releases
In another example of Pareto, Ostrand and Weyuker [2] have studied the dis-
tribution of defects over different files in 13 releases of a large industrial inventory
tracking system. For each release, the faults were always heavily concentrated in
a relatively small number of les. For example, they nd that in a certain release,
10% of files account for 68% of the faults. Interestingly, they find a similar pattern
in code size; a small number of files contain more code.
In yet another example, Murgia et al. [3] studied the defects in two open source Java
projects, both developed following agile practices, and find, “ere are few Compilation
Units hosting most bugs, and most other Compilation Units are with a very few bugs,
Data 16.1 Pareto Distribution of Usage
Parameters
m 1
α 1.2
Statistics
Mean 6
Median 1.8
No. of Features
1 0.600 0.000
2 0.364 0.565
3 0.253 0.732
4 0.191 0.811
5 0.152 0.855
6 0.125 0.884
7 0.106 0.903
8 0.091 0.918
9 0.080 0.928
10 0.071 0.937
11 0.064 0.944
12 0.058 0.949
13 0.053 0.954
14 0.049 0.958
15 0.045 0.961
16 0.042 0.964
17 0.039 0.967
18 0.036 0.969
19 0.034 0.971
20 0.032 0.973
Probability Density
Function (PDF) of Usage
Cumulative Distribution
Function (CDF) of Usage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
The Law of Life 259
supporting the 80/20 phenomenon. ey report that 80% of bugs are contained in
compilation units ranging from 8% to 20%; clearly, the Pareto law holds. e research-
ers observe, “is is an important result from the software engineering point of view. In
fact, a review of a small fraction of faulty Compilation Unit may have an exponential
impact on the overall amount of software defects detectable and fixable.
Box 16.3 the 80/20 laWs
e 80/20 principlethat 80% of result flows from just 20% of the causes
is the one true principle of highly effective people. It has become a man-
agement law. Its effect on all facets of life may be seen from the following
compilation:
Pareto’s historic observations:
80% of Italy’s land was owned by 20% of the population
20% of the pea pods in his garden contain 80% of the peas
In software development,
80% of errors and crashes come from 20% of bugs
20% of software components contain 80% of defects
20% of defects cause 80% of down time
20% of test cases capture 80% of defects
In problem solving,
20% of problems cause 80% of damage
20% of causes are responsible for 80% of problems
20% of hazards account for 80% of injuries
20% of customers take up 80% of one’s time
80% of crimes are committed by 20% of criminals
In the Internet,
1% of the users of a website create new content, 99% lurk
In general,
20% of humans hold power over the remaining 80%
20% patients use 80% health care resources
10% of expenditure on health helps 90% of poor people
20% of the worlds population controlling 82.7% of the worlds income
260 Simple Statistical Methods for Software Engineering
Generalized Pareto Distribution
Open source projects have a different DNA. ey follow the Pareto distribution.
Simmons and Dillon [4] note a Pareto distribution in the size of the number of
developers participating in open source projects with most projects having only one
developer and a much smaller percentage with larger, ongoing involvement.
Kolassa et al. [5] have proposed a generalized Pareto distribution to define com-
mit sizes in open source projects.
e equation to the model is shown as follows:
f x
x
( ) = +
1
1 0
1
1
σ
ξ
θ
σ
ξ
ξ
for
(16.6)
where θ is the location parameter (controls how much the distribution is shifted), σ
is the scale parameter (controls the dispersion of the distribution), and ξ is the shape
parameter (controls the shape).
Kolassa et al. [5] have fitted the generalized Pareto with the following param-
eters derived from commit size data:
In business,
80% of a company’s profits come from 20% of its customers
80% of a company’s complaints come from 20% of its customers
80% of a company’s profits come from 20% of the time spent by its sta
80% of a company’s sales come from 20% of its products
80% of a company’s sales are made by 20% of its sales sta
80% of your sales come from 20% of your clients
Related special findings:
A small number of flows carry most Internet traffic, and the remainder
consists of a large number of flows that carry very little Internet traffic
(Elephant flow and mice flow).
e rst 90% of the code accounts for the rst 90% of the development
time. e remaining 10% of the code accounts for the other 90% of the
development time (Tom Cargill).
Ninety percent of everything is crap (eodore Sturgeon).
80% of your benefits come from 20% of your efforts (Tim Ferriss in e
4 Hour Workweek).
The Law of Life 261
ξ shape = 1.4617
θ location = 0.5
σ scale = 13.854
e previously mentioned model explains how contributions from open source
developers can vary.
Duane’s Model
J. T. Duane has developed a reliability growth model based on the Pareto distribu-
tion (power law), which has long been in use National Institute of Standards and
Technology (NIST).
is model implies that the reliability during any specific interval can be repre-
sented by the negative exponential model:
λ
α
C
kt=
T
(16.7)
where C is the average estimate of cumulative failure rate, t
T
is the total accu-
mulated operating hours, k is the constant representing cumulative failure rate at
t
T
= 1, and α is the improvement rate constant.
Tailing a Body
In a novel attempt, Herraiz et al. [6] fit a Pareto tail to a log-normal body with
object-oriented software metrics. e Pareto distribution, known for its promi-
nent tail, fits the larger values better while the smaller values follow the log-normal
distribution. Herraiz et al. [6] called the mixture model double Pareto distribu-
tion. Not all OO metrics need a double Pareto. Two metrics, the number of
children and the lack of cohesion in methods, are better described using a power
law for the entire range of values. ree metrics, weighted methods per class,
coupling between object classes, and requests for a class, are better described by
a double Pareto.
In another study, Herraiz et al. [7] have studied a large quantity of open source
code, approximately 700,000 C files. In particular, the following metrics were stud-
ied: source lines of code, lines of code, number of blank lines, number of comment
lines, number of comments, number of C functions, McCabe’s cyclomatic com-
plexity, number of function returns, and Halstead metrics. All the metrics were
found to follow a double Pareto distribution.
Pareto is the simplest distribution available. It is also the flexible and easy to
adapt. e 80/20 law derived from Pareto principle are used extensively in business
262 Simple Statistical Methods for Software Engineering
management, project management, software development and problem solving. It
is also used in reliability modelling.
Review Questions
1. What is the meaning of 80/20 law?
2. What is the Pareto principle?
3. Why is Pareto distribution called a fat tailed distribution?
4. Which distribution has the fattest tail: Gaussian, exponential, or Pareto?
5. Give three examples of Pareto laws.
Exercises
1. Calculate the mean value of Pareto distribution if mode = 7 and shape = 4.
2. Calculate the median value for the previous case.
3. Assume the defect density distribution of a certain application follows Gaus-
sian tail. If the threshold of defect density is six units (relative value), what is
the reliability of the application? (Clue: make use of the calculation shown in
Box 16.2: A Story of Tails.)
4. Assume the defect density distribution of a certain application follows expo-
nential tail. If the threshold of defect density is six units (relative value), what
is the reliability of the application? (Clue: make use of the calculation shown
in Box 16.2: A Story of Tails.)
5. Assume the defect density distribution of a certain application follows Pareto
tail. If the threshold of defect density is six units (relative value), what is the
reliability of the application? (Clue: make use of the calculation shown in Box
16.2: A Story of Tails.)
References
1. T. Illes-Seifert and B. Paech, e Vital Few and Trivial Many: An Empirical Analysis
of the Pareto Distribution of Defects. Software Engineering, Kaiserslautern, Germany,
2009, pp. 151–164.
2. T. J. Ostrand and E. J. Weyuker, e distribution of faults in a large industrial soft-
ware system, In: Proceedings of the 2002 ACM SIGSOFT International Symposium on
Software Testing and Analysis (ISSTA), ACM Press, Roma, Italy, 2002, pp. 55–64.
3. A. Murgia, G. Concas, S. Pinna, R. Tonelli and I. Turnu, Empirical Study of Software
Quality Evolution in Open Source Projects Using Agile Practices.
4. G. L. Simmons and T. S. Dillon, Towards an ontology for open source software devel-
opment, International Federation for Information Processing, 203, 65–75, 2006.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.168.199