Chapter 16 - The Law of Life: Pareto Distribution

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

258 ◾ Simple Statistical Methods for Software Engineering

Illes-Seifert and Paech set up four Pareto hypotheses and prove each is right.

1. Pareto distribution of defects in ﬁles: a small number of ﬁles accounts for the

majority of defects

2. Pareto distribution of defects in ﬁles across releases

3. Pareto distribution of defects in code: a small part of the system’s code size

accounts for the majority of defects

4. Pareto distribution of defects in code across releases

In another example of Pareto, Ostrand and Weyuker [2] have studied the dis-

tribution of defects over diﬀerent ﬁles in 13 releases of a large industrial inventory

tracking system. For each release, the faults were always heavily concentrated in

a relatively small number of ﬁles. For example, they ﬁnd that in a certain release,

10% of ﬁles account for 68% of the faults. Interestingly, they ﬁnd a similar pattern

in code size; a small number of ﬁles contain more code.

In yet another example, Murgia et al. [3] studied the defects in two open source Java

projects, both developed following agile practices, and ﬁnd, “ere are few Compilation

Units hosting most bugs, and most other Compilation Units are with a very few bugs,”

Data 16.1 Pareto Distribution of Usage

Parameters

m 1

α 1.2

Statistics

Mean 6

Median 1.8

No. of Features

1 0.600 0.000

2 0.364 0.565

3 0.253 0.732

4 0.191 0.811

5 0.152 0.855

6 0.125 0.884

7 0.106 0.903

8 0.091 0.918

9 0.080 0.928

10 0.071 0.937

11 0.064 0.944

12 0.058 0.949

13 0.053 0.954

14 0.049 0.958

15 0.045 0.961

16 0.042 0.964

17 0.039 0.967

18 0.036 0.969

19 0.034 0.971

20 0.032 0.973

Probability Density

Function (PDF) of Usage

Cumulative Distribution

Function (CDF) of Usage

The Law of Life ◾ 259

supporting the 80/20 phenomenon. ey report that 80% of bugs are contained in

compilation units ranging from 8% to 20%; clearly, the Pareto law holds. e research-

ers observe, “is is an important result from the software engineering point of view. In

fact, a review of a small fraction of faulty Compilation Unit may have an exponential

impact on the overall amount of software defects detectable and ﬁxable.”

Box 16.3 the 80/20 laWs

e 80/20 principle—that 80% of result ﬂows from just 20% of the causes—

is the one true principle of highly eﬀective people. It has become a man-

agement law. Its eﬀect on all facets of life may be seen from the following

compilation:

Pareto’s historic observations:

80% of Italy’s land was owned by 20% of the population

20% of the pea pods in his garden contain 80% of the peas

In software development,

80% of errors and crashes come from 20% of bugs

20% of software components contain 80% of defects

20% of defects cause 80% of down time

20% of test cases capture 80% of defects

In problem solving,

20% of problems cause 80% of damage

20% of causes are responsible for 80% of problems

20% of hazards account for 80% of injuries

20% of customers take up 80% of one’s time

80% of crimes are committed by 20% of criminals

In the Internet,

1% of the users of a website create new content, 99% lurk

In general,

20% of humans hold power over the remaining 80%

20% patients use 80% health care resources

10% of expenditure on health helps 90% of poor people

20% of the world’s population controlling 82.7% of the world’s income

260 ◾ Simple Statistical Methods for Software Engineering

Generalized Pareto Distribution

Open source projects have a diﬀerent DNA. ey follow the Pareto distribution.

Simmons and Dillon [4] note a Pareto distribution in the size of the number of

developers participating in open source projects with most projects having only one

developer and a much smaller percentage with larger, ongoing involvement.

Kolassa et al. [5] have proposed a generalized Pareto distribution to deﬁne com-

mit sizes in open source projects.

e equation to the model is shown as follows:

f x

( ) = +

−

≠

− −

1 0

for

(16.6)

where θ is the location parameter (controls how much the distribution is shifted), σ

is the scale parameter (controls the dispersion of the distribution), and ξ is the shape

parameter (controls the shape).

Kolassa et al. [5] have ﬁtted the generalized Pareto with the following param-

eters derived from commit size data:

In business,

80% of a company’s proﬁts come from 20% of its customers

80% of a company’s complaints come from 20% of its customers

80% of a company’s proﬁts come from 20% of the time spent by its staﬀ

80% of a company’s sales come from 20% of its products

80% of a company’s sales are made by 20% of its sales staﬀ

80% of your sales come from 20% of your clients

Related special ﬁndings:

A small number of ﬂows carry most Internet traﬃc, and the remainder

consists of a large number of ﬂows that carry very little Internet traﬃc

(Elephant ﬂow and mice ﬂow).

e ﬁrst 90% of the code accounts for the ﬁrst 90% of the development

time. e remaining 10% of the code accounts for the other 90% of the

development time (Tom Cargill).

Ninety percent of everything is crap (eodore Sturgeon).

80% of your beneﬁts come from 20% of your eﬀorts (Tim Ferriss in e

4 Hour Workweek).

The Law of Life ◾ 261

ξ shape = 1.4617

θ location = 0.5

σ scale = 13.854

e previously mentioned model explains how contributions from open source

developers can vary.

Duane’s Model

J. T. Duane has developed a reliability growth model based on the Pareto distribu-

tion (power law), which has long been in use National Institute of Standards and

Technology (NIST).

is model implies that the reliability during any speciﬁc interval can be repre-

sented by the negative exponential model:

kt=

−

(16.7)

where C is the average estimate of cumulative failure rate, t

is the total accu-

mulated operating hours, k is the constant representing cumulative failure rate at

= 1, and α is the improvement rate constant.

Tailing a Body

In a novel attempt, Herraiz et al. [6] ﬁt a Pareto tail to a log-normal body with

object-oriented software metrics. e Pareto distribution, known for its promi-

nent tail, ﬁts the larger values better while the smaller values follow the log-normal

distribution. Herraiz et al. [6] called the mixture model double Pareto distribu-

tion. Not all OO metrics need a double Pareto. Two metrics, the number of

children and the lack of cohesion in methods, are better described using a power

law for the entire range of values. ree metrics, weighted methods per class,

coupling between object classes, and requests for a class, are better described by

a double Pareto.

In another study, Herraiz et al. [7] have studied a large quantity of open source

code, approximately 700,000 C ﬁles. In particular, the following metrics were stud-

ied: source lines of code, lines of code, number of blank lines, number of comment

lines, number of comments, number of C functions, McCabe’s cyclomatic com-

plexity, number of function returns, and Halstead metrics. All the metrics were

found to follow a double Pareto distribution.

Pareto is the simplest distribution available. It is also the ﬂexible and easy to

adapt. e 80/20 law derived from Pareto principle are used extensively in business

262 ◾ Simple Statistical Methods for Software Engineering

management, project management, software development and problem solving. It

is also used in reliability modelling.

Review Questions

1. What is the meaning of 80/20 law?

2. What is the Pareto principle?

3. Why is Pareto distribution called a fat tailed distribution?

4. Which distribution has the fattest tail: Gaussian, exponential, or Pareto?

5. Give three examples of Pareto laws.

Exercises

1. Calculate the mean value of Pareto distribution if mode = 7 and shape = 4.

2. Calculate the median value for the previous case.

3. Assume the defect density distribution of a certain application follows Gaus-

sian tail. If the threshold of defect density is six units (relative value), what is

the reliability of the application? (Clue: make use of the calculation shown in

Box 16.2: A Story of Tails.)

4. Assume the defect density distribution of a certain application follows expo-

nential tail. If the threshold of defect density is six units (relative value), what

is the reliability of the application? (Clue: make use of the calculation shown

in Box 16.2: A Story of Tails.)

5. Assume the defect density distribution of a certain application follows Pareto

tail. If the threshold of defect density is six units (relative value), what is the

reliability of the application? (Clue: make use of the calculation shown in Box

16.2: A Story of Tails.)

References

1. T. Illes-Seifert and B. Paech, e Vital Few and Trivial Many: An Empirical Analysis

of the Pareto Distribution of Defects. Software Engineering, Kaiserslautern, Germany,

2009, pp. 151–164.

2. T. J. Ostrand and E. J. Weyuker, e distribution of faults in a large industrial soft-

ware system, In: Proceedings of the 2002 ACM SIGSOFT International Symposium on

Software Testing and Analysis (ISSTA), ACM Press, Roma, Italy, 2002, pp. 55–64.

3. A. Murgia, G. Concas, S. Pinna, R. Tonelli and I. Turnu, Empirical Study of Software

Quality Evolution in Open Source Projects Using Agile Practices.

4. G. L. Simmons and T. S. Dillon, Towards an ontology for open source software devel-

opment, International Federation for Information Processing, 203, 65–75, 2006.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 16 - The Law of Life: Pareto Distribution—80/20 Aphorism (2/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 16 - The Law of Life: Pareto Distribution—80/20 Aphorism (2/3)