253
Chapter 16
The Law of Life:
Pareto Distribution
80/20 Aphorism
Pareto distribution is a fat-tailed skewed distribution invented by Vifredo Pareto. A
brief biography of Pareto is given in Box 16.1. e distribution was originally used to
describe wealth distribution in society. Larger wealth is controlled by fewer people.
Box 16.1 Vilfredo Paretothe economist
Who discoVered management (1848–1923)
Vilfredo Pareto was an Italian sociologist, engineer, economist, philosopher,
political scientist, and mathematician.
Between 1859 and 1864, Vilfredo changed schools several times. From 1864
to 1867, Vilfredo studied mathematics and physics at the Università di Torino.
In 1869, he earned a doctor’s degree in engineering from what is now the
Polytechnic University of Turin. His dissertation was titled “e Fundamental
Principles of Equilibrium in Solid Bodies.His later interest in equilibrium
analysis in economics and sociology can be traced back to this paper.
After his studies, Pareto worked for some years at the Italian Railway
Company and traveled to Germany, England, Belgium, Switzerland, and
Austria. In the field of statistics, Pareto worked for insurances and the calcu-
lation of pensions.
254 Simple Statistical Methods for Software Engineering
Structure of Pareto
Pareto is known as a fat-tailed distribution. Gaussian, exponential, and Pareto tails
are compared in Box 16.2. It is shown that Pareto has the largest tail.
A graph of the Pareto distribution is plotted in Figure 16.1. e probability of
usage of software features is the metric plotted in Figure 16.1. e distribution begins
from its mode and extends asymptotically to the right. e decline of usage is gradual.
e Pareto probability density function (PDF) depends on two parameters,
mode m and shape factor α. e equation to the PDF is shown as follows:
PDF =
+
α
α
α
m
x
1
(16.1)
e equation can be rewritten by marking the constant term separately and
bringing the variable term to the numerator, as follows:
f (x) = (αm
α
)x
−(α+1)
(16.2)
e equation is clearly a form of the power law with a negative exponential x
b
.
Power law is one of the favorite curves used in data mining.
Pareto became famous by the Pareto Optimum in economics and the
Pareto distribution. In 1896, he found that the distribution of income does
not follow the normal distribution but is mostly inclined to the right side. His
discovery of the distribution curve for wealth and incomes” of 1895 made
Pareto famous as a statistician.
e Pareto principle was named after him and built on observations of his
such as that 80% of the land in Italy was owned by 20% of the population.
Pareto was the first to realize that utility was a preference ordering. With this,
Pareto not only inaugurated modern microeconomics but also demolished the
alliance of economics and utilitarian philosophy. Pareto said good” cannot be
measured. He replaced it with the notion of Pareto optimality, the idea that a sys-
tem is enjoying maximum economic satisfaction when no one can be made better
off without making someone else worse off. Pareto optimality is widely used in
welfare economics and game theory. A standard theorem is that a perfectly com-
petitive market creates distributions of wealth that are Pareto optimal.
His legacy as an economist was profound. Partly because of him, the field
evolved from a branch of social philosophy as practiced by Adam Smith
into a data-intensive field of scientific research and mathematical equations.
(http://en.wikipedia.org/wiki/Pareto_principle; http://en.wikipedia.org/wiki
/Pareto_distribution)
The Law of Life 255
Box 16.2 a story of tails
e Gaussian tail dies soon. e exponential tail stretches longer but is lim-
ited. e Pareto tail, resulting from the power law, is unlimited. We can
compare the standard forms of these three tail equations:
Gaussian, standard form =
e
x
1
2
2
Exponential, standard form = e
x
Pareto, standard form = x
−1
In the previously mentioned expressions, scale factor = 1 and location = 0.
If we check the value of tails at x = 6, we find
Gaussian tail = 0.0000000152
Exponential tail = 0.00248
Pareto tail = 0.167
At x = 6, the Gaussian tail is nearly zero, and the exponential tail is
162,755 times bigger. In turn, the Pareto tail is 67 times stronger than the
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
2 4 6 8 10 12
Number of features
m = 1
α = 1.2
Usage probability
14 16 18 20 22
Figure 16.1 Pareto distribution of features usage.
256 Simple Statistical Methods for Software Engineering
e cumulative distribution function is shown in Figure 16.2. e y-axis
directly reads usage probability, while the x-axis reads the number of features.
Using this model, we can find quickly the usage probability of n number of features
in a software product.
e equation to cumulative distribution is rather simple and is shown as follows:
exponential. For larger values of x, divergence among the three tails increases
further. e Gaussian tail will be dead, the exponential tail will slide toward
zero, and the Pareto tail will still have significant values for a long distance.
ese three tails represent three aspects of engineering and management.
Gaussian is drawn to its center; its body is accentuated and its tail attenuated,
a true model of process behavior. e Gaussian tails are either process defects
or rejection areas.
Exponential curve represents decay or defects in a product. ere seem to
be special mechanisms in a product that cause decay or vulnerabilities that
cause defects. By denition, exponential tail represents failure, not perfor-
mance of products.
Pareto is often a model for external factors that influence a product or a
process from outside the organization.
Business comprises effects represented by these three tails.
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
2 4 6 8 10
80/20 Law
12
Number of features
Usage probability
14 16 18 20 22
m = 1
α = 1.2
Figure 16.2 Cumulative Pareto distribution of features usage.
The Law of Life 257
CDF =
1
m
x
α
(16.3)
It may be noted that the previously mentioned equations are defined for values
of x greater than mode m.
Key statistics of the distribution are given as follows:
Mean =
α
α
m
1
(16.4)
Median = m2
1
α
(16.5)
e mean is defined for values of shape factor α > 1.
An Example
A Pareto model has been established with mode m = 1 and shape factor α = 1.2 in
Data 16.1. e mean for this model turns out to be 6 while the median is 1.8. e
fact that the mean is so far away from the median explains a model skew. e mean
has shifted toward the tail. e PDF and cumulative distribution function (CDF)
have computed and the values are shown in Data 16.1. Pareto calculations are easy
and can be managed with basic Excel.
The 80/20 Law: Vital Few and Trivial Many
e CDF shown in Figure 16.2 allows us to think of the famous 80/20 due to
Pareto. It may be seen that 20% of features have 80% usage probability. is is a
basic principle used in statistical testing. is model is also called the operational
profile of the product. ere are many 80/20 laws that rule life. A brief list is given
in Box 16.3.
e 80/20 law depicts the phenomenon of “vital few and trivial many.Illes-
Seifert and Paech [1] have analyzed application of this principle to software defects.
ey report,
e distribution of about 430 defects over about 500 modules has been
analysed and confirms the Pareto Principle, i.e. approximately 80% of
the defects were contained in 20% of the modules.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.81.254