Data Dispersion 45
Skewness and Kurtosis
Pearson’s Skewness
Skewness is a measure of asymmetry in data. Pearsons formula for skewness is
based on the dierence between mean and mode. If the dierence is more, there is
more asymmetry. e formula is given as follows:
Skewness
Mean Mode
SD
=
(3.4)
Applying this formula to bug repair time data, we obtain a skewness of 0.666.
e value is positive, indicating the presence of more large data. Data are said to be
skewed to the right. If the skewness is negative, data would be negatively skewed,
or skewed to the left.
If the mode is ill dened, then we can use the following modied formula based
on the dierence between mean and median:
Skewness
Mean Median
SD
=
3( )
(3.5)
For bug repair time data, this formula yields a skewness of 0.669.
Bowley’s Skewness
A robust estimate of skewness is based on quartiles and median. is is also known
as quartile skewness or Bowley’s skewness. e formula is given as follows:
Bowley s skewness
Q Q Median
Q Q
=
+
3 1
3 1
2
(3.6)
For bug repair time data, Bowley’s skewness was calculated as 0.259. is value
is much smaller than Pearsons value. Bowleys skewness is on a dierent scale; it
varies from −1 to +1. Pearsons skewness varies from −3 to +3.
Third Standardized Moment
Skewness can be considered using the method of moments. Skewness is the third
standardized moment. is convention is followed in Excel in the function SKEW
that uses the following formula:
Skewness =
n
n n
x x
s
i
( )( )1 2
3
(3.7)
where n is the sample size,
x
is the sample mean, and s is the sample standard deviation.
46 Simple Statistical Methods for Software Engineering
It may be noted that in the formula, deviations are raised to the third power.
Also, like in all skewness calculations, the value is normalized or standardized with
a division by standard deviation.
e moment-based calculation skewnessusing Excel function SKEW—for
bug repair time data is 1.271. is is a more sensitive measure of skewness.
Kurtosis
e atness of data is measured as kurtosis. e lower the value of kurtosis, the
atter the data distribution. ere are dierent conventions in computing kurtosis.
e Excel function KURT uses a formula for kurtosis given as follows:
Kurtosis =
n
n n n
x x
s
n
i
( )( )( )
( )
(1 2 3
3 1
4
2
nn n
2 3)( )
(3.8)
where n is the sample size,
x
is the sample mean, and s is the sample standard deviation.
e formula has been adjusted to make the kurtosis of normally distributed
data equal to zero. e Pearson method of calculating kurtosis yields a value of 3
for normal distribution. If we subtract 3 from the Pearson result, we will obtain
excess kurtosis. Hence, the Excel KURT formula gives “excess kurtosis,the value
in excess of normal kurtosis. If this “excess kurtosis” value is positive, data are more
peaked; if it is negative, data are broader.
Kurtosis for bug repair time data has been calculated. It is +1.676; hence, data
are peaked.
BOX 3.4 SKEWED LIFE
A good amount of software project data are skewed. Symmetrical and normal
data are an exception. Data from simple processes show symmetry. Data from
complex processes are skewed. Software development is certainly a collection
of several processes and is expected to produce skewed data. If data collection
is a process of observation, then we must recognize skew in data and learn to
accept skew as a reality of life. e transformation of skewed data into sym-
metrical data is an articial step performed often to apply some statistical
tests. e untransformed raw data from software projects is often skewed.
An outstanding example is the skew in complexity data. Another is skew in
TAT data. In such cases, skew is the DNA of a process. Skew may restrict the
application of several classic statistical methods while testing the data, but
that is a secondary issue.
Data Dispersion 47
Coefficient of Dispersion
e term coecient is commonly used in algebra. e coecient of a variable
tells us the magnitude of the eect of the variable on the result. In metallurgy, the
coecient of expansion of metals can be used to calculate the expansion of metals.
Here the coecient is a metal property. e design of a coecient of dispersion has
a dierent purpose, although the connotations of the term are not entirely strange.
Coefficient of Range
e simplest coecient of dispersion is the coecient of range (COR), calculated
as follows:
COR
Max Min
Max Min
=
+
( )
( )
(3.9)
For the bug repair data, COR can be computed as follows:
Max 58 days
Min 6 days
Max − Min 52 days
Max + Min 64 days
COR 0.8125 (dimensionless ratio)
Coefficient of Quartile Deviation
COR is based on extreme values and hence is not robust. Coecient of Quartile
Deviation (CQD) is based on quartiles and hence is not inuenced by extreme
values. e formula for CQD is given as follows:
CQD
Quartile Quartile
Quartile Quartile
=
+
3 1
3 1
(3.10)
For the bug repair time data, CQD is computed as follows:
Q
3
26.5 days
Q
1
13 days
Q
3
− Q
1
13.5 days
Q
3
+ Q
1
39.5 days
CQD 0.342 (dimensionless ratio)
It may be seen that using quartiles gives a favorable value for the process of
repairing bugs. CQD is much better than (smaller than) COR.
48 Simple Statistical Methods for Software Engineering
Coefficient of Mean Deviation
is is the ratio of average absolute deviation to mean value. For bug repair time
data, the ratio is computed as follows:
Average absolute deviation 8.669
Mean 20.517
Ratio 0.423
Coefficient of MAD
is is the ratio of MAD to median. For bug repair time data, the ratio is computed
as follows:
MAD 6
Median 18
Ratio 0.333
Coefficient of Standard Deviation
is is the ratio of standard deviation to mean. For bug repair time data, the ratio
is calculated as follows:
SD 11.284
Mean 20.517
Ratio 0.550
is ratio is commonly known as coecient of variation (COV). It can be
expressed as a percentage. For bug repair time data, COV can be expressed as 55%.
is is also called relative standard deviation (RSD).
Summary of Coefficients of Dispersion
For bug repair time data, the coecient of dispersion has been studied using ve
dierent conventions, summarized as follows:
1. COR deviation 0.8125
2. CQD 0.342
3. Coecient of mean deviation 0.423
4. Coecient of median deviation 0.333
5. Coecient of standard deviation 0.550
Higher values of this coecient indicate problems because variation is seen as a
risk. Estimates 1, 3, and 5 have been inuenced by extreme values. Estimates 2 and
Data Dispersion 49
4 are robust, without any inuence from extreme values. e true capability of the
bug repair process is indicated by estimates 2 and 4.
Application Contexts
e statistic “dispersion measure” is most sensitive to context. Measures of disper-
sion can be applied in three prominent contexts: process control, experiments, and
risk management.
Variation is unavoidable in software processes. In the manufacturing context,
variation is the least in machine-controlled processes. Manual processes of hardware
production have a few orders of magnitude more than variation. Software processes
have several orders of magnitude more than variation. Software processes rst have
human variation; next most software processes are of a problem-solving nature and
thus reect variation in the complexity of the problem. Hence, Shewhart’s common
and special cause variations do not completely represent software process variation.
In software processes, variation has subtler components, including genetic variation
of agents and entropy of the problem scenario. We would rather attempt to under-
stand variation before we classify variation in tune with the philosophy of Deming
[1], which propounded that understanding variation is part of profound knowl-
edge. Categorizing variation into types is divisive, whereas nding a numerical
expression for dispersion is integrative. e numerical expression, robust enough
to deal with nonnormal data, is MAD and can be used as a measure of process
performance in performance scorecards. For instance, in the cases of bug repair, the
following two values represent the process:
Median 18 days
MAD 6 days
If we study variation in experimental data, we will have a dierent context. In
experiments, variation is treated as error. Truth is in the center. e standard devia-
tion is a good measure to represent error. If the measured value is positive, we will
benet from using coecient of standard deviation. When we do an experiment to
measure productivity, we can express the experimental result as a mean ± % RSD
(relative standard deviation or coecient of standard deviation). For example, the
mean productivity of 120 LOC per day ±30% RSD could be a good expression of
experimental study.
Risk managers need a mathematical expression for variation. Of all the options,
the standard deviation is a close enough approximation that works well for risk
assessments.
e measures of dispersion given in this chapter provide a basic entry into the
subject. For a cohesive understanding, variation should be modeled by methods
given in Section II of this book.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.14.150