Chapter 3 - Data Dispersion (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data Dispersion ◾ 45

Skewness and Kurtosis

Pearson’s Skewness

Skewness is a measure of asymmetry in data. Pearson’s formula for skewness is

based on the dierence between mean and mode. If the dierence is more, there is

more asymmetry. e formula is given as follows:

Skewness

Mean Mode

−

(3.4)

Applying this formula to bug repair time data, we obtain a skewness of 0.666.

e value is positive, indicating the presence of more large data. Data are said to be

skewed to the right. If the skewness is negative, data would be negatively skewed,

or skewed to the left.

If the mode is ill dened, then we can use the following modied formula based

on the dierence between mean and median:

Skewness

Mean Median

−

3( )

(3.5)

For bug repair time data, this formula yields a skewness of 0.669.

Bowley’s Skewness

A robust estimate of skewness is based on quartiles and median. is is also known

as quartile skewness or Bowley’s skewness. e formula is given as follows:

Bowley s skewness

Q Q Median

Q Q

’ =

+ −

−

3 1

(3.6)

For bug repair time data, Bowley’s skewness was calculated as 0.259. is value

is much smaller than Pearson’s value. Bowley’s skewness is on a dierent scale; it

varies from −1 to +1. Pearson’s skewness varies from −3 to +3.

Third Standardized Moment

Skewness can be considered using the method of moments. Skewness is the third

standardized moment. is convention is followed in Excel in the function SKEW

that uses the following formula:

Skewness =

− −

−













∑

n n

x x

( )( )1 2

(3.7)

where n is the sample size,

is the sample mean, and s is the sample standard deviation.

46 ◾ Simple Statistical Methods for Software Engineering

It may be noted that in the formula, deviations are raised to the third power.

Also, like in all skewness calculations, the value is normalized or standardized with

a division by standard deviation.

e moment-based calculation skewness—using Excel function SKEW—for

bug repair time data is 1.271. is is a more sensitive measure of skewness.

Kurtosis

e atness of data is measured as kurtosis. e lower the value of kurtosis, the

atter the data distribution. ere are dierent conventions in computing kurtosis.

e Excel function KURT uses a formula for kurtosis given as follows:

Kurtosis =

− − −

−













−

∑

n n n

x x

( )( )( )

( )

(1 2 3

3 1

nn n− −













2 3)( )

(3.8)

where n is the sample size,

is the sample mean, and s is the sample standard deviation.

e formula has been adjusted to make the kurtosis of normally distributed

data equal to zero. e Pearson method of calculating kurtosis yields a value of 3

for normal distribution. If we subtract 3 from the Pearson result, we will obtain

excess kurtosis. Hence, the Excel KURT formula gives “excess kurtosis,” the value

in excess of normal kurtosis. If this “excess kurtosis” value is positive, data are more

peaked; if it is negative, data are broader.

Kurtosis for bug repair time data has been calculated. It is +1.676; hence, data

are peaked.

BOX 3.4 SKEWED LIFE

A good amount of software project data are skewed. Symmetrical and normal

data are an exception. Data from simple processes show symmetry. Data from

complex processes are skewed. Software development is certainly a collection

of several processes and is expected to produce skewed data. If data collection

is a process of observation, then we must recognize skew in data and learn to

accept skew as a reality of life. e transformation of skewed data into sym-

metrical data is an articial step performed often to apply some statistical

tests. e untransformed raw data from software projects is often skewed.

An outstanding example is the skew in complexity data. Another is skew in

TAT data. In such cases, skew is the DNA of a process. Skew may restrict the

application of several classic statistical methods while testing the data, but

that is a secondary issue.

Data Dispersion ◾ 47

Coefﬁcient of Dispersion

e term coecient is commonly used in algebra. e coecient of a variable

tells us the magnitude of the eect of the variable on the result. In metallurgy, the

coecient of expansion of metals can be used to calculate the expansion of metals.

Here the coecient is a metal property. e design of a coecient of dispersion has

a dierent purpose, although the connotations of the term are not entirely strange.

Coefﬁcient of Range

e simplest coecient of dispersion is the coecient of range (COR), calculated

as follows:

COR

Max Min

−

( )

(3.9)

For the bug repair data, COR can be computed as follows:

Max 58 days

Min 6 days

Max − Min 52 days

Max + Min 64 days

COR 0.8125 (dimensionless ratio)

Coefﬁcient of Quartile Deviation

COR is based on extreme values and hence is not robust. Coecient of Quartile

Deviation (CQD) is based on quartiles and hence is not inuenced by extreme

values. e formula for CQD is given as follows:

CQD

Quartile Quartile

−

3 1

(3.10)

For the bug repair time data, CQD is computed as follows:

26.5 days

13 days

− Q

13.5 days

+ Q

39.5 days

CQD 0.342 (dimensionless ratio)

It may be seen that using quartiles gives a favorable value for the process of

repairing bugs. CQD is much better than (smaller than) COR.

48 ◾ Simple Statistical Methods for Software Engineering

Coefﬁcient of Mean Deviation

is is the ratio of average absolute deviation to mean value. For bug repair time

data, the ratio is computed as follows:

Average absolute deviation 8.669

Mean 20.517

Ratio 0.423

Coefﬁcient of MAD

is is the ratio of MAD to median. For bug repair time data, the ratio is computed

as follows:

MAD 6

Median 18

Ratio 0.333

Coefﬁcient of Standard Deviation

is is the ratio of standard deviation to mean. For bug repair time data, the ratio

is calculated as follows:

SD 11.284

Mean 20.517

Ratio 0.550

is ratio is commonly known as coecient of variation (COV). It can be

expressed as a percentage. For bug repair time data, COV can be expressed as 55%.

is is also called relative standard deviation (RSD).

Summary of Coefﬁcients of Dispersion

For bug repair time data, the coecient of dispersion has been studied using ve

dierent conventions, summarized as follows:

1. COR deviation 0.8125

2. CQD 0.342

3. Coecient of mean deviation 0.423

4. Coecient of median deviation 0.333

5. Coecient of standard deviation 0.550

Higher values of this coecient indicate problems because variation is seen as a

risk. Estimates 1, 3, and 5 have been inuenced by extreme values. Estimates 2 and

Data Dispersion ◾ 49

4 are robust, without any inuence from extreme values. e true capability of the

bug repair process is indicated by estimates 2 and 4.

Application Contexts

e statistic “dispersion measure” is most sensitive to context. Measures of disper-

sion can be applied in three prominent contexts: process control, experiments, and

risk management.

Variation is unavoidable in software processes. In the manufacturing context,

variation is the least in machine-controlled processes. Manual processes of hardware

production have a few orders of magnitude more than variation. Software processes

have several orders of magnitude more than variation. Software processes rst have

human variation; next most software processes are of a problem-solving nature and

thus reect variation in the complexity of the problem. Hence, Shewhart’s common

and special cause variations do not completely represent software process variation.

In software processes, variation has subtler components, including genetic variation

of agents and entropy of the problem scenario. We would rather attempt to under-

stand variation before we classify variation in tune with the philosophy of Deming

[1], which propounded that understanding variation is part of profound knowl-

edge. Categorizing variation into types is divisive, whereas nding a numerical

expression for dispersion is integrative. e numerical expression, robust enough

to deal with nonnormal data, is MAD and can be used as a measure of process

performance in performance scorecards. For instance, in the cases of bug repair, the

following two values represent the process:

Median 18 days

MAD 6 days

If we study variation in experimental data, we will have a dierent context. In

experiments, variation is treated as error. Truth is in the center. e standard devia-

tion is a good measure to represent error. If the measured value is positive, we will

benet from using coecient of standard deviation. When we do an experiment to

measure productivity, we can express the experimental result as a mean ± % RSD

(relative standard deviation or coecient of standard deviation). For example, the

mean productivity of 120 LOC per day ±30% RSD could be a good expression of

experimental study.

Risk managers need a mathematical expression for variation. Of all the options,

the standard deviation is a close enough approximation that works well for risk

assessments.

e measures of dispersion given in this chapter provide a basic entry into the

subject. For a cohesive understanding, variation should be modeled by methods

given in Section II of this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3 - Data Dispersion (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 3 - Data Dispersion (3/4)