Chapter 3 - Data Dispersion (4/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

50 ◾ Simple Statistical Methods for Software Engineering

In a Nutshell

Dispersion denitions used in chapter, in a nutshell, are presented as follows:

Measures of Dispersion

1. Range: maximum–minimum

2. Percentile range: 97th–3rd percentile

3. IQR: Q

–Q

4. Average deviation: average deviation from mean

5. Average absolute deviation: average absolute deviation from mean

6. MAD: median value of absolute deviations from median

7. Sum of squares: sum of squares of deviations from mean

8. Variance: square of standard deviation

9. Standard deviation

SD =

−

∑

( )x x

Nature of Dispersion

10. Pearson’s skewness: Skewness

Mean Median

−3( )

11. Quartile skewness:

Bowley s skewness

Median

’ =

+ −

−

Q Q

3 1

12. ird standardized moment:

Skewness =

− −

−













∑

n n

x x

( )( )1 2

13. Kurtosis =

− − −

−













−

∑

n n n

x x

( )( )( )

( )

(1 2 3

3 1

nn n− −













2 3)( )

Coefﬁcients of Dispersion

14. Coecient of range: COR

Max Min

−

( )

15. Coecient of quartile deviation

CQD

Quartile Quartile

−

3 1

16. Coecient of mean deviation: ratio of average absolute deviation to mean

17. Coecient of MAD: ratio of MAD to median

18. Coecient of standard deviation: ratio of standard deviation to mean

Data Dispersion ◾ 51

Case Study: Dispersion Analysis of Data Sample

is case study is from a support project. e data volume is pretty large. Around

15,000 incidents are logged every week. e turnaround time (TAT) of resolv-

ing the issues is taken for our study. We take a random sample of 30 data points

from the database for dispersion analysis. Using a sample has its own risks: we may

obtain a limited view of reality, and dispersion seen in the entire database may be

quite large. However, analysis of a small sample is easy and provides a perspective

and guidance for further analysis. e range of 30 data sample is 199 days, and it

seems odd given the fact that there are tightly controlled service level agreements.

e analyst remembers a 7-day service level agreement and is prompted to form

clusters in the data. ree clusters emerge by visual analysis. e rst cluster agrees

with the memory recall of the analysts: the data are around 7 days. ere seems to

be a second cluster around 50 days. Two data points show extreme values of 80and

200 days. A better evaluation of dispersion is possible if we use coecients of dis-

persion and the analyst chooses the coecient of MAD (CMAD). For the raw data,

CMAD is high. After forming clusters and creating categories, the CMAD values

become small and reasonable, as shown in Figure 3.1.

Category C seems to be special cases; perhaps those events were put on a

low priority queue and were taken up very late. ere is no information regard-

ing this in the database; the only data logged in are time stamps of entry and

release. A dispersion analysis of the data sample brought the problem in the

database and prompted the analyst to create categories. Lessons learned from

data sample dispersion analysis help in designing a framework for the big job:

analysis of the total database.

0.000

0.500

1.000

1.500

2.000

2.500

3.000

3.500

Raw Category A Category B

Coeﬃcient of median

absolute deviation

Figure 3.1 Dispersion analysis of data sample.

52 ◾ Simple Statistical Methods for Software Engineering

Review Questions

1. When will you use IQR instead of full range?

2. When will you use 3% trimmed range instead of full range?

3. When will you use range calculation based on standard deviation?

4. What is the benet of using coecients of dispersion?

5. Why do we prefer absolute deviations instead of plain deviations?

Exercises

1. Calculate skew in the data provided in Figure 3.1 in the case study using

Pearson’s and Bowley’s method. Compare the results.

2. Use the bug repair time data from Data 3.4 and calculate the coecient of

standard deviation. What is your judgment on dispersion? Is it high or normal?

Reference

1. W. E. Deming, Out of the Crisis, MIT Press, Cambridge, Massachusetts, 2000.

Table of Contents for Chapter 3 - Data Dispersion (4/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 3 - Data Dispersion (4/4)