50 Simple Statistical Methods for Software Engineering
In a Nutshell
Dispersion denitions used in chapter, in a nutshell, are presented as follows:
Measures of Dispersion
1. Range: maximum–minimum
2. Percentile range: 97th–3rd percentile
3. IQR: Q
3
Q
1
4. Average deviation: average deviation from mean
5. Average absolute deviation: average absolute deviation from mean
6. MAD: median value of absolute deviations from median
7. Sum of squares: sum of squares of deviations from mean
8. Variance: square of standard deviation
9. Standard deviation
SD =
=
( )x x
n
i
n
2
1
1
Nature of Dispersion
10. Pearsons skewness: Skewness
Mean Median
SD
=
3( )
11. Quartile skewness:
Bowley s skewness
Median
=
+
Q Q
Q Q
3 1
3 1
2
12. ird standardized moment:
Skewness =
n
n n
x x
s
i
( )( )1 2
3
13. Kurtosis =
n
n n n
x x
s
n
i
( )( )( )
( )
(1 2 3
3 1
4
2
nn n
2 3)( )
Coefficients of Dispersion
14. Coecient of range: COR
Max Min
Max Min
=
+
( )
( )
15. Coecient of quartile deviation
CQD
Quartile Quartile
Quartile Quartile
=
+
3 1
3 1
16. Coecient of mean deviation: ratio of average absolute deviation to mean
17. Coecient of MAD: ratio of MAD to median
18. Coecient of standard deviation: ratio of standard deviation to mean
Data Dispersion 51
Case Study: Dispersion Analysis of Data Sample
is case study is from a support project. e data volume is pretty large. Around
15,000 incidents are logged every week. e turnaround time (TAT) of resolv-
ing the issues is taken for our study. We take a random sample of 30 data points
from the database for dispersion analysis. Using a sample has its own risks: we may
obtain a limited view of reality, and dispersion seen in the entire database may be
quite large. However, analysis of a small sample is easy and provides a perspective
and guidance for further analysis. e range of 30 data sample is 199 days, and it
seems odd given the fact that there are tightly controlled service level agreements.
e analyst remembers a 7-day service level agreement and is prompted to form
clusters in the data. ree clusters emerge by visual analysis. e rst cluster agrees
with the memory recall of the analysts: the data are around 7 days. ere seems to
be a second cluster around 50 days. Two data points show extreme values of 80and
200 days. A better evaluation of dispersion is possible if we use coecients of dis-
persion and the analyst chooses the coecient of MAD (CMAD). For the raw data,
CMAD is high. After forming clusters and creating categories, the CMAD values
become small and reasonable, as shown in Figure 3.1.
Category C seems to be special cases; perhaps those events were put on a
low priority queue and were taken up very late. ere is no information regard-
ing this in the database; the only data logged in are time stamps of entry and
release. A dispersion analysis of the data sample brought the problem in the
database and prompted the analyst to create categories. Lessons learned from
data sample dispersion analysis help in designing a framework for the big job:
analysis of the total database.
0.000
0.500
1.000
1.500
2.000
2.500
3.000
3.500
Raw Category A Category B
Coefficient of median
absolute deviation
Figure 3.1 Dispersion analysis of data sample.
52 Simple Statistical Methods for Software Engineering
Review Questions
1. When will you use IQR instead of full range?
2. When will you use 3% trimmed range instead of full range?
3. When will you use range calculation based on standard deviation?
4. What is the benet of using coecients of dispersion?
5. Why do we prefer absolute deviations instead of plain deviations?
Exercises
1. Calculate skew in the data provided in Figure 3.1 in the case study using
Pearsons and Bowleys method. Compare the results.
2. Use the bug repair time data from Data 3.4 and calculate the coecient of
standard deviation. What is your judgment on dispersion? Is it high or normal?
Reference
1. W. E. Deming, Out of the Crisis, MIT Press, Cambridge, Massachusetts, 2000.
Suggested Readings
Aczel, A. D. and J. Sounderpandian, Complete Business Statistics, McGraw-Hill, London,
2008.
Brown, S. Measures of Shape: Skewness and Kurtosis, Oak Road Systems, 2008–2011.
Heeringa, S. G., B. T. West and P. A. Berglund, Applied Survey Data Analysis, Chapman and
Hall/CRC, April 5, 2010.
Lewis-Beck, M. S. Data Analysis: An Introduction, Sage Publications Inc., ISBN0-8039-
5772-6, 1995.
Lunn, D., C. Jackson, A. omas, N. Best and D. Spiegelhalter, e BUGS Book: A Practical
Introduction to Bayesian Analysis, Chapman and Hall/CRC, October 2, 2012.
Wuensch, K. L. Skewness, Kurtosis, and the Normal Curve, 2014, Available at http://core.ecu
.edu/psyc/wuenschk/StatsLessons.htm.
Available at http://easycalculation.com/statistics/kurtosis.php.
Available at http://www.springerreference.com/docs/navigation.do?m=e+Concise+Encyclo
pedia+of+Statistics+%28Mathematics+and+Statistics%29-book62.
Available at http://www.uvic.ca/hsd/publicadmin/assets/docs/aboutUs/linksonterest/excel
/excelModule_3.pdf.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.31.100