35
Chapter 3
Data Dispersion
Data dispersion arises because of sources of variation, including variations in measure-
ments and results due to changes in the underlying process. Which one varies more,
measurement or processes under measurement? We proceed in this chapter with the
assumption that there is sucient measurement capability behind the data; that means
the measurement errors are very small compared with process variation. If this condition
is met, dispersion in data will represent dispersion in the process under measurement.
Range-Based Empirical Representation
Dispersion or variation in process is viewed as uncertainty in the process outcome.
To deal with uncertainty, we need to measure, express, and understand it. Range is
a good old way of measuring dispersion. is has been used in the traditional X-Bar
and R control charts, where R stands for range and X-Bar stands for sample mean.
Range is the dierence between maximum and minimum values in data.
ere is another convention to leave out extreme values and the considered range.
Typically, the values below the 3rd percentile and the values above the 97th percen-
tile are disregarded. To understand this, we need to construct the data array and
sort data in some order and then chop o the upper 3rd percentile and the lower
3rd percentile. (If the length of the array is L, the 3rd percentile point will rest at a
point on the array at a distance of 0.03L from the origin. Similarly, the point of the
97th percentile will rest at a distance of 0.97L from the origin.) is range is the
empirical dierence between the 3rd and the 97th percentile values.
e two calculations previously mentioned are conservative. In a third approach,
the interquartile range (IQR) is taken as the core variation of the process. IQR is the dif-
ference between Q3 and Q1. It may be noted that 50% of observations remain in IQR.
36 Simple Statistical Methods for Software Engineering
Here is a summary of the three empirical expressions of dispersion:
Range maximum
Percentile range 97th
=
=
–minimum
3rd ppercentile
Q
1
IQR Q
3
=
Example 3.1: Studying Variation in Design Eort
Design eort data as a percentage of project eort have been collected. e data
are shown in Data 3.1.
e dispersion analysis of the data using the above-mentioned formulas is
shown as follows:
% Design Effort
Max 100.00
Min 2.87
Range 97.13
3rd percentile 4.478
97th percentile 91.616
Range 87.139
1st quartile 9.932
2nd quartile 24.933
Range 15.001
Data 3.1 Design Effort Data
% Design Effort
2.87
11.55
11.11
18.55
21.08
100.00
83.56
6.75
6.33
9.71
21.25
6.02
30.63
13.14
26.16
18.85
32.14
10.59
Data Dispersion 37
e full range, 97.13, is uncomplicated.e other two ranges have been trimmed.
e 3rd percentile range (obtained by cutting o 3% of data on either side) is 87.139.
e IQR is 15.001. Choosing the trimming rules has an inherent trouble—it can
tend to be arbitrary, unless we exercise caution. Trimming the range is a practical
requirement because we do not want extreme values to misrepresent the process. e
untrimmed range is so large that it is impractical. ere could be data outliers; one
would suspect wrong entry, or wrong computation of the percentage of design eort.
It is obvious that the very high values, such as 100% design eort, are impractical.
Because the data have not been validated by the team, we can cautiously trim and
cleanthe data. e 3rd percentile range seems to be clean, but it still has recognized
a high value of 91.616%, again an impractical value. Perhaps we can tighten the trim-
ming rule, say a cuto of 20% on either end of data. If we want such tight trimmings,
we might as well use the IQR, which has trimmed o 25% of data on either end.
Example 3.2: Analyzing Eort Variance Data from Two Processes
Let us take a look at eort variance data from two types of projects, following two
dierent types of estimation processes. e data are shown in Data 3.2.
A summary of range analysis is shown as follows: Estimation A and Estimation B
Estimation A Estimation B
Full range 86.482 63.640
Percentile range 69.466 45.181
IQR 19.995 8.075
e three ranges individually conrm that the second set of data from projects
using a dierent estimation model has less dispersion.
In both of the previously mentioned examples, we have studied dispersion from
three dierent angles. We have not looked into the messages derived from the extreme
values. Extreme values are used elsewhere in risk management and hazard analysis.
Range calculations approach dispersion from the extreme valuesthe ends
of data. ese calculations do not use the central tendencies. In fact, these are
independent of central tendencies.
Next we are going to see expressions of dispersion that consider central ten-
dency of data.
BOX 3.1 CROSSING A RIVER
ere was a man who believed in averages. He had to cross a river but did
not know how to swim. He obtained statistics about the depth and gured
out that the average depth was just 4 feet. is information comforted him
because he was 6 feet tall and thought he could cross the river. Midway in the
river, he encountered a 9-foot-deep pit and never came out. is story is often
cited to caution about averages. is story also reminds us that we should
register the extreme values in data for survival.
38 Simple Statistical Methods for Software Engineering
Dispersion as Deviation from Center
e range of data is a fairly good measure of dispersion. However, if we look at the
scatter of data around a center, we will obtain a better and more complete picture.
To visualize this, let us look at hits from a eld gun on an enemy target. e hits are
scattered around the target. If the gun is biased, the hits may be scattered around
Data 3.2 Dispersion in Effort Variance Data
Effort
Variance – 1
Effort
Variance – 2
51.41 0.07
23.20 9.00
–7.67 –7.67
0.19 0.19
8.31 0.79
4.32 4.32
22.58 22.58
8.23 8.23
28.57 28.57
7.24 7.24
–1.39 –1.39
–1.31 –1.31
–11.32 –11.32
1.27 1.27
6.51 6.51
0.00 0.00
48.15 6.67
1.38 1.38
–2.67 –2.67
23.02 23.02
–35.07 –35.07
4.67 4.67
Range
Effort
Variance – 1
Effort
Variance – 2
86.48 63.64
Percentile Range
Effort
Variance – 1
Effort
Variance – 2
3rd Percentile –20.11 –20.11
97th Percentile 49.36 25.07
Percentile Range 69.47 45.18
Inter Quartile Range (IQR)
Effort
Variance – 1
Effort
Variance – 2
Q1 –0.98 –0.98
Q3 19.01 7.09
IQR 19.99 8.07
Data Dispersion 39
a center slightly away from the target. In both the scenarios, the hits are scattered
around a center, the mean value.
Dispersion is measured in terms of deviations from the mean. Data with larger
dispersion show larger deviations from the mean. e following are dierent expres-
sions making use of this basic idea.
Average Deviation
If we calculate the deviations of all data pointsall the N hits—from the mean
x
and take the average, we will obtain a measure of scatter, or dispersion. e average
deviation for N hits from the eld gun is a fairly representative estimate of dispersion.
Every deviation is measured in meters, and the average deviation is also in meters.
e formula is given as follows:
D
x x
N
x
i
N
=
=
( )
1
(3.1)
When we try the same formula to obtain the average deviation in bug repair
times (data shown in Table 3.1), we encounter a problem. e average deviation can
be seen as zero.
e deviation values have a direction; the value can be positive or negative. In
the process of calculating the average deviation, the positive values have cancelled
out the negative values.
Average Absolute Deviation
To avoid the sign problem, we can take absolute values of deviation, as shown in
this modied formula. is is a workaround.
D
x x
N
x
i
i
N
=
=
1
(3.2)
Absolute deviations have been calculated and shown in Table 3.1. e average
absolute deviation from the mean is 8.669 days. is number denes dispersion of
bug repair time around the data mean of 20.517.
We can use this measure to compare scatter in 2-month bug repair data.
Median Absolute Deviation
e scale of scatter of data can be computed with respect to any center. Normally,
we use mean as the center as we have conducted in computing the average absolute
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.54.120