Chapter 3 - Data Dispersion (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3

Data Dispersion

Data dispersion arises because of sources of variation, including variations in measure-

ments and results due to changes in the underlying process. Which one varies more,

measurement or processes under measurement? We proceed in this chapter with the

assumption that there is sucient measurement capability behind the data; that means

the measurement errors are very small compared with process variation. If this condition

is met, dispersion in data will represent dispersion in the process under measurement.

Range-Based Empirical Representation

Dispersion or variation in process is viewed as uncertainty in the process outcome.

To deal with uncertainty, we need to measure, express, and understand it. Range is

a good old way of measuring dispersion. is has been used in the traditional X-Bar

and R control charts, where R stands for range and X-Bar stands for sample mean.

Range is the dierence between maximum and minimum values in data.

ere is another convention to leave out extreme values and the considered range.

Typically, the values below the 3rd percentile and the values above the 97th percen-

tile are disregarded. To understand this, we need to construct the data array and

sort data in some order and then chop o the upper 3rd percentile and the lower

3rd percentile. (If the length of the array is L, the 3rd percentile point will rest at a

point on the array at a distance of 0.03L from the origin. Similarly, the point of the

97th percentile will rest at a distance of 0.97L from the origin.) is range is the

empirical dierence between the 3rd and the 97th percentile values.

e two calculations previously mentioned are conservative. In a third approach,

the interquartile range (IQR) is taken as the core variation of the process. IQR is the dif-

ference between Q3 and Q1. It may be noted that 50% of observations remain in IQR.

36 ◾ Simple Statistical Methods for Software Engineering

Here is a summary of the three empirical expressions of dispersion:

Range maximum

Percentile range 97th

–minimum

–3rd ppercentile

–Q

IQR Q

Example 3.1: Studying Variation in Design Eort

Design eort data as a percentage of project eort have been collected. e data

are shown in Data 3.1.

e dispersion analysis of the data using the above-mentioned formulas is

shown as follows:

% Design Effort

Max 100.00

Min 2.87

Range 97.13

3rd percentile 4.478

97th percentile 91.616

Range 87.139

1st quartile 9.932

2nd quartile 24.933

Range 15.001

Data 3.1 Design Effort Data

% Design Eﬀort

2.87

11.55

11.11

18.55

21.08

100.00

83.56

6.75

6.33

9.71

21.25

6.02

30.63

13.14

26.16

18.85

32.14

10.59

Data Dispersion ◾ 37

e full range, 97.13, is uncomplicated. e other two ranges have been trimmed.

e 3rd percentile range (obtained by cutting o 3% of data on either side) is 87.139.

e IQR is 15.001. Choosing the trimming rules has an inherent trouble—it can

tend to be arbitrary, unless we exercise caution. Trimming the range is a practical

requirement because we do not want extreme values to misrepresent the process. e

untrimmed range is so large that it is impractical. ere could be data outliers; one

would suspect wrong entry, or wrong computation of the percentage of design eort.

It is obvious that the very high values, such as 100% design eort, are impractical.

Because the data have not been validated by the team, we can cautiously trim and

“clean” the data. e 3rd percentile range seems to be clean, but it still has recognized

a high value of 91.616%, again an impractical value. Perhaps we can tighten the trim-

ming rule, say a cuto of 20% on either end of data. If we want such tight trimmings,

we might as well use the IQR, which has trimmed o 25% of data on either end.

Example 3.2: Analyzing Eort Variance Data from Two Processes

Let us take a look at eort variance data from two types of projects, following two

dierent types of estimation processes. e data are shown in Data 3.2.

A summary of range analysis is shown as follows: Estimation A and Estimation B

Estimation A Estimation B

Full range 86.482 63.640

Percentile range 69.466 45.181

IQR 19.995 8.075

e three ranges individually conrm that the second set of data from projects

using a dierent estimation model has less dispersion.

In both of the previously mentioned examples, we have studied dispersion from

three dierent angles. We have not looked into the messages derived from the extreme

values. Extreme values are used elsewhere in risk management and hazard analysis.

Range calculations approach dispersion from the extreme values—the ends—

of data. ese calculations do not use the central tendencies. In fact, these are

independent of central tendencies.

Next we are going to see expressions of dispersion that consider central ten-

dency of data.

BOX 3.1 CROSSING A RIVER

ere was a man who believed in averages. He had to cross a river but did

not know how to swim. He obtained statistics about the depth and gured

out that the average depth was just 4 feet. is information comforted him

because he was 6 feet tall and thought he could cross the river. Midway in the

river, he encountered a 9-foot-deep pit and never came out. is story is often

cited to caution about averages. is story also reminds us that we should

38 ◾ Simple Statistical Methods for Software Engineering

Dispersion as Deviation from Center

e range of data is a fairly good measure of dispersion. However, if we look at the

scatter of data around a center, we will obtain a better and more complete picture.

To visualize this, let us look at hits from a eld gun on an enemy target. e hits are

scattered around the target. If the gun is biased, the hits may be scattered around

Data 3.2 Dispersion in Effort Variance Data

Eﬀort

Variance – 1

Eﬀort

Variance – 2

51.41 0.07

23.20 9.00

–7.67 –7.67

0.19 0.19

8.31 0.79

4.32 4.32

22.58 22.58

8.23 8.23

28.57 28.57

7.24 7.24

–1.39 –1.39

–1.31 –1.31

–11.32 –11.32

1.27 1.27

6.51 6.51

0.00 0.00

48.15 6.67

1.38 1.38

–2.67 –2.67

23.02 23.02

–35.07 –35.07

4.67 4.67

Range

Eﬀort

Variance – 1

Eﬀort

Variance – 2

86.48 63.64

Percentile Range

Eﬀort

Variance – 1

Eﬀort

Variance – 2

3rd Percentile –20.11 –20.11

97th Percentile 49.36 25.07

Percentile Range 69.47 45.18

Inter Quartile Range (IQR)

Eﬀort

Variance – 1

Eﬀort

Variance – 2

Q1 –0.98 –0.98

Q3 19.01 7.09

IQR 19.99 8.07

Data Dispersion ◾ 39

a center slightly away from the target. In both the scenarios, the hits are scattered

around a center, the mean value.

Dispersion is measured in terms of deviations from the mean. Data with larger

dispersion show larger deviations from the mean. e following are dierent expres-

sions making use of this basic idea.

Average Deviation

If we calculate the deviations of all data points—all the N hits—from the mean

and take the average, we will obtain a measure of scatter, or dispersion. e average

deviation for N hits from the eld gun is a fairly representative estimate of dispersion.

Every deviation is measured in meters, and the average deviation is also in meters.

e formula is given as follows:

x x

−

∑

( )

(3.1)

When we try the same formula to obtain the average deviation in bug repair

times (data shown in Table 3.1), we encounter a problem. e average deviation can

be seen as zero.

e deviation values have a direction; the value can be positive or negative. In

the process of calculating the average deviation, the positive values have cancelled

out the negative values.

Average Absolute Deviation

To avoid the sign problem, we can take absolute values of deviation, as shown in

this modied formula. is is a workaround.

x x

−

∑

(3.2)

Absolute deviations have been calculated and shown in Table 3.1. e average

absolute deviation from the mean is 8.669 days. is number denes dispersion of

bug repair time around the data mean of 20.517.

We can use this measure to compare scatter in 2-month bug repair data.

Median Absolute Deviation

e scale of scatter of data can be computed with respect to any center. Normally,

we use mean as the center as we have conducted in computing the average absolute

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3 - Data Dispersion (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 3 - Data Dispersion (1/4)