Chapter 2 - Truth and Central Tendency (1/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2

Truth and Central

Tendency

We have seen three statistical expressions for central tendency: mean, median, and

mode. Mean is the arithmetic average of all observations. Each data point con-

tributes to the mean. Median is the middle value of the data array when data are

arranged in an order—either increasing order or decreasing order. It is the value

of a middle position of the ordered array and does not enjoy contribution from all

observations as the mean does. Mode is the most often repeated value. e three

are equal for symmetrical distributions such as the normal distribution. In fact,

equality of the three values can be used to test if the data are skewed or not. Skew

is proportional to the diﬀerence between mean and mode.

Mean

Use of mean as the central tendency of data is most common. e mean is the true

value while making repeated measurements of an entity. e way to obtain truth

is to repeat the observation several times and take the mean value. e inﬂuence

of random errors in the observations cancel out, and the true value appears as the

mean. e central tendency mean is used in normal distribution to represent data,

even if it was an approximation. Mean is the basis for normal distribution; it is

one of the two parameters of normal distribution (the other parameter is standard

deviation). One would expect the mean value of project variance data such as eﬀort

variance, schedule variance, and size variance to reveal the true error in estimation.

22 ◾ Simple Statistical Methods for Software Engineering

Once the true error is found out, the estimation can be calibrated as a measurement

process.

It is customary to take a sample data and consider the mean of the sample as

the true observation. It makes no statistical sense to judge based on a single obser-

vation. We need to think with “sample mean” and not with stray single points.

“Sample mean” is more reliable than any individual observation. “Sample mean”

dominates statistical analysis.

Uncertainty in Mean: Standard Error

e term “sample mean” must be seen with more care; it simply refers to the mean

of observed data. Say we collect data about eﬀort variance from several releases

in a development project. ese data form a sample from which we can compute

the mean eﬀort variance in the project. Individual eﬀort variance data are used to

measure and control events; sample mean is used to measure and control central

capability. Central tendency is used to judge process capability.

Now the Software Engineering Process Group (SEPG) would be interested in

estimating process capability from an organizational perspective. ey can collect

sample means from several projects and construct a grand mean. We can call the

grand mean by another term, the population mean. Here population refers to the

collective experience of all projects in the organization. e population mean rep-

resents the true capability of organization.

If we go back to the usage of the term truth, we ﬁnd there are several discoveries

of truth; each project discovers eﬀort variance using sample mean. e organiza-

tion discovers truth from population mean.

Now we can estimate the population mean (the central tendency of the organi-

zational process) from the sample mean from one project (the central tendency of

the local process). We cannot pinpoint the population mean, but we can ﬁx a band

of values where population mean may reside. ere is an uncertainty associated

with this estimation. It is customary to deﬁne this uncertainty by a statistic called

standard error. Let us look further into this concept.

It is known that the mean values gathered from diﬀerent projects—the sample

means—vary according to the normal distribution. e theorem that propounds

this is known as the central limit theorem. e standard deviation of this normal

distribution is known as the standard error.

If we have just collected sample data from one project with n data points, and

with a standard deviation s, then we can estimate standard error with reasonable

accuracy using the relation

Truth and Central Tendency ◾ 23

SE =

Deﬁning an uncertainty interval for mean is further explained in Chapter 25.

Median

e physical median divides a highway into two, and the statistical median divides

data into two halves. One half of the data have values greater than the median. e

other half of the data have values smaller than the median. It is a rule of thumb that

if data are nonnormal, use median as the central tendency. If data are normally dis-

tributed, median is equal to mean in any case. Hence, median is a robust expression

of the central tendency, true for all kinds of data. For example, customer satisfac-

tion data—known as CSAT data—are usually obtained in an ordinal scale known

as the Likert scale. One should not take the mean value of CSAT data; median is

the right choice. (It is a commonly made mistake to take the mean of CSAT data.)

In fact, only median is a relevant expression of central tendency for all subjective

data. Median is a truer expression of central tendency than mean in engineering

data, such as data obtained from measurements of software complexity, productiv-

ity, and defect density.

While the mean is used in the design of normal distribution, the median is

used in the design of skewed distributions such as the Weibull distribution.

Median value is used to develop the scale parameter that controls width.

Box 2.1 Hanging a Beam

ink of mean as a center of gravity. In Figure 2.1, the center of gravity

coincides with the geometric center, which is analogous to the median of the

beam, and as a result, the beam achieves equilibrium. In Figure 2.2, the cen-

ter of gravity shifts because of asymmetrical load distribution; the beam tilts

in the direction of center of gravity. e median, however, is still the same

old point. e distance between median and center of gravity is like the dif-

ference between median and mean. Such a diﬀerence makes the beam tilt; in

the case of a data array, the diﬀerence between median and mean is a signal

of data “skew” or asymmetry.

24 ◾ Simple Statistical Methods for Software Engineering

Geometric middle point

(analogous to median)

Center of gravity

(analogous to mean)

e uniform beam balances at the middle

point. e center of gravity (analogous to

mean) and the middle point (analogous to

median) coincide.

Figure 2.1 Geometric middle point and center of gravity coincides and the

beam is balanced.

Geometric middle point

(analogous to median)

Center of gravity

(analogous to mean)

e asymmetrically loaded beam tilts. is is

analogous to data skew.

Rider upsets

balance of the

beam

Figure 2.2 Asymmetry is introduced by additional weight on the rightside

of the beam. The mean shifts to the right.

Truth and Central Tendency ◾ 25

Mode

Mode, the most often repeated value in data, appears as the peak in the data dis-

tribution. Certain judgments are best made with mode. e arrival time of an

employee varies, and the arrival data are skewed as indicated in the three expres-

sions of central tendency: mean = 10:00 a.m., conﬁdence interval of the mean =

10:00 a.m. ± 20 minutes, median = 9:30 a.m., and mode = 9:00 a.m. e expected

arrival time is 9:00 a.m. Let us answer the question, is the employee on time?

e question presumes that we have already decided not to bother with individual

arrival data but wish to respond to the central tendency. Extreme values are not

counted in the judgment. We choose the mode for some good reasons. Mean is

biased by extremely late arrivals. Median is insensitive to best performances. Mode

is more appropriate in this case.

Geometric Mean

When the data are positive, as is the case with bug repair time, we have a more

rigorous way of avoiding the inﬂuence of extreme values. We can use the concept

of geometric mean.

e geometric mean of n numbers is the nth root of the product of the n num-

bers, that is,

GM = x x x

1 2



Box 2.2 a RoBust RefeRence

Median is a robust reference that can serve as a baseline much better than

mean serves. If we wish to monitor a process, say test eﬀectiveness, ﬁrst we

need to establish a baseline value that is fair. Median value is a fair central line

of the process, although many tend to use mean. Mean is already inﬂuenced

by extreme values and is “prejudiced.” Median reﬂects true performance of

the process. Untrimmed mean reﬂects the exact location of the process with-

out any discrimination. Median eﬀectively ﬁlters away prejudices and oﬀers a

fair and robust judgment of process tendency. For example, the median score

of a class in a given subject is the true performance of the class, and the mean

score does not reﬂect the true performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2 - Truth and Central Tendency (1/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2 - Truth and Central Tendency (1/3)