Chapter 4 - Tukey’s Box Plot: Exploratory Analysis (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

58 ◾ Simple Statistical Methods for Software Engineering

Seeing Process Drift

If the median is not on a process target value, we can say that the process has drifted.

e amount of drift can be easily seen if we draw a target line on the box plot.

Figure 4.3 shows a box plot of bug repair time. e corporate goal is to ﬁx bugs within

a maximum of 16 days. e goal is marked on the box plot for easy interpretation.

Detecting Skew

e box plot is an eloquent way of expressing problems in process. One can see

clearly if the process results are skewed. If the median is in the middle of the box,

data are not skewed. If the median shifts to the right, data are left skewed. If the

median shifts to the left, data are right skewed. Another sign of skew is the length

of whisker. If the right whisker is longer, as seen in Figure 4.3, the process is skewed

to the right.

Seeing Variation

e width of the box is a measure of process variation. Box width shows variation

with 50% conﬁdence level. e whisker-to-whisker width also expresses variation,

perhaps with better clarity and more dramatic eﬀect. e whisker-to-whisker range

is an expression of variation with conﬁdence levels more than 90%. In Figure 4.3,

the whisker-to-whisker range is from 6 to 45 days. e variation is far in excess of

what is anticipated.

Goal

Bugs must be repaired

within 16 days

Corporate goal

0 20 40 60 80

Figure 4.3 Box plot of bug repair time, days.

Tukey’s Box Plot ◾ 59

Risk Measurement

If we plot speciﬁcation lines on the box plot, we can easily see the risk element

in the process. If the entire whiskers stay outside the speciﬁcation lines, the

process is very risky. Risk is proportional to the portion of the box plot that

stays outside process speciﬁcations. Bug repair, shown in Figure 4.3, surely has a

schedule risk. We cannot quantify risk using a box plot, but we can qualitatively

say the risk is very high. Risk management is one area where qualitative judg-

ment is good enough and often more dependable than sophisticated quantita-

tive analysis.

Outlier Detection

A very useful result from the box plot is the detection of outliers. e rules

applied in the box plot do not assume any mathematical distribution function

for the process. e box plot way of detecting outliers diﬀers from the control

chart way of detecting outliers. In control charts, we use the probability density

function that corresponds to the inherent distribution of data. e box plot

rules do not apply any distribution formula. e box plot uses a distribution-

free judgment that is more universal and robust enough to engage all kinds of

data.

Comparison of Processes

Box plots are used to compare process results. All the three elements of processes

can be visually compared:

Central tendency

Dispersion

Outliers

is visual comparison performs the functions of three tests: t test for process

mean, F test for process variation, and control chart tests for process outliers. is

comparison is discussed later in this chapter.

Improvement Planning

Process improvement planning is well supported by box plot analysis. A box plot

deﬁnes the problem with a picturesque essay of three dimensions of the process:

central tendency, dispersion, and outliers. A box plot is an empirical problem state-

ment. If we think that a well-deﬁned problem is a problem half solved, then we

stand to gain immensely by the box plot way of problem deﬁnition.

60 ◾ Simple Statistical Methods for Software Engineering

An approximate answer to the right problem is worth a good

deal more than an exact answer to an approximate problem.

John W. Tukey

Box plots help us to identify and deﬁne the right problems.

e productivity box plot shown in Figure 4.1 highlights three opportunities

for innovation:

1. Removal of outliers: is is the easiest innovation. ere is no outlier in the

lower side of the plot. at is good news. e outliers with higher values

might appear as welcome outcomes. Here is the good old question of specify-

ing an upper limit even for the better side of events. It may be suspected that

extreme value of productivity is the result of a compromise, a slow acting

fuse, that might show up later somewhere as an issue. Although we need to

understand all the outliers, the outliers beyond the outer fence may be stud-

ied in detail. e presence of more outliers on the right side also indicates the

possible existence of a tail or skew in the distribution.

2. Shifting the median toward higher levels: is means the expected value of

productivity can be improved.

3. Reduction of IQR as well as whisker-to-whisker width: Process variation, depicted

both by the IQR and whisker-to-whisker width, can be reduced to minimize

variation.

e three innovations could coexist. Improving the median may be accompanied

by reduction in outliers, and vice versa. It is a good strategy to take up one at a time, in

the previously mentioned order, and take the beneﬁcial side eﬀects in the other two.

Core Beneﬁts of Box Plot

Tukey’s box plot contains suﬃcient statistical strategies and yet retains its intended

simplicity. Many attempts are being made to enhance the information content in box

plots and make them colorful as well. We focus on the simple box plot in this chapter

and ﬁnd that it has great potential. e box plot can be applied to the following:

◾ Provide a visual summary of data

◾ See process variation

◾ Detect outliers

◾ Detect skew in data

Tukey’s Box Plot ◾ 61

◾ See process drift

◾ Compare processes

◾ Plan process improvement

Twin Box Plot

Let us take the case of reestimating software development eﬀort. Teams are reluc-

tant to do a second estimate and are in a hurry to move forward with development.

However, it is a best practice to do a second estimate after a fortnight into the

project when many project details become visible. We get to know the require-

ments better, teams communicate better, risks are seen with clarity, and we are

enlightened by the early lessons. e second estimate is expected to be more accu-

rate. We wish to compare the second estimates with ﬁrst estimates and study the

improvement.

e box plot can be eminently used to compare the two results. In Figure 4.4, a

twin box plot is shown comparing two sets of eﬀort variance data.

e twin box plot oﬀers what might be called a visual test, a preliminary analy-

sis before we start rigorous tests. Visual judgment of the following can provide vital

clues regarding the diﬀerences between two results:

1. Is there a diﬀerence between the whisker-to-whisker widths?

2. Is there a diﬀerence between the box widths?

3. Is there a relative shift in the position of the median?

If the answer is yes to any one of these questions, we need to take a deeper

look at the box plots. Sometimes the presence or absence of outliers could make

a diﬀerence. Sometimes the skew of the median line inside a box could provide

a clue.

If the diﬀerence is signiﬁcant, the boxes in the two plots may be completely

disjointed. ey may not overlap. Using the box plot representation, it is rather easy

to see if the new result is diﬀerent from the old.

EVA 1

EVA 2

EVA

–80 –60 –40 –20 0 20 40 60 80

Figure 4.4 Comparison of two estimates using box plots.

62 ◾ Simple Statistical Methods for Software Engineering

If results due to innovation show improvement, one or more of the following

visual clues may be present:

◾ e overall length of the box plot would have decreased

◾ Outliers might have disappeared

◾ e central line might show a favorable shift

◾ e box might have shrunk

◾ e box might be relocated in a favorable region

◾ e unfavorable whisker might have diminished

If an improvement is not visible in a box plot, it may not be an improvement in

the ﬁrst place. e question of looking for signiﬁcance does not arise.

However, in most cases, people take pains to go through lengthy procedures

to execute signicant tests to check dierences, without box plot visual

checks. In some cases even after box plot rejections, people go through the

ritual of signicance tests.

Holistic Test

e twin box plot test is a holistic approach; it can compare two populations (two

groups) in a complete balanced fashion that no other test can oﬀer. e price we

pay for completeness is loss of rigor. It so happens that rigorous tests have narrower

scope than robust tests; approximate analysis can sweep more terrain than precise

analysis. We need such a holistic test before we go into more sophisticated tests.

e twin box plot shown in Figure 4.4 oﬀers a holistic comparison described in

the following paragraphs.

First, it compares the median values. e median of the ﬁrst estimate is 4.67%,

and the median of the second estimate is 1.27%. Comparing medians is more

robust than comparing means, which makes sense even with nonnormal data. is

is a comment on central tendency.

en dispersion is compared at two levels; the ﬁrst IQR is 23.45 and the sec-

ond and improved value is 8.54. It is evident that the core of the estimation process

covering 50% of results shows less dispersion—an order of magnitude less. e

new dispersion is one-third the old. e old whisker-to-whisker range is 86.48,

whereas the new whisker-to-whisker range is 20.32, four times less. It is evident

that the dispersion is reduced in the new estimation technique; it is more reliable.

e box plot provides an order of magnitude test before we resort to p values for

judgment.

e box plot identiﬁes outliers in the second group; not every estimate has been

well performed. e best practice must spread. e second process has philosophi-

cal problems called statistical outliers. However, in a practical sense, even the outli-

ers are better than the ﬁrst process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4 - Tukey’s Box Plot: Exploratory Analysis (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4 - Tukey’s Box Plot: Exploratory Analysis (2/4)