Chapter 1 - Data, Data Quality, and Descriptive Statistics (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data, Data Quality, and Descriptive Statistics ◾ 13

for drawing conclusions. We can do with 14, keeping in mind that there could be

small but tolerable errors in our judgment.

Two statistics are of signicant consequence—the mean value is 10.414 and

the maximum value is 30. We are going to apply business rules to evaluate these

statistics and not statistical rules. e mean value of variance, when the estimation

process is mature, should be close to zero. e ideal behavior of estimation errors is

like that of measurement errors; both should be symmetrically distributed with the

center at zero. After all, estimation is also a measurement. e current mean vari-

ance of 10.414 is high, suggesting that the project consistently loses approximately

10% of manpower. is is what Juran called chronic waste.

e second problem is that the maximum value of variance stretches as far as

30%. is is not terribly bad, from a practical angle. Projects have reported much

higher extremities going once in a while as far as 80%. is is anyway a less serious

problem than the mean value.

Both kurtosis and skewness are not alarming.

e median stays closer to the mean, as expected.

ere is no clear mode in the data.

e range is 33, but the standard deviation is approximately 8.5, suggesting

a mathematical process width of six times standard deviation, equal to 51. e

mathematical model predicts larger variation of process. However, even this larger

forecast is not alarming as the mean value.

Overall, the project team has a reasonable discipline in complying with plans,

indicated by acceptable range. e estimation process requires improvement, and it

looks as if the estimation process could be ne-tuned to achieve a mean error of zero.

Box 1.3 SMall IS BIG

e maintenance projects had to deal with 20,000 bugs every week pouring

in from globally located customer service centers. e product was huge, and

multiple updates happened every month and delivered to dierent users in

dierent parts of the world. e maintenance engineers were busy xing the

bugs and had no inclination to look at and learn from maintenance data. e

very thought of a database with millions of data points deterred them from

taking a dip into the data. Managers were helpless in this regard because they

had no case to persuade people to take large chunks of time and pore over

data. Data were unpopular until people came to know about ve-point sum-

maries. A month’s data can be reduced to Tukey’s ve statistics: minimum,

rst quartile, median, third quartile, and maximum. People found it very

easy at merely ve statistics to understand a month’s performance.

14 ◾ Simple Statistical Methods for Software Engineering

Application Notes

A primary application of the ideas we have seen in this chapter is in presenting data

summaries. e design of summary tables deserves attention.

First, presenting too many metrics in a single table must be avoided. Beyond

seven metrics, the brain cannot process parallel data. Data summary tables with 40

metrics go overhead. Such data can be grouped under the ve categories: business,

project, process, subprocess, and product. If such a categorization is not favored,

the summary table can have any of the following categories:

Long term–short term

Business–process

Project–process

Project–process–product

What is important is that the table must be portioned into tiles; the parts may

be presented separately connected by digital links. is way, dierent stakeholders

may read dierent tables. Whoever picks up a table will nd the data relevant and

hence interesting.

Next, for every metric, the ve-point summary may be presented instead of the

usual mean and sigma for one good reason: most engineering data are nonnormal.

e ve-point summary is robust and can handle both normal and nonnormal

data.

Concluding Remarks

It is important to realize the context of data to make both data collection and inter-

pretation eective enough.

Time to Repair Analysis

Tukey’s Five-Point Summary

Zone 1 Zone 2 Zone 3 Zone 4

22,000 12,000 23,600 32,000

Statistic

Time to Time to Time to Time to

Repair,Days Repair, Days Repair, Days Repair, Days

Minimum 12 6 45 25

Quartile 1 70 44 57 63

Median 120 66 130 89

Quartile 3 190 95 154 165

Maximum 300 126 200 223

N (Bugs

Reported)

Data, Data Quality, and Descriptive Statistics ◾ 15

Before analyzing data, we must determine its scale. Permissible statistical

methods change with scale. For example, we use median and percentiles for ordi-

nal data.

Errors in data undermine our condence in the data. We should not unwit-

tingly repose undue condence in data. We must seek to nd the data sources and

make an assessment of possible percentage of error in data. For example, customer

perception data are likely to be inconsistent and subjective. In this case, we would

trust the central tendency expressions rather than dispersion gures. Machine-

collected data such as bug repair time is likely to be accurate.

We should learn to summarize data visually as well as numerically. We can

make use of Excel graphs for the former and descriptive statistics in Excel for the

latter. ese summaries also constitute rst-level rudimentary analyses without

which data collection is incomplete.

Data have the power to change paradigms. Old paradigms that do not t fresh

data are replaced by new paradigms that t. Data have the power to renew business

management continually. Data are also a fertile ground for innovation, new dis-

coveries, and improvement. All these advantages can be gained with rudimentary

analyses of data.

Review Questions

1. What are data?

2. What are scales of measurement?

3. What is a statistic? How is it dierent from data?

4. What are the most commonly used descriptive statistics?

5. What is Tukey’s ve-point summary?

6. How does data contribute to self-improvement?

Box 1.4 analoGy: BIofeedBack

ere was this boy who stammered and went to a speech therapist. e treat-

ment given was simple: he had to watch his speech waveform in an oscillo-

scope as he was speaking to a microphone. He practiced for 5 days, half an

hour a day, and walked away cured of stammering. e way he gained normal

speech is ascribed to biofeedback. Human systems correct themselves if they

happen to see their performance. at is precisely what data usage in software

development project achieves. When programmers see data about their code

defects, the human instinctive capability is to rectify the problems and oer

defect-free code. is principle has universal application and is relevant to all

software processes, from requirement gathering to testing.

16 ◾ Simple Statistical Methods for Software Engineering

Exercises

1. If you are engaged in writing code for a mission critical software application,

and if you wish to control the quality of the code to ensure delivery of defect

free components, what data will you collect? Design a data collection table.

2. During testing of a 5000 LOC code, what data will you collect for the pur-

pose of assessing code stability?

Appendix 1.1: Deﬁnition of Descriptive Statistics

Number of Data Points

When we see a metric value, we should also know the size of the sample used in

the calculation.

Number of data points (observations) n

Sum

is is a plain total of all values, useful as a meta-calculation:

Sum =

∑

i n

Variance

is is a mathematical calculation of data dispersion obtained from the following

formula:

Variance =

−

∑

( )x x

where n is the sample size and

is the sample mean. Variance is the average squared

deviation from the mean.

Standard Deviation

Square root of variance is equal to standard deviation. is is the mathematical

expression of dispersion. is is also a parameter to normal distribution.

e standard deviation symbol σ is used to show the standard deviation nota-

tion. Symbol = σ, σ read as sigma:

σ = variance

Data, Data Quality, and Descriptive Statistics ◾ 17

Maximum

is is the largest value in the sample. Large values of eort variance indicate a

special problem and are worth scrutiny. e questions here are “How bad is the

worst value? Is it beyond practical limits?” is statistic is a simple recognition of a

serious characteristic of data.

Minimum

is is the other end of data values. e question is similar: “How low is the min-

imum value?” In eort variance, the minimum value can have a negative sign,

suggesting cost compression. Usually, cost compression is good news, but process

managers get cautious when the value becomes deeply negative. e questions that

bother them are as follows: Has there been some compromise? Will cost saving have

a boomerang eect?

Range

Range is obtained by subtracting the minimum from the maximum. Range repre-

sents process variation, in an empirical sense. is statistic is widely used in process

control. It is simple to compute and yet sensitive enough to alert if processes vary

too much.

Range is just the dierence between the largest and the smallest values:

Range = maximum − minimum

Mode

Mode is the most often repeated value. It is an expression of central tendency.

Median

Median is the value that divides data—organized into an ordered array—into two

equal halves. is is another expression of central tendency.

In simple words, median is the middle value in the list of numbers. A list should

be arranged in an ascending order rst to calculate the median value. en the

formula is stated as follows:

If the total number of numbers (n) is an odd number, then the formula is given

as follows

Median te













n 1

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1 - Data, Data Quality, and Descriptive Statistics (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 1 - Data, Data Quality, and Descriptive Statistics (3/4)