Data, Data Quality, and Descriptive Statistics 13
for drawing conclusions. We can do with 14, keeping in mind that there could be
small but tolerable errors in our judgment.
Two statistics are of signicant consequencethe mean value is 10.414 and
the maximum value is 30. We are going to apply business rules to evaluate these
statistics and not statistical rules. e mean value of variance, when the estimation
process is mature, should be close to zero. e ideal behavior of estimation errors is
like that of measurement errors; both should be symmetrically distributed with the
center at zero. After all, estimation is also a measurement. e current mean vari-
ance of 10.414 is high, suggesting that the project consistently loses approximately
10% of manpower. is is what Juran called chronic waste.
e second problem is that the maximum value of variance stretches as far as
30%. is is not terribly bad, from a practical angle. Projects have reported much
higher extremities going once in a while as far as 80%. is is anyway a less serious
problem than the mean value.
Both kurtosis and skewness are not alarming.
e median stays closer to the mean, as expected.
ere is no clear mode in the data.
e range is 33, but the standard deviation is approximately 8.5, suggesting
a mathematical process width of six times standard deviation, equal to 51. e
mathematical model predicts larger variation of process. However, even this larger
forecast is not alarming as the mean value.
Overall, the project team has a reasonable discipline in complying with plans,
indicated by acceptable range. e estimation process requires improvement, and it
looks as if the estimation process could be ne-tuned to achieve a mean error of zero.
Box 1.3 SMall IS BIG
e maintenance projects had to deal with 20,000 bugs every week pouring
in from globally located customer service centers. e product was huge, and
multiple updates happened every month and delivered to dierent users in
dierent parts of the world. e maintenance engineers were busy xing the
bugs and had no inclination to look at and learn from maintenance data. e
very thought of a database with millions of data points deterred them from
taking a dip into the data. Managers were helpless in this regard because they
had no case to persuade people to take large chunks of time and pore over
data. Data were unpopular until people came to know about ve-point sum-
maries. A months data can be reduced to Tukeys ve statistics: minimum,
rst quartile, median, third quartile, and maximum. People found it very
easy at merely ve statistics to understand a months performance.
14 Simple Statistical Methods for Software Engineering
Application Notes
A primary application of the ideas we have seen in this chapter is in presenting data
summaries. e design of summary tables deserves attention.
First, presenting too many metrics in a single table must be avoided. Beyond
seven metrics, the brain cannot process parallel data. Data summary tables with 40
metrics go overhead. Such data can be grouped under the ve categories: business,
project, process, subprocess, and product. If such a categorization is not favored,
the summary table can have any of the following categories:
Long term–short term
Businessprocess
Project–process
Project–processproduct
What is important is that the table must be portioned into tiles; the parts may
be presented separately connected by digital links. is way, dierent stakeholders
may read dierent tables. Whoever picks up a table will nd the data relevant and
hence interesting.
Next, for every metric, the ve-point summary may be presented instead of the
usual mean and sigma for one good reason: most engineering data are nonnormal.
e ve-point summary is robust and can handle both normal and nonnormal
data.
Concluding Remarks
It is important to realize the context of data to make both data collection and inter-
pretation eective enough.
Time to Repair Analysis
Tukeys Five-Point Summary
Zone 1 Zone 2 Zone 3 Zone 4
22,000 12,000 23,600 32,000
Statistic
Time to Time to Time to Time to
Repair,Days Repair, Days Repair, Days Repair, Days
Minimum 12 6 45 25
Quartile 1 70 44 57 63
Median 120 66 130 89
Quartile 3 190 95 154 165
Maximum 300 126 200 223
N (Bugs
Reported)
Data, Data Quality, and Descriptive Statistics 15
Before analyzing data, we must determine its scale. Permissible statistical
methods change with scale. For example, we use median and percentiles for ordi-
nal data.
Errors in data undermine our condence in the data. We should not unwit-
tingly repose undue condence in data. We must seek to nd the data sources and
make an assessment of possible percentage of error in data. For example, customer
perception data are likely to be inconsistent and subjective. In this case, we would
trust the central tendency expressions rather than dispersion gures. Machine-
collected data such as bug repair time is likely to be accurate.
We should learn to summarize data visually as well as numerically. We can
make use of Excel graphs for the former and descriptive statistics in Excel for the
latter. ese summaries also constitute rst-level rudimentary analyses without
which data collection is incomplete.
Data have the power to change paradigms. Old paradigms that do not t fresh
data are replaced by new paradigms that t. Data have the power to renew business
management continually. Data are also a fertile ground for innovation, new dis-
coveries, and improvement. All these advantages can be gained with rudimentary
analyses of data.
Review Questions
1. What are data?
2. What are scales of measurement?
3. What is a statistic? How is it dierent from data?
4. What are the most commonly used descriptive statistics?
5. What is Tukey’s ve-point summary?
6. How does data contribute to self-improvement?
Box 1.4 analoGy: BIofeedBack
ere was this boy who stammered and went to a speech therapist. e treat-
ment given was simple: he had to watch his speech waveform in an oscillo-
scope as he was speaking to a microphone. He practiced for 5 days, half an
hour a day, and walked away cured of stammering. e way he gained normal
speech is ascribed to biofeedback. Human systems correct themselves if they
happen to see their performance. at is precisely what data usage in software
development project achieves. When programmers see data about their code
defects, the human instinctive capability is to rectify the problems and oer
defect-free code. is principle has universal application and is relevant to all
software processes, from requirement gathering to testing.
16 Simple Statistical Methods for Software Engineering
Exercises
1. If you are engaged in writing code for a mission critical software application,
and if you wish to control the quality of the code to ensure delivery of defect
free components, what data will you collect? Design a data collection table.
2. During testing of a 5000 LOC code, what data will you collect for the pur-
pose of assessing code stability?
Appendix 1.1: Definition of Descriptive Statistics
Number of Data Points
When we see a metric value, we should also know the size of the sample used in
the calculation.
Number of data points (observations) n
Sum
is is a plain total of all values, useful as a meta-calculation:
Sum =
=
=
x
i
i
i n
1
Variance
is is a mathematical calculation of data dispersion obtained from the following
formula:
Variance =
=
( )x x
n
i
i
n
2
1
1
where n is the sample size and
x
is the sample mean. Variance is the average squared
deviation from the mean.
Standard Deviation
Square root of variance is equal to standard deviation. is is the mathematical
expression of dispersion. is is also a parameter to normal distribution.
e standard deviation symbol σ is used to show the standard deviation nota-
tion. Symbol = σ, σ read as sigma:
σ = variance
Data, Data Quality, and Descriptive Statistics 17
Maximum
is is the largest value in the sample. Large values of eort variance indicate a
special problem and are worth scrutiny. e questions here are “How bad is the
worst value? Is it beyond practical limits?” is statistic is a simple recognition of a
serious characteristic of data.
Minimum
is is the other end of data values. e question is similar: “How low is the min-
imum value?” In eort variance, the minimum value can have a negative sign,
suggesting cost compression. Usually, cost compression is good news, but process
managers get cautious when the value becomes deeply negative. e questions that
bother them are as follows: Has there been some compromise? Will cost saving have
a boomerang eect?
Range
Range is obtained by subtracting the minimum from the maximum. Range repre-
sents process variation, in an empirical sense. is statistic is widely used in process
control. It is simple to compute and yet sensitive enough to alert if processes vary
too much.
Range is just the dierence between the largest and the smallest values:
Range = maximum − minimum
Mode
Mode is the most often repeated value. It is an expression of central tendency.
Median
Median is the value that divides data—organized into an ordered array—into two
equal halves. is is another expression of central tendency.
In simple words, median is the middle value in the list of numbers. A list should
be arranged in an ascending order rst to calculate the median value. en the
formula is stated as follows:
If the total number of numbers (n) is an odd number, then the formula is given
as follows
Median te
rm
th
=
+
n 1
2
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.109.223