145
Chapter 10
Pattern Extraction
Using Histogram
e histogram is easily the most commonly used statistical method. It is used to
detect frequency patterns in data.
A histogram is a way to count the number of data points in data intervals. First,
the data range that extends from the minimum to maximum value is divided into
a certain number of equal intervals. en we count the number of data points in
each interval and tabulate the counts in a table called tally. e frequency table
is converted into a bar graph known as a histogram. An example of data creating
a tally, constructing a histogram, and deriving a frequency diagram as well as an
ogive from the histogram is illustrated in Appendix 10.1.
A histogram of requirements volatility is shown in Figure 10.1. is histo-
gram is similar to the one obtained by Kulk and Verhoef [1], who studied require-
ments volatility in 84 IT projects comprising 16,500 function points. Requirement
changes do vary beyond the traditional limit of 10%.
What draws our attention rst in the histogram is its peak. Stable processes
have strong peaks. e peak represents the mode of the process. e core process is
denoted by the body of the histogram. Almost the entire process results are seen in
the core. Outside the core, we can see outliers. Unlike in the box plot, outliers are
distinguished by contrasts in the pattern and not by any rules. is histogram is
symmetrical. Many metrics exhibit nearly symmetrical shapes; effort variance and
schedule variance are well-known examples.
Not every histogram is symmetrical. For example, the histogram of complexity
of object oriented structures, measured as weighted methods per class, is shown in
Figure 10.2, modeled after the finding of Rosenberg [2].
146 Simple Statistical Methods for Software Engineering
0
20
40
60
80
100
120
5 15 25 35 45 55 65 75 85 95 105
Frequency
Weighted methods per class
Figure 10.2 Skewed histogram of object complexity.
0
5
10
15
20
25
30
–20 –15 –10 5 0 5 10 15 20 25 30
Frequency
% Requirements volatility
Figure 10.1 Symmetrical histogram of requirements volatility.
Pattern Extraction Using Histogram 147
is histogram is skewed. ere are few classes with large weighted methods
per class. e larger the number of methods in a class, the greater the potential
effect on children; children inherit all the methods defined in the parent class.
Classes with many methods are likely to limit the possibility of reuse. ey are also
difficult to understand and test. ose few complex objects require review and a
relook. A lot more metrics show skewed shapes; time to repair, defect density, and
productivity are known for their skew.
BOX 10.1 HISTOGRAM IN CAMERAS
Histogram is a digital signature of reality. It is used in modern digital cam-
eras to present a pictorial view of light intensity in the field of view. e light
profile of objects seen through the lens is scanned, digitized, and converted
into a histogram. e photographer can derive clues from this histogram to
the settings required to get a good picture.
Understanding image histograms is probably the single most important
concept to become familiar with when working with pictures from a digital
camera. A histogram can tell you whether or not your image has been prop-
erly exposed, whether the lighting is harsh or flat, and what adjustments will
work best. It will not only improve your skills on the computer, but as a pho-
tographer as well. (http://www.cam bridgeincolour.com/tutorials/histograms1
.htm)
Before the histogram, photography enthusiasts had to go through a lot
more effort to get good exposures.
Image editors typically have provisions to create a histogram of the
image being edited. e histogram plots the number of pixels in the
image (vertical axis) with a particular brightness value (horizontal axis).
Algorithms in the digital editor allow the user to visually adjust the bright-
ness value of each pixel and to dynamically display the results as adjust-
ments are made. Improvements in picture brightness and contrast can thus
be obtained. (http://en.wikipedia.org/wiki/Image_editing)
148 Simple Statistical Methods for Software Engineering
Choosing the Number of Intervals
Square Root Formula
e elements of a histogram, namely, the peak, the body, and the outliers, can
change if we change the number of interval N. Normally, we choose the number of
intervals to be the square root of the number of data points in the sample n.
N = n
0.5
(10.1)
is square root formula is taken as a default value in histogram analysis.
Alternate Approaches
ere are three other conventions as well in selecting the number of intervals.
1. e Sturges rule,
N = log
2
n + 1 (10.2)
2. e Freedman–Diaconis rule,
N
IQR n
=
Range
/2 1 3( )
(10.3)
3. e Scott rule, which suggests fewer intervals,
N
n
=
Range
3 5
3
. σ
(10.4)
Exploratory Iterations
We can see how the histogram varies when the number of bin intervalsand as
a direct consequence the bin sizesvary according to the four rules previously
mentioned. Our trials need not be limited to these four rules. We can try our
own choice of bin size and extract patterns to suit our inquiry. Bin size reduction
increases the resolution of histogram graph. For example, in the first iteration, with
just nine bins, effort variance data yield a histogram shown in Figure 10.3. e
histogram shows stability and a single peak. We can explore further by improving
the resolution of the histogram and choose 20 bins; we get a histogram shown in
Figure 10.4.
is histogram has three modes, or three clusters. is is merely an estimate.
Extracting different histogram estimates with the same data set is known as non-
parametric density function estimation.
Pattern Extraction Using Histogram 149
120
100
80
60
40
20
0
≤–100
(–100, –70]
(–70, –40]
(–40, –10]
(–10, 20]
(20, 50]
EVA %
(50, 80]
(80, 110]
>110
Figure 10.3 Histogram of effort variance (EVA%) with nine bins.
120
100
80
60
40
20
0
EVA %
≤–100
(–100, –90]
(–90, –80]
(–80, –70]
(–70, –60]
(–60, –50]
(–50, –40]
(–40, –30]
(–30, –20]
(–20, –10]
(–10, 0]
(0, 10]
(10, 20]
(20, 30]
(30, 40]
(40, 50]
(50, 60]
(60, 70]
(70, 80]
>80
Figure 10.4 Histogram of effort variance (EVA%) with 20 bins.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.67.0