Software Size Growth 277
Application of the Log-Normal Model 2
e next application refers to developing a control chart for design complexity. We
cannot apply the Shewhart Control Chart with the mean μ as the central line and
μ ± 3 σ as the control limits. Shewhart chart assumes a normal distribution, and its
design complexity is log-normal. Shewhart limits are symmetrical design complex-
ity limits that cannot be symmetrical. We have established design complexity as a
skewed distribution. Besides, we do not need a lower control limit for complexity.
All we need is an upper control limit.
Shewhart limits include 99.73% of process inside the limits and keep only
0.27% as outliers. Shewhart limits apply better for manufacturing processes.
For creative processes such as software design, the authors suggest a dierent
rule for control limits. e proposed limits include 95% (0.95) of processes
inside the limits and mark 5% of processes as outliers.is upper control limit
is obtained from the CDF shown in Figure 17.6 as an x value corresponding to
a y value of 0.95.
Upper control limit for y = 0.95
= LOGNORMINV (y, scale, shape)
= LOGNORMINV (0.95,0.693,0.93)
= 9.2324
is sets the statistical limit for upper control of design complexity.
Features Addition in Software Enhancement
In Chapter 13, we treated requirement volatility to a Gaussian with a standard
deviation of approximately 3.3% and a mean value of 4%, for full life cycle devel-
opment projects with stringent business control on requirements. In large enhance-
ment projects, the Gaussian model does not hold; here changes are far more
common. Features added after requirements are “finalized” can touch high values,
as high as 50%. e growth of features is log-normal. e pattern of growth varies.
ree examples, A, B, and C, are presented in Figure 17.7.
Model C has the largest scale of 20 and the fattest tail. Model B has a scale of 10
and a medium-sized tail. Model A has a scale of 4 and has an early finishing point.
All three models represent the customer’s processes over which the maintenance
team has no direct control. In such cases, statistical management reduces to empiri-
cal understanding of the process with data and creating the appropriate PDFs. To
recognize if variation is Gaussian or log-normal is the rst step; this is enabled by
histograms. Fitting the appropriate PDF by parameter extraction is the next step.
Applying the model to solve problems and take decisions is the goal.
278 Simple Statistical Methods for Software Engineering
A Log-Normal PDF for Change Requests
e process of change requests” in a support project is a mixture; it is composed
of assorted tasks, including bug x, feature addition, and patchwork. A PDF of
change requests is modeled with the following parameters:
Shape σ = 1
Scale β = 7
e scale factor is set at the median of data. e shape factor is chosen by an
iterative search for best fit. e log-normal PDF for change requests is plotted in
Figure 17.8.
However, this is merely curve fitting. is model does not benefit from
the ideological context such as that present in the model for feature addition
or design complexity. Despite this limitation, the model can still be used for
forecasting.
A better approach, beyond the scope of this book, would be to create a mixture
model, combining inherent probabilistic characteristics of the components.
Bug xes may be denoted by Weibull distribution, patchwork by beta dis-
tribution and feature addition by log-normal distribution.
Mathematically combining the three would require a series of approxi-
mations and special analytical treatments, which would require a specialists
knowledge. However, such a combination can also be achieved digitally by
simulation.
A
B
C
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25
Features added %
Probability
30
Shape 1 Scale 4
Shape 1 Scale 10
Shape 1 Scale 20
35 40 45 50
Figure 17.7 Log-normal PDF of software enhancements.
Software Size Growth 279
From Pareto to Log-Normal
Pareto distribution is a power law and is known for its fat tail (see Chapter 16 for
more details). Log-normal is a growth model and also has a limited-sized tail. One can
switch to Pareto if bigger tails are needed. However, the similarity does not end in tails.
When examining income distribution data, Aitchison and
Brown (1954) observe that for lower incomes a lognormal dis-
tribution appears a better fit, while for higher incomes a power
law distribution appears better [4].
Power law distributions and log-normal distributions are quite natural models
and can be generated intuitively. ey are also intrinsically connected.
Power law and log-normal both have been applied to le size distributions in
the Internet; they fare equally well. From a pragmatic point of view, it might be
reasonable to use whichever distribution makes it easier to obtain results.
Some Properties of Log-Normal Distribution
Some properties of log-normal distribution can come in handy while analyzing
data. e following median-related formulas are given in NIST:
Mean Median= e
σ
2
2
(17.5)
0
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
5 10 15
Change requests
Probability
20 25
reshold
Shape
Scale
0
1
7
Figure 17.8 Log-normal PDF of change requests.
280 Simple Statistical Methods for Software Engineering
Variance Median=
( )
2
2 2
1e e
σ σ
(17.6)
e central tendencies are defined in terms of parameters as follows:
Mean
=
+
e
β
σ
2
2
(17.7)
Median = e
β
(17.8)
Mode
=
( )
e
β σ
2
(17.9)
We can see from the previous equations that the mean is always larger than the
median. Similarly, the mode is the smallest.
When β = 1, the log-normal distribution is called standard log-normal distri-
bution.
Case Study—Analysis of Failure Interval
Log-normal distribution is widely used in reliability studies. NIST presents several
models for reliability analysis, and log-normal is one of them. e choice depends
on interpretation of the famous bathtub curve. Initially, mechanical systems show
infant mortality with a failure rate that increases till the system stabilizes. en
failure rate decreases and reaches a flat low level. When the failure rate is constant,
the exponential distribution is enough.
When the failure rate is changing, log-normal or Weibull or other models capa-
ble of handling change are required.
Reliability Analysis Centre [5] illustrates an example of log-normal distribution
with a scale of 10.3 and a shape of 1.0196 to represent infant mortality and speedy
recovery, although Weibull is their favorite model for reliability analysis.
In mechanical systems reliability decreases with time whereas in software
products reliability increases with usage, bug discovery and bug xing.
Failure mechanisms propagate and grow in physical systems; in software, they
are located, confined, and eliminated. We need to bear this in mind while working
on developing a probabilistic model for software reliability.
Software Size Growth 281
Failure models also use theory of product and ensure relevance. For exam-
ple, Varde [6] developed a log-normal model based on physics of failure involving
electromigration. Varde, ardently supporting physics based reasoning and appar-
ently reluctant to use of mindless statistical models, observed,
Nevertheless, statistics still forms the part of physics-of-failure
approach. This is because prediction of time to failure is still
modeled employing probability distribution. Traditionally log-
normal failure distribution has been used to estimate failure
time due to electromigration related failure.
Varde used median time to fail as the scale parameter and standard deviation as
the shape parameter, exactly as in NIST guidelines.
We have studied failure times of software after release, the data made avail-
able by the Cyber Security and Information Systems Information Analysis Center
CSIAC [7]. CSIAC is a Department of Defense (DoD) Information Analysis
Center (IAC) sponsored by the Defense Technical Information Center (DTIC). e
CSIAC is a consolidation of three predecessor IACs: the Data and Analysis Center
for Software (DACS), the Information Assurance Technology IAC (IATAC) and
the Modeling and Simulation IAC (MSIAC), with the addition of the Knowledge
Management and Information Sharing technical area.
e software reliability data set has 111 records of failure intervals. With
time, the failure intervals grow, increasing software reliability. We consider
time between failures (TBF) as the key indicator of a complex process involving
usage and maintenance. Growth of TBF is expected with a smooth log-normal
with a clear peak and a distinct tail (see Box 17.3 for an analogy for software
TBF).
However, the histogram of TBF, shown in Figure 17.9, reveals two peaks,
belonging to two separate clusters, suggesting two growth processes. It could be
that the second cluster could arise from a second release; it could also arise from a
new pattern of usage recently introduced.
We have fitted two log-normal curves to the clusters. e rst has a scale of
15.5, Ln(median), and a shape of 0.8 (standard deviation of Ln(x)). e second has
a scale of 16.4 and a shape of 0.1. e graphs are shown in Figure 17.10. is is a
composite model.
e second log-normal curve in Figure 17.10 resembles Gaussian, but still we
prefer the log-normal equation because it is median based.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.79.179