3
Chapter 1
Data, Data Quality, and
Descriptive Statistics
The Challenge That Persists
Data refer to facts in contrast to opinion or conjecture. Data are evidence, results,
an expression of reality, and all such concrete realizations. Data are the result of
observation and measurement (of life in general) of processes and products in soft-
ware development. We use the term data to represent the basic measures (raw) and
derived (manipulated) metrics.
Data collection remains a challenge even after fty years of history. e chal-
lenge engulfs the two types of data collection: the routine data collection and the
special purpose data collection, such as in improvement programs and experiments.
Problems in these areas are more in the rst kind. A summary of the problems in
data collection was presented by Goethert and Siviy [1], who nd that “inconsistent
denitions of measures, meaningless charts” are among the reasons for poor data.
ey complain that “the big picture is lost.We miss the forest for the trees.
ey also point ngers at a serious problem: the context of the indicators is not
understood. Not nding a context for data is becoming a major crisis.
Data have no meaning apart from their context.
Shewhart
e routine data collection can be studied from ve contexts, viewing from
ve management layers: business management, project management, process
4 Simple Statistical Methods for Software Engineering
management, product management, and the recently introduced subprocess man-
agement. When data lose their context, they are considered irrelevant and are thus
dismissed. When managers lose interest, data are buried at the source. e solution
is to categorize metrics according to context and assure relevance to the stake-
holders. e periodical metrics data report should be divided into interlinked and
context-based sections. Dierent stakeholders read dierent sections of the report
with interest. Context setting should be performed before the goal question metric
(GQM) paradigm is applied to the metrics design.
Several software development organizations prefer to dene “mandatory data”
and call the rest as “optional data.” Mandatory metrics are chosen from the context
of the organization, whereas optional metrics are for local consumption. Mandatory
metrics are like the metrics in a car dashboard display; the industry needs them to
run the show. e industry chooses mandatory metrics to know the status of the
project and to assess the situation, departing from the connes of GQM paradigm
in response to operational requirements.
SEI’s GQ(I)M framework [2] improved the GQM methodology in several ways.
Using mental models and including charts as part of the measurement process are
noteworthy. Instant data viewing using charts connects data with decision making
and triggers biofeedback. Creating charts is a commendable achievement of statisti-
cal methods. Spreadsheets are available with tools to make adequate charts.
Mapping is frequently used in engineering measurements. e mapping phase
of software size measurement in COSMIC Function Points is a brilliant exposition
of this mapping. e International Function Point Users Group denes counting
rules in a similar vein. Counting lines of code (LOC) is already a long established
method. Unambiguous methods are available to measure complexity. ese are all
product data regarded as “optional.Despite the clarity provided by measurement
technologies, product data are still not commonly available.
Moving up, business data include key performance indicators, best organized
under the balanced scorecard scheme. ese data are driven by strategy and vision and
used in a small number of organizations as a complement to regular data collection.
Data availability has remained a problem and is still a problem. e degree of
data availability problem varies according to the category of data. A summary is
presented in the following table:
Category Data Availability
1. Business data Medium availability
2. Project data High availability
3. Process data Low availability
4. Subprocess data Extremely low availability
5. Product data Very low availability
Data, Data Quality, and Descriptive Statistics 5
Collecting data in the last two categories meets with maximum resistance
from teams because this data collection is considered as micromanagement. e
previously mentioned prole of data availability is typical of software business
and contrasts with manufacturing; for example, product data are easily available
there.
Bringing Data to the Table Requires Motivation
A strong sense of purpose and motivation is required to compile relevant data for
perusal, study, and analysis. Dierent stakeholders see dierent sections of data as
pertinent. Business managers would like to review performance data. Project man-
agers would like to review delivery related data, including some performance data
they are answerable to. Process managers would like to review process data. Model
builders and problem solvers dig into subprocess data. An engineering team would
be interested in looking at product data.
Data are viewed by dierent people playing dierent roles from dierent win-
dows. Making data visible in each window is the challenge. e organizational
purpose of data collection should be translated into data collection objectives for
dierent stakeholders and dierent users. Plurality of usage breeds plurality in usage
perspectives. Plurality is a postmodern phenomenon. Single-track data compiling
initiatives fail to satisfy the multiple users, resulting in dismally poor metric usage
across the organization.
e mechanics of data compilation and maintaining data in a database that
would cater to diverse users is now feasible. One can look up data warehouse tech-
nology to know the method. A common, structured platform, however, seems to be
a goal-driven process to bring data to the data warehouse.
Data Quality
On Scales
Software data have several sources as there are several contexts; these data come in
dierent qualities. A very broad way of classifying data quality would be to divide
data into qualitative and quantitative kinds. Verbal descriptions and subjective rat-
ings are qualitative data. Numerical values are quantitative data. Stevens [3] devel-
oped scales for data while working on psychometrics, as follows: nominal, ordinal,
interval, and ratio scales. e rst two scales address qualitative data. e remain-
ing two address quantitative data. Stevens restored legitimacy for qualitative data
and identied permissible statistical analyses for each scale. Each scale is valuable
in its own way, although most analysts prefer the higher scales because they carry
data with better quality and transfer richer information.
6 Simple Statistical Methods for Software Engineering
When data quality is low we change the rules of analyses; we do not discard
the data.
Stevens measurement theory has cast a permanent inuence in statistical
methods.
e lower scales with allegedly inferior data quality found several applications
in market research and customer satisfaction (CSAT) measurement. CSAT data are
collected in almost every software project, and an ordinal scale designed by Likert
[4] is extensively used at present for this purpose. We can improve CSAT data
quality by switching over to the ratio scale, as in the Net Promoter Score approach
invented by Frederick [5] to measure CSAT. CSAT data quality is our own making.
With better quality, CSAT data manifest better resolution that in turn supports a
comprehensive and dependable analysis.
e advent of articial intelligence has increased the scope of lower scale data. In
these days of fuzzy logic, even text can be analyzed, fullling the vision of the German
philosopher Frege, who strived to establish mathematical properties of text. Today, the
lower scales have proved to be equally valuable in their ability to capture truth.
Error
All data contain measurement errors, whether the data are from a scientic laboratory
or from a eld survey. Errors are the least in a laboratory and the most in a eld survey.
We repeat the measurement of a product in an experiment, and we may get results that
vary from trial to trial. is is the repeatability” error. If many experimenters from dif-
ferent locations repeat the measurement, additional errors may appear because of per-
son to person variation and environmental variation known as “reproducibility” error.
ese errors, collectively called noise, in experiments can be minimized by replication.
e discrepancy between the mean value of measured data and the true value
denotes bias.Bias due to measuring devices can be corrected by calibrating the
devices. Bias in estimation can be reduced by adopting the wide band Delphi
method. Bias in regularly collected data is dicult to correct by statistical methods.
Both bias and noise are present in all data; the magnitude varies. Special pur-
pose data such as those collected in experiments and improvement programs have
the least. Data regularly collected from processes and products have the most. If the
collected data could be validated by team leaders or managers, most of the human
errors could be reduced. Statistical cleaning of data is possible, to some extent, by
using data mining approaches, as shown by Han and Kamber [6]. Hundreds of
tools are available to clean data by using standard procedures such as auditing,
parsing, standardization, record matching, and house holding. However, data vali-
dation by team leaders is far more eective than automated data mining technol-
ogy. Even better is to analyze data and spot outliers and odd patterns and let these
data anomalies be corrected by process owners. Simple forms of analysis such as line
graphs, scatter plots, and box plots can help in spotting bad data.
Data, Data Quality, and Descriptive Statistics 7
Cleaned data can be kept in a separate database called a data warehouse. Using
data warehouse techniques also help in collecting data from heterogeneous sources
and providing data a structure that makes further analysis easy. e need for a
commonly available database is felt strongly in the software industry. More and
more data get locked into personal databases of team members. Although data col-
lection is automated and data quality is free from bias and noise, the nal situation
is even worse: data are quietly logged into huge repositories with access available
only to privileged managers. ey do not have the time for data related work. e
shoemaker syndrome seems to be working.
Data Stratification
is is one of the earliest known methods. Data must be grouped, categorized, or
stratied before analysis. Data categories are decided from engineering and man-
agement standpoint. is should not be left to statistical routines such as clustering
or principal component analysis.
In real life, stratication is performed neither with the right spirit nor with
the required seriousness. For instance, a common situation that may be noticed
is attempts to gather software productivity data and arriving at an organizational
baseline. Productivity (function point/person month) depends on programming
language. For example, Caper Jones [7] has published programming tables, indicat-
ing how the level of language increases as productivity increases.
Visual Summary
Descriptive statistics is used to describe and depict collected data in the form of
charts and tables. Data are summarized to facilitate reasoning and analysis. e
rst depiction is the visual display of data, a part of indicators in the GQ(I)M para-
digm [1]. e second depiction is a numerical summary of data.
Visual display is an eective way of presenting data. It is also called statisti-
cal charting. Graphical form communicates to the human brain better and faster,
allowing the brain to do visual reasoning, a crucial process for engineers and
managers. Park and Kim [8] proposed a model for visual reasoning in the creative
design process. ere is growing evidence to show that information visualization
augments mental models in engineering design (Liu and Stasko [9]). Data visual-
ization is emerging into a sophisticated discipline of its own merit.
Let us see as an example two simple graphs. First is a radar chart of project risks
shown in Figure 1.1.
is provides a risk prole of project environment at a glance. e radar chart
presents an integrated view of risk; it is also an elegant summary. is chart can
be refreshed every month, showing project managers the reality. Reecting upon
the chart, managers can make decisions for action. e second chart is a line graph
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.12.224