Chapter 1 - Data, Data Quality, and Descriptive Statistics (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1

Data, Data Quality, and

Descriptive Statistics

The Challenge That Persists

Data refer to facts in contrast to opinion or conjecture. Data are evidence, results,

an expression of reality, and all such concrete realizations. Data are the result of

observation and measurement (of life in general) of processes and products in soft-

ware development. We use the term data to represent the basic measures (raw) and

derived (manipulated) metrics.

Data collection remains a challenge even after fty years of history. e chal-

lenge engulfs the two types of data collection: the routine data collection and the

special purpose data collection, such as in improvement programs and experiments.

Problems in these areas are more in the rst kind. A summary of the problems in

data collection was presented by Goethert and Siviy [1], who nd that “inconsistent

denitions of measures, meaningless charts” are among the reasons for poor data.

ey complain that “the big picture is lost.” We miss the forest for the trees.

ey also point ngers at a serious problem: the context of the indicators is not

understood. Not nding a context for data is becoming a major crisis.

Data have no meaning apart from their context.

Shewhart

e routine data collection can be studied from ve contexts, viewing from

ve management layers: business management, project management, process

4 ◾ Simple Statistical Methods for Software Engineering

management, product management, and the recently introduced subprocess man-

agement. When data lose their context, they are considered irrelevant and are thus

dismissed. When managers lose interest, data are buried at the source. e solution

is to categorize metrics according to context and assure relevance to the stake-

holders. e periodical metrics data report should be divided into interlinked and

context-based sections. Dierent stakeholders read dierent sections of the report

with interest. Context setting should be performed before the goal question metric

(GQM) paradigm is applied to the metrics design.

Several software development organizations prefer to dene “mandatory data”

and call the rest as “optional data.” Mandatory metrics are chosen from the context

of the organization, whereas optional metrics are for local consumption. Mandatory

metrics are like the metrics in a car dashboard display; the industry needs them to

run the show. e industry chooses mandatory metrics to know the status of the

project and to assess the situation, departing from the connes of GQM paradigm

in response to operational requirements.

SEI’s GQ(I)M framework [2] improved the GQM methodology in several ways.

Using mental models and including charts as part of the measurement process are

noteworthy. Instant data viewing using charts connects data with decision making

and triggers biofeedback. Creating charts is a commendable achievement of statisti-

cal methods. Spreadsheets are available with tools to make adequate charts.

Mapping is frequently used in engineering measurements. e mapping phase

of software size measurement in COSMIC Function Points is a brilliant exposition

of this mapping. e International Function Point Users Group denes counting

rules in a similar vein. Counting lines of code (LOC) is already a long established

method. Unambiguous methods are available to measure complexity. ese are all

product data regarded as “optional.” Despite the clarity provided by measurement

technologies, product data are still not commonly available.

Moving up, business data include key performance indicators, best organized

under the balanced scorecard scheme. ese data are driven by strategy and vision and

used in a small number of organizations as a complement to regular data collection.

Data availability has remained a problem and is still a problem. e degree of

data availability problem varies according to the category of data. A summary is

presented in the following table:

Category Data Availability

1. Business data Medium availability

2. Project data High availability

3. Process data Low availability

4. Subprocess data Extremely low availability

5. Product data Very low availability

Data, Data Quality, and Descriptive Statistics ◾ 5

Collecting data in the last two categories meets with maximum resistance

from teams because this data collection is considered as micromanagement. e

previously mentioned prole of data availability is typical of software business

and contrasts with manufacturing; for example, product data are easily available

there.

Bringing Data to the Table Requires Motivation

A strong sense of purpose and motivation is required to compile relevant data for

perusal, study, and analysis. Dierent stakeholders see dierent sections of data as

pertinent. Business managers would like to review performance data. Project man-

agers would like to review delivery related data, including some performance data

they are answerable to. Process managers would like to review process data. Model

builders and problem solvers dig into subprocess data. An engineering team would

be interested in looking at product data.

Data are viewed by dierent people playing dierent roles from dierent win-

dows. Making data visible in each window is the challenge. e organizational

purpose of data collection should be translated into data collection objectives for

dierent stakeholders and dierent users. Plurality of usage breeds plurality in usage

perspectives. Plurality is a postmodern phenomenon. Single-track data compiling

initiatives fail to satisfy the multiple users, resulting in dismally poor metric usage

across the organization.

e mechanics of data compilation and maintaining data in a database that

would cater to diverse users is now feasible. One can look up data warehouse tech-

nology to know the method. A common, structured platform, however, seems to be

a goal-driven process to bring data to the data warehouse.

Data Quality

On Scales

Software data have several sources as there are several contexts; these data come in

dierent qualities. A very broad way of classifying data quality would be to divide

data into qualitative and quantitative kinds. Verbal descriptions and subjective rat-

ings are qualitative data. Numerical values are quantitative data. Stevens [3] devel-

oped scales for data while working on psychometrics, as follows: nominal, ordinal,

interval, and ratio scales. e rst two scales address qualitative data. e remain-

ing two address quantitative data. Stevens restored legitimacy for qualitative data

and identied permissible statistical analyses for each scale. Each scale is valuable

in its own way, although most analysts prefer the higher scales because they carry

data with better quality and transfer richer information.

6 ◾ Simple Statistical Methods for Software Engineering

When data quality is low we change the rules of analyses; we do not discard

the data.

Steven’s measurement theory has cast a permanent inuence in statistical

methods.

e lower scales with allegedly inferior data quality found several applications

in market research and customer satisfaction (CSAT) measurement. CSAT data are

collected in almost every software project, and an ordinal scale designed by Likert

[4] is extensively used at present for this purpose. We can improve CSAT data

quality by switching over to the ratio scale, as in the Net Promoter Score approach

invented by Frederick [5] to measure CSAT. CSAT data quality is our own making.

With better quality, CSAT data manifest better resolution that in turn supports a

comprehensive and dependable analysis.

e advent of articial intelligence has increased the scope of lower scale data. In

these days of fuzzy logic, even text can be analyzed, fullling the vision of the German

philosopher Frege, who strived to establish mathematical properties of text. Today, the

lower scales have proved to be equally valuable in their ability to capture truth.

Error

All data contain measurement errors, whether the data are from a scientic laboratory

or from a eld survey. Errors are the least in a laboratory and the most in a eld survey.

We repeat the measurement of a product in an experiment, and we may get results that

vary from trial to trial. is is the “repeatability” error. If many experimenters from dif-

ferent locations repeat the measurement, additional errors may appear because of per-

son to person variation and environmental variation known as “reproducibility” error.

ese errors, collectively called noise, in experiments can be minimized by replication.

e discrepancy between the mean value of measured data and the true value

denotes “bias.” Bias due to measuring devices can be corrected by calibrating the

devices. Bias in estimation can be reduced by adopting the wide band Delphi

method. Bias in regularly collected data is dicult to correct by statistical methods.

Both bias and noise are present in all data; the magnitude varies. Special pur-

pose data such as those collected in experiments and improvement programs have

the least. Data regularly collected from processes and products have the most. If the

collected data could be validated by team leaders or managers, most of the human

errors could be reduced. Statistical cleaning of data is possible, to some extent, by

using data mining approaches, as shown by Han and Kamber [6]. Hundreds of

tools are available to clean data by using standard procedures such as auditing,

parsing, standardization, record matching, and house holding. However, data vali-

dation by team leaders is far more eective than automated data mining technol-

ogy. Even better is to analyze data and spot outliers and odd patterns and let these

data anomalies be corrected by process owners. Simple forms of analysis such as line

graphs, scatter plots, and box plots can help in spotting bad data.

Data, Data Quality, and Descriptive Statistics ◾ 7

Cleaned data can be kept in a separate database called a data warehouse. Using

data warehouse techniques also help in collecting data from heterogeneous sources

and providing data a structure that makes further analysis easy. e need for a

commonly available database is felt strongly in the software industry. More and

more data get locked into personal databases of team members. Although data col-

lection is automated and data quality is free from bias and noise, the nal situation

is even worse: data are quietly logged into huge repositories with access available

only to privileged managers. ey do not have the time for data related work. e

shoemaker syndrome seems to be working.

Data Stratiﬁcation

is is one of the earliest known methods. Data must be grouped, categorized, or

stratied before analysis. Data categories are decided from engineering and man-

agement standpoint. is should not be left to statistical routines such as clustering

or principal component analysis.

In real life, stratication is performed neither with the right spirit nor with

the required seriousness. For instance, a common situation that may be noticed

is attempts to gather software productivity data and arriving at an organizational

baseline. Productivity (function point/person month) depends on programming

language. For example, Caper Jones [7] has published programming tables, indicat-

ing how the level of language increases as productivity increases.

Visual Summary

Descriptive statistics is used to describe and depict collected data in the form of

charts and tables. Data are summarized to facilitate reasoning and analysis. e

rst depiction is the visual display of data, a part of indicators in the GQ(I)M para-

digm [1]. e second depiction is a numerical summary of data.

Visual display is an eective way of presenting data. It is also called statisti-

cal charting. Graphical form communicates to the human brain better and faster,

allowing the brain to do visual reasoning, a crucial process for engineers and

managers. Park and Kim [8] proposed a model for visual reasoning in the creative

design process. ere is growing evidence to show that information visualization

augments mental models in engineering design (Liu and Stasko [9]). Data visual-

ization is emerging into a sophisticated discipline of its own merit.

Let us see as an example two simple graphs. First is a radar chart of project risks

shown in Figure 1.1.

is provides a risk prole of project environment at a glance. e radar chart

presents an integrated view of risk; it is also an elegant summary. is chart can

be refreshed every month, showing project managers the reality. Reecting upon

the chart, managers can make decisions for action. e second chart is a line graph

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1 - Data, Data Quality, and Descriptive Statistics (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 1 - Data, Data Quality, and Descriptive Statistics (1/4)