Data is at the heart of statistics. Basically, data is
information. It is any type of information about the things you are
trying to analyze. It may be information about customers, or companies,
or shares, like the Accu-Phi sales and other customer data. It may
come to you in numbers, words, phrases, sentences, pictures, or other
formats. If you can record it in a consistent and retrievable way,
it is data.
For instance, say you
are a manager of an automobile manufacturing plant. You might want
to understand your production efficiencies better. You need information
to do so, perhaps speed of production of each car produced, number
of defects of each car, and the like. This is raw data.
Data gathering and cleaning
is a huge step, because gathering the wrong information means you
will get the wrong answers. (You have doubtless heard the expression
“GIGO,” which stands for “Garbage in, Garbage
out.” This is especially true in statistics, where wrong data
means your study may well be nonsense.)
Continuing with the
automobile manufacturing example, if you get inaccurate data on the
speeds of production then any further analysis will have the wrong
answers and your decisions will be made on this wrong information.
I discuss the critical
data step in far more detail in Chapter 3 and Chapter 4. For now,
I summarize the major data challenges as follows:
-
Data challenge 1: Focusing on the
right observations (your population and samples). For instance, who
or what are you studying? Which people, companies, and the like?
-
Data challenge 2: Choosing issues
to analyze (constructs). Are you interested in demographics of people,
profitability of companies, economic variables of countries? It’s
important to pick the right constructs and constructs that really
matter.
-
Data challenge 3: Once you have
gathered your data, making sure it has been cleaned, that is, it has
no major faults that could derail your analysis.