1
Introduction to information quality

1.1 Introduction

Suppose you are conducting a study on online auctions and consider purchasing a dataset from eBay, the online auction platform, for the purpose of your study. The data vendor offers you four options that are within your budget:

  1. Data on all the online auctions that took place in January 2012
  2. Data on all the online auctions, for cameras only, that took place in 2012
  3. Data on all the online auctions, for cameras only, that will take place in the next year
  4. Data on a random sample of online auctions that took place in 2012

Which option would you choose? Perhaps none of these options are of value? Of course, the answer depends on the goal of the study. But it also depends on other considerations such as the analysis methods and tools that you will be using, the quality of the data, and the utility that you are trying to derive from the analysis. In the words of David Hand (2008):

Statisticians working in a research environment… may well have to explain that the data are inadequate to answer a particular question.

While those experienced with data analysis will find this dilemma familiar, the statistics and related literature do not provide guidance on how to approach this question in a methodical fashion and how to evaluate the value of a dataset in such a scenario.

Statistics, data mining, econometrics, and related areas are disciplines that are focused on extracting knowledge from data. They provide a toolkit for testing hypotheses of interest, predicting new observations, quantifying population effects, and summarizing data efficiently. In these empirical fields, measurable data is used to derive knowledge. Yet, a clean, exact, and complete dataset, which is analyzed professionally, might contain no useful information for the problem under investigation. In contrast, a very “dirty” dataset, with missing values and incomplete coverage, can contain useful information for some goals. In some cases, available data can even be misleading (Patzer, 1995, p. 14):

Data may be of little or no value, or even negative value, if they misinform.

The focus of this book is on assessing the potential of a particular dataset for achieving a given analysis goal by employing data analysis methods and considering a given utility. We call this concept information quality (InfoQ). We propose a formal definition of InfoQ and provide guidelines for its assessment. Our objective is to offer a general framework that applies to empirical research. Such element has not received much attention in the body of knowledge of the statistics profession and can be considered a contribution to both the theory and the practice of applied statistics (Kenett, 2015).

A framework for assessing InfoQ is needed both when designing a study to produce findings of high InfoQ as well as at the postdesign stage, after the data has been collected. Questions regarding the value of data to be collected, or that have already been collected, have important implications both in academic research and in practice. With this motivation in mind, we construct the concept of InfoQ and then operationalize it so that it can be implemented in practice.

In this book, we address and tackle a high‐level issue at the core of any data analysis. Rather than concentrate on a specific set of methods or applications, we consider a general concept that underlies any empirical analysis. The InfoQ framework therefore contributes to the literature on statistical strategy, also known as metastatistics (see Hand, 1994).

1.2 Components of InfoQ

Our definition of InfoQ involves four major components that are present in every data analysis: an analysis goal, a dataset, an analysis method, and a utility (Kenett and Shmueli, 2014). The discussion and assessment of InfoQ require examining and considering the complete set of its components as well as the relationships between the components. In such an evaluation we also consider eight dimensions that deconstruct the InfoQ concept. These dimensions are presented in Chapter 3. We start our introduction of InfoQ by defining each of its components.

Before describing each of the four InfoQ components, we introduce the following notation and definitions to help avoid confusion:

  • g denotes a specific analysis goal.
  • X denotes the available dataset.
  • f is an empirical analysis method.
  • U is a utility measure.

We use subscript indices to indicate alternatives. For example, to convey K different analysis goals, we use g1, g2,…, gK; J different methods of analysis are denoted f1, f2,…, fJ.

Following Hand’s (2008) definition of statistics as “the technology of extracting meaning from data,” we can think of the InfoQ framework as one for evaluating the application of a technology (data analysis) to a resource (data) for a given purpose.

1.2.1 Goal (g)

Data analysis is used for a variety of purposes in research and in industry. The term “goal” can refer to two goals: the high‐level goal of the study (the “domain goal”) and the empirical goal (the “analysis goal”). One starts from the domain goal and then converts it into an analysis goal. A classic example is translating a hypothesis driven by a theory into a set of statistical hypotheses.

There are various classifications of study goals; some classifications span both the domain and analysis goals, while other classification systems focus on describing different analysis goals.

One classification approach divides the domain and analysis goals into three general classes: causal explanation, empirical prediction, and description (see Shmueli, 2010; Shmueli and Koppius, 2011). Causal explanation is concerned with establishing and quantifying the causal relationship between inputs and outcomes of interest. Lab experiments in the life sciences are often intended to establish causal relationships. Academic research in the social sciences is typically focused on causal explanation. In the social science context, the causality structure is based on a theoretical model that establishes the causal effect of some constructs (abstract concepts) on other constructs. The data collection stage is therefore preceded by a construct operationalization stage, where the researcher establishes which measurable variables can represent the constructs of interest. An example is investigating the causal effect of parents’ intelligence on their children’s intelligence. The construct “intelligence” can be measured in various ways, such as via IQ tests. The goal of empirical prediction differs from causal explanation. Examples include forecasting future values of a time series and predicting the output value for new observations given a set of input variables. Examples include recommendation systems on various websites, which are aimed at predicting services or products that the user is most likely to be interested in. Predictions of the economy are another type of predictive goal, with forecasts of particular economic measures or indices being of interest. Finally, descriptive goals include quantifying and testing for population effects by using data summaries, graphical visualizations, statistical models, and statistical tests.

A different, but related goal classification approach (Deming, 1953) introduces the distinction between enumerative studies, aimed at answering the question “how many?,” and analytic studies, aimed at answering the question “why?”

A third classification (Tukey, 1977) classifies studies into exploratory and confirmatory data analysis.

Our use of the term “goal” includes all these different types of goals and goal classifications. For examples of such goals in the context of customer satisfaction surveys, see Chapter 7 and Kenett and Salini (2012).

1.2.2 Data (X)

Data is a broadly defined term that includes any type of data intended to be used in the empirical analysis. Data can arise from different collection instruments: surveys, laboratory tests, field experiments, computer experiments, simulations, web searches, mobile recordings, observational studies, and more. Data can be primary, collected specifically for the purpose of the study, or secondary, collected for a different reason. Data can be univariate or multivariate, discrete, continuous, or mixed. Data can contain semantic unstructured information in the form of text, images, audio, and video. Data can have various structures, including cross‐sectional data, time series, panel data, networked data, geographic data, and more. Data can include information from a single source or from multiple sources. Data can be of any size (from a single observation in case studies to “big data” with zettabytes) and any dimension.

1.2.3 Analysis (f)

We use the general term data analysis to encompass any empirical analysis applied to data. This includes statistical models and methods (parametric, semiparametric, nonparametric, Bayesian and classical, etc.), data mining algorithms, econometric models, graphical methods, and operations research methods (such as simplex optimization). Methods can be as simple as summary statistics or complex multilayer models, computationally simple or computationally intensive.

1.2.4 Utility (U)

The extent to which the analysis goal is achieved is typically measured by some performance measure. We call this measure “utility.” As with the study goal, utility refers to two dimensions: the utility from the domain point of view and the operationalized measurable utility measure. As with the goal, the linkage between the domain utility and the analysis utility measure should be properly established so that the analysis utility can be used to infer about the domain utility.

In predictive studies, popular utility measures are predictive accuracy, lift, and expected cost per prediction. In descriptive studies, utility is often assessed based on goodness‐of‐fit measures. In causal explanatory modeling, statistical significance, statistical power, and strength‐of‐fit measures (e.g., R2) are common.

1.3 Definition of information quality

Following Hand’s (2008) definition of statistics as “the technology of extracting meaning from data,” we consider the utility of applying a technology f to a resource X for a given purpose g. In particular, we focus on the question: What is the potential of a particular dataset to achieve a particular goal using a given data analysis method and utility? To formalize this question, we define the concept of InfoQ as

images

The quality of information, InfoQ, is determined by the quality of its components g (“quality of goal definition”), X (“data quality”), f (“analysis quality”), and U (“quality of utility measure”) as well as by the relationships between them. (See Figure 1.1 for a visual representation of InfoQ components.)

Puzzle diagram of the four InfoQ components for analysis goal (g), available data (x), utility measure (u), and data analysis method (f).

Figure 1.1 The four InfoQ components.

1.4 Examples from online auction studies

Let us recall the four options of eBay datasets we described at the beginning of the chapter. In order to evaluate the InfoQ of each of these datasets, we would have to specify the study goal, the intended data analysis, and the utility measure.

To better illustrate the role that the different components play, let us examine four studies in the field of online auctions, each using data to address a particular goal.

1.5 InfoQ and study quality

We defined InfoQ as a framework for answering the question: What is the potential of a particular dataset to achieve a particular goal using a given data analysis method and utility? In each of the four studies in Section 1.4, we examined the four InfoQ components and then evaluated the InfoQ based on examining the components. In Chapter 3 we introduce an InfoQ assessment approach, which is based on eight dimensions of InfoQ. Examining each of the eight dimensions assists researchers and analysts in evaluating the InfoQ of a dataset and its associated study.

In addition to using the InfoQ framework for evaluating the potential of a dataset to generate information of quality, the InfoQ framework can be used for retrospective evaluation of an empirical study. By identifying the four InfoQ components and assessing the eight InfoQ dimensions introduced in Chapter 3, one can determine the usefulness of a study in achieving its stated goal. In part II of the book, we take this approach and examine multiple studies in various domains. Chapter 12 in part III describes how the InfoQ framework can provide a more guided process for authors, reviewers and editors of scientific journals and publications.

1.6 Summary

In this chapter we introduced the concept of InfoQ and its four components. In the following chapters, we discuss how InfoQ differs from the common concepts of data quality and analysis quality. Moving from a concept to a framework that can be applied in practice requires a methodology for assessing InfoQ. In Chapter 3, we break down InfoQ into eight dimensions, to facilitate quantitative assessment of InfoQ. The final chapters (Chapters 4 and 5) in part I examine existing statistical methodology aimed at increasing InfoQ at the study design stage and at the postdata collection stage. Structuring and examining various statistical approaches through the InfoQ lens creates a clearer picture of the role of different statistical approaches and methods, often taught in different courses or used in separate fields. In summary, InfoQ is about assessing and improving the potential of a dataset to achieve a particular goal using a given data analysis method and utility. This book is about structuring and consolidating such an approach.

References

  1. Bapna, R., Jank, W. and Shmueli, G. (2008) Consumer surplus in online auctions. Information Systems Research, 19, pp. 400–416.
  2. Deming, W.E. (1953) On the distinction between enumerative and analytic studies. Journal of the American Statistical Association, 48, pp. 244–255.
  3. Ghani, R. and Simmons, H. (2004) Predicting the End‐Price of Online Auctions. International Workshop on Data Mining and Adaptive Modelling Methods for Economics and Management, Pisa, Italy.
  4. Hand, D.J. (1994) Deconstructing statistical questions (with discussion). Journal of the Royal Statistical Society, Series A, 157(3), pp. 317–356.
  5. Hand, D.J. (2008) Statistics: A Very Short Introduction. Oxford University Press, Oxford.
  6. Jank, W. and Shmueli, G. (2010) Modeling Online Auctions. John Wiley & Sons, Inc., Hoboken.
  7. Katkar, R. and Reiley, D.H. (2006) Public versus secret reserve prices in eBay auctions: results from a Pokemon field experiment. Advances in Economic Analysis and Policy, 6(2), article 7.
  8. Kenett, R.S. (2015) Statistics: a life cycle view (with discussion). Quality Engineering, 27(1), pp. 111–129.
  9. Kenett, R.S. and Salini, S. (2012) Modern analysis of customer surveys: comparison of models and integrated analysis (with discussion). Applied Stochastic Models in Business and Industry, 27, pp. 465–475.
  10. Kenett, R.S. and Shmueli, G. (2014) On information quality (with discussion). Journal of the Royal Statistical Society, Series A, 177(1), pp. 3–38.
  11. Marshall, A. (1920) Principles of Economics, 8th edition. MacMillan, London.
  12. Patzer, G.L. (1995) Using Secondary Data in Marketing Research. Praeger, Westport, CT.
  13. Shmueli, G. (2010) To explain or to predict? Statistical Science, 25, pp. 289–310.
  14. Shmueli, G. and Koppius, O.R. (2011) Predictive analytics in information systems research. Management Information Systems Quarterly, 35, pp. 553–572.
  15. Tukey, J.W. (1977) Exploratory Data Analysis. Addison‐Wesley, Reading, PA.
  16. Wang, S., Jank, W. and Shmueli, G. (2008) Explaining and forecasting online auction prices and their dynamics using functional data analysis. Journal of Business and Economics Statistics, 26, pp. 144–160.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.118.198