Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2
Quality of goal, data quality, and analysis quality

2.1 Introduction

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

John Tukey, 1962

At the most basic level, the quality of a goal under investigation depends on whether the stated goal is of interest and relevant either scientifically or practically. At the next level, the quality of a goal is derived from translating a scientific or practical goal into an empirical goal. This challenging step requires knowledge of both the problem domain and data analysis and necessitates close collaboration between the data analyst and the domain expert. A well‐defined empirical goal is one that properly reflects the scientific or practical goal. Although a dataset can be useful for one scientific goal g₁, it can be completely useless for a second scientific goal g₂. For example, monthly average temperature data for a city can be utilized to quantify and understand past trends and seasonal patterns, goal g₁, but cannot be used effectively for generating future daily weather forecasts, goal g₂. The challenge is therefore to define the right empirical question under study in order to avoid what Kimball (1957) calls “error of the third kind” or “giving the right answer to the wrong question.”

The task of goal definition is often more difficult than any of the other stages in a study. Hand (1994) says:

It is clear that establishing the mapping from the client’s domain to a statistical question is one of the most difficult parts of a statistical analysis.

Moreover, Mackay, and Oldford (2000) note that this important step is rarely mentioned in introductory statistics textbooks:

Understanding what is to be learned from an investigation is so important that it is surprising that it is rarely, if ever, treated in any introduction to statistics. In a cursory review, we could find no elementary statistics text that provided a structure to understand the problem.

Several authors have indicated that the act of finding and formulating a problem is a key aspect of creative thought and performance, an act that is distinct from, and perhaps more important than, problem solving (see Jay and Perkins, 1997).

Quality issues of goal definition often arise when translating a stakeholder’s language into empirical jargon. An example is a marketing manager who requests an analyst to use the company’s existing data to “understand what makes customers respond positively or negatively to our advertising.” The analyst might translate this statement into an empirical goal of identifying the causal factors that affect customer responsiveness to advertising, which could then lead to designing and conducting a randomized experiment. However, in‐depth discussions with the marketing manager may lead the analyst to discover that the analysis results are intended to be used for targeting new customers with ads. While the manager used the English term “understand,” his/her goal in empirical language was to “predict future customers’ ad responsiveness.” Thus, the analyst should develop and evaluate a predictive model rather than an explanatory one. To avoid such miscommunication, a critical step for analysts is to learn how to elicit the required information from the stakeholder and to understand how their goal translates into empirical language.

2.1.1 Goal elicitation

One useful approach for framing the empirical goal is scenario building, where the analyst presents different scenarios to the stakeholder of how the analysis results might be used. The stakeholder’s feedback helps narrow the gap between the intended goal and its empirical translation. Another approach, used in developing integrated information technology (IT) systems, is to conduct goal elicitation using organizational maps. A fully developed discipline, sometimes called goal‐oriented requirements engineering (GORE), was designed to do just that (Dardenne et al., 1993; Regev and Wegmann, 2005).

2.1.2 From theory to empirical hypotheses

In academic research, different disciplines have different methodologies for translating a scientific question into an empirical goal. In the social sciences, such as economics or psychology, researchers start from a causal theory and then translate it into statistical hypotheses by a step of operationalization. This step, where abstract concepts are mapped into measurable variables, allows the researcher to translate a conceptual theory into an empirical goal. For example, in quantitative linguistics, one translates scientific hypotheses about the human language faculty and its use in the world into statistical hypotheses.

2.1.3 Goal quality, InfoQ, and goal elicitation

Defining the study goal inappropriately, or translating it incorrectly into an empirical goal, will obviously negatively affect information quality (InfoQ). InfoQ relies on, but does not assess the quality of, the goal definition. The InfoQ framework offers an approach that helps assure the alignment of the study goal with the other components of the study. Since goal definition is directly related to the data, data analysis and utility, the InfoQ definition is dependent on the goal, U(f(X|g)), thereby requiring a clear goal definition and considering it at every step. By directly considering the goal, using the InfoQ framework raises awareness to the stated goal, thereby presenting opportunities for detecting challenges or issues with the stated goal.

Moreover, the InfoQ framework can be used to enhance the process of goal elicitation and hypothesis generation. It is often the case that researchers formulate their goals once they see and interact with the data. In his commentary on the paper “On Information Quality” (Kenett and Shmueli, 2013), Schouten (2013) writes about the importance and difficulty of defining the study goal and the role that the InfoQ framework can play in improving the quality of goal definition. He writes:

A decisive ingredient to information quality is the goal or goals that researchers have set when starting an analysis. From my own experience and looking at analyses done by others, I conclude that research goals may not be that rigorously defined and/or stated beforehand. They should of course be well defined in order to assess the fitness for use of data, but often data exploration and analysis sharpen the mind of the researcher and goals get formed interactively. As such I believe that an assessment of the InfoQ dimensions may actually be helpful in deriving more specific and elaborated analysis goals. Still, I suspect that the framework is only powerful when researchers have well defined goals.

2.2 Data quality

Raw data, like raw potatoes, usually require cleaning before use.

Thisted, in Hand, 2008

It is rare to meet a data set which does not have quality problems of some kind.

Hand, 2008

Data quality is a critically important subject. Unfortunately, it is one of the least understood subjects in quality management and, far too often, is simply ignored.

Godfrey, 2008

Data quality has long been recognized by statisticians and data analysts as a serious challenge. Almost all data requires some cleaning before it can be further used for analysis. However, the level of cleanliness and the approach to data cleaning depend on the goal. Using the InfoQ notation, data quality typically concerns U(X|g). The same data can contain high‐quality information for one purpose and low‐quality information for another. This has been recognized and addressed in several fields. Mallows (1998) posed the zeroth problem, asking “How do the data relate to the problem, and what other data might be relevant?” In the following we briefly examine several approaches to data quality in different fields and point out how they differ from InfoQ.

2.2.1 MIS‐type data quality

In database engineering and management information systems (MIS), the term “data quality” refers to the usefulness of queried data to the person querying it. Wang et al. (1993) gave the following example:

Acceptable levels of data quality may differ from one user to another. An investor loosely following a stock may consider a ten minute delay for share price sufficiently timely, whereas a trader who needs price quotes in real time may not consider ten minutes timely enough.

Another aspect sometimes attributable to data quality is conformance to specifications or standards. Wang et al. (1993) define data quality as “conformance to requirements.” For the purpose of evaluating data quality, they use “data quality indicators.” These indicators are based on objective measures such as data source, creation time, collection method and subjective measures such as the level of credibility of a source as assigned by a researcher. In the United Kingdom, for instance, the Department of Health uses an MIS type of definition of data quality with respect to the quality of medical and healthcare patient data in the National Health Service (UK Department of Health, 2004).

Lee et al. (2002) propose a methodology for InfoQ assessment and benchmarking, called assessment of information system methodology and quality (AIMQ). Their focus is on usefulness of organizational data to its users, specifically, data from IT systems. The authors define four categories of InfoQ: intrinsic, contextual, representational and accessibility. While the intrinsic category refers to “information [that] has quality in its own right,” the contextual category takes into account the task at hand (from the perspective of a user), and the last two categories relate to qualities of the information system. Lee et al.’s use of the term “InfoQ” indicates that they consider the data in the context of the user, rather than in isolation (as the term data quality might imply). The AIMQ methodology is used for assessing and benchmarking an organization’s own data usage.

A main approach to InfoQ implemented in the context of MIS is the application of entity resolution (ER) analysis. ER is the process of determining whether two references to real‐world objects are referring to the same object or two different objects. The degree of completeness, accuracy, timeliness, believability, consistency, accessibility and other aspects of reference data can affect the operation of ER processes and produce better or worse outcomes. This is one of the reasons that ER is so closely related to the MIS field of IQ, an emerging discipline concerned with maximizing the value of an organization’s information assets and assuring that the information products it produces meet the expectations of the customers who use them. Improving the quality of reference sources dramatically improves ER process results, and conversely, integrating the references through ER improves the overall quality of the information in the system. ER systems generally use four basic techniques for determining that references are equivalent and should be linked: direct matching, association analysis, asserted equivalence and transitive equivalence. For an introduction to ER, see Talburt (2011). For a case study on open source software conducting ER analysis, in the context of healthcare systems, see Zhou et al. (2010).

Gackowski (2005) reviews popular MIS textbooks and states:

Current MIS textbooks are deficient with regard to the role of end users and even more so about the information disseminators. The texts are overly technology laden, with oversimplified coverage of the fundamentals on data, information, and particularly the role of informing in business.

The treatment of InfoQ in this book addresses this void and, in some sense, links to data quality communities such as the International Association for Information and Data Quality (IAIDQ).

In a broader context, technology can improve data quality. For example, in manual data entry to an automated system, automatic data validation can provide immediate feedback so that data entry errors can be corrected on the spot. Technological advances in electronic recording, scanners, RFID, electronic entry, electronic data transfer, data verification technologies and robust data storage, as well as more advanced measurement instruments, have produced over time much “cleaner” data (Redman, 2007).

These data quality issues focus on U(X|g), which differs from InfoQ by excluding the data analysis component f. In addition, the MIS reference to utility is typically qualitative and not quantitative. It considers utility as the value of information provided to the receiver in the context of its intended use. In InfoQ, the utility U(X|g) is considered with a quantitative perspective and consists of statistical measures such as prediction error or estimation bias.

2.2.2 Quality of statistical data

A similar concept is the quality of statistical data which has been developed and used in official statistics and international organizations that routinely collect data. The concept of quality of statistical data refers to the usefulness of summary statistics that are produced by national statistics agencies and other producers of official statistics. This is a special case of InfoQ where f is equivalent to computing summary statistics (although this operation might seem very simple, it is nonetheless considered “analysis,” because it in fact involves estimation).

Such organizations have created frameworks for assessing the quality of statistical data. The International Monetary Fund (IMF) and the Organisation for Economic Co‐operation and Development (OECD) each developed an assessment framework. Aspects that they assess are relevance, accuracy, timeliness, accessibility, interpretability, coherence and credibility. These different dimensions are each assessed separately—either subjectively or objectively. For instance, the OECD’s definition of relevance of statistical data refers to a qualitative assessment of the value contributed by the data. Other aspects are more technical in nature. For example, accessibility refers to how readily the data can be located and accessed. See Chapter 3 for further details on the data quality dimensions used by government and international agencies.

In the context of survey quality, official agencies such as Eurostat, the National Center for Science and Engineering Statistics, and Statistics Canada have created quality dimensions for evaluating the quality of a survey for the goal g of obtaining “accurate survey data” as measured by U equivalent to the mean square error (MSE) (see Biemer and Lyberg (2003)). Such agencies have also defined a set of data quality dimensions for the purpose of evaluating data quality. For example, Eurostat’s quality dimensions are relevance of statistical concept, accuracy of estimates, timeliness, and punctuality in disseminating results, accessibility, and clarity of the information, comparability, coherence, and completeness (see www.nsf.gov/statistics for the National Center for Science and Engineering Statistics guidelines and standards).

2.2.3 Data quality in statistics

In the statistics literature, discussions of data quality mostly focus on the cleanliness of the data in terms of data entry errors, missing values, measurement errors and so on. These different aspects of data quality can be classified into different groups using different criteria. For example, Hand (2008) distinguishes between two types of data quality problems: incomplete data (including missing values and sampling bias) and incorrect data.

In the InfoQ framework, we distinguish between strictly data quality issues and InfoQ issues on the basis of whether they relate only to X or to one or more of the InfoQ components. An issue is a “data quality” issue if it characterizes a technical aspect of the data that can be “cleaned” with adequate technology and without knowledge of the goal. Aspects such as data entry errors, measurement errors and corrupted data are therefore classified as “data quality.” Data issues that involve the goal, analysis and/or utility of the study are classified as “InfoQ” issues. These include sampling bias and missing values, which are not simply technical errors: their definition or impact depends on the study goal g. Sampling bias, for example, is relative to the population of interest: the same sample can be biased for one goal and unbiased for another goal. Missing values can add uncertainty for achieving one goal, but reduce uncertainty for achieving another goal (e.g., missing information in financial reports can be harmful for assessing financial performance, but helpful for detecting fraudulent behavior).

Other classifications of data quality issues are also possible. Schouten (2013) distinguishes between data quality and InfoQ, saying “data quality is about the data that one intended to have and InfoQ is about the data that one desired to have.” According to his view, different methods are used to improve data quality and InfoQ. “Data processing, editing, imputing and weighting [aim at] reducing the gap between the data at hand and the data that one intended to have. These statistical methods … aim at improving data quality. Data analysis is about bridging the gap between intended and desired data.”

In the remainder of the book, we use the classifications of “data quality” and “InfoQ” based on whether the issue pertains to the data alone (X) or to at least one more InfoQ component.

2.3 Analysis quality

All models are wrong, but some are useful.

Box, 1979

Analysis quality refers to the adequacy of the empirical analysis in light of the data and goal at hand. Analysis quality reflects the adequacy of the modeling with respect to the data and for answering the question of interest. Godfrey (2008) described low analysis quality as “poor models and poor analysis techniques, or even analyzing the data in a totally incorrect way.” We add to that the ability of the stakeholder to use the analysis results. Let us consider a few aspects of analysis quality, so that it becomes clear how it differs from InfoQ and how the two are related.

2.3.1 Correctness

Statistics education as well as education in other related fields such as econometrics and data mining is aimed at teaching high‐quality data analysis. Techniques for checking analysis quality include graphic and quantitative methods such as residual analysis and cross validation as well as qualitative evaluation such as consideration of endogeneity (reverse causation) in causal studies. Analysis quality depends on the expertise of the analyst and on the empirical methods and software available at the time of analysis.

Analysis quality greatly depends on the goal at hand. The coupling of analysis and goal allows a broader view of analysis adequacy, since “textbook” approaches often consider the suitability of a method for use with specific data types for a specific goal. But usage might fall outside that scope and still be useful. As an example, the textbook use of linear regression models requires data that adheres to the assumption of independent observations. Yet, the use of linear regression for forecasting time series, where the observations are typically auto‐correlated, is widely used in practice because it meets the goal of sufficiently accurate forecasts. Naive Bayes classifiers are built on the assumption of conditional independence of predictors, yet despite the violation of the assumption in most applications, naive Bayes provides excellent classification performance.

Analysis quality refers not only to the statistical model used but also to the methodology. For example, comparing a predictive model to a benchmark is a necessary methodological step.

2.3.2 Usability of the analysis

The usability of the analysis method to the stakeholder is another facet of analysis quality. In applications such as credit risk, regulations exist regarding information that must be conveyed to customers whose credit requests are denied. In such cases, using models that are not transparent in terms of the variables used (“blackbox” models) will have low analysis quality, even if they adequately deny/grant credit to prospects.

2.3.3 Culture of data analysis

Finally, there is a subjective aspect of analysis quality. Different academic fields maintain different “cultures” and norms of what is considered acceptable data analysis. For example, regression models are often used for causal inference in the social sciences, while in other fields such a use is deemed unacceptable. Hence, analysis quality also depends on the culture and environment of the researcher or analysis team.

When considering InfoQ, analysis quality is not examined against textbook assumptions or theoretical properties. Instead, it is judged against the specific goal g and the utility U of using the analysis method f with the specific dataset X for that particular goal.

2.4 Quality of utility

Like the study goal, the utility of a study provides the link between the domain and the empirical worlds. The analyst must understand the intended use and purpose of the analysis in order to choose the correct empirical utility or performance measures. The term “utility” refers to the overall usefulness of what the study aims to achieve, as well as to the set of measures and metrics used to assess the empirical results.

2.4.1 Statistical performance

A model may be useful along one dimension and worse than useless along another.

Berk et al., 2013

The field of statistics offers a wealth of utility measures, tests, and charts aimed at gauging the performance of a statistical model. Methods range from classical to Bayesian; loss functions range from L₁‐distance to L₂‐distance metrics; metrics are based on in‐sample or out‐of‐sample data. They include measures of goodness of fit (e.g., residual analysis) and strength‐of‐relationship tests and measures (e.g., R² and p‐values in regression models).

Predictive performance measures include penalized metrics such as the Akaike information criterion (AIC) and Bayes information criterion (BIC) and out‐of‐sample measures such as root mean square error (RMSE), mean absolute percentage error (MAPE) and other aggregations of prediction errors. One can use symmetric cost functions on prediction errors or asymmetric ones which more heavily penalize over‐ or underprediction. Even within predictive modeling, there exist a variety of metrics depending on the exact predictive task and data type: for classification (predicting a categorical outcome) one can use a classification matrix, overall error, measures of sensitivity and specificity, recall and precision, receiver operating curves (ROC), and area under curve (AUC) metrics. For predicting numerical records, various aggregations of prediction errors exist that weigh the direction and magnitude of the error differently. For ranking new records, lift charts are most common.

As a side note, Akaike originally called his approach an “entropy maximization principle,” because the approach is founded on the concept of entropy in information theory. Minimizing AIC in a statistical model is equivalent to maximizing entropy in a thermodynamic system; in other words, the information‐theoretic approach in statistics essentially applies the second law of thermodynamics. As such, AIC generalizes the work of Boltzmann on entropy to model selection in the context of generalized regression (GR). We return to the important dimension of generalization in the context of InfoQ dimensions in the next chapter.

With such a rich set of potential performance measures, utility quality greatly depends on the researcher’s ability and knowledge to choose adequate metrics.

2.4.2 Decision‐theoretical and economic utility

In practical applications, there are often costs and gains associated with decisions to be made based on the modeling results. For example, in fraud detection, misclassifying a fraudulent case is associated with some costs, while misclassifying a nonfraudulent case is associated with other costs. In studies of this type, it is therefore critical to use cost‐based measures for assessing the utility of the model. In the field of statistical process control (SPC), the parameters of classic control charts are based on the sampling distribution of the monitored statistic (typically the sample mean or standard deviation). A different class of control charts is based on “economic design” (see Montgomery, 1980; Kenett et al., 2014), where chart limits, sample size and time between samples are set based on a cost minimization model that takes into account costs due to sampling, investigation, repair and producing defective products (Serel, 2009).

Decision theory provides a rational framework for choosing between alternative courses of action when the consequences resulting from this choice are imperfectly known. In Lindley’s foreword to an edited volume by Di Bacco et al. (2004), he treats the question what is meant by statistics by referring to those he considers the founding fathers: Harold Jeffreys, Bruno de Finetti, Frank Ramsey, and Jimmie Savage:

Both Jeffreys and de Finetti developed probability as the coherent appreciation of uncertainty, but Ramsey and Savage looked at the world rather differently. Their starting point was not the concept of uncertainty but rather decision‐making in the face of uncertainty. They thought in terms of action, rather than in the passive contemplation of the uncertain world. Coherence for them was not so much a matter of how your beliefs hung together but of whether your several actions, considered collectively, make sense….

2.4.3 Computational performance

In industrial applications, computation time is often of great importance. Google’s search engine must return query results with very little lag time; Amazon and Facebook must choose the product or ad to be presented immediately after a user clicks. Even in academic studies, in some cases research groups cannot wait for months to complete a run, thereby resorting to shortcuts, approximations, parallel computing and other solutions to speed run time. In such cases, the utility also includes computational measures such as run time and computational resources, as well as scalability.

2.4.4 Other types of utility

Interpretability of the analysis results might be considered critical to the utility of a model, where one prefers interpretable models over “blackbox” models, while in other cases the user might be agnostic to interpretation, so that interpretability is not part of the utility function.

For academics, a critical utility of a data analysis is publication! Hence, choices of performance metrics might be geared toward attaining the requirements in their field. These can vary drastically by field. For example, the journal Basic and Applied Social Psychology recently announced that it will not publish p‐values, statistical tests, significance statement or confidence intervals in submitted manuscripts (Trafimow and Marks, 2015).

2.4.5 Connecting empirical utility and study utility

When considering the quality of the utility U, two dangers can lower quality: (i) an absence of a study utility, limiting the study to statistical utility measures, and (ii) a mismatch between the utility measure and the utility of the study.

With respect to focusing solely on statistical utility, we can quote again Lindley (2004) who criticized current publications that use Bayesian methods for neglecting to consider utility. He writes:

If one looks today at a typical statistical paper that uses the Bayesian method, copious use will be made of probability, but utility, or maximum expected utility (MEU), will rarely get a mention… When I look at statistics today, I am astonished at the almost complete failure to use utility…. Probability is there but not utility. This failure has to be my major criticism of current statistics; we are abandoning our task half‐way, producing the inference but declining to explain to others how to act on that inference. The lack of papers that provide discussions on utility is another omission from our publications.

Choosing the right measures depends on correctly identifying the underlying study utility as well as correctly translating the study utility into empirical metrics. This is similar to the case of goal definition and its quality.

Performance measures must depend on the goal at hand, on the nature of the data and on the analysis method. For example, a common error in various fields is using the R² statistic for measuring predictive accuracy (see Shmueli, 2010; Shmueli and Koppius, 2011). Recall our earlier example of a marketing manager who tells the analyst the analysis goal is to “understand customer responsiveness to advertising,” while effectively the model is to be used for targeting new customers with ads. If the analyst pursues (incorrectly) an explanatory modeling path, then their choice of explanatory performance measures, such as R² (“How well does my model explain the effect of customer information on their ad responsiveness?”), would lower the quality of utility. This low‐quality utility would typically be discovered in the model deployment stage, where the explanatory model’s predictive power would be observed for the first time.

Another error that lowers utility quality is relying solely on p‐values for testing hypotheses with very large samples, a common practice in several fields that now use hundreds of thousands or even millions of observations. Because p‐values are a function of sample size, with very large samples one can obtain tiny p‐values (highly statistically significant) for even miniscule effects. One must therefore examine the effect size and consider its practical relevance (see Lin et al., 2013).

With the proliferation of data mining contests, hosted on public platforms such as kaggle.com, there has been a strong emphasis on finding a model that optimizes a specific performance measure, such as RMSE or lift. However, in real‐life studies, it is rarely the case that a model is chosen on the basis of a single utility measure. Instead, the analyst considers several measures and examines the model’s utility under various practical considerations, such as adequacy of use by the stakeholder, cost of deployment and robustness under different possible conditions. Similarly, in academic research, model selection is based not on optimizing a single utility measure but rather on additional criteria such as parsimony and robustness and, importantly, on supporting meaningful discoveries.

The quality of the utility therefore directly impacts InfoQ. As with goal quality, the InfoQ framework raises awareness to the relationship between the domain and empirical worlds, thereby helping to avoid disconnects between analysis and reality, as is the case in data mining competitions.

2.5 Summary

This chapter lays the foundation for the remainder of the book by examining each of the four InfoQ components (goal, data, analysis, and utility) from the perspective of quality. We consider the intrinsic quality of these components, thereby differentiating the single components’ quality from the overall notion of InfoQ. The next chapter introduces the eight InfoQ dimensions used to deconstruct the general concept of InfoQ. InfoQ combines the four components treated here with the eight dimensions discussed in Chapter 3. The examples in this and other chapters show how InfoQ combines data collection and organization with data analytics and operationalization, designed to achieve specific goals reflecting particular utility functions. In a sense, InfoQ expands on the domain of decision theory by considering modern implications of data availability, advanced and accessible analytics and data‐driven systems with operational tasks. Following Chapter 3 we devote specific chapters to the data collection and study design phase and the postdata collection phase, from the perspective of InfoQ. Examples in a wide range of applications are provided in part II of the book.

References

Berk, R.A., Brown, L., George, E., Pitkin, E., Traskin, M., Zhang, K. and Zhao, L. (2013) What You Can Learn from Wrong Causal Models, in Handbook of Causal Analysis for Social Research, Morgan, S.L. (editor), Springer, Dordrecht.
Biemer, P. and Lyberg, L. (2003) Introduction to Survey Quality. John Wiley & Sons, Inc., Hoboken, NJ.
Box, G.E.P. (1979) Robustness in the Strategy of Scientific Model Building, in Robustness in Statistics, Launer, R.L. and Wilkinson, G.N. (editors), Academic Press, New York, pp. 201–236.
Dardenne, A., van Lamsweerde, A. and Fickas, S. (1993) Goal‐directed requirements acquisition. Science of Computer Programming, 20, pp. 3–50.
Di Bacco, N., d’Amore, G. and Scalfari, F. (2004) Applied Bayesian Statistical Studies in Biology and Medicine. Springer, Boston, MA.
Gackowski, Z. (2005) Informing systems in business environments: a purpose‐focused view. Informing Science Journal, 8, pp. 101–122.
Godfrey, A.B. (2008) Eye on data quality. Six Sigma Forum Magazine, 8, pp. 5–6.
Hand, D.J. (1994) Deconstructing statistical questions (with discussion). Journal of the Royal Statistical Society, Series A, 157(3), pp. 317–356.
Hand, D.J. (2008) Statistics: A Very Short Introduction. Oxford University Press, Oxford.
Jay, E.S. and Perkins, D.N. (1997) Creativity’s Compass: A Review of Problem Finding, in Creativity Research Handbook, vol. 1, Runco, M.A. (editor), Hampton, Cresskill, NJ, pp. 257–293.
Kenett, R.S. and Shmueli, G. (2013) On information quality. Journal of the Royal Statistical Society, Series A, 176(4), pp. 1–25.
Kenett, R., Zacks, S. and Amberti, D. (2014) Modern Industrial Statistics: With Applications in R, MINITAB and JMP, 2nd edition. John Wiley & Sons, Chichester, West Sussex, UK.
Kimball, A.W. (1957) Errors of the third kind in statistical consulting. Journal of the American Statistical Association, 52, 133–142.
Lee, Y., Strong, D., Kahn, B. and Wang, R. (2002) AIMQ: a methodology for information quality assessment. Information & Management, 40, pp. 133–146.
Lin, M., Lucas, H. and Shmueli, G. (2013) Too big to fail: large samples and the p‐value problem. Information Systems Research, 24(4), pp. 906–917.
Lindley, D.V. (2004) Some Reflections on the Current State of Statistics, in Applied Bayesian Statistics Studies in Biology and Medicine, di Bacco, M., d’Amore, G. and Scalfari, F. (editors), Springer, Boston, MA.
Mackay, R.J. and Oldford, R.W. (2000) Scientific method, statistical method, and the speed of light. Statistical Science, 15(3), pp. 254–278.
Mallows, C. (1998) The zeroth problem. The American Statistician, 52, pp. 1–9.
Montgomery, D.C. (1980) The economic design of control charts: a review and literature survey. Journal of Quality Technology, 12, pp. 75–87.
Redman, T. (2007) Statistics in Data and Information Quality, in Encyclopedia of Statistics in Quality and Reliability, Ruggeri, F., Kenett, R.S. and Faltin, F. (editors in chief), John Wiley & Sons, Ltd, Chichester, UK.
Regev, G. and Wegmann, W. (2005) Where Do Goals Come from: The Underlying Principles of Goal‐Oriented Requirements Engineering. Proceedings of the 13th IEEE International Requirements Engineering Conference (RE’05), Paris, France.
Schouten, B. (2013) Comments on ‘on information quality’. Journal of the Royal Statistical Society, Series A, 176(4), pp. 27–29.
Serel, D.A. (2009) Economic design of EWMA control charts based on loss function. Mathematical and Computer Modelling, 49(3–4), pp. 745–759.
Shmueli, G. (2010) To explain or to predict? Statistical Science, 25(3), pp. 289–310.
Shmueli, G. and Koppius, O.R. (2011) Predictive analytics in information systems research. MIS Quarterly, 35(3), pp. 553–572.
Talburt, J.R. (2011) Entity Resolution and Information Quality. Morgan Kaufmann, Burlington, VT.
Trafimow, D. and Marks, M. (2015) Editorial. Basic and Applied Social Psychology, 37(1), pp. 1–2.
Tukey, J.W. (1962) The future of data analysis. Annals of Mathematical Statistics, 33(1), pp. 1–67.
UK Department of Health (2004) A Strategy for NHS Information Quality Assurance – Consultation Draft. Department of Health, London. http://webarchive.nationalarchives.gov.uk/20130107105354/http://www.dh.gov.uk/prod_consum_dh/groups/dh_digitalassets/@dh/@en/documents/digitalasset/dh_4087588.pdf (accessed May 2, 2016).
Wang, R.Y., Kon, H.B. and Madnick, S.E. (1993) Data Quality Requirements Analysis and Modeling. 9th International Conference on Data Engineering, Vienna.
Zhou, Y., Talburt, J., Su, Y. and Yin, L. (2010) OYSTER: A Tool for Entity Resolution in Health Information Exchange. Proceedings of the Fifth International Conference on the Cooperation and Promotion of Information Resources in Science and Technology (COINFO10), pp. 356–362.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2 Quality of goal, data quality, and analysis quality

Create new playlist

Sign In

Sign Up