1. The Advent of Data Science: Some Considerations on the Unreasonable Effectiveness of Data (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The Advent of Data Science 13

1.6 Herald of t he Controversy

Although Bacon could clearly be considered an ancestor of the debate, we will now conﬁne

ourselves to a speciﬁc and contemporary statistical understanding and elucidation of the

issue, and the key ﬁgure is undoubtedly the American mathematician John W. Tukey

(1915–2000). In the annals of the history of science, he is primarily chronicled for his

contributions in the ﬁeld of statistics and data analysis. In 1977, his now classic monograph

Exploratory Data Analysis (EDA) saw the light (Tukey, 1977), however, he was already

successful in many areas inside and outside of statistics. Originally a chemist, he shifted

his attention to mathematics and earned his PhD at Princeton on a topological subject.

His interest in metamathematics led to contributions to logic and model theory (Tukey’s

lemma), his investigations in spectral analysis of time series resulted among others in

the famous algorithm for the fast Fourier transform. In statistics, his name has been

associated with the post hoc test of the variance analysis (e.g., the Tukey range test),

a resampling method as the jackknife and of course to many descriptive, often-graphic

techniques, ranging from box plots and stem-and-leaf plots to smoothing techniques and

all kinds of new descriptive metrics. Tukey worked with various renowned scientists such

as the physicist Richard Feynman, the computing pioneer John von Neumann and Claude

Shannon, Turing Award laureate Richard Hamming, and statistician Samuel Wilks. Among

the many concepts and notions introduced by Tukey, bits and software are undoubtedly

the most famous ones. Of the many other neologisms attributed to him, some will probably

be apocryphal, but they indisputably illustrate Tukey’s inﬂuence on the times in which he

lived. With respect to the themes addressed in this chapter, an essential aspect of his legacy

concerns the fact that he appears the herald of the current controversy. In fact, he ushered

in a new phase in the development of statistics and data analysis, which as a Hegelian

process is taking place now and has to be completed in the era of big data (Starmans,

2013). This triptych of thesis–antithesis–synthesis will be brieﬂy discussed in the next

sections.

1.6.1 Exploratory Data Analysis

We start with the publication of EDA, a remarkable and in many ways unorthodox book,

which is very illustrative for the underlying debate. First, it contains no axioms, theorems,

lemmas, or evidence, and even barely formulas. Given Tukey’s previous theoretical mathe-

matical contributions, this is striking, though he certainly was not the ﬁrst. Ronald Fisher,

who wrote Statistical Methods for Research Workers (1925), which went through many

reprints and in fact can be considered as the ﬁrst methodological and statistical manual,

was his predecessor. This book provided many practical tips for conducting research,

required only a slight mathematical knowledge, and contained relatively few formulas.

Partly as a result, Fisher’s methods in biology, agronomy, and psychology were quickly

known and canonized, well before the mathematical statistics had codiﬁed these insights.

However, Tukey was in many ways diﬀerent from Fisher, who of course was a theoretical

statistician at heart. EDA hardly seems comparable to traditional statistical handbooks. It

contains no theoretical distributions, signiﬁcance tests, p values, hypotheses, parameter

estimates, and conﬁdence intervals. There were no signs of conﬁrmatory or inferential

statistics, but purely the understanding of data, looking for patterns, relationships, and

structures in data, and visualizing the results. According to Tukey, a detective has to go

to work like a contemporary Sherlock Holmes, looking for clues, signs, or hints. Tukey

maintains this metaphor consistently throughout the book and provides the data analyst

with a toolbox full of methods for the understanding of frequency distributions, smoothing

14 Handbook of Big Data

techniques, scale transformations, and, above all, many graphic techniques, for exploration,

storage, abstraction, and illustration of data. Strikingly, EDA contains a large number

of neologisms, which Tukey sometimes tactically or polemically uses, but which may

have an alienating eﬀect on the uninitiated reader. He chooses the words because of his

dissatisfaction with conventional techniques and nomenclature, which according to him are

habitually presented as unshakable and sacred by his statistical colleagues to the researcher.

Finally, the bibliography of EDA has remarkably few references for a voluminous study of

nearly 700 pages. There are only two of these: a joint article by Mosteller and Tukey and

the Bible.

1.6.2 Thesis

The unorthodox approach of Tukey in EDA shows a rather fundamental dissatisfaction with

the prevailing statistical practice and the underlying paradigm of inferential/conﬁrmatory

statistics. The contours of this tradition began to take shape when the probabilistic

revolution during the early twentieth century was well under way: ﬁrst, with the prob-

ability distributions and goodness-of-ﬁt procedures developed by Karl Pearson, then the

signiﬁcance tests and maximum likelihood estimation (MLE) developed by Ronald Fisher,

the paradigm of hypothesis testing by Jerzy Neyman and Egon Pearson, and the conﬁdence

intervals of Neyman. These techniques combined with variance and regression analysis

showed a veritable triumphal procession through the empirical sciences, and often not just

methodologically. Biology and psychology sometimes seemed to be transformed or reduced

to applied statistics. According to some researchers, inferential/testing statistics became

virtually synonymous with the scientiﬁc method and appeared philosophically justiﬁed

since the 1930s, especially with the advancement of the hypothetical-deductive method

and Popper’s concept of falsiﬁability. The success inevitably caused a backlash. This is

not only because many psychologist and biologists rejected the annexation of their ﬁeld by

the statisticians without a ﬁght but also because of intrinsic reasons. The embrace of the

heritage of Pearson, Fisher, and Neyman did not cover up the fact that various techniques

were contradictory, but were presented and used as a seemingly coherent whole at the same

time. In spite of the universal application, there were many theoretical and conceptual

problems regarding the interpretation and use of concepts such as probability, conﬁdence

intervals, and other (alleged) counterintuitive concepts, causing unease and backlash that

continue even to this day. The emergence of nonparametric and robust statistics formed a

prelude to a countermovement, propagated by Tukey since the 1960s.

1.6.3 Antithesis

Although Tukey in EDA does his best to emphasize the importance of conﬁrmatory

tradition, the antagonism becomes manifest to the reader in the beginning of the book.

Moreover, he had already put his cards on the table in 1962 in the famous opening passage

from The Future of Data Analysis.

For a long time I have thought that I was a statistician, interested in inferences

from the particular to the general. But as I have watched mathematical statistics

evolve, I have had cause to wonder and to doubt. And when I have pondered

about why such techniques as the spectrum analysis of time series have proved

so useful, it has become clear that their ‘dealing with ﬂuctuations’ aspects are,

in many circumstances, of lesser importance than the aspects that would already

have been required to deal eﬀectively with the simpler case of very extensive data

where ﬂuctuations would no longer be a problem. All in all, I have come to feel

The Advent of Data Science 15

that my central interest is in data analysis, which I take to include, among other

things: procedures for analyzing data, techniques for interpreting the results of

such procedures, ways of planning the gathering of data to make its analysis easier,

more precise or more accurate, and all the machinery and results of mathematical

statistics which apply to analyzing data. (...) Data analysis is a larger and more

varied ﬁeld than inference, or allocation.

Also in other writings, Tukey makes a sharp distinction between statistics and data analysis,

and this approach largely determines the status quo. First, Tukey unmistakably contributed

to the emancipation of the descriptive/visual approach, after a pioneering work of William

Playfair (eighteenth century) and Florence Nightingale (nineteenth century) after the rise

of inferential power in the previous period. In addition, it can hardly be surprising that

many see in Tukey a pioneer of computational disciplines such as data mining and machine

learning, although he himself assigned a modest place to the computer in his analyses. More

importantly, however, Tukey, because of his alleged antitheoretical stance, is often regarded

as the person who tried to undo the Fisherian revolution. He could thus be regarded as

an exponent or precursor of today’s erosion of the concept model, with the view that all

models are wrong, that the classical concept of truth is obsolete, and pragmatic criteria as

predictive success should come ﬁrst in data analysis.

1.6.4 Synthesis

This one very succinct opposition has many aspects that cannot be discussed in this

chapter. We focus on three considerations. Although it almost sounds like a platitude,

it must ﬁrst be noted that EDA techniques nowadays are implemented in all statistical

packages either alone or sometimes by using hybrid inferential methods. In current empirical

research methodology, EDA has been integrated in diﬀerent phases of the research process.

In the second place, it could be argued that Tukey did not undermine the revolution

initiated and set forward by Galton and Pearson, but on the contrary he grasped the

ultimate consequences to this position. Indeed, it was Galton who had shown 100 years

before that variation and change are intrinsic in nature, urging to search for the deviant,

the special or idiosyncratic, and indeed it was Pearson who realized that the straitjacket

of the normal distribution (Laplace and Quetelet) had to be abandoned and replaced

by many (classes of) skewed distributions. Galton’s heritage suﬀered a little by the

successes of the parametric, strong model assumptions-based Fisherian statistics and was

partly restored by Tukey. Finally, the contours of a Hegelian triad became visible. The

nineteenth-century German philosopher G.F.W. Hegel postulated that history goes through

a process of formation or development, which evokes a thesis and an antithesis, both

of which are then to be brought to a higher level to be completed, thereby leading to

a synthesis. Applied to the less metaphysically oriented present issue, this dialectical

principle seems very apparent in the era of big data, which evokes a convergence between

data analysis and statistics, creating similar problems for both. It sets high demands

for data management, storage, and retrieval; has a great inﬂuence on the research of

eﬃciency of machine learning algorithms; and also involves new challenges for statistical

inference and the underlying mathematical theory. These include the consequences of

misspeciﬁed models, the problem of small, high-dimensional datasets (microarray data), the

search for causal relationships in nonexperimental data, quantifying uncertainty, eﬃciency

theory, and so on. The fact that many data-intensive empirical sciences depend heavily on

machine learning algorithms and statistics makes the need for bridging the gap compelling

for practical reasons. It appears so in a long-standing philosophical opposition, which

manifests the scientiﬁc realism debate. Roughly put, the scientiﬁc realist believes in a

16 Handbook of Big Data

mind-independent objective reality that can be known and from which true statements

can be made. This applies equally to postulated theoretical entities. In contrast, the

empiricist/instrumentalist, who accepts no reality behind the phenomena, doubts causality

and has a more pragmatic vision of truth. Thus considered, Fisher belonged to the ﬁrst

tradition. He shows himself as a dualist in the explicit distinction between sample and

population; the statistic is calculated in the sample. It has a distribution of its own

based on the parameter, a ﬁxed but unknown quantity that can be estimated. It focuses

on the classic problems of speciﬁcation, estimation,anddistribution. From this point of

view, the empiricist and anticausalistic Karl Pearson belonged to the second tradition. He

also replaced the material reality by the probability distribution, but according to him,

this distribution was observable in the data, not as an expression or other narrative for

an underlying real world. Although Pearson was far from antitheoretical, Tukey and his

conception of data analysis resemble more as the anticausalistic, goodness-of-ﬁt-oriented,

monist Pearsonian tradition than as the causalistic, estimation-oriented, dualist Fisherean

approach.

1.7 Plea for Reconciliation

In the previous sections, we showed how the issue of current practice of data analysis

has certain aspects that can be traced back to classical epistemic positions. They have

traditionally played a role in the quest for knowledge, the philosophy of science, and in the

rise of statistics: the ancient dichotomy between empirism and rationalism, the Baconian

stanza, the Pearson–Fisher controversy, Tukey’s sharpening of the debate, and most recently

the Wigner–Google controversy. It also appeared that not all controversies are based on the

same dimension, there is no one-to-one correspondence, and many seem to suﬀer from

the fallacy of the false dilemma, as may be concluded from the many attempts to rephrase

the dilemma. Perhaps the coexistence of positions is a precondition (in the Hegelian sense)

for progress/synthesis, but the latter is not always manifest in practice. Due to the growing

dependency on data and the fact that all sciences, including epistemology, have experienced

a probabilistic revolution, a reconciliation is imperative. We have shown that the debate has

many aspects, including, among other things, the erosion of models and the predicament

of truth, the obscure role of the dualist concept of estimation in current data analysis, the

disputed need for a notion of causality, the problem of a warranted choice for an appropriate

machine learning algorithm, and the methodology required to obtain data. Here, we will

discuss such an attempt for reconciliation, restricting ourselves to a few aspects, which are

typical for current practice of data analysis and are also rooted in the tradition we have

sketched. We then outlined brieﬂy how to ensure such a step toward reconciliation from the

proposed methodology such as targeted maximum likelihood estimation (TMLE), combined

with super learning (SL) algorithms (Van der Laan and Rose, 2011; Starmans and van der

Laan, 2013; Van der Laan, 2013).

Typically, the current practice of statistical data analysis relies heavily on parametric

models and MLE as an estimation method. The unbiasedness of MLE is of course determined

by the correct speciﬁcation of the model. An important assumption is that the probability

distribution that generated the data is known up to a ﬁnite number of parameters.

Violation of this assumption and misspeciﬁcation of the model may lead to unbiased

and extremely diﬃcult interpretation of estimators, often identiﬁed with coeﬃcients in a

(logistic) regression model. This cannot be repaired by a larger sample size or big data.

The Advent of Data Science 17

In this respect, George Box’s famous dictum that “Essentially, all models are wrong,

but some are useful” is often quoted, but there really is a clear erosion of the model

concept in statistics, sometimes making the classic concept of truth obsolete. Models often

demonstrably do not obtain the (approximation of the) true data generating distribution

and ignore the available realistic background knowledge. The models must therefore be

bigger, which makes the MLE problematic. Essential here is the fact that the maximum

likelihood estimators are typically nontargeted, but do not have to be estimated to answer

almost any conceivable research question, requiring only a low-dimensional target parameter

of the distribution. Because of a nontargeted approach, an evaluation criterion is used,

which is focused on the ﬁt of the entire distribution, and the error is spread over the entire

distribution. The MLE of the target parameter is then not necessarily unbiased, especially in

high-dimensional datasets (e.g., microarray data) and/or data with thousands of potential

covariates or interaction terms. The larger the statistical model, the more problematic the

nontargeted approach.

Targeted learning starts with the speciﬁcation of a nonparametric or semiparametric

model that contains only the realistic background knowledge and focuses on the parameter

of interest, which is considered as a property of the as yet unknown, true data generating

distribution. This methodology has a clear imperative: model and parameter of interest must

be speciﬁed in advance. The (empirical) research question needs to be translated in terms

of the parameter of interest. Additionally, a rehabilitation of the model concept is realized.

Subsequently, targeted learning involves an estimation procedure that takes place in a data-

adaptive, ﬂexible way in two steps. First, an initial estimate is searched on the basis of the

relevant part of the true distribution, which is needed to evaluate the target parameter.

This initial estimator is found using the SL algorithm. In short, this is based on a library of

many diverse analytical techniques, ranging from logistic regression to ensemble techniques,

random forest, and support vector machines. Because the choice of one of these techniques

is generally subjective and the variation in the results of the diﬀerent techniques is usually

considerable, a kind of weighted sum of the values is calculated by means of cross-validation.

On the basis of this initial estimator, the second stage of the estimation procedure can then

be started, wherein the initial ﬁt is updated with the objective of an optimum bias-variance

trade-oﬀ for the parameter of interest. This is accomplished with a targeted maximum

likelihood estimator of the ﬂuctuation parameter of a selected submodel parameter by

the initial estimator. The statistical inference is then completed by calculating standard

errors on the basis of the inﬂuence-curve theory or resampling techniques. Thus, parameter

estimation keeps, or rather regains, a crucial place in the data analysis. If one wants to do

justice to variation and change in the symptoms, one cannot deny Fisher’s unshakable insight

that randomness is intrinsic and implies that the estimator of the parameter of interest itself

has a distribution. Big data, and even census survey or other attempts to discount or encrypt

the whole of reality in the dataset, cannot replace it. After doing justice to the notion of

a model, and the restoration of the dualist concept of estimation in the practice of data

analysis, two methodological criteria are at stake: a speciﬁcally formulated research question

and the choice of the algorithm, which is less dependent on personal preferences. Finally,

some attention has to be paid to the notion of causality, which is always a diﬃcult area in

statistics, but is now associated with this discipline and, of course, in the presence of big

data is considered to be unnecessary and outdated. (Correlations are suﬃcient!) It cannot

be overemphasized that the experience of cause–eﬀect relationships in reality is inherent to

the human condition, and many attempts to exclude it, including those of Bertrand Russell

and Karl Pearson, have dramatically failed. Most of the data analyses include impact studies

or have other causal connotations. The TMLE parameter can be interpreted statistically

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 1. The Advent of Data Science: Some Considerations on the Unreasonable Effectiveness of Data (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
1. The Advent of Data Science: Some Considerations on the Unreasonable Effectiveness of Data (3/4)