The Advent of Data Science 13
1.6 Herald of t he Controversy
Although Bacon could clearly be considered an ancestor of the debate, we will now confine
ourselves to a specific and contemporary statistical understanding and elucidation of the
issue, and the key figure is undoubtedly the American mathematician John W. Tukey
(1915–2000). In the annals of the history of science, he is primarily chronicled for his
contributions in the field of statistics and data analysis. In 1977, his now classic monograph
Exploratory Data Analysis (EDA) saw the light (Tukey, 1977), however, he was already
successful in many areas inside and outside of statistics. Originally a chemist, he shifted
his attention to mathematics and earned his PhD at Princeton on a topological subject.
His interest in metamathematics led to contributions to logic and model theory (Tukey’s
lemma), his investigations in spectral analysis of time series resulted among others in
the famous algorithm for the fast Fourier transform. In statistics, his name has been
associated with the post hoc test of the variance analysis (e.g., the Tukey range test),
a resampling method as the jackknife and of course to many descriptive, often-graphic
techniques, ranging from box plots and stem-and-leaf plots to smoothing techniques and
all kinds of new descriptive metrics. Tukey worked with various renowned scientists such
as the physicist Richard Feynman, the computing pioneer John von Neumann and Claude
Shannon, Turing Award laureate Richard Hamming, and statistician Samuel Wilks. Among
the many concepts and notions introduced by Tukey, bits and software are undoubtedly
the most famous ones. Of the many other neologisms attributed to him, some will probably
be apocryphal, but they indisputably illustrate Tukey’s influence on the times in which he
lived. With respect to the themes addressed in this chapter, an essential aspect of his legacy
concerns the fact that he appears the herald of the current controversy. In fact, he ushered
in a new phase in the development of statistics and data analysis, which as a Hegelian
process is taking place now and has to be completed in the era of big data (Starmans,
2013). This triptych of thesis–antithesis–synthesis will be briefly discussed in the next
sections.
1.6.1 Exploratory Data Analysis
We start with the publication of EDA, a remarkable and in many ways unorthodox book,
which is very illustrative for the underlying debate. First, it contains no axioms, theorems,
lemmas, or evidence, and even barely formulas. Given Tukey’s previous theoretical mathe-
matical contributions, this is striking, though he certainly was not the first. Ronald Fisher,
who wrote Statistical Methods for Research Workers (1925), which went through many
reprints and in fact can be considered as the first methodological and statistical manual,
was his predecessor. This book provided many practical tips for conducting research,
required only a slight mathematical knowledge, and contained relatively few formulas.
Partly as a result, Fisher’s methods in biology, agronomy, and psychology were quickly
known and canonized, well before the mathematical statistics had codified these insights.
However, Tukey was in many ways different from Fisher, who of course was a theoretical
statistician at heart. EDA hardly seems comparable to traditional statistical handbooks. It
contains no theoretical distributions, significance tests, p values, hypotheses, parameter
estimates, and confidence intervals. There were no signs of confirmatory or inferential
statistics, but purely the understanding of data, looking for patterns, relationships, and
structures in data, and visualizing the results. According to Tukey, a detective has to go
to work like a contemporary Sherlock Holmes, looking for clues, signs, or hints. Tukey
maintains this metaphor consistently throughout the book and provides the data analyst
with a toolbox full of methods for the understanding of frequency distributions, smoothing
14 Handbook of Big Data
techniques, scale transformations, and, above all, many graphic techniques, for exploration,
storage, abstraction, and illustration of data. Strikingly, EDA contains a large number
of neologisms, which Tukey sometimes tactically or polemically uses, but which may
have an alienating effect on the uninitiated reader. He chooses the words because of his
dissatisfaction with conventional techniques and nomenclature, which according to him are
habitually presented as unshakable and sacred by his statistical colleagues to the researcher.
Finally, the bibliography of EDA has remarkably few references for a voluminous study of
nearly 700 pages. There are only two of these: a joint article by Mosteller and Tukey and
the Bible.
1.6.2 Thesis
The unorthodox approach of Tukey in EDA shows a rather fundamental dissatisfaction with
the prevailing statistical practice and the underlying paradigm of inferential/confirmatory
statistics. The contours of this tradition began to take shape when the probabilistic
revolution during the early twentieth century was well under way: first, with the prob-
ability distributions and goodness-of-fit procedures developed by Karl Pearson, then the
significance tests and maximum likelihood estimation (MLE) developed by Ronald Fisher,
the paradigm of hypothesis testing by Jerzy Neyman and Egon Pearson, and the confidence
intervals of Neyman. These techniques combined with variance and regression analysis
showed a veritable triumphal procession through the empirical sciences, and often not just
methodologically. Biology and psychology sometimes seemed to be transformed or reduced
to applied statistics. According to some researchers, inferential/testing statistics became
virtually synonymous with the scientific method and appeared philosophically justified
since the 1930s, especially with the advancement of the hypothetical-deductive method
and Popper’s concept of falsifiability. The success inevitably caused a backlash. This is
not only because many psychologist and biologists rejected the annexation of their field by
the statisticians without a fight but also because of intrinsic reasons. The embrace of the
heritage of Pearson, Fisher, and Neyman did not cover up the fact that various techniques
were contradictory, but were presented and used as a seemingly coherent whole at the same
time. In spite of the universal application, there were many theoretical and conceptual
problems regarding the interpretation and use of concepts such as probability, confidence
intervals, and other (alleged) counterintuitive concepts, causing unease and backlash that
continue even to this day. The emergence of nonparametric and robust statistics formed a
prelude to a countermovement, propagated by Tukey since the 1960s.
1.6.3 Antithesis
Although Tukey in EDA does his best to emphasize the importance of confirmatory
tradition, the antagonism becomes manifest to the reader in the beginning of the book.
Moreover, he had already put his cards on the table in 1962 in the famous opening passage
from The Future of Data Analysis.
For a long time I have thought that I was a statistician, interested in inferences
from the particular to the general. But as I have watched mathematical statistics
evolve, I have had cause to wonder and to doubt. And when I have pondered
about why such techniques as the spectrum analysis of time series have proved
so useful, it has become clear that their ‘dealing with fluctuations’ aspects are,
in many circumstances, of lesser importance than the aspects that would already
have been required to deal effectively with the simpler case of very extensive data
where fluctuations would no longer be a problem. All in all, I have come to feel
The Advent of Data Science 15
that my central interest is in data analysis, which I take to include, among other
things: procedures for analyzing data, techniques for interpreting the results of
such procedures, ways of planning the gathering of data to make its analysis easier,
more precise or more accurate, and all the machinery and results of mathematical
statistics which apply to analyzing data. (...) Data analysis is a larger and more
varied field than inference, or allocation.
Also in other writings, Tukey makes a sharp distinction between statistics and data analysis,
and this approach largely determines the status quo. First, Tukey unmistakably contributed
to the emancipation of the descriptive/visual approach, after a pioneering work of William
Playfair (eighteenth century) and Florence Nightingale (nineteenth century) after the rise
of inferential power in the previous period. In addition, it can hardly be surprising that
many see in Tukey a pioneer of computational disciplines such as data mining and machine
learning, although he himself assigned a modest place to the computer in his analyses. More
importantly, however, Tukey, because of his alleged antitheoretical stance, is often regarded
as the person who tried to undo the Fisherian revolution. He could thus be regarded as
an exponent or precursor of today’s erosion of the concept model, with the view that all
models are wrong, that the classical concept of truth is obsolete, and pragmatic criteria as
predictive success should come first in data analysis.
1.6.4 Synthesis
This one very succinct opposition has many aspects that cannot be discussed in this
chapter. We focus on three considerations. Although it almost sounds like a platitude,
it must first be noted that EDA techniques nowadays are implemented in all statistical
packages either alone or sometimes by using hybrid inferential methods. In current empirical
research methodology, EDA has been integrated in different phases of the research process.
In the second place, it could be argued that Tukey did not undermine the revolution
initiated and set forward by Galton and Pearson, but on the contrary he grasped the
ultimate consequences to this position. Indeed, it was Galton who had shown 100 years
before that variation and change are intrinsic in nature, urging to search for the deviant,
the special or idiosyncratic, and indeed it was Pearson who realized that the straitjacket
of the normal distribution (Laplace and Quetelet) had to be abandoned and replaced
by many (classes of) skewed distributions. Galton’s heritage suffered a little by the
successes of the parametric, strong model assumptions-based Fisherian statistics and was
partly restored by Tukey. Finally, the contours of a Hegelian triad became visible. The
nineteenth-century German philosopher G.F.W. Hegel postulated that history goes through
a process of formation or development, which evokes a thesis and an antithesis, both
of which are then to be brought to a higher level to be completed, thereby leading to
a synthesis. Applied to the less metaphysically oriented present issue, this dialectical
principle seems very apparent in the era of big data, which evokes a convergence between
data analysis and statistics, creating similar problems for both. It sets high demands
for data management, storage, and retrieval; has a great influence on the research of
efficiency of machine learning algorithms; and also involves new challenges for statistical
inference and the underlying mathematical theory. These include the consequences of
misspecified models, the problem of small, high-dimensional datasets (microarray data), the
search for causal relationships in nonexperimental data, quantifying uncertainty, efficiency
theory, and so on. The fact that many data-intensive empirical sciences depend heavily on
machine learning algorithms and statistics makes the need for bridging the gap compelling
for practical reasons. It appears so in a long-standing philosophical opposition, which
manifests the scientific realism debate. Roughly put, the scientific realist believes in a
16 Handbook of Big Data
mind-independent objective reality that can be known and from which true statements
can be made. This applies equally to postulated theoretical entities. In contrast, the
empiricist/instrumentalist, who accepts no reality behind the phenomena, doubts causality
and has a more pragmatic vision of truth. Thus considered, Fisher belonged to the first
tradition. He shows himself as a dualist in the explicit distinction between sample and
population; the statistic is calculated in the sample. It has a distribution of its own
based on the parameter, a fixed but unknown quantity that can be estimated. It focuses
on the classic problems of specication, estimation,anddistribution. From this point of
view, the empiricist and anticausalistic Karl Pearson belonged to the second tradition. He
also replaced the material reality by the probability distribution, but according to him,
this distribution was observable in the data, not as an expression or other narrative for
an underlying real world. Although Pearson was far from antitheoretical, Tukey and his
conception of data analysis resemble more as the anticausalistic, goodness-of-fit-oriented,
monist Pearsonian tradition than as the causalistic, estimation-oriented, dualist Fisherean
approach.
1.7 Plea for Reconciliation
In the previous sections, we showed how the issue of current practice of data analysis
has certain aspects that can be traced back to classical epistemic positions. They have
traditionally played a role in the quest for knowledge, the philosophy of science, and in the
rise of statistics: the ancient dichotomy between empirism and rationalism, the Baconian
stanza, the Pearson–Fisher controversy, Tukey’s sharpening of the debate, and most recently
the Wigner–Google controversy. It also appeared that not all controversies are based on the
same dimension, there is no one-to-one correspondence, and many seem to suffer from
the fallacy of the false dilemma, as may be concluded from the many attempts to rephrase
the dilemma. Perhaps the coexistence of positions is a precondition (in the Hegelian sense)
for progress/synthesis, but the latter is not always manifest in practice. Due to the growing
dependency on data and the fact that all sciences, including epistemology, have experienced
a probabilistic revolution, a reconciliation is imperative. We have shown that the debate has
many aspects, including, among other things, the erosion of models and the predicament
of truth, the obscure role of the dualist concept of estimation in current data analysis, the
disputed need for a notion of causality, the problem of a warranted choice for an appropriate
machine learning algorithm, and the methodology required to obtain data. Here, we will
discuss such an attempt for reconciliation, restricting ourselves to a few aspects, which are
typical for current practice of data analysis and are also rooted in the tradition we have
sketched. We then outlined briefly how to ensure such a step toward reconciliation from the
proposed methodology such as targeted maximum likelihood estimation (TMLE), combined
with super learning (SL) algorithms (Van der Laan and Rose, 2011; Starmans and van der
Laan, 2013; Van der Laan, 2013).
Typically, the current practice of statistical data analysis relies heavily on parametric
models and MLE as an estimation method. The unbiasedness of MLE is of course determined
by the correct specification of the model. An important assumption is that the probability
distribution that generated the data is known up to a finite number of parameters.
Violation of this assumption and misspecification of the model may lead to unbiased
and extremely difficult interpretation of estimators, often identified with coefficients in a
(logistic) regression model. This cannot be repaired by a larger sample size or big data.
The Advent of Data Science 17
In this respect, George Box’s famous dictum that “Essentially, all models are wrong,
but some are useful” is often quoted, but there really is a clear erosion of the model
concept in statistics, sometimes making the classic concept of truth obsolete. Models often
demonstrably do not obtain the (approximation of the) true data generating distribution
and ignore the available realistic background knowledge. The models must therefore be
bigger, which makes the MLE problematic. Essential here is the fact that the maximum
likelihood estimators are typically nontargeted, but do not have to be estimated to answer
almost any conceivable research question, requiring only a low-dimensional target parameter
of the distribution. Because of a nontargeted approach, an evaluation criterion is used,
which is focused on the fit of the entire distribution, and the error is spread over the entire
distribution. The MLE of the target parameter is then not necessarily unbiased, especially in
high-dimensional datasets (e.g., microarray data) and/or data with thousands of potential
covariates or interaction terms. The larger the statistical model, the more problematic the
nontargeted approach.
Targeted learning starts with the specification of a nonparametric or semiparametric
model that contains only the realistic background knowledge and focuses on the parameter
of interest, which is considered as a property of the as yet unknown, true data generating
distribution. This methodology has a clear imperative: model and parameter of interest must
be specified in advance. The (empirical) research question needs to be translated in terms
of the parameter of interest. Additionally, a rehabilitation of the model concept is realized.
Subsequently, targeted learning involves an estimation procedure that takes place in a data-
adaptive, flexible way in two steps. First, an initial estimate is searched on the basis of the
relevant part of the true distribution, which is needed to evaluate the target parameter.
This initial estimator is found using the SL algorithm. In short, this is based on a library of
many diverse analytical techniques, ranging from logistic regression to ensemble techniques,
random forest, and support vector machines. Because the choice of one of these techniques
is generally subjective and the variation in the results of the different techniques is usually
considerable, a kind of weighted sum of the values is calculated by means of cross-validation.
On the basis of this initial estimator, the second stage of the estimation procedure can then
be started, wherein the initial fit is updated with the objective of an optimum bias-variance
trade-off for the parameter of interest. This is accomplished with a targeted maximum
likelihood estimator of the fluctuation parameter of a selected submodel parameter by
the initial estimator. The statistical inference is then completed by calculating standard
errors on the basis of the influence-curve theory or resampling techniques. Thus, parameter
estimation keeps, or rather regains, a crucial place in the data analysis. If one wants to do
justice to variation and change in the symptoms, one cannot deny Fisher’s unshakable insight
that randomness is intrinsic and implies that the estimator of the parameter of interest itself
has a distribution. Big data, and even census survey or other attempts to discount or encrypt
the whole of reality in the dataset, cannot replace it. After doing justice to the notion of
a model, and the restoration of the dualist concept of estimation in the practice of data
analysis, two methodological criteria are at stake: a specifically formulated research question
and the choice of the algorithm, which is less dependent on personal preferences. Finally,
some attention has to be paid to the notion of causality, which is always a difficult area in
statistics, but is now associated with this discipline and, of course, in the presence of big
data is considered to be unnecessary and outdated. (Correlations are sufficient!) It cannot
be overemphasized that the experience of cause–effect relationships in reality is inherent to
the human condition, and many attempts to exclude it, including those of Bertrand Russell
and Karl Pearson, have dramatically failed. Most of the data analyses include impact studies
or have other causal connotations. The TMLE parameter can be interpreted statistically
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.186.109