The Advent of Data Science 13
1.6 Herald of t he Controversy
Although Bacon could clearly be considered an ancestor of the debate, we will now confine
ourselves to a specific and contemporary statistical understanding and elucidation of the
issue, and the key figure is undoubtedly the American mathematician John W. Tukey
(1915–2000). In the annals of the history of science, he is primarily chronicled for his
contributions in the field of statistics and data analysis. In 1977, his now classic monograph
Exploratory Data Analysis (EDA) saw the light (Tukey, 1977), however, he was already
successful in many areas inside and outside of statistics. Originally a chemist, he shifted
his attention to mathematics and earned his PhD at Princeton on a topological subject.
His interest in metamathematics led to contributions to logic and model theory (Tukey’s
lemma), his investigations in spectral analysis of time series resulted among others in
the famous algorithm for the fast Fourier transform. In statistics, his name has been
associated with the post hoc test of the variance analysis (e.g., the Tukey range test),
a resampling method as the jackknife and of course to many descriptive, often-graphic
techniques, ranging from box plots and stem-and-leaf plots to smoothing techniques and
all kinds of new descriptive metrics. Tukey worked with various renowned scientists such
as the physicist Richard Feynman, the computing pioneer John von Neumann and Claude
Shannon, Turing Award laureate Richard Hamming, and statistician Samuel Wilks. Among
the many concepts and notions introduced by Tukey, bits and software are undoubtedly
the most famous ones. Of the many other neologisms attributed to him, some will probably
be apocryphal, but they indisputably illustrate Tukey’s influence on the times in which he
lived. With respect to the themes addressed in this chapter, an essential aspect of his legacy
concerns the fact that he appears the herald of the current controversy. In fact, he ushered
in a new phase in the development of statistics and data analysis, which as a Hegelian
process is taking place now and has to be completed in the era of big data (Starmans,
2013). This triptych of thesis–antithesis–synthesis will be briefly discussed in the next
sections.
1.6.1 Exploratory Data Analysis
We start with the publication of EDA, a remarkable and in many ways unorthodox book,
which is very illustrative for the underlying debate. First, it contains no axioms, theorems,
lemmas, or evidence, and even barely formulas. Given Tukey’s previous theoretical mathe-
matical contributions, this is striking, though he certainly was not the first. Ronald Fisher,
who wrote Statistical Methods for Research Workers (1925), which went through many
reprints and in fact can be considered as the first methodological and statistical manual,
was his predecessor. This book provided many practical tips for conducting research,
required only a slight mathematical knowledge, and contained relatively few formulas.
Partly as a result, Fisher’s methods in biology, agronomy, and psychology were quickly
known and canonized, well before the mathematical statistics had codified these insights.
However, Tukey was in many ways different from Fisher, who of course was a theoretical
statistician at heart. EDA hardly seems comparable to traditional statistical handbooks. It
contains no theoretical distributions, significance tests, p values, hypotheses, parameter
estimates, and confidence intervals. There were no signs of confirmatory or inferential
statistics, but purely the understanding of data, looking for patterns, relationships, and
structures in data, and visualizing the results. According to Tukey, a detective has to go
to work like a contemporary Sherlock Holmes, looking for clues, signs, or hints. Tukey
maintains this metaphor consistently throughout the book and provides the data analyst
with a toolbox full of methods for the understanding of frequency distributions, smoothing