Chapter 2

Brief Review of the Literature and Background Materials

A picture shows me at a glance what it takes dozens of pages of a book to expound.

Ivan Turgenev (Fathers and Sons)

Summary

Chapter 2 reviews some of the available literature on graphics for data analysis and statistics, gives a brief overview of alternative software to R for graphics (as at the time of writing), and explores what relevant material is available on the web: discussions, graphics and datasets. Finally there is a list of texts covering statistical models that might be used in conjunction with the graphical approach described in the book.

2.1 Literature review

Progress in graphics is bedevilled by the lack of theory. Bertin's classic “Semiology of Graphics”, originally published in French in 1967, recently reissued by ESRI Press [Bertin, 2010], and Wilkinson's “Grammar of Graphics” [Wilkinson, 2005] are two of the few works that attempt to make substantial contributions. Bertin is not an easy read and is now old-fashioned in many ways, being essentially from a pre-computing age, but the book contains many interesting ideas. It has a strong geographic emphasis. Wilkinson describes a formal structure and includes many attractively drawn displays, although the reasons for drawing the graphics are seldom discussed. Hadley Wickham developed his R package ggplot2 using the “Grammar of Graphics” and has shown how useful Wilkinson's grammar can be in practice, that it is more than a theoretical construct.

There are far too many books to mention every one that offers good advice on how to draw graphics (let alone to mention all that offer any advice). Tufte's books, especially his first one [Tufte, 2001], are an excellent starting point, containing many attractive and instructive examples as well as some cautionary ones. Tufte offers general principles and discusses both graphics for data analysis and statistics and also what is now called Information Visualisation. Cleveland's advice ([Cleveland, 1993] and [Cleveland, 1994]) is more specific and directed at statistical graphics. Wainer has written a number of enlightening books ([Wainer, 1997], [Wainer, 2004] and [Wainer, 2009]) illustrating with a range of real examples how graphics can be used to reveal information in data. [Robbins, 2005] is worth looking through for practical advice and [Few, 2012], while taking a business-oriented line, supplies forceful opinions in a strongly argued text. For analysts working in the Life Sciences [Krause and McConnell, 2012] offers sound advice and discussion of many real applications.

For graphics in R there is the book [Murrell, 2005], which appeared in an extensively revised second edition in 2011. It covers the full range of R graphics and in particular it explains the details of Murrell's grid graphics package, which he implemented as a more structured alternative to the original graphics system in R. The book is fairly technical, but essential for anyone wanting to write graphics using grid and valuable for anyone wanting guidance on how graphics in R work. Wickham's ggplot2 package and Sarkar's lattice package are based on grid. The additional power and flexibility of grid comes at the expense of occasional slowness. Wickham [Wickham, 2009] and Sarkar [Sarkar, 2008] have both written useful books on their packages, although given the continual improvements and changes being made, it is probably best to consult the packages' websites rather than the books.

If you want a specific form of graphic that is not directly available or want to amend or embellish a particular graphic, and these books are not enough, then one of the R cookbooks may help. There are at least two for graphics, [Mittal, 2011] and [Chang, 2012]. In addition, there is the German book [Rahlf, 2014], which provides extensive code examples for producing elegant images. Finally, given that drawing graphics often requires restructuring the data first, it may be helpful to consult one of the many textbooks and resources available for the R language itself.

For graphics not specifically tied to R there have been a number of interesting books in recent years. The visualization volume [Chen et al., 2008] in the Handbook of Computational Statistics series offers a collection of articles from authors with many different views of data visualization. The book “Graphics of Large Datasets” [Unwin et al., 2006] considers the problems of visualising datasets that are a lot bigger than the datasets graphics texts usually discuss. [Inselberg, 2009] treats parallel coordinate plots in detail, emphasising their geometric properties. [Cook and Swayne, 2007] covers dynamic graphics with special reference to the GGobi software for rotating plots. In [Theus and Urbanek, 2007] the authors give an overview of interactive graphics for data analysis including several in-depth case studies. To see how much progress has been made, it is worth looking at “Graphical Exploratory Data Analysis” [DuToit et al., 1986]. Both the ease of producing graphics and the quality of graphics produced have improved enormously in the last twenty-five years.

As this book is about graphics for data analysis and statistics, the emphasis is on the relevant statistical literature. There is additionally an extensive literature on Information Visualisation, which overlaps in interest, although not always in approach, with the books mentioned already. The books [Spence, 2007] and [Bederson and Shneiderman, 2003] are both worth consulting.

There are important factors which affect graphical displays that have nothing to do with data analysis per se. Colour, perception, psychology all play critical roles and effective design should take account of them [Ware, 2008]. There are valuable books on design in general such as [Norman, 1988]. Design is difficult, and much of the theory in these areas is still too far removed from practice to have a direct influence. Theories can explain why something does not work well and they can provide rational support for principles. That does not mean that they can offer specific advice on how to tackle particular tasks. Graphics are still very much a matter of taste.

2.2 Interactive graphics

It would be misleading to talk about GDA without at least some discussion of interactive graphics. Interaction means being able to interact directly with the statistical components of a graphic display. This makes graphical analyses faster and more flexible. The corresponding disadvantages are the difficulty of recording what has been done and the lack of presentation quality reproduction. This will change, and the increasing use of the web for presenting analyses will likely have a big influence.

There are interesting problems to be solved, notably how to formalise interactive graphics, how to provide an intuitive interface for the software, and how to provide results of analyses interactively in a structured form for occasional users. Querying values, zooming in, and reformatting graphics are sensible starting points and much can be achieved with RStudio's Shiny. More is possible, in particular the linking of several associated graphics for the same dataset, which the RCloud project of a group at AT&T provides [Urbanek et al., 2014].

Dynamic graphics, where the displays are animated in some way, for instance rotating through low-dimensional projections of high-dimensional data, is an attractive subset of interactive graphics.

2.3 Other graphics software

Lots of people draw their graphics in Excel, and why not? For many purposes it is a very useful tool. It offers a large variety of alternative graphic forms and considerable numbers of options for amending the appearance of the graphics. Since it is easy to record data in a spreadsheet, Excel is a natural tool for small projects, even though it offers only a limited range of statistical methods. As with all software, it is better for some graphics than others (histograms are difficult to draw in Excel).

The ‘big' statistics packages like SAS and SPSS provide extensive graphics capabilities and, like Excel, aim more for preparing presentation graphics than exploratory graphics. They are capable of handling large and complex datasets, provide extensive and substantial statistical capabilities, and can be used to set up regularly operating analysis systems as well as to carry out one-off studies. There are also commercial packages for presentation graphics, some more closely associated with statistical tools than others.

Although R offers interactive graphics resources through the iplots, rggobi, and ggvis packages, it is more for static graphics. There have been a number of commercial interactive graphics softwares, notably Data Desk, JMP, Spotfire, and Tableau. They all have their strengths and weaknesses. Data Desk was very innovative for its time and is still impressive. JMP has the most powerful statistical tools of the group and offers a tightly integrated system. Spotfire arose out of work in Shneiderman's group in Maryland. Tableau is the newest of the group and originated from work at Stanford. On the research side of interactive graphics there are tools like Mondrian and GGobi. All these software packages have been for general applications. Recently there have been some tools designed for particular applications made available on the web, such as Gapminder for displaying scatterplots animated over time or Wordle for word clouds.

The main advantage of concentrating on R is the integrated access to R's extensive range of statistical models and tools.

2.4 Websites

The web changes so quickly that anything written to-day is liable to be out-of-date tomorrow. Nevertheless, several of these websites have been around a while and there are many crosslinks between them. New sites that are any good will doubtless quickly be linked, and once you have found a starting point you should be able to find further interesting sites. Failing that, there is always Google or whatever may take over from them.

A number of websites have sprung up around the theme of data visualization encouraging contributions from readers. Both the FlowingData [Yau, 2011] and Junk Charts [Fung, 2011] websites discuss visualisations of data and have many interesting examples, often taken from the media. Statistical Graphics and More [Theus, 2013] is a similar site. Many Eyes [IBM, 2007] lets users create visualisations of datatsets on their site and upload their own datasets for others to visualise. Unsurprisingly the resulting graphics are of a very mixed standard. Visualizing.org [Viz, 2011] sees itself as a forum, where readers can discuss visualisation, contribute graphics, and find data. It appears to be more for designers than statisticians. The British site Improving data visualisation for the public sector [OCSI, 2009] wants to do just that and includes examples, case studies, and guides.

Tufte includes a discussion page on his website [Tufte, 2013], covering visualisation issues amongst other topics. Gelman's blog Statistical Modelling, Causal Inference, and Social Science [Gelman, 2011] often includes debates about graphics and on which alternatives readers prefer. There are sites, which are more Infoviz oriented than statistics oriented, such as eagereyes [Kosara, 2011] and information aesthetics [Vande Moere, 2011].

The Gallery of Data Visualisation [Friendly, 2011] has many classic graphics and helpful supporting historical information. The choice of graphics on display is mildly idiosyncratic, although none the worse for that, and there are enough splendid graphics to suit everyone's taste. The R Graph Gallery used to provide a large number of graphics drawn in R and the code used to draw them. Visitors to the site could vote on how good they thought the individual graphics were and it was curious to see which were rated highly and which poorly. The voting probably said more about the voters than about R graphics. The site is no longer maintained. So many websites have sprung up around R that it is impossible to keep track and it would be inadvisable to make firm recommendations given the speed with which things change. All that can be said is that if you are looking for good advice, it is almost certainly available somewhere, just be cautious and check carefully any advice you intend to use.

2.5 Datasets

Many datasets are used in this book, often several times. You need to gain experience in using various kinds of graphics with the same data, and in using graphics for different contexts with different kinds of data. All the datasets are already available in R, in one of its packages, or in the package GDAdata accompanying the book. The sources of the datasets and definitions of the variables in them can be found on the corresponding R help pages. To see where a dataset is referred to or analysed by other users of R, you can try one of the search functions on CRAN (http://cran.r-project.org).

There are some excellent datasets in R and readers can easily find them to experiment with the data themselves. Unfortunately, there is the disadvantage that the datasets are not always fully documented and that they are sometimes only provided in a cut-down form, for no apparent reason. It is a pity that some R examples are more just for showing how commands work rather than for illustrating how and why you might want to use those commands. More could be done [Unwin et al., 2013]. On the other hand, sorting out the necessary background information and sources for a dataset can involve a lot of hard work (e.g., see the discussions in §3.3 on Galton's and Pearson's family height datasets). Successful data analysis requires sound knowledge of context, so that you can make sense of your results.

A number of datasets are available in several different packages, occasionally under different names and in various versions or formats. For instance, the Titanic dataset used in the book is the one from the datasets package. You will also find titanic (COUNT, prLogistic, msme), titanic.dat (exaxtLoglinTest), titan.Dat (elrm), titgrp (COUNT), etitanic (earth), ptitanic (rpart.plot), Lifeboats (vcd), TitanicMat (RelativeRisk), Titanicp (vcdExtra), TitanicSurvival (effects), Whitestar (alr4), and one package, plotrix, includes a manually entered version of the dataset in one of its help examples. The datasets differ on whether the crew is included or not, on the number of cases, on information provided, and on formatting. Versions with the same names in different packages are not identical.

Collecting data requires considerable effort. The individual experiments carried out by Michelson to estimate the speed of light over one hundred years ago are a classic example (cf. Exercise 4 in Chapter 1). There used to be plenty of effort involved in loading datasets into a computer for analysis as well (although far less than needed for gathering the original data). It is amusing to read the instructions for the 1983 DataExpo competition organised by the ASA's Committee on Statistical Graphics: “Because of the Committee's limited (zero) budget for the Exposition, we are forced to provide the data in hardcopy form only (enclosed). (Sorry!) There are 406 observations on the following 8 variables...”

In the early years of the web there were a few sites like Statlib at Carnegie Mellon University or UCI's Machine Learning repository that collected datasets for free use by others. The datasets tended to be small and primarily for teaching purposes (Statlib) or for studying algorithms (UCI). In recent years the situation has totally changed. It is astonishing how much data is now provided on the web. Statistical offices, health departments, election offices, and other official bodies make substantial amounts of data available (for instance, the US Government's site www.data.gov). Sports data are often collected and put up on the web (e.g., the splendid Estonian decathlon webpage www.decathlon2000.com). The British newspaper the Guardian runs a Datastore [Guardian, 2011] where they publish many datasets of public interest ranging from AIDS statistics round the world to more parochial issues like the expenses claimed by British MP's.

Some academic journals require contributors to make their data publicly available. For instance, in 2014 the American Statistical Association's information for authors for its main journal JASA included this rule: “Whenever a dataset is used, its source should be fully documented and the data should be made available as an online supplement. Exceptions for reasons of security or confidentiality may be granted by the Editor.” [ASA, 2014] We can expect further progress in this direction.

There are R packages to assist you in downloading data from websites that offer their data in particular formats. The book “Data Technologies” [Murrell, 2009] gives advice on gathering and organising web data. When working with web datasets (or indeed with any dataset collected by someone else), it is always a good idea to check the source of the data and to ensure you have sufficient background information. Ideally you should know the aims of the original study from which the data came, how the variables are defined, how the data were collected, and what editing or cleaning of the data has been carried out. Investing time in analysing data that turn out to be flawed is a bad use of your time.

2.6 Statistical texts

Although this book is about graphics for data analysis, there is no claim being made that graphics are enough on their own, far from it. Statistical models are essential for checking ideas found through graphics just as graphics are important for checking results from models.

There are many excellent statistics textbooks that can be recommended for statistical modelling (although perhaps not always for statistical graphics). The following is a personal selection that should cover the models mentioned in this book and more besides.

Sound introductory texts include [Freedman et al., 2007], [De Veaux et al., 2011], and [Maindonald and Braun, 2010]. Other books like [Gelman and Hill, 2006], [Davison, 2008], and [Fahrmeir et al., 2013] assume more knowledge on the part of the reader. All are excellent books. The classic “Modern Applied Statistics with S” [Venables and Ripley, 2002] is still worth reading for its smooth integration of theory, software, and application. Overlaps of statistics with machine learning are well handled by the splendid [Hastie et al., 2001]. Recommended texts covering specific areas of statistics are [van Belle et al., 2004] (Biostatistics), [Tutz, 2012] (Categorical Data Analysis), [Kleiber and Zeileis, 2008] (Econometrics), [Lumley, 2010] (Survey Analysis), [Ord and Fildes, 2013] (Time Series), and, for those wanting a Bayesian approach, [Gelman et al., 2013].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.183.210