Chapter 1

Introduction

1.1 What This Book Is About

A data graphic is not only a static image but also tells a story about the data. It activates cognitive processes that are able to detect patterns and discover information not readily available with the raw data. This is particularly true for time series, spatial, and space-time datasets.

There are several excellent books about data graphics and visual perception theory, with guidelines and advice for displaying information, including visual examples. Let’s mention The Elements of Graphical Data (Cleveland 1994) and Visualizing Data (Cleveland 1993) by W. S. Cleveland, Envisioning Information (Tufte 1990) and The Visual Display of Quantitative Information (Tufte 2001) by E. Tufte, The Functional Art by A. Cairo (Cairo 2012), and Visual Thinking for Design by C. Ware (Ware 2008). Ordinarily, they do not include the code or software tools to produce those graphics.

On the other hand, there is a collection of books that provides code and detailed information about the graphical tools available with R. Commonly they do not use real data in the examples and do not provide advice for improving graphics according to visualization theory. Three books are the unquestioned representatives of this group: R Graphics by P. Murrell (Murrell 2011), Lattice: Multivariate Data Visualization with R by D. Sarkar (Sarkar 2008), and ggplot2: Elegant Graphics for Data Analysis by H. Wickham (Wickham 2009).

This book proposes methods to display time series, spatial, and spacetime data using R, and aims to be a synthesis of both groups providing code and detailed information to produce high-quality graphics with practical examples.

1.2 What You Will Not Find in This Book

  • This is not a book to learn R.

    Readers should have a fair knowledge of programming with R to understand the book. In addition, previous experience with the zoo, sp, raster, lattice, ggplot2, and grid packages is helpful.

    If you need to improve your R skills, consider these information sources:

    • Introduction to R1
    • Official manuals2
    • Contributed documents3
    • Mailing lists4
    • R-bloggers5
    • Books related to R6, and particularly Software for Data Analysis by John M. Chambers (Chambers 2008).
  • This book does not provide an exhaustive collection of visualization methods.

    Instead, it illustrates what I found to be the most useful and effective methods. Notwithstanding, each part includes a section titled “Further Reading” with bibliographic proposals for additional information.

  • This book does not include a complete review or discussion of R packages.

    Their most useful functions, classes, and methods regarding data and graphics are outlined in the introductory chapter of each part, and conveniently illustrated with the help of examples. However, if you need detailed information about a certain aspect of a package, you should read the correspondent package manual or vignette. Moreover, if you want to know additional alternatives, you can navigate through the CRAN Task Views about Time Series7, Spatial Data8, Spatiotemporal Data9, and Graphics10.

  • Finally, this book is not a handbook of data analysis, geostatistics, point pattern analysis, or time series theory.

    Instead, this book is focused on the exploration of data with visual methods, so it may be framed in the Exploratory Data Analysis approach. Therefore, this book may be a useful complement for superb bibliographic references where you will find plenty of information about those subjects. For example, (Chatfield 2003), (Cressie and Wikle 2011), (Slocum 2005) and (R. S. Bivand, E. J. Pebesma, and Gomez-Rubio 2008).

1.3 How to Read This Book

This book is organized into three parts, each devoted to different types of data. Each part comprises several chapters according to the various visualization methods or data characteristics. The chapters are structured as independent units so readers can jump directly to a certain chapter according to their needs. Of course, there are several dependencies and redundancies between the sets of chapters that have been conveniently signaled with cross-references.

The content of each chapter illustrates how to display a dataset starting with an easy and direct approach. Often this first result is not entirely satisfactory so additional improvements are progressively added. Each step involves additional complexity which, in some cases, can be overwhelming during a first reading. Thus, some sections, marked with the sign image, can be safely skipped for later reading.

Although I have done my best to help readers understand the methods and code, you should not expect to understand it after one reading. The key is practical experience, and the best way is to try out the code with the provided data and modify it to suit your needs with your own data. There is a website and a code repository to help you in this task.

1.3.1 Website and Code Repository

The book website with the main graphics of this book is located at

http://oscarperpinan.github.com/spacetime-vis/

The full code is freely available from the repository:

https://github.com/oscarperpinan/spacetime-vis

On the other hand, the datasets used in the examples are either available at the repository or can be freely obtained from other websites. It must be underlined that the combination of code and data freely available allows this book to be fully reproducible.

I have chosen the datasets according to two main criteria:

  1. They are freely available without restrictions for public use.
  2. They cover different scientific and professional fields (meteorology and climate research, economy and social sciences, energy and engineering, environmental research, epidemiology, etc.).

The repository and the website can be downloaded as compressed files11, and if you use git, you can clone the repository with

git clone https://github.com/oscarperpinan/spacetime-vis.git

1.4 R Graphics

There are two distinct graphics systems built into R, referred to as traditional and grid graphics. Grid graphics are produced with the grid package (Murrell 2011), a flexible low-level graphics toolbox. Compared with the traditional graphics model, it provides more flexibility to modify or add content to an existent graphical output, better support for combining different outputs easily, and more possibilities for interaction. All the graphics in this book have been produced with the grid graphics model.

Other packages are constructed over it to provide high-level functions, most notably the lattice and ggplot2 packages.

1.4.1 lattice

The lattice package (Sarkar 2008) is an independent implementation of Trellis graphics, which were mostly influenced by The Elements of Graphing Data (Cleveland 1994). Trellis graphics often consist of a rectangular array of panels. The lattice package uses a formula interface to define the structure of the array of panels with the specification of the variables involved in the plot. The result of a lattice high-level function is a trellis object.

For bivariate graphics, the formula is generally of the form y ~ x representing a single panel plot with y versus x. This formula can also involve expressions. The main function for bivariate graphics is xyplot.

Optionally, the formula may be y ~ x | g1 * g2 and y is represented against x conditional on the variables g1 and g2. Each unique combination of the levels of these conditioning variables determines a subset of the variables x and y. Each subset provides the data for a single panel in the Trellis display, an array of panels laid out in columns, rows, and pages.

For example, in the following code, the variable wt of the dataset mtcars is represented against the mpg, with a panel for each level of the categorical variable am. The points are grouped by the values of the cyl variable.

xyplot(wt ~ mpg | am, data = mtcars, groups = cyl)

For trivariate graphics, the formula is of the form z ~x *y, where z is a numeric response, and x and y are numeric values evaluated on a rectangular grid. Once again, the formula may include conditioning variables, for example z ~x *y | g1 *g2. The main function for these graphics is levelplot.

The plotting of each panel is performed by the panel function, specified in a high-level function call as the panel argument. Each high-level lattice function has a default panel function, although the user can create new Trellis displays with custom panel functions.

lattice is a member of the recommended packages list so it is commonly distributed with R itself. There are more than 250 packages depending on it, and the most important packages for our purposes (zoo, sp, and raster) define methods to display their classes using lattice.

On the other hand, the latticeExtra package (Sarkar and Andrews 2012) provides additional flexibility for the somewhat rigid structure of the Trellis framework implemented in lattice. This package complements the lattice with the implementation of layers via the layer function, and superposition of trellis objects and layers with the +.trellis function. Using both packages, you can define a graphic with the formula interface (under the lattice model) and overlay additional content as layers (following the ggplot2 model).

1.4.2 ggplot2

The ggplot2 package (Wickham 2009) is an implementation of the system proposed in The Grammar of Graphics (Wilkinson 1999), a general scheme for data visualization that breaks up graphs into semantic components such as scales and layers. Under this framework, the definition of the graphic with ggplot2 is done with a combination of several functions that provides the components, instead of the formula interface of lattice.

With ggplot2, a graphic is composed of

  • A dataset, data, and a set of mappings from variables to aesthetics, aes.
  • One or more layers, each composed of: a geometric object, geom_*, to control the type of plot you create (points, lines, etc.); a statistical transformation, stat_*; and a position adjustment (and optionally, additional dataset and aesthetic mappings).
  • A scale, scale_*, to control the mapping from data to aesthetic attributes. Scales are common across layers to ensure a consistent mapping from data to aesthetics.
  • A coordinate system, coords_*.
  • Optionally, a faceting specification, facet_*, the equivalent of Trellis graphics with panels.

The function ggplot is typically used to construct a plot incrementally, using the + operator to add layers to the existing ggplot object. For instance, the following code (equivalent to the previous lattice example) uses mtcars as the dataset, and maps the mpg variable on the x-axis and the wt variable on the y-axis. The geometric object is the point using the cyl variable to control the color. Finally, the levels of the am variable define the panels of the graphic.

ggplot(mtcars, aes(mpg, wt)) +
   geom_point(aes(colour=factor(cyl))) +
   facet_grid(. ~ am)

This package is increasingly popular, with a list of more than ninety packages depending on it. On the other hand, few packages provide method definitions based on ggplot2 to display their classes. In our context, only the zoo package defines the autoplot function based on it.

1.4.3 Comparison between lattice and ggplot2

Which package to choose is, for a wide range of datasets, a question of personal preferences. You may be interested in a comparison between them published in a series of blog posts12. However, the major drawback of ggplot2 is its considerably slower speed when dealing with large datasets13, so you should be cautious with large spatial and spatiotemporal data.

Consequently, most of the code in Part I contains alternatives defined both with lattice and with ggplot2. However, because of the speed problem and the absence of ggplot2 functions in the corresponding packages, only a minor fraction of the code in Parts II and III contains graphics defined with ggplot2.

1.5 Packages

Throughout the book, several R packages are used. All of them are available from CRAN, and you must install them before using the code. Most of them are loaded at the start of the code of each chapter, although some of them are loaded later if they are used only inside optional sections (marked with image). You should install the last version available at CRAN to ensure correct functioning of the code.

Although the introductory chapter of each part includes a section with an outline of the most relevant packages, some of them deserve to be highlighted here:

  • zoo (Zeileis and Grothendieck 2005) provides infrastructure for time series using arbitrary classes for the time stamps (Section 2.1.1).
  • sp (E. Pebesma 2012) provides a coherent set of classes and methods for the major spatial data types: points, lines, polygons, and grids (Section 7.1.1). spacetime (E. Pebesma 2012) defines classes and methods for spatiotemporal data, and methods for plotting data as map sequences or multiple time series (Section 11.1.1).
  • raster (R. J. Hijmans 2013) is a major extension of gridded spatial data classes. It provides a unified access method to different raster formats, permitting large objects to be analyzed with the definition of basic and high-level processing functions (Sections 7.1.2 and 11.1.2). rasterVis (Oscar Perpiñán and R. Hijmans 2013) provides enhanced visualization of raster data with methods for spatiotemporal rasters (Sections 7.1.3 and 11.1.3).
  • gridSVG (Murrell and Potter 2013) converts any grid scene to an SVG document. The grid.hyperlink function allows a hyperlink to be associated with any component of the scene, the grid.animate function can be used to animate any component of a scene, and the grid. garnish function can be used to add SVG attributes to the components of a scene. By setting event handler attributes on a component, plus possibly using the grid.script function to add JavaScript to the scene, it is possible to make the component respond to user input such as mouse clicks.

1.6 Software Used to Write This Book

This book has been written using different computers running Debian GNU Linux and using several gems of open-source software:

  • org-mode for authoring text and code (Schulte et al. 2012).
  • R (R Development Core Team 2013) with Emacs Speaks Statistics (Rossini et al. 2004).
  • LATEX with AUCTEX to produce the final document.
  • GNU Emacs as development environment.

1.7 About the Author

During the past 15 years, my main area of expertise has been photovoltaic solar energy systems, with a special interest in solar radiation.

Initially I worked as an engineer for a private company and I was involved in several commercial and research projects. The project teams were partly integrated by people with low technical skills who relied on the input from engineers to complete their work. I learned how a good visualization output eased the communication process.

Now I work as a professor and researcher at the university. Data visualization is one of the most important tools I have available. It helps me embrace and share the steps, methods, and results of my research. With students, it is an inestimable partner in helping them understand complex concepts.

I have been using R to simulate the performance of photovoltaic energy systems and to analyze solar radiation data, both as time series and spatial data. As a result, I have developed packages that include several graphical methods to deal with multivariate time series (namely, solaR (Oscar Perpiñán 2012)) and space-time data (rasterVis).

1.8 Acknowledgments

Writing a book is often described as a solitary activity. It is certainly difficult to write when you are with friends or spending time with your family,... although with three little children at home I have learned to write prose and code while my baby wants to learn typing and my daughters need help to share a family of dinosaurs.

Seriously speaking, solitude is the best partner of a writer. But when I am writing or coding I feel I am immersed in a huge collaborative network of past and present contributors. Piotr Kropotkin described it with the following words (Kropotkin 1906):

Thousands of writers, of poets, of scholars, have laboured to increase knowledge, to dissipate error, and to create that atmosphere of scientific thought, without which the marvels of our century could never have appeared. And these thousands of philosophers, of poets, of scholars, of inventors, have themselves been supported by the labour of past centuries. They have been upheld and nourished through life, both physically and mentally, by legions of workers and craftsmen of all sorts.

And Lewis Mumford claimed (Mumford 1934):

Socialize Creation! What we need is the realization that the creative life, in all its manifestations, is necessarily a social product.

I want to express my deepest gratitude and respect to all those women and men who have contributed and contribute to strengthening the communities of free software, open data, and open science. My special thanks go to the people of the R community: users, members of the R Core Development Team, and package developers.

With regard to this book in particular, I would like to thank John Kimmel for his constant support, guidance, and patience.

Last, and most importantly, thanks to Candela, Marina, and Javi, my crazy little shorties, my permanent source of happiness, imagination, and love. Thanks to María, mi amor, mi cómplice y todo.

1 http://cran.r-project.org/doc/manuals/R-intro.html

2 http://cran.r-project.org/manuals.html

3 http://cran.r-project.org/other-docs.html

4 http://www.r-project.org/mail.html

5 http://www.r-bloggers.com

6 http://www.r-project.org/doc/bib/R-books.html

7 http://cran.r-project.org/web/views/TimeSeries.html

8 http://cran.r-project.org/web/views/Spatial.html

9 http://cran.r-project.org/web/views/SpatioTemporal.html

10 http://cran.r-project.org/web/views/Graphics.html

11 Repository: https://github.com/oscarperpinan/spacetime-vis/archive/master.zip, Website: https://github.com/oscarperpinan/spacetime-vis/archive/gh-pages.zip

12 http://learnr.wordpress.com/2009/06/28/ggplot2-version-of-figures-in-lattice-multivariate-data-visualization-with-r-part-1/

13 Take a look at the time comparison published as the final result of the previous series of blog posts, http://learnr.files.wordpress.com/2009/08/latbook.pdf

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.137.38