CHAPTER 9

image

Working with Graphics

We have obviously been working with graphics since the beginning of this book, as it is not easy to separate statistics and graphics, and perhaps it is impossible to do so. In Chapter 9, we will fill in some gaps and introduce the ggplot2 package as an effective alternative to the graphics package distributed with R.1

The ggplot2 package takes some adjustment in one’s thinking, as it works very differently from the graphics commands in base R. The payoff is that it can produce beautiful, even stunning, graphics. Although we will occasionally return to the base R’s graphics package, we will accomplish the majority of what we do from here on with ggplot2.

To paraphrase John Tukey, descriptive numerical indexes help us see the expected in a set of numbers, but graphics help us see the unexpected. In Chapter 9, we will tie graphics to the visual revelation of data in traditional ways. Later we will broaden the discussion to data visualization and include maps and other ways to allow users to derive meaning and information beyond the traditional graphs and plots.

9.1 Creating Effective Graphics

Graphics can be informative or misleading. Over many years, Yale professor emeritus Edward Tufte has championed the effective visual display of information. Tufte is a statistician and an artist, so his books are worth studying for those who would like to learn and observe the principles of excellent graphical design.2

Tufte provides the following principles for good graphics. Graphical displays should

  • show the data
  • induce the viewer to think about substance rather than methodology, graphic design, the technology of graphic production, or something else
  • avoid distorting what the data have to say
  • present many numbers in a small space
    • make large datasets coherent
    • encourage the eye to compare different pieces of data
    • reveal the data at several levels of detail, from a broad overview to the fine structure
    • serve a reasonably clear purpose: description, exploration, tabulation, or decoration
    • be closely integrated with the statistical and verbal descriptions of a data set.1

In addition to Tufte’s excellent principles, I will add that it is inappropriate to add three-dimensional effects to two-dimensional graphs, as that introduces both confusion and a false third dimension. Certain packages such as Microsoft Excel will readily allow you to make 3-D bar plots and pie charts, but only data with three actual dimensions should be plotted as such.

9.2 Graphing Nominal and Ordinal Data

Data that are categorical in nature should be represented in bar plots rather than histograms. The visual separation of the bars corresponds to the discrete nature of the categories. Pie charts can also be used to display the relative frequencies or proportions of nominal and ordinal data, but the pie chart is often maligned. Research at Bell Labs indicated that humans are more easily able to judge relative length than relative area, and thus bar plots (also called bar charts) are typically more informative than pie charts. Pie charts are available in base R graphics but not in ggplot2. Bar plots are available in both. In addition to plotting the frequencies for categorical variables, bar plots are also useful for summarizing quantitative data for two or more factors in what are known as clustered bar plots.

The default in base R for the pie chart is a pastel color palette. To produce a pie chart is simple in base R. We will use socioeconomic status (SES) in our hsb data for that purpose; these data record 1 (low), 2 (medium), and 3 (high) SES. First, we will summarize the data using the table function, and then we will produce the pie chart from the table. The completed pie chart appears in Figure 9-1.

> ses <- table ( hsb $ ses )
> pie ( ses , main = " Pie Chart ")

9781484203743_Fig09-01.jpg

Figure 9-1. Pie chart of the three levels of student SES

As an exercise, without looking further at the table or the actual data, try to estimate the relative percentages of the three levels. We will now produce a bar plot using ggplot2. As mentioned, ggplot2 requires a different approach from that of base R. In particular, ggplot2 makes use of geometric objects (abbreviated as geom in their source code) and aesthetics (abbreviated as aes in their source code). To make the bar plot in ggplot2, we must identify our data source; the variable to summarize; and the type of geometric object that should be used to represent the data, in this case, bars so we use the geometric object bar or, in code, geom_bar(). See that you can build up the plot sequentially and then produce it by typing the name you have given the list. The completed bar plot appears in Figure 9-2. Most people would find it easier to estimate the relative percentages from the length of the bars rather than from relative areas in a pie chart. Notice that we add the factor(ses) command to treat the numeric levels as factors. This would not be required if the three levels were listed as text (e.g., low, medium, and high).

> library ( ggplot2 )
> bar <- ggplot ( hsb , aes (x = ses )) + geom_bar()
> bar

9781484203743_Fig09-02.jpg

Figure 9-2. Bar plot of the three levels of student SES

9.3 Graphing Scale Data

Scale data (interval and ratio) lend themselves to many more kinds of graphics than do nominal and ordinal data. We can generate boxplots, histograms, dotplots, smooth density plots, frequency polygons, scatterplots, and other graphical representations of one or more quantitative variables. We discussed the boxplot in Chapter 8, and we used the base version of R to produce a serviceable boxplot. The boxplot geom in ggplot2 is designed to use a factor with two or more levels to produce side-by-side boxplots, and it is quite good for that purpose. With a little adjustment of the parameters, you can also make an attractive boxplot for a single variable in ggplot2. For that purpose, you would have to use a theme in which certain elements are blank. We will look at the side-by-side boxplots first, and then at how to make a boxplot for a single variable should we need to do that.

9.3.1 Boxplots Revisited

Let’s produce side-by-side boxplots of the math scores of the students from the three different SESs. The position of the variable in the aesthetic identifies it as x or y on the plot, so you can omit those labels in most cases. The defaults in ggplot2 are adequate for this purpose, and Figure 9-3 shows the finished boxplots.

> boxplots <- ggplot ( hsb , aes( factor(ses) , math )) + geom_boxplot()
> boxplots

9781484203743_Fig09-03.jpg

Figure 9-3. Boxplots of the math scores for low, medium, and high SES

Although it is a little extra work, we can make a boxplot for a single variable in ggplot2. Let us use a different dataset for some variety. We will use the cars data from the openintro package.

> install.packages ("openintro")
> library ( openintro )
> head ( cars )
     type price mpgCity driveTrain passengers weight
1   small  15.9      25      front          5   2705
2 midsize  33.9      18      front          5   3560
3 midsize  37.7      19      front          6   3405
4 midsize  30.0      22       rear          4   3640
5 midsize  15.7      22      front          6   2880
6   large  20.8      19      front          6   3470

Recall our discussion of the fact that ggplot2 is designed to use factors for the x axis. When we have a single variable, we must provide a fake x factor. We can remove unwanted x axis labeling by using a theme with element blank() for the axis title, text, and tick marks. Let’s do this for the mpgCity variable, again building up the graphic in steps.

> mpgBox <- ggplot (cars , aes( factor(0) , mpgCity )) + geom_boxplot()
> mpgBox <- mpgBox + theme ( axis.title.x = element_blank() , axis.text.x = element_blank (), axis.ticks.x = element_blank())
> mpgBox

Figure 9-4 shows the completed boxplot. The appearance is slightly different from that of the boxplot we produced using the graphics package in base R.

9781484203743_Fig09-04.jpg

Figure 9-4. Boxplot of city miles per gallon for 54 cars

9.3.2 Histograms and Dotplots

In ggplot2, the default is to create histograms with the bins set to the range of the data divided by 30. For larger datasets, this default may be appropriate, but for smaller datasets, it is unlikely to be the best choice. Let’s continue with the city MPG data from the cars dataset and create a histogram using the defaults, and then adjust the bin width to something more reasonable. The default number of bins is clearly too many for a dataset with only 54 observations. We can adjust the bin width and change the bars to white bars with black borders by changing our options in ggplot2. In the base version of R, one can use the par() function to create multiple plots in the same graphic viewer window. The ggplot2 package uses a different strategy, including facets for multiple plots (rows or matrices) of plots for similar variables. When the plots are potentially unrelated, we can use the grid.arrange() function from the gridExtra package to easily combine multiple ggplot2 graphs into one. We will create one histogram with the default bin width, and another with a bin width of 5, and then display them side by side (see Figure 9-5 for the finished plots).

> install.packages ("gridExtra")
> library ( gridExtra )
> myHist1 <- ggplot (cars , aes ( mpgCity )) + geom_histogram ( fill = " white ", color = "black ")
> myHist2 <- ggplot (cars , aes ( mpgCity )) + geom_histogram ( binwidth = 5, fill = "white", color = " black ")
> grid.arrange(myHist1, myHist2, ncol = 2)
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

9781484203743_Fig09-05.jpg

Figure 9-5. Comparison of histograms

Examination of the side-by-side plots reveals that the default produced too many bars, whereas setting the bin width to 5 made the histogram more effective. The data are clearly right skewed, as both histograms make obvious.

When datasets are relatively small, a dotplot can be an effective alternative to the histogram. Like the stem-and-leaf display we discussed in Chapter 8, a dotplot preserves the “granularity” of the data, so that a single dot can represent a single data point. Let’s use the same city mileage data to create a dotplot using geom_dotplot() in ggplot2. In this case, we will not be concerned with reducing the bin width, as we would want each bin to represent a single data value. We use the table function discussed in Chapter 7 to create a frequency distribution of the data, in order to compare that distribution with the dotplot.

> table ( cars $ mpgCity )

16 17 18 19 20 21 22 23 25 28 29 31 32 33 39 42 46
 3  3  6  8  5  4  4  3  3  2  6  2  1  1  1  1  1
> dot <- ggplot (cars , aes ( mpgCity )) + geom_dotplot ()
> dot
stat _ bindot : binwidth defaulted to range / 30. Use 'binwidth = x' to adjust this

Figure 9-6 shows the dotplot. As you can see, the dotplot resembles a simple frequency histogram, with each dot representing a single data point. The numbers on the y axis for dotplots are not meaningful in ggplot2.

9781484203743_Fig09-06.jpg

Figure 9-6. Dotplot of city miles per gallon

9.3.3 Frequency Polygons and Smoothed Density Plots

A frequency polygon is a type of line graph in which straight-line segments are used to connect the frequencies of data values. Best practice is to anchor such plots to the x axis of the plot rather than having them “float” above the axis, as some statistical software packages do. In ggplot2, geom_freqpoly() is used to produce frequency polygons. We will revert to the bin width of 5 to make the frequency polygon more useful to the reader. Continuing with the city miles per gallon data, we have the following commands to produce the frequency polygon shown in Figure 9-7.

> polygon <- ggplot (cars , aes ( mpgCity )) + geom_freqpoly ( binwidth = 5)
> polygon

9781484203743_Fig09-07.jpg

Figure 9-7. Frequency polygon of city miles per gallon

Sometimes, smoothing a plot gives us a better idea of the real shape of the distribution. We can create a smoothed density plot of our data as follows. Let’s fill the density plot with a gray color just to make it a bit more visually appealing (see Figure 9-8).

> density <- ggplot (cars , aes ( mpgCity )) + geom_density ( fill = " gray ")
> density

9781484203743_Fig09-08.jpg

Figure 9-8. Smoothed density plot of city miles per gallon

9.3.4 Graphing Bivariate Data

Up to this point, we have been working mostly with a single variable at a time, or with different levels of the same variable used as factors. Often, we have the opportunity to explore the relationships among and between two or more variables, and graphical visualizations of such data are quite helpful. We will limit ourselves here to bivariate data, but you should be aware that R can produce both 2-D and 3-D plots as needed.

When the x variable is time, or some index based on time, we can plot the values of y over time in a line graph. When both x and y are continuous variables, we can use points to represent their relationship as a series of (x, y) pairs. If the relationship is perfectly linear, the points will fit along a straight line. When the relationship is not perfect, our points will produce scatter around a best-fitting line. Let us first examine scatterplots and see that we have the ability in ggplot2 to add the regression line and a confidence region to the scatterplot. We will then examine a hexbin plot in which bivariate data are grouped into hexagonal bins, with shading used to show the overlap in the data, something about which scatterplots do not convey much information.

Let’s use ggplot2 to make a scatterplot of the gas mileage of the car and the weight of the car. This should plot a negative relationship, with heavier cars getting lower mileage. To produce the scatterplot, we use geom_point(). As before, we can build up the chart by adding specifications and then plot the final version. The method = lm setting will add the linear regression line, and by default will also add a shaded 95% confidence region. Here is the R code to produce the scatterplot shown in Figure 9-9.

> scatter <- ggplot (cars , aes (weight , mpgCity )) + geom_point ()
> scatter <- scatter + geom_smooth ( method = lm)
> scatter

9781484203743_Fig09-09.jpg

Figure 9-9. Scatterplot of city miles per gallon and car weight

If there is more than one observation with the same (x, y) coordinates, the points will overlap on the scatterplot. Such overplotting can make seeing the data difficult. If there is a small amount of this, adding some transparency to the points representing the data can help. We can do this in ggplot2 using the alpha argument, which ranges from 0 for complete transparent to 1 for completely opaque (see Figure 9-10 for the results).

> scatter <- ggplot (cars , aes (weight , mpgCity )) + geom_point (alpha = .5)
> scatter <- scatter + geom_smooth ( method = lm)
> scatter

9781484203743_Fig09-10.jpg

Figure 9-10. Scatterplot of city miles per gallon and car weight with transparency added for the points

Another very effective way to show bivariate data is a hexagonal bin plot. Hexagonal bins are shaded to show the frequency of bivariate data in each bin. This plot is a very good way to determine where the data are stacked or overlapped. As before, we can build up the plot by adding elements to the overall list as follows. To show how difficult seeing the data can be with a scatterplot, we use the diamonds dataset built into ggplot2, which has 50,000 observations. We will also use the grid.arrange() function as before to put two plots side by side (shown in Figure 9-11), one a scatterplot and the other a hexagonal bin plot. Note how we can reuse the basic ggplot2 graph data and variable mapping and just present it differently with geom_point() or geom_hex().

> install.packages ("hexbin")
> library(hexbin)
> dplot  <- ggplot (diamonds , aes ( price , carat ))
> splot <- dplot + geom_point( alpha = .25 )
> hex <- dplot + geom_hex ()
> grid.arrange ( splot, hex, ncol = 2 )

9781484203743_Fig09-11.jpg

Figure 9-11. Scatterplot with transparency (left) and hexagonal bin plot (right) of the price and carat size of 50,000 diamonds

References

1. See Leland Wilkinson, The Grammar of Graphics (New York: Springer, 2005), for Hadley Wickham’s translation of the ggplot2 package into an R package.

2. See, for example, Edward R. Tufte, The Visual Display of Quantitative Information (Columbia, MD: Graphic Press, 2001), which is considered a classic in the field.

________________________

1Edward R. Tufte, The Visual Display of Quantitative Information (Columbia, MD: Graphic Press, 2001), at 13.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.35.193