Discovering other lattice plots

We have just discovered one type of plot in lattice as well as multipanel conditioning. Lattice is a rich package which features diverse plots. We have already encountered the multi-paneled scatterplot obtained using xyplot(). We will have a look at some more lattice multi-paneled graphs in this section: histograms, stacked bars, dotplots, as well as a customization of the scatterplot, where points are replaced by text.

Histograms

In the previous chapter, we examined the overall distribution of an attribute using the hist() function. The distribution of some measures can vary between groups, that is, it can be more or less skewed in some groups compared to others. The histogram() function in the lattice package allows for a visual inspection of this. We will examine variability in temperatures by month using the airquality dataset. This dataset has six attributes (Ozone, Solar.R, Wind, Temp, Month, and Day), of which you will find a description by typing:

?airquality

To generate a lattice graphic with a histogram of Temp for each month, type:

histogram(~Temp | factor(Month), data = airquality,
   xlab = "Temperature", ylab = "Percent of total")

Not using a left-hand side in the formula above has the effect of plotting the values of the right-hand side on the y axis, instead of those of another variable, as in the previous example. Notice we use the factor() function here. This function allows us to tell R that we want to consider the values in the variable Month as categories, instead of as quantities as it would have by default.

The effect is that the plot is produced for each month (from May to September). The output is provided below. We can notice that the temperature increases from May to July and then decreases:

Histograms

Histograms of temperature by month

Stacked bars

Stacked bar graphs are very useful representations of multiway data. We call multiway data in which 3 or more factors are plotted or analyzed together. In this example, we will use fictitious sales data from a company specializing in selling DVDs, and Blu-ray discs in 2006. The company has 5 branches and has 5 departments in each branch: Movies, TV series, documentary, music and instructional. The following code will create the salesdata data frame containing sales record for years 2004 and 2014 (in hundreds of thousands). We will then examine the sales as a function of branches, departments and year.

5    "Branch 3", "Branch 4", "Branch 5"), 5)
6 salesdata$Dept = c(rep("Movies",5),rep("TVSeries",5),
7    rep("Documentary",5),rep("Music", 5), rep("Instructional", 5))
8 salesdata$Sales = c(50.795, 25.469, 30.241, 100.658, 36.412,
9    45.632, 30.541, 31.421, 70.212, 25.412, 5.124, 3.124, 4.065,
10  10.258, 0.82, 10.658, 5.474, 6.541, 10.698, 76.584, 1.021,
11   0.504, 0.76, 0.15, 0.3, 203.18, 101.876, 120.964, 402.632,
12   145.648, 182.528, 122.164,125.684, 280.848, 101.648, 20.496,
13   12.496, 16.26, 41.032, 3.28, 42.632, 21.896, 26.164, 42.792,
14   306.336, 4.084, 2.016, 3.04, 0, 0)

On line, we build a matrix of 50 rows and 4 columns filled with zeroes, and coherce it to a data frame before populating it. As a reminder, a data frame is a list of vectors that can be of different types but all of the same length. A matrix can only contain elements of the same type. On line 2, we name the columns of the data frame. We will have attributes Year, Branch, Dept (for department) and Sales (for sales volume). From line 3 to 12, we populate the data frame.

The following code will produce a stacked bar chart of the data:

barchart(Dept ~ Sales | Branch, groups = Year, data = salesdata,
   auto.key = list(space = 'right'), stack = TRUE)
Stacked bars

A stacked bar chart of yearly sales by department and branch

We used the barchart() function to generate the graph. In the formula argument, we included the department (attribute Dept) on the left as we want it to be displayed on the y axis, we want the sales (attribute Sales) on the x axis, so we put it second (after the tilde ~ symbol). We continued our formula by stating that we want the graph to be conditioned on Branch (after the vertical line which means conditioned on), we want each panel to discriminate between the years, so we assigned Year to the groups argument. We asked for stacked graphs using the stack argument, and finally we asked for the key of the graph to be placed on the right using the auto-key argument. The stacked argument allows specifying that we want a stacked bar graph.

On this graph, we can see that the sales were much higher in 2014 than in 2004. Whereas branch 4 increased mostly in movies and TV series sales, branch 5 became apparently specialized in musical DVDs and Blu-rays and generated an increased income in this department. Branches 1, 2 and 3 were not as lucky and didn't increase their sales as much, but still managed to survive the crisis.

Dotplots

The dotplot() is a useful graphing function in R. It allows for the representation of the relationship between a numeric attribute and one or more factor attributes. We will use the salesdata dataset again to illustrate the use of the dotplot() function. We will reuse the formula we used for producing the stacked bars chart, as well as most of the argument assignations. This is possible because the same options can be used for generating most lattice objects. We will also add a title using the argument main, and increase the size of the dots a bit using the argument cex:

dotplot(Dept ~ Sales | Branch, groups = Year, data = salesdata, 
   cex = 2, auto.key = list(space = "right"), 
   main = "Sales by department, branch and year") 

The following is the output:

Dotplots

A dot plot of yearly sales by department and branch

We can interpret this graph the same way we did for the stacked bar graph, as it represents the same data. It is sometimes useful to use multiple visualizations for a better understanding of the data.

Displaying data points as text

An interesting feature of xyplot() is the possibility to display data point as text in multi-paneled scatterplots conditioned on one attribute or the combination of attributes. In what follows, we will examine the relationship between fertility (y axis) and education (x axis) in Swiss districts. We will use multi-paneled scatterplots conditioned on high versus low infant mortality and whether the district is rural versus non rural. We will display the observation as the name of the district.

The following code starts by creating a new data frame from the swiss dataset (line 1). We then add three additional attributes. The first is Mortality, which is computed as whether Infant.Mortality is higher than the mean value across the dataset (lines 2 to 5). Remember that in the Chapter 2, Visualizing and Manipulating Data Using R, we used the subset() function combined with if statements to subset data based on a condition. Here we rely on an alternative solution by including the condition into brackets. For instance, lines 2 and 3 mean fertility$Mortality (the attribute Mortality of data frame fertility) takes the value High infant mortality in observations where fertility$Infant.Mortality is higher than the mean of swiss$Infant.Mortality.

The second additional attribute is Rural, which is computed as whether Agriculture is higher than the mean value across the dataset (lines 6 and 9). Finally, the attribute District (that is, the district where the data was collected). This is simply the row names of the dataset (line 10). After this initial part, attributes Mortality and Rural are recoded in order to make the values understandable in the graph. Here is the code for this data preparation:

1  fertility=swiss
2  fertility$Mortality[(fertility$Infant.Mortality > 
3     mean(swiss$Infant.Mortality))==TRUE]="High infant mortality"
4  fertility$Mortality[(fertility$Infant.Mortality >    
5     mean(swiss$Infant.Mortality))==FALSE]="Low infant mortality"
6  fertility$Rural[(fertility$Agriculture > 
7     mean(swiss$Agriculture)) == TRUE] = "Rural"
8  fertility$Rural[(fertility$Agriculture> 
9     mean(swiss$Agriculture)) == FALSE] = "Non-rural"
10  fertility$District = rownames(fertility)

In the plotting section below, we first include the formula, that specifies which relationship between attributes to plot, including the conditioning on Mortality * Rural (the combination of which will result in 4 panes; line 1), we then configure the groups (line 2). An important part of the graph is the configuration of the panel (line 3 and 4).

We create a panel function in which we call ltext(), which allows printing text and lattice graphics, in this case, the district names. This replaces the default printing of data points as dots. We then configure the main title of the graph (line 5):

1  xyplot(Fertility ~ Education | Mortality * Rural, data = fertility,
2     groups = as.character(District),
3     panel = function(x, y, subscripts, groups) {
4        ltext(x, y, labels = groups[subscripts], cex=.4)},
5        main = "Fertility and education in 1888 Occidental Switzerland")

On the resulting graph, presented in the figure below, we can notice the relationship between education and fertility, especially in non-rural areas. We can also see that fertility is lowest in non-rural areas where infant mortality is low:

Displaying data points as text

A multi-panel plot of the relationship between fertility and education, conditioning on infant mortality and agriculture

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.79.241