We have just discovered one type of plot in lattice as well as multipanel conditioning. Lattice is a rich package which features diverse plots. We have already encountered the multi-paneled scatterplot obtained using xyplot()
. We will have a look at some more lattice multi-paneled graphs in this section: histograms, stacked bars, dotplots, as well as a customization of the scatterplot, where points are replaced by text.
In the previous chapter, we examined the overall distribution of an attribute using the hist()
function. The distribution of some measures can vary between groups, that is, it can be more or less skewed in some groups compared to others. The histogram()
function in the lattice
package allows for a visual inspection of this. We will examine variability in temperatures by month using the airquality
dataset. This dataset has six attributes (Ozone
, Solar.R
, Wind
, Temp
, Month
, and Day
), of which you will find a description by typing:
?airquality
To generate a lattice
graphic with a histogram of Temp
for each month, type:
histogram(~Temp | factor(Month), data = airquality, xlab = "Temperature", ylab = "Percent of total")
Not using a left-hand side in the formula above has the effect of plotting the values of the right-hand side on the y axis, instead of those of another variable, as in the previous example. Notice we use the
factor()
function here. This function allows us to tell R that we want to consider the values in the variable Month
as categories, instead of as quantities as it would have by default.
The effect is that the plot is produced for each month (from May to September). The output is provided below. We can notice that the temperature increases from May to July and then decreases:
Stacked bar graphs are very useful representations of multiway data. We call multiway data in which 3 or more factors are plotted or analyzed together. In this example, we will use fictitious sales data from a company specializing in selling DVDs, and Blu-ray discs in 2006. The company has 5 branches and has 5 departments in each branch: Movies, TV series, documentary, music and instructional. The following code will create the salesdata
data frame containing sales record for years 2004 and 2014 (in hundreds of thousands). We will then examine the sales as a function of branches, departments and year.
5 "Branch 3", "Branch 4", "Branch 5"), 5) 6 salesdata$Dept = c(rep("Movies",5),rep("TVSeries",5), 7 rep("Documentary",5),rep("Music", 5), rep("Instructional", 5)) 8 salesdata$Sales = c(50.795, 25.469, 30.241, 100.658, 36.412, 9 45.632, 30.541, 31.421, 70.212, 25.412, 5.124, 3.124, 4.065, 10 10.258, 0.82, 10.658, 5.474, 6.541, 10.698, 76.584, 1.021, 11 0.504, 0.76, 0.15, 0.3, 203.18, 101.876, 120.964, 402.632, 12 145.648, 182.528, 122.164,125.684, 280.848, 101.648, 20.496, 13 12.496, 16.26, 41.032, 3.28, 42.632, 21.896, 26.164, 42.792, 14 306.336, 4.084, 2.016, 3.04, 0, 0)
On line, we build a matrix of 50 rows and 4 columns filled with zeroes, and coherce it to a data frame before populating it. As a reminder, a data frame is a list of vectors that can be of different types but all of the same length. A matrix can only contain elements of the same type. On line 2, we name the columns of the data frame. We will have attributes Year
, Branch
, Dept
(for department) and Sales
(for sales volume). From line 3 to 12, we populate the data frame.
The following code will produce a stacked bar chart of the data:
barchart(Dept ~ Sales | Branch, groups = Year, data = salesdata, auto.key = list(space = 'right'), stack = TRUE)
We used the barchart()
function to generate the graph. In the formula argument, we included the department (attribute Dept
) on the left as we want it to be displayed on the y axis, we want the sales (attribute Sales
) on the x axis, so we put it second (after the tilde ~
symbol). We continued our formula by stating that we want the graph to be conditioned on Branch
(after the vertical line which means conditioned on), we want each panel to discriminate between the years, so we assigned Year
to the groups
argument. We asked for stacked graphs using the stack
argument, and finally we asked for the key of the graph to be placed on the right using the auto-key
argument. The stacked
argument allows specifying that we want a stacked bar graph.
On this graph, we can see that the sales were much higher in 2014 than in 2004. Whereas branch 4 increased mostly in movies and TV series sales, branch 5 became apparently specialized in musical DVDs and Blu-rays and generated an increased income in this department. Branches 1, 2 and 3 were not as lucky and didn't increase their sales as much, but still managed to survive the crisis.
The dotplot()
is a useful graphing function in R. It allows for the representation of the relationship between a numeric attribute and one or more factor attributes. We will use the salesdata
dataset again to illustrate the use of the dotplot()
function. We will reuse the formula we used for producing the stacked bars chart, as well as most of the argument assignations. This is possible because the same options can be used for generating most lattice
objects. We will also add a title using the argument main
, and increase the size of the dots a bit using the argument cex
:
dotplot(Dept ~ Sales | Branch, groups = Year, data = salesdata, cex = 2, auto.key = list(space = "right"), main = "Sales by department, branch and year")
The following is the output:
We can interpret this graph the same way we did for the stacked bar graph, as it represents the same data. It is sometimes useful to use multiple visualizations for a better understanding of the data.
An interesting feature of xyplot()
is the possibility to display data point as text in multi-paneled scatterplots conditioned on one attribute or the combination of attributes. In what follows, we will examine the relationship between fertility
(y axis) and education
(x axis) in Swiss districts. We will use multi-paneled scatterplots conditioned on high versus low infant mortality and whether the district is rural versus non rural. We will display the observation as the name of the district.
The following code starts by creating a new data frame from the swiss
dataset (line 1). We then add three additional attributes. The first is Mortality
, which is computed as whether Infant.Mortality
is higher than the mean value across the dataset (lines 2 to 5). Remember that in the Chapter 2, Visualizing and Manipulating Data Using R, we used the subset()
function combined with if
statements to subset data based on a condition. Here we rely on an alternative solution by including the condition into brackets. For instance, lines 2 and 3 mean fertility$Mortality
(the attribute Mortality
of data frame fertility) takes the value High infant mortality
in observations where fertility$Infant.Mortality
is higher than the mean of swiss$Infant.Mortality
.
The second additional attribute is Rural
, which is computed as whether Agriculture
is higher than the mean value across the dataset (lines 6 and 9). Finally, the attribute District
(that is, the district where the data was collected). This is simply the row names of the dataset (line 10). After this initial part, attributes Mortality
and Rural
are recoded in order to make the values understandable in the graph. Here is the code for this data preparation:
1 fertility=swiss 2 fertility$Mortality[(fertility$Infant.Mortality > 3 mean(swiss$Infant.Mortality))==TRUE]="High infant mortality" 4 fertility$Mortality[(fertility$Infant.Mortality > 5 mean(swiss$Infant.Mortality))==FALSE]="Low infant mortality" 6 fertility$Rural[(fertility$Agriculture > 7 mean(swiss$Agriculture)) == TRUE] = "Rural" 8 fertility$Rural[(fertility$Agriculture> 9 mean(swiss$Agriculture)) == FALSE] = "Non-rural" 10 fertility$District = rownames(fertility)
In the plotting section below, we first include the formula, that specifies which relationship between attributes to plot, including the conditioning on Mortality * Rural
(the combination of which will result in 4 panes; line 1), we then configure the groups (line 2). An important part of the graph is the configuration of the panel (line 3 and 4).
We create a panel function in which we call ltext()
, which allows printing text and lattice graphics, in this case, the district names. This replaces the default printing of data points as dots. We then configure the main title of the graph (line 5):
1 xyplot(Fertility ~ Education | Mortality * Rural, data = fertility, 2 groups = as.character(District), 3 panel = function(x, y, subscripts, groups) { 4 ltext(x, y, labels = groups[subscripts], cex=.4)}, 5 main = "Fertility and education in 1888 Occidental Switzerland")
On the resulting graph, presented in the figure below, we can notice the relationship between education and fertility, especially in non-rural areas. We can also see that fertility is lowest in non-rural areas where infant mortality is low:
3.142.200.109