What You Will Learn in this Chapter:
The world can be a complicated place, and the data you have can also be correspondingly complicated. You saw in the previous chapter how to use analysis of variance (ANOVA) via the aov() command to help make sense of complicated data. This chapter builds on this knowledge by walking you through the process of creating data objects prior to carrying out a complicated analysis.
This chapter has two main themes. To start, you look at ways to create and manipulate data to produce the objects you require to carry out these complex analyses. Later in the chapter you look at methods to extract the various components of a complicated data object. You have seen some of these commands before and others are new.
To begin with, you need to have some data to work on. You can construct your data in a spreadsheet and have it ready for analysis in R, or you may have to construct the data from various separate elements. This section covers the latter scenario.
When you need to carry out a complex analysis, the likelihood is that you will have to make a complex data object. The more complicated the situation you are examining, the more important it is that your data are arranged in a sensible fashion. In general, this means that you should have a column for each variable that you are dealing with—usually this means a column containing the response variable and additional columns each containing a predictor variable.
You have already seen various ways to create data items:
If you read data from another application, like a spreadsheet, it is likely that your data are already in the layout you require. If you have individual vectors of data, you need to construct data frames and matrix objects before you can carry out the business of complex analysis.
The data frame is probably the most useful kind of data object for complex analyses because you can have columns containing a variety of data types. For example, you can have columns containing numeric data and other columns containing factor data. This is unlike a matrix where all the data must be of one type (for example, all numeric or all text).
To make a data frame you simply use the data.frame() command, and type the names of the objects that will form the columns into the parentheses. However, you need to ensure that all the objects are of the same length. The following example contains two simple vectors of numerical data that you want to make into a data frame. They have different lengths, so you need to alter the shorter one and add NA items to pad it out:
> mow ; unmow
[1] 12 15 17 11 15
[1] 8 9 7 9
> length(unmow) = length(mow)
> unmow
[1] 8 9 7 9 NA
> grassy = data.frame(mow, unmow)
> grassy
mow unmow
1 12 8
2 15 9
3 17 7
4 11 9
5 15 NA
The length() command is usually used to query the length of an object, but here you use it to alter the original data by setting its length to be the same as the longer item. If you use a value that turns out to be shorter than the current length, your object is truncated and the extra data are removed.
You can use a variety of other commands to set the names of the columns, and also add names for the individual rows. The following example looks at the main column names using the names() command:
> names(grassy)
[1] "mow" "unmow"
> names(grassy) = c('mown', 'unmown')
> names(grassy)
[1] "mown" "unmown"
Here, you query the column names and then set them to new values. You can do something similar with row names. In the following example you create a vector of names first and then set them using the row.names() command:
> grn = c('Top', 'Middle', 'Lower', 'Set aside', 'Verge')
> row.names(grassy)
[1] "1" "2" "3" "4" "5"
> row.names(grassy) = grn
> row.names(grassy)
[1] "Top" "Middle" "Lower" "Set aside" "Verge"
Notice that the original row names are a simple index and appear as characters when you query them. The newly renamed data frame appears like this:
> grassy
mown unmown
Top 12 8
Middle 15 9
Lower 17 7
Set aside 11 9
Verge 15 NA
You may prefer to have your data frame in a different layout, with one column for the response variable and one for the predictor (in most cases this is preferable). In the current example you would have one column for the numerical values, and one to hold the treatment names (mown or unmown). You can do this in several ways, depending on where you start.
In this case you already have a data frame and can convert it using the stack() command:
> stack(grassy)
values ind
1 12 mown
2 15 mown
3 17 mown
4 11 mown
5 15 mown
6 8 unmown
7 9 unmown
8 7 unmown
9 9 unmown
10 NA unmown
Now you have the result you want, but you have an NA item that you do not really need. You can use na.omit() to strip out the NA items that may occur:
> na.omit(stack(grassy))
values ind
1 12 mown
2 15 mown
3 17 mown
4 11 mown
5 15 mown
6 8 unmown
7 9 unmown
8 7 unmown
9 9 unmown
The column names are set to the defaults of values and ind. You can use the names() command to alter them afterward. The stack() command really only works when you have a simple situation with all samples being related to a single predictor variable. When you need multiple columns with several predictor variables, you need a different approach.
When you need to create vectors of treatment names you are repeating the same names over and over according to how many replicates you have. You can use the rep() command to generate repeating items and take some of the tedium out of the process. In the following example and subsequent steps, you use the rep() command to make labels to match up with the two samples you have (mow and unmow):
> mow ; unmow
[1] 12 15 17 11 15
[1] 8 9 7 9
> trt = c(rep('mow', length(mow)), rep('unmow', length(unmow)))
> trt
[1] "mow" "mow" "mow" "mow" "mow" "unmow" "unmow" "unmow" "unmow"
> rich = c(mow, unmow)
> data.frame(rich, trt)
rich trt
1 12 mow
2 15 mow
3 17 mow
4 11 mow
5 15 mow
6 8 unmow
7 9 unmow
8 7 unmow
9 9 unmow
rep(what, times)
> grass.dat = data.frame(rich, trt)
The rep() command is useful to help you create repeating elements (like factors) and you will see it again shortly. Before then, you look at creating matrix objects.
A matrix can be thought of as a single vector of data that is conveniently split up into rows and columns. You can make a matrix object in several ways:
The following examples and subsequent steps illustrate the two methods:
> mow ; unmow
[1] 12 15 17 11 15
[1] 8 9 7 9
> length(unmow) = length(mow)
> cbind(mow, unmow)
mow unmow
[1,] 12 8
[2,] 15 9
[3,] 17 7
[4,] 11 9
[5,] 15 NA
> rbind(mow,unmow)
[,1] [,2] [,3] [,4] [,5]
mow 12 15 17 11 15
unmow 8 9 7 9 NA
If you have your data as one single vector, you can use an alternative method to make a matrix using the matrix() command. This command takes a single vector and splits it into a matrix with the number of rows or columns that you specify. This means that your vector of data must be divisible by the number of rows or columns that you require. In the following example and subsequent steps you have a single vector of values that you use to create a matrix:
> rich
[1] 12 15 17 11 15 8 9 7 9
> length(rich) = 10
> rich
[1] 12 15 17 11 15 8 9 7 9 NA
> matrix(rich, ncol = 2)
[,1] [,2]
[1,] 12 8
[2,] 15 9
[3,] 17 7
[4,] 11 9
[5,] 15 NA
> mow ; unmow
[1] 12 15 17 11 15
[1] 8 9 7 9 NA
> matrix(rich, nrow = 2)
[,1] [,2] [,3] [,4] [,5]
[1,] 12 17 15 9 9
[2,] 15 11 8 7 NA
> matrix(rich, nrow = 2, byrow = TRUE)
[,1] [,2] [,3] [,4] [,5]
[1,] 12 15 17 11 15
[2,] 8 9 7 9 NA
Like before with the first method, when you use the matrix() command none of the margin names are set; you need to use the rownames() or colnames() commands to set them.
When you create data for complex analysis, like analysis of variance, you create vectors for both the response variables and the predictor variables. The response variables are generally numeric, but the predictor variables may well be characters and refer to names of treatments. Alternatively, they may be simple numeric values with each number representing a separate treatment. When you create a data frame that contains numeric and character vectors, the character vectors are regarded as being factors. In the following example you can see a simple data frame created from a numeric vector and a character vector:
> rich ; graze
[1] 12 15 17 11 15 8 9 7 9
[1] "mow" "mow" "mow" "mow" "mow" "unmow" "unmow" "unmow" "unmow"
> grass.df = data.frame(rich, graze)
> str(grass.df)
'data.frame': 9 obs. of 2 variables:
$ rich : int 12 15 17 11 15 8 9 7 9
$ graze: Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 2 2
When you use the str() command to examine the structure of the data frame that was created, you see that the character vector has been converted into a factor. If you add a character vector to an existing data frame, it will remain as a character vector unless you use the data.frame() command as your means of adding the new vector; you see this in a moment.
You can force a numeric or character vector to be a factor by using the factor() command:
> graze
[1] "mow" "mow" "mow" "mow" "mow" "unmow" "unmow" "unmow" "unmow"
> graze.f = factor(graze)
> graze.f
[1] mow mow mow mow mow unmow unmow unmow unmow
Levels: mow unmow
Here you see that the original characters are made into factors, and you see the list of levels when you look at the object (note that the data are not in quotes as they were when they were a character object). If you want to add a character vector to an existing data frame and require the new vector to be a factor, you can use the as.factor() command to convert the vector to a factor. In the following example you see the result of adding a vector of characters without using as.factor() and then with the as.factor() command:
> grass.df$graze2 = graze
> grass.df
rich graze graze2
1 12 mow mow
2 15 mow mow
3 17 mow mow
4 11 mow mow
5 15 mow mow
6 8 unmow unmow
7 9 unmow unmow
8 7 unmow unmow
9 9 unmow unmow
> str(grass.df)
'data.frame': 9 obs. of 3 variables:
$ rich : int 12 15 17 11 15 8 9 7 9
$ graze : Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 2 2
$ graze2: chr "mow" "mow" "mow" "mow" ...
> grass.df$graze2 = as.factor(graze)
> str(grass.df)
'data.frame': 9 obs. of 3 variables:
$ rich : int 12 15 17 11 15 8 9 7 9
$ graze : Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 2 2
$ graze2: Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 2 2
In the first instance you see that the character vector appears in the data frame without quotes, but the str() command reveals it is still comprised of characters. In the second case you use the as.factor() command, and the new column is successfully transferred as a factor variable. You can, of course, set a column to be a factor afterward, as you can see in the following example:
> grass.df$graze2 = factor(grass.df$graze2)
In this case you convert the graze2 column of the data frame into a factor using the factor() command. If you use the data.frame() command then any character vectors are converted to factors as the following example shows:
> grass.df = data.frame(grass.df, graze2 = graze)
Notice how the name of the column created is set as part of the command; the graze2 object is created on the fly and added to the data frame as a factor.
You may want to analyze how your factor vector is split up at some point because the factor vector represents the predictor variable, and shows you how many treatments are applied. You can use the levels() command to see how your factor vector is split up. You can use the command in two ways; you can use it to query an object and find out what levels it possesses, or you can use it to set the levels. Following are examples of two character vectors:
> graze
[1] "mow" "mow" "mow" "mow" "mow" "unmow" "unmow" "unmow" "unmow"
> levels(graze)
NULL
Here the data are plain characters and no levels are set; when you examine the data with the levels() command you get NULL as a result.
> graze.f
[1] mow mow mow mow mow unmow unmow unmow unmow
Levels: mow unmow
> levels(graze.f)
[1] "mow" "unmow"
Here you see the names of the levels that you created earlier. If you have a numeric variable that represents codes for treatments, you can make the variable into a factor using the factor() command as you have already seen, but you can also assign names to the levels. In the following example you create a simple numeric vector to represent two treatments:
> graze.nf = c(1,1,1,1,1,2,2,2,2)
You can now assign names to each of the levels in the vector like so:
> levels(graze.nf)[1] = 'mown'
> levels(graze.nf)[2] = 'unmown'
> levels(graze.nf)
[1] "mown" "unmown"
> graze.nf
[1] 1 1 1 1 1 2 2 2 2
attr(,"levels")
[1] "mown" "unmown"
> class(graze.nf)
[1] "numeric"
You can set each level to have a name; now your plain numeric values have a more meaningful label. However, the vector still remains a numeric variable rather than a factor. You can set all the labels in one command with a slight variation, as the following example shows:
> graze.nf = factor(c(1,1,1,1,1,2,2,2,2))
> graze.nf
[1] 1 1 1 1 1 2 2 2 2
Levels: 1 2
> levels(graze.nf) = list(mown = '1', unmown = '2')
> graze.nf
[1] mown mown mown mown mown unmown unmown unmown unmown
Levels: mown unmown
In this case you create your factor object directly using numeric values but wrap these in a factor() command; you can see that you get your two levels, corresponding to the two values. This time you use the levels() command to set the names by listing how you want the numbers to be replaced.
You can also apply level names to a vector as you convert it to a factor via the factor() command:
> graze.nf = c(1,1,1,1,1,2,2,2,2)
> graze.nf
[1] 1 1 1 1 1 2 2 2 2
> factor(graze.nf, labels = c('mown', 'unmown'))
[1] mown mown mown mown mown unmown unmown unmown unmown
Levels: mown unmown
In this instance you have a simple numeric vector and use the labels = instruction to apply labels to the levels as you make your factor object.
You can use the nlevels() command to give you a numeric result for the number of levels in a vector:
> graze
[1] "mow" "mow" "mow" "mow" "mow" "unmow" "unmow" "unmow" "unmow"
> nlevels(graze)
[1] 0
> graze.f
[1] mow mow mow mow mow unmow unmow unmow unmow
Levels: mow unmow
> nlevels(graze.f)
[1] 2
You can also use the class() command to check what sort of object you are dealing with like so:
> class(graze)
[1] "character"
> class(graze.f)
[1] "factor"
In the first case you can see clearly that the data are characters, whereas in the second case you see that you have a factor object. The class() command is useful because, as you have seen, it is possible to apply levels to vectors of data without making them into factor objects. Take the following for example:
> nlevels(graze.nf)
[1] 2
> class(graze.nf)
[1] "numeric"
In the preceding example you have set two levels to your vector, but it remains a numeric object.
If you want to examine a factor variable but only want to view the levels as numeric values rather than as characters (assuming they have been set), you can use the as.numeric() command like so:
> as.numeric(graze.nf)
[1] 1 1 1 1 1 2 2 2 2
Now you can switch between character, factor, and numeric quite easily.
You have already seen how to create vectors of levels using the rep() command. The basic form of the command is:
rep(what, times)
You can use this command to create repeating labels that you can use to create a vector of characters that will become a factor object.
> trt = factor(c(rep('mown', 5), rep('unmown', 4)))
> trt
[1] mown mown mown mown mown unmown unmown unmown unmown
Levels: mown unmown
In this instance you make a factor object directly from five lots of mown and four lots of unmown, which correspond to the two treatments you require.
When you have a balanced design with an equal number of replications, you can use the each instruction like so:
> factor(rep(c('mown', 'unmown'), each = 5))
[1] mown mown mown mown mown unmown unmown unmown unmown unmown
Levels: mown unmown
The each instruction repeats the elements the specified number of times. You can use the times and each instructions together to create more complicated repeated patterns.
You can also create factor objects using the gl() command. The general form of the command is:
gl(n, k, length = n*k, labels = 1:n)
In this command, n is the number of levels you require and k is the number of replications for each of these levels. You can also set the overall length of the vector you create and add specific text labels to your treatments. For example:
> gl(2, 5, labels = c('mown', 'unmown'))
[1] mown mown mown mown mown unmown unmown unmown unmown unmown
Levels: mown unmown
> gl(2, 1, 10, labels = c('mown', 'unmown'))
[1] mown unmown mown unmown mown unmown mown unmown mown unmown
Levels: mown unmown
> gl(2, 2, 10, labels = c('mown', 'unmown'))
[1] mown mown unmown unmown mown mown unmown unmown mown mown
Levels: mown unmown
In the first case you set two levels and require five replicates; you get five of one level and then five of the other. In the second case you set the number of replicates to 1, but also set the overall length to 10; the result is alternation between the levels until you reach the length required. In the third case you set the number of replicates to be two, and now you get two of each treatment until you reach the required length.
When you have a lot of data you will generally find it more convenient to create it in a spreadsheet and save it as a CSV file. However, for data with relatively few replicates it is useful to be able to make up data objects directly in R. In the following activity, you practice making a fairly simple data object comprising a numeric response variable and two predictor variables.
> higher = c(12, 15, 17, 11, 15)
> lower = c(8, 9, 7, 9)
> middle = c(12, 14, 17, 21, 17)
> daisy = c(higher, lower, middle)
> cutting = c(rep('mow', 5), rep('unmow', 4), rep('sheep', 5))
> time = rep(gl(2, 1, length = 5, labels = c('early', 'late')), 3)[-10]
> flwr = data.frame(daisy, cutting, time)
>rm(higher, lower, middle, daisy, cutting, time)
> flwr
> flwr
daisy cutting time
1 12 mow early
2 15 mow late
3 17 mow early
4 11 mow late
5 15 mow early
6 8 unmow early
7 9 unmow late
8 7 unmow early
9 9 unmow late
10 12 sheep early
11 14 sheep late
12 17 sheep early
13 21 sheep late
14 17 sheep early
When it comes to adding data to an existing data frame or matrix, you have various options. The following examples illustrate some of the ways you can add data:
> grassy
mown unmown
Top 12 8
Middle 15 9
Lower 17 7
Set aside 11 9
Verge 15 NA
> grazed
[1] 11 14 17 10 8
> grassy$grazed = grazed
> grassy
mown unmown grazed
Top 12 8 11
Middle 15 9 14
Lower 17 7 17
Set aside 11 9 10
Verge 15 NA 8
In the preceding example you have a new sample and want to add this as a column to your data frame. The sample is the same length as the others so you can add it simply by using the $. In the next example you use the data.frame() command, but this time you are combining an existing data frame with a vector; this works fine as long as the new vector is the same length as the existing columns:
> grassy
mown unmown
Top 12 8
Middle 15 9
Lower 17 7
Set aside 11 9
Verge 15 NA
> grassy = data.frame(grassy, grazed)
> grassy
mown unmown grazed
Top 12 8 11
Middle 15 9 14
Lower 17 7 17
Set aside 11 9 10
Verge 15 NA 8
You add a row to a data frame using the [row, column] syntax. In the following example you have a new vector of values that you want to add as a row in your data frame:
> Midstrip
[1] 10 10 12
> grassy['Midstrip',] = Midstrip
> grassy
mown unmown grazed
Top 12 8 11
Middle 15 9 14
Lower 17 7 17
Set aside 11 9 10
Verge 15 NA 8
Midstrip 10 10 12
You have now assigned the appropriate row of the data frame to your new vector of values; note that you give the name in the brackets using quotes.
If the new data are longer than the original data frame, you must expand the data frame to “make room” for the new items; you can do this by assigning NA to new rows as required. In the following example you have a simple data frame and want to add a new column, but this is longer than the original data:
> grassy
mown unmown
Top 12 8
Middle 15 9
Lower 17 7
Set aside 11 9
Verge 15 NA
> grazed
[1] 11 14 17 10 8 9
> grassy$grazed = grazed
Error in `$<-.data.frame`(`*tmp*`, "grazed", value = c(11, 14, 17, 10, :
replacement has 6 rows, data has 5
When you try to add the new data, you get an error message; there are not enough existing rows to accommodate the new column. In this instance the data frame has named rows; you require only one extra row so you can name the row as you create it:
> grassy['Midstrip',] = NA
> grassy
mown unmown
Top 12 8
Middle 15 9
Lower 17 7
Set aside 11 9
Verge 15 NA
Midstrip NA NA
> grassy$grazed = grazed
> grassy
mown unmown grazed
Top 12 8 11
Middle 15 9 14
Lower 17 7 17
Set aside 11 9 10
Verge 15 NA 8
Midstrip NA NA 9
Once you have the additional row you can add the new column as before. In this case you added a column that required only a single additional row, but if you needed more you could do this easily:
> grassy[6:10,] = NA
> grassy
mown unmown
Top 12 8
Middle 15 9
Lower 17 7
Set aside 11 9
Verge 15 NA
6 NA NA
7 NA NA
8 NA NA
9 NA NA
10 NA NA
You added rows six to ten and set all the values to be NA. Notice, however, that the row names of the additional rows are unset and have a plain numerical index value. You have to reset the names of the rows using the row.names() command:
> row.names(grassy) = c(row.names(grassy)[1:6], "A", "B", "C", "D")
In this case you take the names from the first six rows and add to them the new names you require (in this case, uppercase letters).
When you have a matrix you can add additional rows or columns using the rbind() or cbind() commands as appropriate:
> grassy.m
top upper mid lower bottom
mow 12 15 17 11 15
unmow 8 9 7 9 NA
> grazed
[1] 11 14 17 10 8
> grassy.m = rbind(grassy.m, grazed)
> grassy.m
top upper mid lower bottom
mow 12 15 17 11 15
unmow 8 9 7 9 NA
grazed 11 14 17 10 8
> grassy.m
mow unmow
[1,] 12 8
[2,] 15 9
[3,] 17 7
[4,] 11 9
[5,] 15 NA
> grassy.m = cbind(grassy.m, grazed)
> grassy.m
mow unmow grazed
[1,] 12 8 11
[2,] 15 9 14
[3,] 17 7 17
[4,] 11 9 10
[5,] 15 NA 8
In the first case you use rbind() to add the extra row to the matrix, and in the second case you use cbind() to add an extra column.
You cannot use the $ syntax or square brackets to add columns or rows like you did for the data frame. If you try to add a row, for example, you get an error:
> grassy.m
mown unmown
[1,] 12 8
[2,] 15 9
[3,] 17 7
[4,] 11 9
[5,] 15 NA
> grassy.m[6,] = NA
Error in grassy.m[6, ] = NA : subscript out of bounds
You have to use the rbind() or cbind() commands to add to a matrix. You can, however, create a blank matrix and fill in the blanks later, as the following example shows:
> extra = matrix(nrow = 2, ncol = 2)
> extra
[,1] [,2]
[1,] NA NA
[2,] NA NA
> rbind(grassy.m, extra)
mown unmown
[1,] 12 8
[2,] 15 9
[3,] 17 7
[4,] 11 9
[5,] 15 NA
[6,] NA NA
[7,] NA NA
Here you create a blank matrix by omitting the data, which is filled in with NA items. You give the dimensions, as rows and columns, for the matrix and then use the rbind() command to add this to your existing matrix.
You can also specify the data explicitly like so:
matrix(data = NA, ncol = 2, nrow = 2)
matrix(NA, ncol = 2, nrow = 2)
matrix(data = 0, ncol = 2, nrow = 2)
matrix(data = 'X', ncol = 2, nrow = 2)
In the first two cases you use NA as your data, in the second case you fill the new matrix with the number zero, and in the final case you use an uppercase character X.
Adding rows and columns of data to existing objects is useful, especially when you are dealing with fairly small data sets. You do not always want to resort to your spreadsheet for minor alterations. In the following activity you get a bit of extra practice by adding a column and then a row to a data frame you created in the previous activity.
> flwr
> poa = c(8, 9, 11, 12, 10, 15, 17, 16, 16, 7, 8, 8, 5, 9)
> flwr$poa = poa
> flwr = flwr[c(1,4,2,3)]
> row15 = data.frame(10,18,'mow','early')
> flwr[15,] = row15
> rm(poa, row15)
> flwr
> flwr
daisy poa cutting time
1 12 8 mow early
2 15 9 mow late
3 17 11 mow early
4 11 12 mow late
5 15 10 mow early
6 8 15 unmow early
7 9 17 unmow late
8 7 16 unmow early
9 9 16 unmow late
10 12 7 sheep early
11 14 8 sheep late
12 17 8 sheep early
13 21 5 sheep late
14 17 9 sheep early
15 10 18 mow early
Summarizing data is an important element of any statistical or analytical process. However complex the statistical process is, you always need to summarize your data in terms of means or medians, and generally break up the data into more manageable chunks. In the simplest of cases you merely need to summarize rows or columns of data, but as the situation becomes more complex, you need to prepare summary information based on combinations of factors.
The summary statistics that you extract can be used to help visualize the situation or to check replication and experimental design. The statistics can also be used as the basis for graphical summaries of the data.
You have various commands at your disposal, and this section starts with simple row/column summaries and builds toward more complex commands.
When you only require a really simple column sum or mean, you can use the colSums() and colMeans() commands. Equivalent commands exist for the rows, too. These are all used in the following example:
> fw
count speed
Taw 9 2
Torridge 25 3
Ouse 15 5
Exe 2 9
Lyn 14 14
Brook 25 24
Ditch 24 29
Fal 47 34
> colMeans(fw)
count speed
20.125 15.000
> colSums(fw)
count speed
161 120
> rowMeans(fw)
Taw Torridge Ouse Exe Lyn Brook Ditch Fal
5.5 14.0 10.0 5.5 14.0 24.5 26.5 40.5
> rowSums(fw)
Taw Torridge Ouse Exe Lyn Brook Ditch Fal
11 28 20 11 28 49 53 81
In the example, the data frame has row names set so the rowMeans() and rowSums() commands show you the means and sums for the named rows. When row names are not set, you end up with a simple numeric vector as such:
> rowSums(mf)
[1] 274.25 262.15 215.75 240.95 227.95 228.75 197.85 264.75 247.95 262.35 267.35
[12] 264.35 259.05 245.85 229.75 247.45 275.35 253.05 201.25 295.05 275.55 176.85
[23] 204.95 218.85 208.75
If you have NA items, you end up with NA as a result; to avoid this you can use the na.rm = TRUE instruction to ignore NA items in the sum or mean calculation like so:
> colSums(bf)
Grass Heath Arable
82 NA NA
> colSums(bf, na.rm = TRUE)
Grass Heath Arable
82 72 90
If some of your data are not numeric, you must specify either the rows (or columns) you want to include or what you want to exclude; the following examples all produce the same result:
> str(mf)
'data.frame': 25 obs. of 6 variables:
$ Length: int 20 21 22 23 21 20 19 16 15 14 ...
$ Speed : int 12 14 12 16 20 21 17 14 16 21 ...
$ Algae : int 40 45 45 80 75 65 65 65 35 30 ...
$ NO3 : num 2.25 2.15 1.75 1.95 1.95 2.75 1.85 1.75 1.95 2.35 ...
$ BOD : int 200 180 135 120 110 120 95 168 180 195 ...
$ site : Factor w/ 5 levels "Exe","Lyn","Ouse",..: 4 4 4 4 4 5 5 5 5 5 ...
> colMeans(mf[-6])
> colMeans(mf[1:5])
> colMeans(mf[c(1,2,3,4,5)])
Length Speed Algae NO3 BOD
19.640 15.800 58.400 2.046 145.960
At the beginning you see that the data frame has five columns of numeric data and one character vector (actually, it is a factor). In the first case you exclude the factor column using [-6]; in the second case you specify columns one to five using [1:5]. In the last case you list all five columns you require explicitly.
Although these commands are useful, they are somewhat limited and are intended as convenience commands. They are also only useful when your data are all numeric, and you may very well have data comprising numeric predictor variables and factor response variables. In these cases you can call upon a range of other summary commands, as you see shortly.
When you have complicated data you often have a mixture of numeric and factor variables. The simple colMeans() and colSums() commands are not sufficient enough to extract information from these data. Fortunately, you have a variety of commands that you can use to summarize your data, and you have seen some of these before. Here you see an overview of some of these methods.
To help illustrate the options, start by taking a numeric data frame and adding a factor: a simple vector of site names:
> mfnames = c(rep('Taw',5), rep('Torridge',5), rep('Ouse',5), rep('Exe',5),
rep('Lyn',5))
> mf$site = factor(mfnames)
> str(mf)
'data.frame': 25 obs. of 6 variables:
$ Length: int 20 21 22 23 21 20 19 16 15 14 ...
$ Speed : int 12 14 12 16 20 21 17 14 16 21 ...
$ Algae : int 40 45 45 80 75 65 65 65 35 30 ...
$ NO3 : num 2.25 2.15 1.75 1.95 1.95 2.75 1.85 1.75 1.95 2.35 ...
$ BOD : int 200 180 135 120 110 120 95 168 180 195 ...
$ site : Factor w/ 5 levels "Exe","Lyn","Ouse",..: 4 4 4 4 4 5 5 5 5 5 ...
Now that you have a suitable practice sample, it is time to look at some of the complex summary functions that you can use.
You can calculate the sums of rows in a data frame or matrix and group the sums according to some factor or grouping variable. In the following example, you use the rowsum() command to determine the sums for each of the sites that are listed in the site column:
> rowsum(mf[1:5], group = mf$site)
Length Speed Algae NO3 BOD
Exe 88 83 235 7.15 859
Lyn 110 73 355 12.95 534
Ouse 102 76 325 10.35 753
Taw 107 74 285 10.05 745
Torridge 84 89 260 10.65 758
The result shows all the sites listed, and for each numeric variable you have the sum. Note that you specified the columns in the data using [1:5]; these are the numeric columns. You could also eliminate the non-numeric column like so:
> rowsum(mf[-6], mf$site)
In this case the sixth column contained the grouping variable. You can also specify a single column using its name (in quotes):
> rowsum(mf['Length'], mf$site)
Length
Exe 88
Lyn 110
Ouse 102
Taw 107
Torridge 84
When you have a matrix, your grouping variable must be separate because a matrix is comprised of data of all the same type. In the following example, you create a simple vector specifying the groupings:
> bird
Garden Hedgerow Parkland Pasture Woodland
Blackbird 47 10 40 2 2
Chaffinch 19 3 5 0 2
Great Tit 50 0 10 7 0
House Sparrow 46 16 8 4 0
Robin 9 3 0 0 2
Song Thrush 4 0 6 0 0
> grp = c(1,1,1,2,2,3)
> rowsum(bird, grp)
Garden Hedgerow Parkland Pasture Woodland
1 116 13 55 9 4
2 55 19 8 4 2
3 4 0 6 0 0
The group vector must be the same length as the number of rows in your matrix; in this case, six rows of data. You might also create a character vector as in the following example:
> grp = c('black', 'color', 'color', rep('brown', 3))
> grp
[1] "black" "color" "color" "brown" "brown" "brown"
> rowsum(bird, grp)
Garden Hedgerow Parkland Pasture Woodland
black 47 10 40 2 2
brown 59 19 14 4 2
color 69 3 15 7 2
It is also possible to specify part of the matrix using a grouping contained within the original matrix:
> rowsum(bird[,1:4], bird[,5])
Garden Hedgerow Parkland Pasture
0 100 16 24 11
2 75 16 45 2
Here you use the last column as the grouping, and the result shows the group labels (as numbers). However, you can use only a numeric grouping variable, of course, because the matrix can contain only data of a single type.
You can use the apply() command to apply a function over all the rows or columns of a data frame (or matrix). To use it, you specify the rows or columns that you require, whether you want to apply the function to the rows or columns, and finally, the actual function you want, like so:
apply(X, MARGIN, FUN, ...)
You replace the MARGIN part with a numeric value: 1 = rows and 2 = columns. You can also add other instructions if they are related to the function you are going to use; for example, you can exclude NA items, using na.rm = TRUE. In the following case you use the apply() command to apply the median to the first five columns of your data frame:
> apply(mf[1:5], 2, median)
Length Speed Algae NO3 BOD
20.00 16.00 65.00 1.95 145.00
You put the columns you require in the square brackets; in this case you used [1:5]. Because your object is a data frame, you can simply list the column names; more properly you should use [row, col] syntax:
> apply(mf[,1:5], 2, median)
Here you added the comma, saying in effect that you want all the rows but only columns one through five. If you want to apply your function to the rows, you simply switch the numeric value in the MARGIN part:
> apply(mf[,1:5], 1, median)
[1] 20 21 22 23 21 21 19 16 16 21 21 26 21 20 19 18 17 19 21 21 22 25 24 23 22
Notice that you have not specified MARGIN or FUN in the command, but have used a shortcut. R commands have a default order for instructions; so as long as you put the arguments in the default order you do not need to name them. If you do name them then the instructions can appear in any order. The full version for the preceding example would be written like so:
> apply(X = mf[,1:5], MARGIN = 1, FUN = median)
The apply() command enables you to use a wider variety of commands on rows and columns than the rowSums() or colMeans() commands, which obviously are limited to sum() and mean(). However, you can use apply() only on entire rows or columns that are discrete samples. When you have grouping variables, you need a different approach.
The summary commands you have looked at so far have enabled you to look at entire rows or columns; only the rowsum() command lets you take into account a grouping variable. When you have grouping variables in the form of predictor variables, for example, you can use the tapply() command to take into account one or more factors as grouping variables.
The following illustrates a fairly simple example where you have a data frame comprising several numeric columns, and a single column that is a grouping variable (a factor):
> tapply(mf$Length, mf$site, FUN = sum)
Exe Lyn Ouse Taw Torridge
88 110 102 107 84
The tapply() command works only on a single vector at a time; in this instance you choose the Length column using the $ syntax. Next you specify the INDEX that you want to use; in other words, the grouping variable. Finally, you select the function that you want to apply; here you choose the sum. The general form of the command is as follows:
tapply(X, INDEX, FUN = NULL, ...)
If you omit the FUN part, or set it to NULL, you get a vector that relates to the INDEX. This is easiest to see in an example:
> tapply(mf$Length, mf$site, FUN = NULL)
[1] 4 4 4 4 4 5 5 5 5 5 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2
If you refer to the original data you will see that the fourth site is the Exe factor, and because this is alphabetically the first, it is returned first. The vector result shows the rows of the original data that relate to the grouping factor.
When you have more than one grouping variable, you can list several factors to be your INDEX. In the following example you have a data frame comprising a column of numeric data and two factor columns:
> str(pw)
'data.frame': 18 obs. of 3 variables:
$ height: int 9 11 6 14 17 19 28 31 32 7 ...
$ plant : Factor w/ 2 levels "sativa","vulgaris": 2 2 2 2 2 2 2 2 2 1 ...
$ water : Factor w/ 3 levels "hi","lo","mid": 2 2 2 3 3 3 1 1 1 2 ...
> tapply(pw$height, list(pw$plant, pw$water), mean)
hi lo mid
sativa 39.66667 6.000000 15.33333
vulgaris 30.33333 8.666667 16.66667
This time you specify the columns you want to use as grouping variables in a list() command; there are only two variables here and the first one becomes the rows of the result and the second becomes the columns.
If you have more than two grouping variables, the result is subdivided into more tables as required. In the following example you have an extra factor column and use all three factors as grouping variables:
> str(pw)
'data.frame': 18 obs. of 4 variables:
$ height: int 9 11 6 14 17 19 28 31 32 7 ...
$ plant : Factor w/ 2 levels "sativa","vulgaris": 2 2 2 2 2 2 2 2 2 1 ...
$ water : Factor w/ 3 levels "hi","lo","mid": 2 2 2 3 3 3 1 1 1 2 ...
$ season: Factor w/ 2 levels "spring","summer": 1 2 2 1 2 2 1 2 2 1 ...
> pw.tap = tapply(pw$height, list(pw$plant, pw$water, pw$season), mean)
, , spring
hi lo mid
sativa 44 7 14
vulgaris 28 9 14
, , summer
hi lo mid
sativa 37.5 5.5 16
vulgaris 31.5 8.5 18
In this case the third grouping variable has two levels, which results in two tables, one for spring and one for summer. The result is presented as a kind of R object called an array; this can have any number of dimensions, but in this case you have three. If you look at the structure of the result using the str() command, you can see how the dimensions are set:
> pw.tap = tapply(pw$height, list(pw$plant, pw$water, pw$season), mean)
> str(pw.tap)
num [1:2, 1:3, 1:2] 44 28 7 9 14 14 37.5 31.5 5.5 8.5 ...
- attr(*, "dimnames")=List of 3
..$ : chr [1:2] "sativa" "vulgaris"
..$ : chr [1:3] "hi" "lo" "mid"
..$ : chr [1:2] "spring" "summer"
You can see that the first dimension is related to the plant variable, the second is related to the water variable, and the third is related to the season variable; in other words, the dimensions are in the same order as you specified in the tapply() command.
You can use the square brackets to extract parts of your result object, but now you have the extra dimension to take into account. To extract part of the result object you need to specify three values in the square brackets (corresponding to each of the three dimensions, plant, water, and season). In the following example you select a single item from the pw.tap result object by specifying a single value for each of the three dimensions.
> pw.tap[1,1,1]
[1] 44
The item you selected corresponds to the first plant (sativa), the first water treatment (hi), and the first season (spring), and you see the result, 44. If you want to see several items you can specify multiple values for any dimension. In the following example you select two values for the plant dimension (1:2), which will display a result for both sativa and vulgaris.
> pw.tap[1:2,1,1]
sativa vulgaris
44 28
The two result values (44 and 28) correspond to the first water treatment (hi) and the first season (spring). In the following example you select multiple values for the first two dimensions (plant and water) but only a single value for the third (season).
> pw.tap[1:2,1:3,1]
hi lo mid
sativa 44 7 14
vulgaris 28 9 14
Now you can see that you have selected all of the plant and water treatments but only a single season (spring).
The result is an array object, and as you have seen, it can have multiple dimensions. You can use the class() command to determine that the result is indeed an array object:
> class(pw.tap)
[1] "array"
Summarizing data using grouping variables is an important task that you will need to undertake often. Most often you will need to check means or medians for the groups, but many other functions can be useful. In the following activity you practice using the mean, but you could try out some other functions (for example, median, sum, sd, or length).
> flwr
> tapply(flwr$daisy, flwr$cutting, mean)
mow sheep unmow
13.33333 16.20000 8.25000
> tapply(flwr$daisy, list(flwr$cutting, flwr$time), mean)
early late
mow 13.50000 13.0
sheep 15.33333 17.5
unmow 7.50000 9.0
> tapply(flwr$daisy, list(flwr$time, flwr$cutting), mean)
mow sheep unmow
early 13.5 15.33333 7.5
late 13.0 17.50000 9.0
> with(flwr, tapply(poa, list(cutting, time), mean))
early late
mow 11.75 10.5
sheep 8.00 6.5
unmow 15.50 16.5
> with(flwr, tapply(poa, list(time, cutting), mean))
mow sheep unmow
early 11.75 8.0 15.5
late 10.50 6.5 16.5
The tapply() command is very useful, but you may only want to have a simple table/matrix as your result rather than a complicated array. It is possible to summarize a data object using multiple grouping factors using other commands, as you see next.
The aggregate() command enables you to compute summary statistics for subsets of a data frame or matrix; the result comes out as a single matrix rather than an array item, even with multiple grouping factors. The general form of the command is as follows:
aggregate(x, by, FUN, ...)
You specify the data you want followed by a list() of the grouping variables and the function you want to use. In the following example you use a single grouping variable and the sum() function:
> aggregate(mf[1:5], by = list(mf$site), FUN = sum)
Group.1 Length Speed Algae NO3 BOD
1 Exe 88 83 235 7.15 859
2 Lyn 110 73 355 12.95 534
3 Ouse 102 76 325 10.35 753
4 Taw 107 74 285 10.05 745
5 Torridge 84 89 260 10.65 758
In this case you specify all the numeric columns in the data frame, and the result shows the sum of each of the groups represented by the grouping variable (site).
You can also use the aggregate() command with a formula syntax; in this case you specify the response variable to the left of the ~ and the predictor variables to the right, like so:
> aggregate(Length ~ site, data = mf, FUN = mean)
site Length
1 Exe 17.6
2 Lyn 22.0
3 Ouse 20.4
4 Taw 21.4
5 Torridge 16.8
This allows a slightly simpler command because you do not need the $ signs as long as you specify where the data are to be found. In this case you chose a single response variable (Length) and the mean() as your summary function. You can select several response variables at once by wrapping them in a cbind() command:
> aggregate(cbind(Length, BOD) ~ site, data = mf, FUN = mean)
site Length BOD
1 Exe 17.6 171.8
2 Lyn 22.0 106.8
3 Ouse 20.4 150.6
4 Taw 21.4 149.0
5 Torridge 16.8 151.6
Here you chose two response variables (Length and BOD), which are given in the cbind() command (thus making a temporary matrix). You can select all the variables by using a period instead of any names to the left of the ~:
> aggregate(. ~ site, data = mf, FUN = mean)
site Length Speed Algae NO3 BOD
1 Exe 17.6 16.6 47 1.43 171.8
2 Lyn 22.0 14.6 71 2.59 106.8
3 Ouse 20.4 15.2 65 2.07 150.6
4 Taw 21.4 14.8 57 2.01 149.0
5 Torridge 16.8 17.8 52 2.13 151.6
Here you use all the variables in the data frame. This works only if the remaining variables are all numeric; if you have other character variables, you need to specify the columns you want explicitly.
Because of the nature of the output/result, some people may find the aggregate() command more useful in presenting summary statistics than the tapply() command discussed earlier. In the following example, you have a data frame with one response variable and three predictor variables:
> str(pw)
'data.frame': 18 obs. of 4 variables:
$ height: int 9 11 6 14 17 19 28 31 32 7 ...
$ plant : Factor w/ 2 levels "sativa","vulgaris": 2 2 2 2 2 2 2 2 2 1 ...
$ water : Factor w/ 3 levels "hi","lo","mid": 2 2 2 3 3 3 1 1 1 2 ...
$ season: Factor w/ 2 levels "spring","summer": 1 2 2 1 2 2 1 2 2 1 ...
> pw.agg = aggregate(height ~ plant * water * season, data = pw, FUN = mean)
plant water season height
1 sativa hi spring 44.0
2 vulgaris hi spring 28.0
3 sativa lo spring 7.0
4 vulgaris lo spring 9.0
5 sativa mid spring 14.0
6 vulgaris mid spring 14.0
7 sativa hi summer 37.5
8 vulgaris hi summer 31.5
9 sativa lo summer 5.5
10 vulgaris lo summer 8.5
11 sativa mid summer 16.0
12 vulgaris mid summer 18.0
The result is a simple data frame, and this can make it easier to extract the components than for the array result you had previously with the tapply() command.
You could have achieved the same result by using the period instead of the grouping variable names like so:
>aggregate(height ~ . , data = pw, FUN = mean)
So, like the previous example, the period means, “everything else not already named.” If you replace the period with a number 1, you get quite a different result:
>aggregate(height ~ 1 , data = pw, FUN = mean)
height
1 19.44444
You get the overall mean value here; essentially you have said, “don’t use any grouping variables.”
The aggregate() command is very powerful, partly because you can use the formula syntax and partly because of the output, which is a single data frame. In the following activity you practice using the command using the mean as the summary function, but you could try some others (for example, median, sum, sd, or length).
>flwr
> aggregate(flwr$daisy, by = list(flwr$cutting), FUN = mean)
Group.1 x
1 mow 13.33333
2 sheep 16.20000
3 unmow 8.25000
> aggregate(flwr$daisy, by = list(flwr$cutting, flwr$time), FUN = mean)
Group.1 Group.2 x
1 mow early 13.50000
2 sheep early 15.33333
3 unmow early 7.50000
4 mow late 13.00000
5 sheep late 17.50000
6 unmow late 9.00000
> aggregate(flwr[1:2], by = list(flwr$cutting, flwr$time), FUN = mean)
Group.1 Group.2 daisy poa
1 mow early 13.50000 11.75
2 sheep early 15.33333 8.00
3 unmow early 7.50000 15.50
4 mow late 13.00000 10.50
5 sheep late 17.50000 6.50
6 unmow late 9.00000 16.50
> aggregate(poa ~ cutting + time, data = flwr, FUN = mean)
cutting time poa
1 mow early 11.75
2 sheep early 8.00
3 unmow early 15.50
4 mow late 10.50
5 sheep late 6.50
6 unmow late 16.50
> aggregate(. ~ cutting + time, data = flwr, FUN = mean)
cutting time daisy poa
1 mow early 13.50000 11.75
2 sheep early 15.33333 8.00
3 unmow early 7.50000 15.50
4 mow late 13.00000 10.50
5 sheep late 17.50000 6.50
6 unmow late 9.00000 16.50
> aggregate(cbind(poa, daisy) ~ cutting + time, data = flwr, FUN = mean)
> aggregate(cbind(poa, daisy) ~ ., data = flwr, FUN = mean)
> aggregate(cbind(poa, daisy) ~ 1, data = flwr, FUN = mean)
poa daisy
1 11.26667 12.93333
Exercises
You can find answers to these exercises in Appendix A.
What You Learned In This Chapter
Topic | Key Points |
Making data items: length() |
Vectors need to be the same length if they are to be added to a data frame or matrix. The length() command can query the current length or alter it. Setting the length to shorter than current truncates the item and making it longer adds NA items to the end. |
Making data items: names()row.names()rownames() colnames() |
You can use several commands to query and alter names of columns or rows. The rownames() and colnames() commands are used for matrix objects, whereas the names() and row.names() commands work on data frames. |
Stacking separate vectors: stack() |
You can use the stack() command to combine vectors and so form a data frame suitable for complex analysis. This really only works when you have a single response variable and a single predictor. |
Removing NA items: na.omit() |
The na.omit() command strips out unwanted NA items from vectors and data frames. |
Repeated elements: rep()gl() |
You can generate repeated elements, such as character labels that will form a predictor variable, by using the rep() or gl() commands. Both commands enable you to generate multiple levels of a variable based on a repeating pattern. |
Factor elements: factor() as.factor()levels() nlevels()as.numeric() |
A factor is a special sort of character variable. You can force a vector to be treated as a factor by using the factor() or as.factor() commands. The levels() command shows the different levels of a factor variable, and the nlevels() command returns the number of discrete levels of a factor variable. A factor can be returned as a list of numbers by using the as.numeric() command. |
Constructing a data frame: data.frame() |
Several objects can be combined to form a data frame using the data.frame() command. |
Constructing a matrix: matrix()cbind() rbind() |
You can create a matrix in two main ways: by assembling a vector into rows and columns using the matrix() command or by combining other elements. You can combine elements by column using the cbind() command or by row using the rbind() command. |
Simple row or column sums or means: rowSums() colSums()rowMeans() colMeans() |
Numerical objects (data frames and matrix objects) can be examined using simple row/column sums or means using rowSums(), colSums(), rowMeans(), or colMeans() commands as appropriate. These cannot take into account any grouping variable. |
Simple sum using a grouping variable: rowsum() |
The rowsum() command enables you to add up columns based on a grouping variable. The result is a series of rows of sums (hence the command name). |
Apply a command to rows or columns: apply() |
The apply() command enables you to give a command across rows or columns of a data frame or matrix. |
Use a grouping variable with any command: tapply() |
The tapply() command enables the use of grouping variables and can utilize any command (for example, mean, median, sd), which is applied to a single vector (or element of a data frame or matrix). |
Array objects: object[x, y, z , ...] |
An array object has more than two dimensions; that is, it cannot be described simply by rows and columns. An array is typically generated by using the tapply() command with more than two grouping variables. The resulting array has several dimensions, each one relating to a grouping variable. The array itself can be subdivided using the square brackets and identifying the appropriate dimensions. |
Summarize using a grouping variable: aggregate() |
The aggregate() command can utilize any command and number of grouping variables. The result is a two-dimensional data frame; regardless of how many grouping variables are used. |
18.226.98.208