Chapter 9

Manipulating Data and Extracting Components

What You Will Learn in this Chapter:

  • How to create data frames and matrix objects ready for complex analyses
  • How to create or set factor data
  • How to add rows and columns to data objects
  • How to use simple summary commands to extract column or row information
  • How to extract summary statistics from complex data objects

The world can be a complicated place, and the data you have can also be correspondingly complicated. You saw in the previous chapter how to use analysis of variance (ANOVA) via the aov() command to help make sense of complicated data. This chapter builds on this knowledge by walking you through the process of creating data objects prior to carrying out a complicated analysis.

This chapter has two main themes. To start, you look at ways to create and manipulate data to produce the objects you require to carry out these complex analyses. Later in the chapter you look at methods to extract the various components of a complicated data object. You have seen some of these commands before and others are new.

Creating Data for Complex Analysis

To begin with, you need to have some data to work on. You can construct your data in a spreadsheet and have it ready for analysis in R, or you may have to construct the data from various separate elements. This section covers the latter scenario.

When you need to carry out a complex analysis, the likelihood is that you will have to make a complex data object. The more complicated the situation you are examining, the more important it is that your data are arranged in a sensible fashion. In general, this means that you should have a column for each variable that you are dealing with—usually this means a column containing the response variable and additional columns each containing a predictor variable.

You have already seen various ways to create data items:

  • Using the c() command to create simple vectors
  • Using the scan() command to create vectors from the keyboard, clipboard, or a file from disk
  • Using the read.table() command to read data previously prepared in a spreadsheet or some other program

If you read data from another application, like a spreadsheet, it is likely that your data are already in the layout you require. If you have individual vectors of data, you need to construct data frames and matrix objects before you can carry out the business of complex analysis.

Data Frames

The data frame is probably the most useful kind of data object for complex analyses because you can have columns containing a variety of data types. For example, you can have columns containing numeric data and other columns containing factor data. This is unlike a matrix where all the data must be of one type (for example, all numeric or all text).

To make a data frame you simply use the data.frame() command, and type the names of the objects that will form the columns into the parentheses. However, you need to ensure that all the objects are of the same length. The following example contains two simple vectors of numerical data that you want to make into a data frame. They have different lengths, so you need to alter the shorter one and add NA items to pad it out:

> mow ; unmow
[1] 12 15 17 11 15
[1] 8 9 7 9
> length(unmow) = length(mow)
> unmow
[1]  8  9  7  9 NA
> grassy = data.frame(mow, unmow)
> grassy
  mow unmow
1  12     8
2  15     9
3  17     7
4  11     9
5  15    NA

The length() command is usually used to query the length of an object, but here you use it to alter the original data by setting its length to be the same as the longer item. If you use a value that turns out to be shorter than the current length, your object is truncated and the extra data are removed.

You can use a variety of other commands to set the names of the columns, and also add names for the individual rows. The following example looks at the main column names using the names() command:

> names(grassy)
[1] "mow"   "unmow"
> names(grassy) = c('mown', 'unmown')
> names(grassy)
[1] "mown"   "unmown"

Here, you query the column names and then set them to new values. You can do something similar with row names. In the following example you create a vector of names first and then set them using the row.names() command:

> grn = c('Top', 'Middle', 'Lower', 'Set aside', 'Verge')
> row.names(grassy)
[1] "1" "2" "3" "4" "5"
> row.names(grassy) = grn
> row.names(grassy)
[1] "Top"       "Middle"    "Lower"     "Set aside" "Verge"   

Notice that the original row names are a simple index and appear as characters when you query them. The newly renamed data frame appears like this:

> grassy
          mown unmown
Top         12      8
Middle      15      9
Lower       17      7
Set aside   11      9
Verge       15     NA

You may prefer to have your data frame in a different layout, with one column for the response variable and one for the predictor (in most cases this is preferable). In the current example you would have one column for the numerical values, and one to hold the treatment names (mown or unmown). You can do this in several ways, depending on where you start.

In this case you already have a data frame and can convert it using the stack() command:

> stack(grassy)
   values    ind
1      12   mown
2      15   mown
3      17   mown
4      11   mown
5      15   mown
6       8 unmown
7       9 unmown
8       7 unmown
9       9 unmown
10     NA unmown

Now you have the result you want, but you have an NA item that you do not really need. You can use na.omit() to strip out the NA items that may occur:

> na.omit(stack(grassy))
  values    ind
1     12   mown
2     15   mown
3     17   mown
4     11   mown
5     15   mown
6      8 unmown
7      9 unmown
8      7 unmown
9      9 unmown

The column names are set to the defaults of values and ind. You can use the names() command to alter them afterward. The stack() command really only works when you have a simple situation with all samples being related to a single predictor variable. When you need multiple columns with several predictor variables, you need a different approach.

When you need to create vectors of treatment names you are repeating the same names over and over according to how many replicates you have. You can use the rep() command to generate repeating items and take some of the tedium out of the process. In the following example and subsequent steps, you use the rep() command to make labels to match up with the two samples you have (mow and unmow):

> mow ; unmow
[1] 12 15 17 11 15
[1] 8 9 7 9

> trt = c(rep('mow', length(mow)), rep('unmow', length(unmow)))
> trt
[1] "mow"   "mow"   "mow"   "mow"   "mow"   "unmow" "unmow" "unmow" "unmow"

> rich = c(mow, unmow)
> data.frame(rich, trt)
  rich   trt
1   12   mow
2   15   mow
3   17   mow
4   11   mow
5   15   mow
6    8 unmow
7    9 unmow
8    7 unmow
9    9 unmow
1. To begin, create a new object to hold your predictor variable, and use the rep() command to repeat the names for the two treatments as many times as is necessary to match the number of observations. The basic form of the rep() command is:
rep(what, times)
2. In this case you want a character name, so enclose the name in quotation marks. You could also use a numerical value for the number of repeats, but here you use the length() command to work out how many times to repeat the labels for each of the two samples.
3. Create the final data object by joining together the response vectors as one column and the new vector of names representing the treatments (the predictor variable). The data.frame() command does the actual joining. Notice that in this example a name is not specified for the final data frame; if you want to use the data frame for some analysis (quite likely), you should give the new frame a name like so:
> grass.dat = data.frame(rich, trt)

The rep() command is useful to help you create repeating elements (like factors) and you will see it again shortly. Before then, you look at creating matrix objects.

Matrix Objects

A matrix can be thought of as a single vector of data that is conveniently split up into rows and columns. You can make a matrix object in several ways:

  • If you have vectors of data you can assemble them in rows or columns using the rbind() or cbind() commands.
  • If you have a single vector of values you can use the matrix() command.

The following examples and subsequent steps illustrate the two methods:

> mow ; unmow
[1] 12 15 17 11 15
[1] 8 9 7 9
> length(unmow) = length(mow)
> cbind(mow, unmow)
     mow unmow
[1,]  12     8
[2,]  15     9
[3,]  17     7
[4,]  11     9
[5,]  15    NA
1. Begin with two vectors of numeric values, and because they are of unequal length, use the length() command to extend the shorter one.
2. Next use the cbind() command to bind together the vectors as columns in a matrix. If you want your vectors to be the rows, you use the rbind() command like so:
> rbind(mow,unmow)
      [,1] [,2] [,3] [,4] [,5]
mow     12   15   17   11   15
unmow    8    9    7    9   NA
3. Notice that you end up with names for one margin in your matrix but not the other; in the first example the row names are not set, and in the second example the column names are not set. You can set the row names or column names using the rownames() or colnames() commands.

If you have your data as one single vector, you can use an alternative method to make a matrix using the matrix() command. This command takes a single vector and splits it into a matrix with the number of rows or columns that you specify. This means that your vector of data must be divisible by the number of rows or columns that you require. In the following example and subsequent steps you have a single vector of values that you use to create a matrix:

> rich
[1] 12 15 17 11 15  8  9  7  9
> length(rich) = 10
> rich
 [1] 12 15 17 11 15  8  9  7  9 NA

> matrix(rich, ncol = 2)
     [,1] [,2]
[1,]   12    8
[2,]   15    9
[3,]   17    7
[4,]   11    9
[5,]   15   NA
1. Start by making sure your original data are the correct length for your matrix and, as before, use the length() command to extend it.
2. Next use the matrix() command to create a matrix with two columns. The command reads along the vector and splits it at intervals appropriate to create the columns you asked for. This has consequences for how the data finally appear; if you use the nrow = instruction to specify how many rows you require (rather than ncol), the data will not end up in their original samples because the matrix is populated column by column:
> mow ; unmow
[1] 12 15 17 11 15
[1]  8  9  7  9 NA
> matrix(rich, nrow = 2)
     [,1] [,2] [,3] [,4] [,5]
[1,]   12   17   15    9    9
[2,]   15   11    8    7   NA
3. If you wish to create a matrix in rows, use the byrow = TRUE instruction:
> matrix(rich, nrow = 2, byrow = TRUE)
     [,1] [,2] [,3] [,4] [,5]
[1,]   12   15   17   11   15
[2,]    8    9    7    9   NA

Like before with the first method, when you use the matrix() command none of the margin names are set; you need to use the rownames() or colnames() commands to set them.

Creating and Setting Factor Data

When you create data for complex analysis, like analysis of variance, you create vectors for both the response variables and the predictor variables. The response variables are generally numeric, but the predictor variables may well be characters and refer to names of treatments. Alternatively, they may be simple numeric values with each number representing a separate treatment. When you create a data frame that contains numeric and character vectors, the character vectors are regarded as being factors. In the following example you can see a simple data frame created from a numeric vector and a character vector:

> rich ; graze
[1] 12 15 17 11 15  8  9  7  9
[1] "mow"   "mow"   "mow"   "mow"   "mow"   "unmow" "unmow" "unmow" "unmow"
> grass.df = data.frame(rich, graze)
> str(grass.df)
'data.frame': 9 obs. of  2 variables:
 $ rich : int  12 15 17 11 15 8 9 7 9
 $ graze: Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 2 2

When you use the str() command to examine the structure of the data frame that was created, you see that the character vector has been converted into a factor. If you add a character vector to an existing data frame, it will remain as a character vector unless you use the data.frame() command as your means of adding the new vector; you see this in a moment.

You can force a numeric or character vector to be a factor by using the factor() command:

> graze
[1] "mow"   "mow"   "mow"   "mow"   "mow"   "unmow" "unmow" "unmow" "unmow"
> graze.f = factor(graze)
> graze.f
[1] mow   mow   mow   mow   mow   unmow unmow unmow unmow
Levels: mow unmow

Here you see that the original characters are made into factors, and you see the list of levels when you look at the object (note that the data are not in quotes as they were when they were a character object). If you want to add a character vector to an existing data frame and require the new vector to be a factor, you can use the as.factor() command to convert the vector to a factor. In the following example you see the result of adding a vector of characters without using as.factor() and then with the as.factor() command:

> grass.df$graze2 = graze
> grass.df
  rich graze graze2
1   12   mow    mow
2   15   mow    mow
3   17   mow    mow
4   11   mow    mow
5   15   mow    mow
6    8 unmow  unmow
7    9 unmow  unmow
8    7 unmow  unmow
9    9 unmow  unmow
> str(grass.df)
'data.frame': 9 obs. of  3 variables:
 $ rich  : int  12 15 17 11 15 8 9 7 9
 $ graze : Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 2 2
 $ graze2: chr  "mow" "mow" "mow" "mow" ...

> grass.df$graze2 = as.factor(graze)
> str(grass.df)
'data.frame': 9 obs. of  3 variables:
 $ rich  : int  12 15 17 11 15 8 9 7 9
 $ graze : Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 2 2
 $ graze2: Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 2 2

In the first instance you see that the character vector appears in the data frame without quotes, but the str() command reveals it is still comprised of characters. In the second case you use the as.factor() command, and the new column is successfully transferred as a factor variable. You can, of course, set a column to be a factor afterward, as you can see in the following example:

> grass.df$graze2 = factor(grass.df$graze2)

In this case you convert the graze2 column of the data frame into a factor using the factor() command. If you use the data.frame() command then any character vectors are converted to factors as the following example shows:

> grass.df = data.frame(grass.df, graze2 = graze)

Notice how the name of the column created is set as part of the command; the graze2 object is created on the fly and added to the data frame as a factor.

You may want to analyze how your factor vector is split up at some point because the factor vector represents the predictor variable, and shows you how many treatments are applied. You can use the levels() command to see how your factor vector is split up. You can use the command in two ways; you can use it to query an object and find out what levels it possesses, or you can use it to set the levels. Following are examples of two character vectors:

> graze
[1] "mow"   "mow"   "mow"   "mow"   "mow"   "unmow" "unmow" "unmow" "unmow"
> levels(graze)
NULL

Here the data are plain characters and no levels are set; when you examine the data with the levels() command you get NULL as a result.

> graze.f
[1] mow   mow   mow   mow   mow   unmow unmow unmow unmow
Levels: mow unmow
> levels(graze.f)
[1] "mow"   "unmow"

Here you see the names of the levels that you created earlier. If you have a numeric variable that represents codes for treatments, you can make the variable into a factor using the factor() command as you have already seen, but you can also assign names to the levels. In the following example you create a simple numeric vector to represent two treatments:

> graze.nf = c(1,1,1,1,1,2,2,2,2)

You can now assign names to each of the levels in the vector like so:

> levels(graze.nf)[1] = 'mown'
> levels(graze.nf)[2] = 'unmown'

> levels(graze.nf)
[1] "mown"   "unmown"

> graze.nf
[1] 1 1 1 1 1 2 2 2 2
attr(,"levels")
[1] "mown"   "unmown"

> class(graze.nf)
[1] "numeric"

You can set each level to have a name; now your plain numeric values have a more meaningful label. However, the vector still remains a numeric variable rather than a factor. You can set all the labels in one command with a slight variation, as the following example shows:

> graze.nf = factor(c(1,1,1,1,1,2,2,2,2))
> graze.nf
[1] 1 1 1 1 1 2 2 2 2
Levels: 1 2

> levels(graze.nf) = list(mown = '1', unmown = '2')
> graze.nf
[1] mown   mown   mown   mown   mown   unmown unmown unmown unmown
Levels: mown unmown

In this case you create your factor object directly using numeric values but wrap these in a factor() command; you can see that you get your two levels, corresponding to the two values. This time you use the levels() command to set the names by listing how you want the numbers to be replaced.

You can also apply level names to a vector as you convert it to a factor via the factor() command:

> graze.nf = c(1,1,1,1,1,2,2,2,2)
> graze.nf
[1] 1 1 1 1 1 2 2 2 2

> factor(graze.nf, labels = c('mown', 'unmown'))
[1] mown   mown   mown   mown   mown   unmown unmown unmown unmown
Levels: mown unmown

In this instance you have a simple numeric vector and use the labels = instruction to apply labels to the levels as you make your factor object.

You can use the nlevels() command to give you a numeric result for the number of levels in a vector:

> graze
[1] "mow"   "mow"   "mow"   "mow"   "mow"   "unmow" "unmow" "unmow" "unmow"
> nlevels(graze)
[1] 0
> graze.f
[1] mow   mow   mow   mow   mow   unmow unmow unmow unmow
Levels: mow unmow
> nlevels(graze.f)
[1] 2

You can also use the class() command to check what sort of object you are dealing with like so:

> class(graze)
[1] "character"
> class(graze.f)
[1] "factor"

In the first case you can see clearly that the data are characters, whereas in the second case you see that you have a factor object. The class() command is useful because, as you have seen, it is possible to apply levels to vectors of data without making them into factor objects. Take the following for example:

> nlevels(graze.nf)
[1] 2
> class(graze.nf)
[1] "numeric"

In the preceding example you have set two levels to your vector, but it remains a numeric object.

If you want to examine a factor variable but only want to view the levels as numeric values rather than as characters (assuming they have been set), you can use the as.numeric() command like so:

> as.numeric(graze.nf)
[1] 1 1 1 1 1 2 2 2 2

Now you can switch between character, factor, and numeric quite easily.

Making Replicate Treatment Factors

You have already seen how to create vectors of levels using the rep() command. The basic form of the command is:

rep(what, times)

You can use this command to create repeating labels that you can use to create a vector of characters that will become a factor object.

> trt = factor(c(rep('mown', 5), rep('unmown', 4)))
> trt
[1] mown   mown   mown   mown   mown   unmown unmown unmown unmown
Levels: mown unmown

In this instance you make a factor object directly from five lots of mown and four lots of unmown, which correspond to the two treatments you require.

When you have a balanced design with an equal number of replications, you can use the each instruction like so:

> factor(rep(c('mown', 'unmown'), each = 5))
 [1] mown   mown   mown   mown   mown   unmown unmown unmown unmown unmown
Levels: mown unmown

The each instruction repeats the elements the specified number of times. You can use the times and each instructions together to create more complicated repeated patterns.

You can also create factor objects using the gl() command. The general form of the command is:

gl(n, k, length = n*k, labels = 1:n)

In this command, n is the number of levels you require and k is the number of replications for each of these levels. You can also set the overall length of the vector you create and add specific text labels to your treatments. For example:

> gl(2, 5, labels = c('mown', 'unmown'))
 [1] mown   mown   mown   mown   mown   unmown unmown unmown unmown unmown
Levels: mown unmown

> gl(2, 1, 10, labels = c('mown', 'unmown'))
 [1] mown   unmown mown   unmown mown   unmown mown   unmown mown   unmown
Levels: mown unmown

> gl(2, 2, 10, labels = c('mown', 'unmown'))
 [1] mown   mown   unmown unmown mown   mown   unmown unmown mown   mown  
Levels: mown unmown

In the first case you set two levels and require five replicates; you get five of one level and then five of the other. In the second case you set the number of replicates to 1, but also set the overall length to 10; the result is alternation between the levels until you reach the length required. In the third case you set the number of replicates to be two, and now you get two of each treatment until you reach the required length.

When you have a lot of data you will generally find it more convenient to create it in a spreadsheet and save it as a CSV file. However, for data with relatively few replicates it is useful to be able to make up data objects directly in R. In the following activity, you practice making a fairly simple data object comprising a numeric response variable and two predictor variables.

Try It Out: Make a Complex Data Frame
In this activity you will make a data frame that represents some numerical sample data and character predictor variables. This is the kind of thing that you might analyze using the aov() command.
1. Start by creating some numerical response data. These relate to the abundance of a plant at three sites:
> higher = c(12, 15, 17, 11, 15)
> lower = c(8, 9, 7, 9)
> middle = c(12, 14, 17, 21, 17)
2. Now join the separate vectors to make one variable:
> daisy = c(higher, lower, middle)
3. Make a predictor variable (the cutting regime) by creating a character vector:
> cutting = c(rep('mow', 5), rep('unmow', 4), rep('sheep', 5))
4. Create a second predictor variable (time of cutting):
> time = rep(gl(2, 1, length = 5, labels = c('early', 'late')), 3)[-10]
5. Assemble the data frame:
> flwr = data.frame(daisy, cutting, time)
6. Tidy up:
>rm(higher, lower, middle, daisy, cutting, time)
7. View the final data:
> flwr
How It Works
You start by making the numerical response variable. In this case you have three sites and you create three vectors using the c() command; you could have used the scan() command instead. Next, you join the three vectors together. You could have done this right at the start, but this way you can see more easily that the three are different lengths.
The first predictor value (how the meadows were cut) is created as a simple character vector. You can see that you need five replicates for the first and third, but only four for the second. You use the rep() command to generate the required number of replicates.
The next predictor variable (time of year) is more difficult because each site was monitored early and late alternately. The solution is to create the alternating variable and remove the “extra.” The gl() command creates the variable and is wrapped in a rep() command to make an alternating variable with length of five repeated three times. The tenth item is not required and is removed using the [-10] instruction.
Now the final data frame can be assembled using the data.frame() command, and the unwanted preliminary variables can be tidied away using the rm() command. You can view the final result by typing its name:
> flwr
   daisy cutting  time
1     12     mow early
2     15     mow  late
3     17     mow early
4     11     mow  late
5     15     mow early
6      8   unmow early
7      9   unmow  late
8      7   unmow early
9      9   unmow  late
10    12   sheep early
11    14   sheep  late
12    17   sheep early
13    21   sheep  late
14    17   sheep early

Adding Rows or Columns

When it comes to adding data to an existing data frame or matrix, you have various options. The following examples illustrate some of the ways you can add data:

> grassy
          mown unmown
Top         12      8
Middle      15      9
Lower       17      7
Set aside   11      9
Verge       15     NA

> grazed
[1] 11 14 17 10  8

> grassy$grazed = grazed
> grassy
          mown unmown grazed
Top         12      8     11
Middle      15      9     14
Lower       17      7     17
Set aside   11      9     10
Verge       15     NA      8

In the preceding example you have a new sample and want to add this as a column to your data frame. The sample is the same length as the others so you can add it simply by using the $. In the next example you use the data.frame() command, but this time you are combining an existing data frame with a vector; this works fine as long as the new vector is the same length as the existing columns:

> grassy
          mown unmown
Top         12      8
Middle      15      9
Lower       17      7
Set aside   11      9
Verge       15     NA

> grassy = data.frame(grassy, grazed)
> grassy
          mown unmown grazed
Top         12      8     11
Middle      15      9     14
Lower       17      7     17
Set aside   11      9     10
Verge       15     NA      8

You add a row to a data frame using the [row, column] syntax. In the following example you have a new vector of values that you want to add as a row in your data frame:

> Midstrip
[1] 10 10 12

> grassy['Midstrip',] = Midstrip
> grassy
          mown unmown grazed
Top         12      8     11
Middle      15      9     14
Lower       17      7     17
Set aside   11      9     10
Verge       15     NA      8
Midstrip    10     10     12

You have now assigned the appropriate row of the data frame to your new vector of values; note that you give the name in the brackets using quotes.

If the new data are longer than the original data frame, you must expand the data frame to “make room” for the new items; you can do this by assigning NA to new rows as required. In the following example you have a simple data frame and want to add a new column, but this is longer than the original data:

> grassy
          mown unmown
Top         12      8
Middle      15      9
Lower       17      7
Set aside   11      9
Verge       15     NA
> grazed
[1] 11 14 17 10  8  9

> grassy$grazed = grazed
Error in `$<-.data.frame`(`*tmp*`, "grazed", value = c(11, 14, 17, 10,  : 
  replacement has 6 rows, data has 5

When you try to add the new data, you get an error message; there are not enough existing rows to accommodate the new column. In this instance the data frame has named rows; you require only one extra row so you can name the row as you create it:

> grassy['Midstrip',] = NA
> grassy
          mown unmown
Top         12      8
Middle      15      9
Lower       17      7
Set aside   11      9
Verge       15     NA
Midstrip    NA     NA

> grassy$grazed = grazed
> grassy
          mown unmown grazed
Top         12      8     11
Middle      15      9     14
Lower       17      7     17
Set aside   11      9     10
Verge       15     NA      8
Midstrip    NA     NA      9

Once you have the additional row you can add the new column as before. In this case you added a column that required only a single additional row, but if you needed more you could do this easily:

> grassy[6:10,] = NA
> grassy
          mown unmown
Top         12      8
Middle      15      9
Lower       17      7
Set aside   11      9
Verge       15     NA
6           NA     NA
7           NA     NA
8           NA     NA
9           NA     NA
10          NA     NA

You added rows six to ten and set all the values to be NA. Notice, however, that the row names of the additional rows are unset and have a plain numerical index value. You have to reset the names of the rows using the row.names() command:

> row.names(grassy) = c(row.names(grassy)[1:6], "A", "B", "C", "D")

In this case you take the names from the first six rows and add to them the new names you require (in this case, uppercase letters).

When you have a matrix you can add additional rows or columns using the rbind() or cbind() commands as appropriate:

> grassy.m
      top upper mid lower bottom
mow    12    15  17    11     15
unmow   8     9   7     9     NA
> grazed
[1] 11 14 17 10  8

> grassy.m = rbind(grassy.m, grazed)
> grassy.m
       top upper mid lower bottom
mow     12    15  17    11     15
unmow    8     9   7     9     NA
grazed  11    14  17    10      8

> grassy.m
     mow unmow
[1,]  12     8
[2,]  15     9
[3,]  17     7
[4,]  11     9
[5,]  15    NA

> grassy.m = cbind(grassy.m, grazed)
> grassy.m
     mow unmow grazed
[1,]  12     8     11
[2,]  15     9     14
[3,]  17     7     17
[4,]  11     9     10
[5,]  15    NA      8

In the first case you use rbind() to add the extra row to the matrix, and in the second case you use cbind() to add an extra column.

You cannot use the $ syntax or square brackets to add columns or rows like you did for the data frame. If you try to add a row, for example, you get an error:

> grassy.m
     mown unmown
[1,]   12      8
[2,]   15      9
[3,]   17      7
[4,]   11      9
[5,]   15     NA

> grassy.m[6,] = NA
Error in grassy.m[6, ] = NA : subscript out of bounds

You have to use the rbind() or cbind() commands to add to a matrix. You can, however, create a blank matrix and fill in the blanks later, as the following example shows:

> extra = matrix(nrow = 2, ncol = 2)
> extra
     [,1] [,2]
[1,]   NA   NA
[2,]   NA   NA

> rbind(grassy.m, extra)
      mown unmown
 [1,]   12      8
 [2,]   15      9
 [3,]   17      7
 [4,]   11      9
 [5,]   15     NA
 [6,]   NA     NA
 [7,]   NA     NA

Here you create a blank matrix by omitting the data, which is filled in with NA items. You give the dimensions, as rows and columns, for the matrix and then use the rbind() command to add this to your existing matrix.

You can also specify the data explicitly like so:

matrix(data = NA, ncol = 2, nrow = 2)
matrix(NA, ncol = 2, nrow = 2)
matrix(data = 0, ncol = 2, nrow = 2)
matrix(data = 'X', ncol = 2, nrow = 2)

In the first two cases you use NA as your data, in the second case you fill the new matrix with the number zero, and in the final case you use an uppercase character X.

Adding rows and columns of data to existing objects is useful, especially when you are dealing with fairly small data sets. You do not always want to resort to your spreadsheet for minor alterations. In the following activity you get a bit of extra practice by adding a column and then a row to a data frame you created in the previous activity.

Try It Out: Add Rows and Columns to Existing Data
In this activity you will add some rows and columns to an existing data frame (the one you created in an earlier activity).
1. Look at the flwr data frame that you created earlier:
> flwr
2. Now create a new numerical vector:
> poa = c(8, 9, 11, 12, 10, 15, 17, 16, 16, 7, 8, 8, 5, 9)
3. Add the new response variable to the previous data frame:
> flwr$poa = poa
4. Rearrange the columns so that the two response variables are on the left:
> flwr = flwr[c(1,4,2,3)]
5. Create data that will form a new row (the missing replicate from the unmow cutting regime):
> row15 = data.frame(10,18,'mow','early')
6. Add the extra row to the existing data:
> flwr[15,] = row15
7. Tidy up:
> rm(poa, row15)
8. View the final result:
> flwr
How It Works
Any new column of data must be the same length as the existing data frame; pad out with NA items if necessary. In this case the new vector is the same length and is created using the c() command (although scan() would be more efficient). Because the target data frame exists, you can add the new variable using the $. The alternative would have been to use the data.frame() command, which is fine but requires more typing.
It is always useful to have the response variables on the left and predictors on the right, so the data are rearranged simply by specifying the column order in square brackets.
Adding a row is more complex because you need a combination of numeric and character values. The easiest way is to make a simple data frame containing the row data you require; this enables you to mix numbers and characters. The new row is added to the bottom of the existing data using the square brackets; row 15 does not exist so you create it, and assign the new data to this row.
You remove the working data to tidy up the workspace and view the final result simply by typing the name of the data:
> flwr
   daisy poa cutting  time
1     12   8     mow early
2     15   9     mow  late
3     17  11     mow early
4     11  12     mow  late
5     15  10     mow early
6      8  15   unmow early
7      9  17   unmow  late
8      7  16   unmow early
9      9  16   unmow  late
10    12   7   sheep early
11    14   8   sheep  late
12    17   8   sheep early
13    21   5   sheep  late
14    17   9   sheep early
15    10  18     mow early

Summarizing Data

Summarizing data is an important element of any statistical or analytical process. However complex the statistical process is, you always need to summarize your data in terms of means or medians, and generally break up the data into more manageable chunks. In the simplest of cases you merely need to summarize rows or columns of data, but as the situation becomes more complex, you need to prepare summary information based on combinations of factors.

The summary statistics that you extract can be used to help visualize the situation or to check replication and experimental design. The statistics can also be used as the basis for graphical summaries of the data.

You have various commands at your disposal, and this section starts with simple row/column summaries and builds toward more complex commands.

Simple Column and Row Summaries

When you only require a really simple column sum or mean, you can use the colSums() and colMeans() commands. Equivalent commands exist for the rows, too. These are all used in the following example:

> fw
         count speed
Taw          9     2
Torridge    25     3
Ouse        15     5
Exe          2     9
Lyn         14    14
Brook       25    24
Ditch       24    29
Fal         47    34

> colMeans(fw)
 count  speed 
20.125 15.000 

> colSums(fw)
count speed 
  161   120 

> rowMeans(fw)
     Taw Torridge     Ouse      Exe      Lyn    Brook    Ditch      Fal 
     5.5     14.0     10.0      5.5     14.0     24.5     26.5     40.5

> rowSums(fw)
     Taw Torridge     Ouse      Exe      Lyn    Brook    Ditch      Fal 
      11       28       20       11       28       49       53       81

In the example, the data frame has row names set so the rowMeans() and rowSums() commands show you the means and sums for the named rows. When row names are not set, you end up with a simple numeric vector as such:

> rowSums(mf)
 [1] 274.25 262.15 215.75 240.95 227.95 228.75 197.85 264.75 247.95 262.35 267.35
[12] 264.35 259.05 245.85 229.75 247.45 275.35 253.05 201.25 295.05 275.55 176.85
[23] 204.95 218.85 208.75

If you have NA items, you end up with NA as a result; to avoid this you can use the na.rm = TRUE instruction to ignore NA items in the sum or mean calculation like so:

> colSums(bf)
 Grass  Heath Arable 
    82     NA     NA 

> colSums(bf, na.rm = TRUE)
 Grass  Heath Arable 
    82     72     90 

If some of your data are not numeric, you must specify either the rows (or columns) you want to include or what you want to exclude; the following examples all produce the same result:

> str(mf)
'data.frame': 25 obs. of  6 variables:
 $ Length: int  20 21 22 23 21 20 19 16 15 14 ...
 $ Speed : int  12 14 12 16 20 21 17 14 16 21 ...
 $ Algae : int  40 45 45 80 75 65 65 65 35 30 ...
 $ NO3   : num  2.25 2.15 1.75 1.95 1.95 2.75 1.85 1.75 1.95 2.35 ...
 $ BOD   : int  200 180 135 120 110 120 95 168 180 195 ...
 $ site  : Factor w/ 5 levels "Exe","Lyn","Ouse",..: 4 4 4 4 4 5 5 5 5 5 ...

> colMeans(mf[-6])
> colMeans(mf[1:5])
> colMeans(mf[c(1,2,3,4,5)])
 Length   Speed   Algae     NO3     BOD 
 19.640  15.800  58.400   2.046 145.960 

At the beginning you see that the data frame has five columns of numeric data and one character vector (actually, it is a factor). In the first case you exclude the factor column using [-6]; in the second case you specify columns one to five using [1:5]. In the last case you list all five columns you require explicitly.

Although these commands are useful, they are somewhat limited and are intended as convenience commands. They are also only useful when your data are all numeric, and you may very well have data comprising numeric predictor variables and factor response variables. In these cases you can call upon a range of other summary commands, as you see shortly.

Complex Summary Functions

When you have complicated data you often have a mixture of numeric and factor variables. The simple colMeans() and colSums() commands are not sufficient enough to extract information from these data. Fortunately, you have a variety of commands that you can use to summarize your data, and you have seen some of these before. Here you see an overview of some of these methods.

To help illustrate the options, start by taking a numeric data frame and adding a factor: a simple vector of site names:

> mfnames = c(rep('Taw',5), rep('Torridge',5), rep('Ouse',5), rep('Exe',5), 
rep('Lyn',5))
> mf$site = factor(mfnames)

> str(mf)
'data.frame': 25 obs. of  6 variables:
 $ Length: int  20 21 22 23 21 20 19 16 15 14 ...
 $ Speed : int  12 14 12 16 20 21 17 14 16 21 ...
 $ Algae : int  40 45 45 80 75 65 65 65 35 30 ...
 $ NO3   : num  2.25 2.15 1.75 1.95 1.95 2.75 1.85 1.75 1.95 2.35 ...
 $ BOD   : int  200 180 135 120 110 120 95 168 180 195 ...
 $ site  : Factor w/ 5 levels "Exe","Lyn","Ouse",..: 4 4 4 4 4 5 5 5 5 5 ...

Now that you have a suitable practice sample, it is time to look at some of the complex summary functions that you can use.

The rowsum() Command

You can calculate the sums of rows in a data frame or matrix and group the sums according to some factor or grouping variable. In the following example, you use the rowsum() command to determine the sums for each of the sites that are listed in the site column:

> rowsum(mf[1:5], group = mf$site)
         Length Speed Algae   NO3 BOD
Exe          88    83   235  7.15 859
Lyn         110    73   355 12.95 534
Ouse        102    76   325 10.35 753
Taw         107    74   285 10.05 745
Torridge     84    89   260 10.65 758

The result shows all the sites listed, and for each numeric variable you have the sum. Note that you specified the columns in the data using [1:5]; these are the numeric columns. You could also eliminate the non-numeric column like so:

> rowsum(mf[-6], mf$site)

In this case the sixth column contained the grouping variable. You can also specify a single column using its name (in quotes):

> rowsum(mf['Length'], mf$site)
         Length
Exe          88
Lyn         110
Ouse        102
Taw         107
Torridge     84

When you have a matrix, your grouping variable must be separate because a matrix is comprised of data of all the same type. In the following example, you create a simple vector specifying the groupings:

> bird
              Garden Hedgerow Parkland Pasture Woodland
Blackbird         47       10       40       2        2
Chaffinch         19        3        5       0        2
Great Tit         50        0       10       7        0
House Sparrow     46       16        8       4        0
Robin              9        3        0       0        2
Song Thrush        4        0        6       0        0

> grp = c(1,1,1,2,2,3)

> rowsum(bird, grp)
  Garden Hedgerow Parkland Pasture Woodland
1    116       13       55       9        4
2     55       19        8       4        2
3      4        0        6       0        0

The group vector must be the same length as the number of rows in your matrix; in this case, six rows of data. You might also create a character vector as in the following example:

> grp = c('black', 'color', 'color', rep('brown', 3))

> grp
[1] "black" "color" "color" "brown" "brown" "brown"

> rowsum(bird, grp)
      Garden Hedgerow Parkland Pasture Woodland
black     47       10       40       2        2
brown     59       19       14       4        2
color     69        3       15       7        2

It is also possible to specify part of the matrix using a grouping contained within the original matrix:

> rowsum(bird[,1:4], bird[,5])
  Garden Hedgerow Parkland Pasture
0    100       16       24      11
2     75       16       45       2

Here you use the last column as the grouping, and the result shows the group labels (as numbers). However, you can use only a numeric grouping variable, of course, because the matrix can contain only data of a single type.

The apply() Command

You can use the apply() command to apply a function over all the rows or columns of a data frame (or matrix). To use it, you specify the rows or columns that you require, whether you want to apply the function to the rows or columns, and finally, the actual function you want, like so:

apply(X, MARGIN, FUN, ...)

You replace the MARGIN part with a numeric value: 1 = rows and 2 = columns. You can also add other instructions if they are related to the function you are going to use; for example, you can exclude NA items, using na.rm = TRUE. In the following case you use the apply() command to apply the median to the first five columns of your data frame:

> apply(mf[1:5], 2, median)
Length  Speed  Algae    NO3    BOD 
 20.00  16.00  65.00   1.95 145.00 

You put the columns you require in the square brackets; in this case you used [1:5]. Because your object is a data frame, you can simply list the column names; more properly you should use [row, col] syntax:

> apply(mf[,1:5], 2, median)

Here you added the comma, saying in effect that you want all the rows but only columns one through five. If you want to apply your function to the rows, you simply switch the numeric value in the MARGIN part:

> apply(mf[,1:5], 1, median)
 [1] 20 21 22 23 21 21 19 16 16 21 21 26 21 20 19 18 17 19 21 21 22 25 24 23 22

Notice that you have not specified MARGIN or FUN in the command, but have used a shortcut. R commands have a default order for instructions; so as long as you put the arguments in the default order you do not need to name them. If you do name them then the instructions can appear in any order. The full version for the preceding example would be written like so:

> apply(X = mf[,1:5], MARGIN = 1, FUN = median)

The apply() command enables you to use a wider variety of commands on rows and columns than the rowSums() or colMeans() commands, which obviously are limited to sum() and mean(). However, you can use apply() only on entire rows or columns that are discrete samples. When you have grouping variables, you need a different approach.

Using tapply() to Summarize Using a Grouping Variable

The summary commands you have looked at so far have enabled you to look at entire rows or columns; only the rowsum() command lets you take into account a grouping variable. When you have grouping variables in the form of predictor variables, for example, you can use the tapply() command to take into account one or more factors as grouping variables.

The following illustrates a fairly simple example where you have a data frame comprising several numeric columns, and a single column that is a grouping variable (a factor):

> tapply(mf$Length, mf$site, FUN = sum)
     Exe      Lyn     Ouse      Taw Torridge 
      88      110      102      107       84 

The tapply() command works only on a single vector at a time; in this instance you choose the Length column using the $ syntax. Next you specify the INDEX that you want to use; in other words, the grouping variable. Finally, you select the function that you want to apply; here you choose the sum. The general form of the command is as follows:

tapply(X, INDEX, FUN = NULL, ...)

If you omit the FUN part, or set it to NULL, you get a vector that relates to the INDEX. This is easiest to see in an example:

> tapply(mf$Length, mf$site, FUN = NULL)
 [1] 4 4 4 4 4 5 5 5 5 5 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2

If you refer to the original data you will see that the fourth site is the Exe factor, and because this is alphabetically the first, it is returned first. The vector result shows the rows of the original data that relate to the grouping factor.

When you have more than one grouping variable, you can list several factors to be your INDEX. In the following example you have a data frame comprising a column of numeric data and two factor columns:

> str(pw)
'data.frame': 18 obs. of  3 variables:
 $ height: int  9 11 6 14 17 19 28 31 32 7 ...
 $ plant : Factor w/ 2 levels "sativa","vulgaris": 2 2 2 2 2 2 2 2 2 1 ...
 $ water : Factor w/ 3 levels "hi","lo","mid": 2 2 2 3 3 3 1 1 1 2 ...

> tapply(pw$height, list(pw$plant, pw$water), mean)
               hi       lo      mid
sativa   39.66667 6.000000 15.33333
vulgaris 30.33333 8.666667 16.66667

This time you specify the columns you want to use as grouping variables in a list() command; there are only two variables here and the first one becomes the rows of the result and the second becomes the columns.

If you have more than two grouping variables, the result is subdivided into more tables as required. In the following example you have an extra factor column and use all three factors as grouping variables:

> str(pw)
'data.frame': 18 obs. of  4 variables:
 $ height: int  9 11 6 14 17 19 28 31 32 7 ...
 $ plant : Factor w/ 2 levels "sativa","vulgaris": 2 2 2 2 2 2 2 2 2 1 ...
 $ water : Factor w/ 3 levels "hi","lo","mid": 2 2 2 3 3 3 1 1 1 2 ...
 $ season: Factor w/ 2 levels "spring","summer": 1 2 2 1 2 2 1 2 2 1 ...

> pw.tap = tapply(pw$height, list(pw$plant, pw$water, pw$season), mean)
, , spring

         hi lo mid
sativa   44  7  14
vulgaris 28  9  14

, , summer

           hi  lo mid
sativa   37.5 5.5  16
vulgaris 31.5 8.5  18

In this case the third grouping variable has two levels, which results in two tables, one for spring and one for summer. The result is presented as a kind of R object called an array; this can have any number of dimensions, but in this case you have three. If you look at the structure of the result using the str() command, you can see how the dimensions are set:

> pw.tap = tapply(pw$height, list(pw$plant, pw$water, pw$season), mean)
> str(pw.tap)
 num [1:2, 1:3, 1:2] 44 28 7 9 14 14 37.5 31.5 5.5 8.5 ...
 - attr(*, "dimnames")=List of 3
  ..$ : chr [1:2] "sativa" "vulgaris"
  ..$ : chr [1:3] "hi" "lo" "mid"
  ..$ : chr [1:2] "spring" "summer"

You can see that the first dimension is related to the plant variable, the second is related to the water variable, and the third is related to the season variable; in other words, the dimensions are in the same order as you specified in the tapply() command.

You can use the square brackets to extract parts of your result object, but now you have the extra dimension to take into account. To extract part of the result object you need to specify three values in the square brackets (corresponding to each of the three dimensions, plant, water, and season). In the following example you select a single item from the pw.tap result object by specifying a single value for each of the three dimensions.

> pw.tap[1,1,1]
[1] 44

The item you selected corresponds to the first plant (sativa), the first water treatment (hi), and the first season (spring), and you see the result, 44. If you want to see several items you can specify multiple values for any dimension. In the following example you select two values for the plant dimension (1:2), which will display a result for both sativa and vulgaris.

> pw.tap[1:2,1,1]
  sativa vulgaris 
      44       28 

The two result values (44 and 28) correspond to the first water treatment (hi) and the first season (spring). In the following example you select multiple values for the first two dimensions (plant and water) but only a single value for the third (season).

> pw.tap[1:2,1:3,1]
         hi lo mid
sativa   44  7  14
vulgaris 28  9  14

Now you can see that you have selected all of the plant and water treatments but only a single season (spring).

The result is an array object, and as you have seen, it can have multiple dimensions. You can use the class() command to determine that the result is indeed an array object:

> class(pw.tap)
[1] "array"

Summarizing data using grouping variables is an important task that you will need to undertake often. Most often you will need to check means or medians for the groups, but many other functions can be useful. In the following activity you practice using the mean, but you could try out some other functions (for example, median, sum, sd, or length).

Try It Out: Use Grouping Variables to Summarize Complex Data
In this activity you will examine the data object you created in an earlier activity by using some grouping variables to split the data in various ways and summarize using the mean.
1. Look back at the flwr data object that you created earlier:
> flwr
2. You have two response variables (daisy and poa) and two predictor variables (cutting and time). Summarize the daisy variable by obtaining the mean for the grouping variable cutting:
> tapply(flwr$daisy, flwr$cutting, mean)
     mow    sheep    unmow 
13.33333 16.20000  8.25000
3. Now add the second grouping variable:
> tapply(flwr$daisy, list(flwr$cutting, flwr$time), mean)
         early late
mow   13.50000 13.0
sheep 15.33333 17.5
unmow  7.50000  9.0
4. Alter the result by changing the order of the grouping variables:
> tapply(flwr$daisy, list(flwr$time, flwr$cutting), mean)
       mow    sheep unmow
early 13.5 15.33333   7.5
late  13.0 17.50000   9.0
5. Look at the other predictor variable:
> with(flwr, tapply(poa, list(cutting, time), mean))
      early late
mow   11.75 10.5
sheep  8.00  6.5
unmow 15.50 16.5
6. Alter the order of the result:
> with(flwr, tapply(poa, list(time, cutting), mean))
        mow sheep unmow
early 11.75   8.0  15.5
late  10.50   6.5  16.5
How It Works
The tapply() command requires three instructions: the data to summarize, a grouping variable (which can be a list of several), and the summary function to apply. If the data are not separate vectors, you need to use $, attach(), or with() to “read” the variables. Here you start by using $.
When you have two grouping variables, use the list() command to specify them. The first grouping variable listed forms the rows of the result and the second forms the columns. Any additional variables result in separate tables and an array object.

The tapply() command is very useful, but you may only want to have a simple table/matrix as your result rather than a complicated array. It is possible to summarize a data object using multiple grouping factors using other commands, as you see next.

The aggregate() Command

The aggregate() command enables you to compute summary statistics for subsets of a data frame or matrix; the result comes out as a single matrix rather than an array item, even with multiple grouping factors. The general form of the command is as follows:

aggregate(x, by, FUN, ...)

You specify the data you want followed by a list() of the grouping variables and the function you want to use. In the following example you use a single grouping variable and the sum() function:

> aggregate(mf[1:5], by = list(mf$site), FUN = sum)
   Group.1 Length Speed Algae   NO3 BOD
1      Exe     88    83   235  7.15 859
2      Lyn    110    73   355 12.95 534
3     Ouse    102    76   325 10.35 753
4      Taw    107    74   285 10.05 745
5 Torridge     84    89   260 10.65 758

In this case you specify all the numeric columns in the data frame, and the result shows the sum of each of the groups represented by the grouping variable (site).

You can also use the aggregate() command with a formula syntax; in this case you specify the response variable to the left of the ~ and the predictor variables to the right, like so:

> aggregate(Length ~ site, data = mf, FUN = mean)
      site Length
1      Exe   17.6
2      Lyn   22.0
3     Ouse   20.4
4      Taw   21.4
5 Torridge   16.8

This allows a slightly simpler command because you do not need the $ signs as long as you specify where the data are to be found. In this case you chose a single response variable (Length) and the mean() as your summary function. You can select several response variables at once by wrapping them in a cbind() command:

> aggregate(cbind(Length, BOD) ~ site, data = mf, FUN = mean)
      site Length   BOD
1      Exe   17.6 171.8
2      Lyn   22.0 106.8
3     Ouse   20.4 150.6
4      Taw   21.4 149.0
5 Torridge   16.8 151.6

Here you chose two response variables (Length and BOD), which are given in the cbind() command (thus making a temporary matrix). You can select all the variables by using a period instead of any names to the left of the ~:

> aggregate(. ~ site, data = mf, FUN = mean)
      site Length Speed Algae  NO3   BOD
1      Exe   17.6  16.6    47 1.43 171.8
2      Lyn   22.0  14.6    71 2.59 106.8
3     Ouse   20.4  15.2    65 2.07 150.6
4      Taw   21.4  14.8    57 2.01 149.0
5 Torridge   16.8  17.8    52 2.13 151.6

Here you use all the variables in the data frame. This works only if the remaining variables are all numeric; if you have other character variables, you need to specify the columns you want explicitly.

Because of the nature of the output/result, some people may find the aggregate() command more useful in presenting summary statistics than the tapply() command discussed earlier. In the following example, you have a data frame with one response variable and three predictor variables:

> str(pw)
'data.frame': 18 obs. of  4 variables:
 $ height: int  9 11 6 14 17 19 28 31 32 7 ...
 $ plant : Factor w/ 2 levels "sativa","vulgaris": 2 2 2 2 2 2 2 2 2 1 ...
 $ water : Factor w/ 3 levels "hi","lo","mid": 2 2 2 3 3 3 1 1 1 2 ...
 $ season: Factor w/ 2 levels "spring","summer": 1 2 2 1 2 2 1 2 2 1 ...

> pw.agg = aggregate(height ~ plant * water * season, data = pw, FUN = mean)
      plant water season height
1    sativa    hi spring   44.0
2  vulgaris    hi spring   28.0
3    sativa    lo spring    7.0
4  vulgaris    lo spring    9.0
5    sativa   mid spring   14.0
6  vulgaris   mid spring   14.0
7    sativa    hi summer   37.5
8  vulgaris    hi summer   31.5
9    sativa    lo summer    5.5
10 vulgaris    lo summer    8.5
11   sativa   mid summer   16.0
12 vulgaris   mid summer   18.0

The result is a simple data frame, and this can make it easier to extract the components than for the array result you had previously with the tapply() command.

You could have achieved the same result by using the period instead of the grouping variable names like so:

>aggregate(height ~ . , data = pw, FUN = mean)

So, like the previous example, the period means, “everything else not already named.” If you replace the period with a number 1, you get quite a different result:

>aggregate(height ~ 1 , data = pw, FUN = mean)
    height
1 19.44444

You get the overall mean value here; essentially you have said, “don’t use any grouping variables.”

The aggregate() command is very powerful, partly because you can use the formula syntax and partly because of the output, which is a single data frame. In the following activity you practice using the command using the mean as the summary function, but you could try some others (for example, median, sum, sd, or length).

Try It Out: Use the aggregate() Command to Make Summary Results
In this activity you look at a data object you created in an earlier activity. You use the aggregate() command to explore the means of various groupings.
1. Look again at the flwr data object that you created earlier:
>flwr
2. You have two response variables (daisy and poa) and two predictor variables (cutting and time). Begin by summarizing the mean of the daisy variable grouped by cutting:
> aggregate(flwr$daisy, by = list(flwr$cutting), FUN = mean)
  Group.1        x
1     mow 13.33333
2   sheep 16.20000
3   unmow  8.25000
3. Now add the time variable and group using both predictors:
> aggregate(flwr$daisy, by = list(flwr$cutting, flwr$time), FUN = mean)
  Group.1 Group.2        x
1     mow   early 13.50000
2   sheep   early 15.33333
3   unmow   early  7.50000
4     mow    late 13.00000
5   sheep    late 17.50000
6   unmow    late  9.00000
4. Summarize both response variables using both the predictors as grouping variables:
> aggregate(flwr[1:2], by = list(flwr$cutting, flwr$time), FUN = mean)
  Group.1 Group.2    daisy   poa
1     mow   early 13.50000 11.75
2   sheep   early 15.33333  8.00
3   unmow   early  7.50000 15.50
4     mow    late 13.00000 10.50
5   sheep    late 17.50000  6.50
6   unmow    late  9.00000 16.50
5. Now use the formula syntax to summarize the poa response variable by using both grouping variables:
> aggregate(poa ~ cutting + time, data = flwr, FUN = mean)
  cutting  time   poa
1     mow early 11.75
2   sheep early  8.00
3   unmow early 15.50
4     mow  late 10.50
5   sheep  late  6.50
6   unmow  late 16.50
6. Summarize both response variables using both predictor variables:
> aggregate(. ~ cutting + time, data = flwr, FUN = mean)
  cutting  time    daisy   poa
1     mow early 13.50000 11.75
2   sheep early 15.33333  8.00
3   unmow early  7.50000 15.50
4     mow  late 13.00000 10.50
5   sheep  late 17.50000  6.50
6   unmow  late  9.00000 16.50
7. Obtain the same result by specifying the response variables explicitly:
> aggregate(cbind(poa, daisy) ~ cutting + time, data = flwr, FUN = mean)
8. This time use all the variables, but save some typing and specify the grouping variables with a period:
> aggregate(cbind(poa, daisy) ~ ., data = flwr, FUN = mean)
9. Ignore all the grouping variables and obtain an overall mean for the two response variables:
> aggregate(cbind(poa, daisy) ~ 1, data = flwr, FUN = mean)
       poa    daisy
1 11.26667 12.93333
How It Works
Using the aggregate() command without the formula syntax requires a list() even if you are summarizing using a single grouping variable. The result is a single data frame with the first columns ordered in the same order in which you specify the grouping variables, with the final column containing the response. You can specify multiple response variables from a data frame by using the square brackets.
The formula syntax enables you to specify the data in a logical manner, and the result also contains more appropriate column headings. Multiple response variables can be given by using the cbind() command to name them explicitly. You can use a period to state that you require all variables not already named in the command. That works here because you do not have any other variables. You can use the period to specify the predictor variables, too.
Using a 1 instead of any grouping variables gives an overall value, essentially ignoring the groupings.

Summary

  • Vector objects need to be of equal length before they can be made into a data frame or matrix.
  • You can use the rep() command to create replicate labels as factors. You can also use the gl() command to generate factor levels.
  • The levels of a factor can be examined using the levels() and nlevels() commands.
  • A character vector can be converted to a factor using the as.factor() and factor() commands.
  • Data frames can be constructed using the data.frame() command. Matrix objects can be constructed using matrix(), cbind(), or rbind() commands.
  • Simple summary commands can be applied to rows or columns using rowSums() and colMeans commands.
  • The apply() command can apply a function to rows or columns.
  • The rowsum() command can use a grouping variable to sum data across rows. Grouping variables can be used in the tapply() and aggregate() commands along with any function.
  • If more than two grouping variables are used with the tapply() command, a multi-dimensional array object is the result. In contrast, the aggregate() command always produces a single data frame as the result and can use the formula syntax.

Exercises

download.eps

You can find answers to these exercises in Appendix A.

1. Look at the bees data object from the Beginning.RData file for this exercise. The data are a matrix, and the columns relate to individual bee species and the rows to different flowers. The numerical data are the number of bees observed visiting each flower. Create a new factor variable that could be used as a grouping variable. Here you require the general color type to be represented; you can think of the first two as being equivalent (blue) and the last three as equivalent (yellow).
2. Take the bees matrix and add the grouping variable you just created to it to form a new matrix.
3. Use the flcol grouping variable you just created to summarize the Buff.tail column in the bees data; use any sensible summarizing command. Can you produce a summary for all the bee species in one go?
4. Look at the ChickWeight data item, which comes built into R. The data comprise a data frame (although it also has other attributes) with a single response variable and some predictor variables. Look at median values for weight broken down by Diet. Now add the Time variable as a second grouping factor.
5. Access the mtcars data, which are built in to R. The data are in a data frame with several columns. Summarize the miles-per-gallon variable (mpg) as a mean for the three grouping variables cyl, gear, and carb.

What You Learned In This Chapter

Topic Key Points
Making data items:
length()
Vectors need to be the same length if they are to be added to a data frame or matrix. The length() command can query the current length or alter it. Setting the length to shorter than current truncates the item and making it longer adds NA items to the end.
Making data items:
names()row.names()rownames() colnames()
You can use several commands to query and alter names of columns or rows. The rownames() and colnames() commands are used for matrix objects, whereas the names() and row.names() commands work on data frames.
Stacking separate vectors:
stack()
You can use the stack() command to combine vectors and so form a data frame suitable for complex analysis. This really only works when you have a single response variable and a single predictor.
Removing NA items:
na.omit()
The na.omit() command strips out unwanted NA items from vectors and data frames.
Repeated elements:
rep()gl()
You can generate repeated elements, such as character labels that will form a predictor variable, by using the rep() or gl() commands. Both commands enable you to generate multiple levels of a variable based on a repeating pattern.
Factor elements:
factor() as.factor()levels() nlevels()as.numeric()
A factor is a special sort of character variable. You can force a vector to be treated as a factor by using the factor() or as.factor() commands. The levels() command shows the different levels of a factor variable, and the nlevels() command returns the number of discrete levels of a factor variable. A factor can be returned as a list of numbers by using the as.numeric() command.
Constructing a data frame:
data.frame()
Several objects can be combined to form a data frame using the data.frame() command.
Constructing a matrix:
matrix()cbind() rbind()
You can create a matrix in two main ways: by assembling a vector into rows and columns using the matrix() command or by combining other elements. You can combine elements by column using the cbind() command or by row using the rbind() command.
Simple row or column sums or means:
rowSums() colSums()rowMeans() colMeans()
Numerical objects (data frames and matrix objects) can be examined using simple row/column sums or means using rowSums(), colSums(), rowMeans(), or colMeans() commands as appropriate. These cannot take into account any grouping variable.
Simple sum using a grouping variable:
rowsum()
The rowsum() command enables you to add up columns based on a grouping variable. The result is a series of rows of sums (hence the command name).
Apply a command to rows or columns:
apply()
The apply() command enables you to give a command across rows or columns of a data frame or matrix.
Use a grouping variable with any command:
tapply()
The tapply() command enables the use of grouping variables and can utilize any command (for example, mean, median, sd), which is applied to a single vector (or element of a data frame or matrix).
Array objects:
object[x, y, z , ...]
An array object has more than two dimensions; that is, it cannot be described simply by rows and columns. An array is typically generated by using the tapply() command with more than two grouping variables. The resulting array has several dimensions, each one relating to a grouping variable. The array itself can be subdivided using the square brackets and identifying the appropriate dimensions.
Summarize using a grouping variable:
aggregate()
The aggregate() command can utilize any command and number of grouping variables. The result is a two-dimensional data frame; regardless of how many grouping variables are used.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.200.46