Appendix

Answers to Exercises

This appendix provides the solutions for the end-of-chapter exercises located in Chapters 1–12.

Chapter 1

Exercise 1 Solution

To install the coin library/package you need to type the following command:

>install.package(‘coin’)

Note that the name must be in quotes. Then you select the closest mirror site and the package is downloaded and installed for you.

Exercise 2 Solution

To load the coin library and make it ready to use you type the following:

>library(coin)

Once the library is active its commands are available for you to use. You can try to bring up the help entry for the package using the following:

>help(coin)

This does not work for all packages; a better alternative is to use the HTML help system by typing:

>help.start()

This opens the main R help files in your default browser. You can now follow the links to the packages and then find coin from the list.

Exercise 3 Solution

The MASS package is already loaded as part of the R installation but is not ready to use until you type:

>library(MASS)

Once you have the library available you can open the help entry using:

>help(bcv)

Exercise 4 Solution

You can use the search() command to get a list of objects that are available for use.

Exercise 5 Solution

To clear the coin package from memory (and remove it from the search() path), type the following:

>detach(package:coin)

Chapter 2

Exercise 1 Solution

You can use either the c() command or the scan() command to enter these data. The problem is that the bee names contain spaces, which are not allowed. You must alter the names to remove the spaces; the period is the simplest solution.

If you decided to use the c() command then the first vector would be created like so:

> Buff.tail = c(10, 1, 37, 5, 12)

If you decided to use the scan() command then the process is in two parts. The first part is to initiate the data entry like so:

> Buff.tail = scan()

The second part is to enter the data:

> 10 1 37 5 12

To finish the data entry process you must enter a blank line.

If you decide to use the c() command, the entire data entry process would look like the following:

> Buff.tail = c(10, 1, 37, 5, 12)
> Garden.bee = c(8, 3, 19, 6, 4)
> Red.tail = c(18, 9, 1, 2, 4)
> Honey.bee = c(12, 13, 16, 9, 10)
> Carder.bee = c(8, 27, 6, 32, 23)

Exercise 2 Solution

You can use the ls() command to list all the objects currently in memory. However, there will often be quite a few other objects so you can narrow your display by using a regular expression like so:

> ls(pattern = 'tail|bee')

Note that you must not include a space on either side of the | pipe character. This listing shows all objects that contain “tail” or “bee”. You could also list objects that ended with “tail” or “bee” by adding the dollar sign as a suffix like so:

> ls(pattern = '.tail$|.bee$')

To save the data objects you can use the save() command. The names of the objects could be typed into the command to make a long listing, but this is tedious. The regular expression you typed earlier can be used to produce the list of objects to be saved like so:

> save(list = ls(pattern = 'tail|bee'), file = 'bee data all.RData')

To remove the unwanted individual vectors you need to use the rm() command. The names of the objects could be typed into the command as a list or the regular expression could be used once again. This time you must ensure that you do not remove the bees data frame so type the ls() command first to check:

> ls(pat = 'tail$|bee$')

Now the $ suffix ensures that you only selected those objects that ended with the text. You can use the up arrow to recall the command and edit it to form the rm() command like so:

> rm(list = ls(pat = 'tail$|bee$'))

Note that you must use the list = instruction to ensure that the result of the ls() part is treated like a list.

Now quit R using the q() command and select “No” when asked of you want to save the workspace. Restart R and use the ls() command once again. All the bee data are gone. To retrieve the data you use the load() command.

> load(‘bee data all.RData’)

This command retrieves the data that you saved earlier; the individual vectors that you made are all included in the one file. If you are using Windows or Macintosh you can also use the file.choose() instruction rather than the explicit filename like so:

> load(file.choose())

You can also load the data by double clicking on the appropriate file from a Windows Explorer or Mac Finder window.

Chapter 3

Exercise 1 Solution

You can use either the c() command or the scan() command to enter these data. The problem is that the bee names contain spaces, which are not allowed. You must alter the names to remove the spaces; the period is the simplest solution.

If you decided to use the c() command then the first vector would be created like so:

> Buff.tail = c(10, 1, 37, 5, 12)

If you decided to use the scan() command then the process is in two parts. The first part is to initiate the data entry like so:

> Buff.tail = scan()

The second part is to enter the data:

> 10 1 37 5 12

To finish the data entry process you must enter a blank line.

If you decide to use the c() command the entire data entry process would look like the following:

> Buff.tail = c(10, 1, 37, 5, 12)
> Garden.bee = c(8, 3, 19, 6, 4)
> Red.tail = c(18, 9, 1, 2, 4)
> Honey.bee = c(12, 13, 16, 9, 10)
> Carder.bee = c(8, 27, 6, 32, 23)

To create a data frame you must decide on a name for the object and then use the data.frame() command like so:

> bees = data.frame(Buff.tail, Garden.bee, Red.tail, Honey.bee, Carder.bee)

To create row names you can use either the row.names() or rownames() command. The plant names also contain spaces, which need to be dealt with as before by replacing with a full stop. The shortest method is to assign the names as a simple list of data like so:

> row.names(bees) = c('Thistle', 'Vipers.bugloss', 'Golden.rain', 
'Yellow.alfalfa', 'Blackberry')

You might also decide to create a vector to hold the plant names as a separate object:

> plant.names = c('Thistle', 'Vipers.bugloss', 'Golden.rain', 
'Yellow.alfalfa', 'Blackberry')
> row.names(bees) = plant.names

This last method is slightly longer but the vector of plant names can be useful for other purposes.

Exercise 2 Solution

To make a matrix you need the data as separate columns (or rows) or as a single vector of values. You already have the separate vectors for the different bees so begin by using the cbind() command to join them column by column into a new matrix:

> beematrix = cbind(Buff.tail, Garden.bee, Red.tail, Honey.bee, Carder.bee)

Your new matrix will not contain any row names so to include them you need to use the rownames() command:

> plant.names = c('Thistle', 'Vipers.bugloss', 'Golden.rain', 
'Yellow.alfalfa', 'Blackberry')
> rownames(beematrix) = plant.names

The second way to create a matrix is to use a single vector of values and use the matrix() command. To create a single vector you could combine the individual bee vectors:

> bee.data = c(Buff.tail, Garden.bee, Red.tail, Honey.bee, Carder.bee)

Now you can create a new matrix; you will need five columns (one for each bee species):

> beematrix2 = matrix(bee.data, ncol = 5)

The new matrix does not contain row or column names. You already created a vector of plant names, which can be used with the rownames() command:

> rownames(beematrix2) = plant.names

In order to make column names you must either type the names directly into the colnames() command or create a vector of bee names to use:

> bee.data = c(Buff.tail, Garden.bee, Red.tail, Honey.bee, Carder.bee)

Now you can use the colnames() command (remember that the names() command does not work with a matrix):

> colnames(beematrix2) = bee.names

To convert a matrix to a data frame you can use the as.data.frame() command:

> mat.to.frame = as.data.frame(beematrix2)

To convert a data frame into a matrix you can use the as.matrix() command:

> frame.to.mat = as.matrix(bees)

To make a list you can use the list() command:

> bee.list = list(bees, plant.names, bee.names)

If you look at the new list you see that the elements are not named so you must add names using the names() command.

> names(bee.list) = c('bees', 'plant.names', 'bee.names')

Now you have a single item that contains the data as well as separate items with the row and column names.

Exercise 3 Solution

To tidy up you will need to use the rm() command. You can type the names into the command individually or you can use the ls() command along with a regular expression to remove several items at once.

> rm(list = ls(pat = 'bee$|tail|^beem'))

The preceding command creates a list of objects ending with “bee” as well as those containing “tail”. It also lists the two matrix objects as they begin with “beem”. The command removes all the individual bee vectors as well as the two matrix objects. You can use a separate command to remove the two vectors of names:

> rm(bee.names, plant.names)

To display the data for the Blackberry only you need to determine which row you need and then use the [row, column] syntax; there are two options:

> bees[5,]
> bees['Blackberry',]

In the first case the number of the row was used whilst in the second example the name of the row was used.

To display the data for Golden rain and Yellow alfalfa requires a bit more thought. You could simply determine the appropriate rows and use these values like so:

> bees[3:4, ]
> bees[c(3, 4), ]

You might have thought about creating an index to work out the appropriate rows:

> ii = which(rownames(bees)=='Golden.rain'|rownames(bees)=='Yellow.alfalfa')
> bees[ii, ]

To display the data for the Red tail bee only you can use the $ syntax like so:

> bees$Red.tail

However, all you see are the plain values; the plant names are not shown alongside the values. You could achieve the same result using the [row, column] syntax too:

> bees[, 3]
> bees[, 'Red.tail']

Once again the numerical values are shown without any labels. With a slight modification you can display the appropriate column along with the labels by omitting the row designation like so:

> bees[3]
> bees['Red.tail']

In either case, you will see the data for the Red tail bee as well as the appropriate row names as labels.

Exercise 4 Solution

The first step is to create an index to re-order the rows. You can use the order() command to achieve this using the Buff tail column like so:

> ii = order(bees$Buff.tail, bees$Red.tail, decreasing = TRUE)

Remember that you need to specify decreasing = TRUE as the default is for ascending (that is, decreasing = FALSE). The index can now be used to create a new data frame with the new row order:

> bees.r = bees[ii, ] 

Now you have re-ordered the rows (Golden rain should be the top row) you can create an index to re-order the columns. You can use the order() command again to alter the order of the first row:

> ii = order(bees.r[1,], decreasing = TRUE)

Once again you need to specify decreasing = TRUE as an instruction. Now you can apply the new index to the data to alter the order of the columns:

> bees.rc = bees.r[, ii ]

You can tidy up by removing any unwanted data frames using the rm() command.

Chapter 4

Exercise 1 Solution

If you type its name you see the mf data. At first glance this appears to be a data frame because it is a two-dimensional object with rows and columns. However, the object might be a matrix or even a table object. To examine the structure and type more closely you can use the str() and class() commands:

> str(mf)
'data.frame': 25 obs. of  5 variables:
 $ Length: int  20 21 22 23 21 20 19 16 15 14 ...
 $ Speed : int  12 14 12 16 20 21 17 14 16 21 ...
 $ Algae : int  40 45 45 80 75 65 65 65 35 30 ...
 $ NO3   : num  2.25 2.15 1.75 1.95 1.95 2.75 1.85 1.75 1.95 2.35 ...
 $ BOD   : int  200 180 135 120 110 120 95 168 180 195 ...
> class(mf)
[1] "data.frame"

You can now see that this is a data frame. You can get a basic summary of the entire data frame using the summary() command:

> summary(mf)
     Length          Speed          Algae           NO3             BOD       
 Min.   :13.00   Min.   : 9.0   Min.   :25.0   Min.   :1.050   Min.   : 55.0  
 1st Qu.:18.00   1st Qu.:12.0   1st Qu.:40.0   1st Qu.:1.750   1st Qu.:110.0  
 Median :20.00   Median :16.0   Median :65.0   Median :1.950   Median :145.0  
 Mean   :19.64   Mean   :15.8   Mean   :58.4   Mean   :2.046   Mean   :146.0  
 3rd Qu.:21.00   3rd Qu.:20.0   3rd Qu.:75.0   3rd Qu.:2.350   3rd Qu.:180.0  
 Max.   :25.00   Max.   :26.0   Max.   :85.0   Max.   :2.950   Max.   :235.0

Because all the columns are numerical, you see a numerical summary for each one. You can select a single column and summarize it by using the $ syntax or include the with() command to allow R to read the columns inside the data frame:

> mean(mf$Speed)
[1] 15.8
> with(mf, median(Algae))
[1] 65

You can summarize all the columns at once using the colMeans() command or apply any summary function using the apply() command:

> colMeans(mf)
 Length   Speed   Algae     NO3     BOD 
 19.640  15.800  58.400   2.046 145.960 
> apply(mf, 2, sd)
   Length     Speed     Algae       NO3       BOD 
 3.080584  4.681524 19.457218  0.504546 44.954125

Exercise 2 Solution

The bfs object is a data frame. You can determine this by using the str() or class() commands. These data comprise only two columns: a response variable (count) and a predictor variable (site). The basic table() command produces a contingency table:

> table(bfs)
     site
count Arable Grass Heath
   3       1     2     0
   4       0     3     0
   5       0     2     0
   6       0     1     1
   7       0     1     1
   8       2     1     2
   9       2     0     1
   11      2     0     2
   12      1     1     1
   19      1     0     0
   21      0     1     0

You can produce an identical table by specifying the columns explicitly:

> with(bfs, table(count, site))

The ftable() command can also produce the same result like so:

> ftable(bfs)
> ftable(site ~ count, data = bfs)

Either command produces the same result. The table can be produced in a different configuration by specifying the columns in a different order:

> with(bfs, table(site, count))
        count
site     3 4 5 6 7 8 9 11 12 19 21
  Arable 1 0 0 0 0 2 2  2  1  1  0
  Grass  2 3 2 1 1 1 0  0  1  0  1
  Heath  0 0 0 1 1 2 1  2  1  0  0

The ftable() command can also produce this result:

> ftable(count ~ site, data = bfs)

The difference between the two commands is that the result of the table() command has a class of “table”, whereas the ftable() command produces a result with class of “ftable”.

Exercise 3 Solution

The invert object is a simple data frame. To create a cross-tabulated contingency table you need to use the xtabs() command like so:

> invert.tab = xtabs(Qty ~ Taxa + Habitat, data = invert)
> invert.tab
        Habitat
Taxa     Upper Lower Stem
  Aphid    230   175  321
  Bug       34    31   35
  Beetle    72    23  101
  Spider    11     3    5
  Ant       12     9   15

The resulting object holds two classes, as you can see when using the class() command:

> class(invert.tab)
[1] "xtabs" "table"

To reconstruct the original data you need to use the as.data.frame() command. Because the object holds a table class this will work adequately:

> as.data.frame(invert.tab)
     Taxa Habitat Freq
1   Aphid   Upper  230
2     Bug   Upper   34
3  Beetle   Upper   72
4  Spider   Upper   11
5     Ant   Upper   12
6   Aphid   Lower  175
7     Bug   Lower   31
8  Beetle   Lower   23
9  Spider   Lower    3
10    Ant   Lower    9
11  Aphid    Stem  321
12    Bug    Stem   35
13 Beetle    Stem  101
14 Spider    Stem    5
15    Ant    Stem   15

Chapter 5

Exercise 1 Solution

The predictor column of the orchis data frame is comprised of three different sites (that is, levels), as you can see if you use the summary() command. To produce a histogram of the data for just the sprayed site, you need to extract the data from the main data frame. One way to do this is to use the $ syntax:

> orchis$flower[which(orchis$site=='sprayed')]
 [1] 1 2 5 4 7 6 4 3 4 5

The histogram can be drawn using these data but you need to remember to set freq = FALSE so that the y-axis displays the density:

> hist(orchis$flower[which(orchis$site=='sprayed')], freq = FALSE)

Now the x-axis is set to density you can use the lines() command to add the density plot over the existing histogram. The density() command itself will calculate where the lines will be plotted and the lines() command actually draws them. You can alter the appearance of the density line; in this case the line is drawn slightly wider than standard using the lwd = 2 instruction:

> lines(density(orchis$flower[which(orchis$site=='sprayed')]), lwd = 2)

You might also have used the unstack() command to create the data; in this case the third column represents the sprayed sample. The following commands would all have created the same result, a new vector of numeric values:

> sprayed = unstack(orchis)[,3]
> sprayed = unstack(orchis)[,'sprayed']

You could then have used the hist() and density() commands on the new vector.

Exercise 2 Solution

The Wilcoxon statistic is examined via the pwilcox(), dwilcox(), rwilcox(), and qwilcox() commands. You might have to examine the help files to find out more. The following help commands would get you to the right place:

help(Distributions)
help(dwilcox)
help(Wilcoxon)

To obtain a critical value you need to use the qwilcox() command. The first instruction should be the level of significance (as a probability) and this needs to be halved, because the test is two-tailed. The command also requires the number of replicates in the two samples:

> qwilcox(c(0.975, 0.995), 8, 8)
[1] 50 56

You now know that if you get a value of 50 or greater the result would be significant at p < 0.05. If you obtained a value of 56 or greater, then p < 0.01.

Assuming your result of 77 represents the larger of the two calculated U values, you can determine the significance using the pwilcox() command like so:

> 2*(1-pwilcox(77, 10, 10))
[1] 0.03546299

The test is two-tailed so the result must be multiplied by two. You can also use the “other end” of the distribution in the command to avoid having to use the 1- part:

> pwilcox(77, 10, 10, lower.tail = FALSE)*2
[1] 0.03546299

Chapter 6

Exercise 1 Solution

You can view the InsectSprays data by simply typing its name. It will not appear if you use ls(), although you can make the item visible using data(InsectSprays). Start with a summary() command to see what you are dealing with:

> summary(InsectSprays)
     count       spray 
 Min.   : 0.00   A:12  
 1st Qu.: 3.00   B:12  
 Median : 7.00   C:12  
 Mean   : 9.50   D:12  
 3rd Qu.:14.25   E:12  
 Max.   :26.00   F:12

To run a t-test you will need to use the subset instruction to select the A and B samples from the spray variable.

> t.test(count ~ spray, data = InsectSprays, subset = spray %in% c('A', 'B'))

       Welch Two Sample t-test

data:  count by spray 
t = -0.4535, df = 21.784, p-value = 0.6547
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -4.646182  2.979515 
sample estimates:
mean in group A mean in group B 
       14.50000        15.33333

Exercise 2 Solution

The data frame contains two separate samples, therefore you must use the vector syntax; a formula is not appropriate. You can use the attach() command to allow the individual vectors to be “read” from the data frame, or the with() command to achieve the same result. The wilcox.test() command can be used like so:

> with(hogl, wilcox.test(fast, slow))

       Wilcoxon rank sum test with continuity correction

data:  fast and slow 
W = 12.5, p-value = 0.02651
alternative hypothesis: true location shift is not equal to 0 

Warning message:
In wilcox.test.default(fast, slow) : cannot compute exact p-value with ties

The $ syntax can also be used to pick out the separate vectors. Because there are tied ranks you might turn off the attempt to use an exact p-value:

> wilcox.test(hogl$fast, hogl$slow, exact = FALSE)

       Wilcoxon rank sum test with continuity correction

data:  hogl$fast and hogl$slow 
W = 12.5, p-value = 0.02651
alternative hypothesis: true location shift is not equal to 0

Now you do not see the warning message.

Exercise 3 Solution

The sleep data can be examined simply by typing its name. Because the variables are inside a data frame, the with() command is useful to allow R to “read” them. Because you have a grouping variable (called group) you need to split the data into two parts. You can do this using a conditional statement and the t-test can be run like so:

> with(sleep, t.test(extra[group==1], extra[group==2], paired = TRUE))

       Paired t-test

data:  extra[group == 1] and extra[group == 2] 
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.4598858 -0.7001142 
sample estimates:
mean of the differences 
                  -1.58

The [group==1] part extracts the data relating to group 1. A more long-winded way would be to unstack() the data into a new form and then run the test:

> sleep2 = unstack(sleep, form = extra ~ group)
> names(sleep2) = c('Grp1', 'Grp2')
> t.test(sleep2$Grp1, sleep2$Grp2, paired = T)

Exercise 4 Solution

You can view the mtcars data simply by typing its name. A complete correlation matrix can be created by using the cor() command and giving the data name in the parentheses:

cor(mtcars)

To narrow the focus you can type the name of a single variable in addition to the overall data. This is analogous to cor(x, y). For the mpg variable you would type:

cor(mtcars$mpg, mtcars)

Note that you need the $ so that the variable can be “read.” To conduct a correlation test you can use either the vector or the formula syntax. The $ sign can be used like so:

> cor.test(mtcars$mpg, mtcars$qsec)

       Pearson's product-moment correlation

data:  mtcars$mpg and mtcars$qsec 
t = 2.5252, df = 30, p-value = 0.01708
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.08195487 0.66961864 
sample estimates:
     cor 
0.418684

The formula syntax can also be used and gives the same result:

> cor.test(~ mpg + qsec, data = mtcars)

Exercise 5 Solution

You can use the chisq.test() command to carry out a goodness-of-fit test. In this case you need to use the visit column as the main data and the ratio variable as the data to form the foundation for the test:

> chisq.test(bv$visit, p = bv$ratio, rescale = TRUE)

       Chi-squared test for given probabilities

data:  bv$visit 
X-squared = 191.9482, df = 7, p-value < 2.2e-16

You might also have considered using the Kolmogorov-Smirnov test that you met in the previous chapter:

> ks.test(bv$visit, bv$ratio)

Chapter 7

Exercise 1 Solution

The simplest way to create the box-whisker plot is to use the formula syntax to state how the graph should be constructed. You can use the range = 0 instruction to force the whiskers to extend to the max and min. The horizontal = TRUE instruction forces the plot to be displayed horizontally. To select a single wool type you need to use the subset instruction that you met earlier. The full command is shown here:

> boxplot(breaks ~ tension, data = warpbreaks, horizontal = TRUE, 
range = 0, subset = wool %in% 'A', col = 'cornsilk')
> title(xlab = 'Number of breaks', ylab = 'Tension', main = 'Wool type "A"')

The title() command has been used to add titles to the plot, but you could have specified the text as part of the main boxplot() command.

Exercise 2 Solution

The plot() command is best used in this case. The formula notation is the simplest way to specify the data to plot. The axes are scaled to fit the points into the plot area, and this will not show the origin of the graph. The line of best-fit will not cross the y-axis unless you modify the scales of the axes. The following command shows how to rescale and draw the axes to show the line to its best advantage. You may have to experiment to get the best values. The xaxs = ‘i’ and yaxs = ‘i’ instructions have been used to remove extra space at the ends of the axes.

> plot(Girth ~ Volume, trees, ylim = c(5, 20), xlim = c(0,70), pch = 17, 
col = 'darkgreen', xaxs = 'i', yaxs = 'i', cex = 1.5, 
xlab = 'Volume (cubic ft.)', ylab = 'Girth (inches)')
> abline(lm(Girth ~ Volume, trees), lty = 2, lwd = 2, col = 'green')

The abline() command is used to add a trend line using the lm() command to determine slope and intercept from a linear model.

Exercise 3 Solution

These data are in a 3D table, so you need the table[row, col, group] syntax to get the parts you require. The dot chart is produced using the following command lines:

> dotchart(HairEyeColor[,,1], gdata = colMeans(HairEyeColor[,,1]), 
gpch = 16, gcolor = 'blue')
> title(xlab = 'Number of individuals', main = 'Males Hair and Eye color')
> mtext('Grouping = mean', side =3, adj = 1)

You get the main table using HairEyeColor[,,1], which selects all rows and columns of the first group (male). To get the column means you use the colMeans() command on the same data. The other instructions set the plotting character and color for the group results. The mtext() command is optional because you could have given this information in a caption. In this case, the reader is informed that the mean is the grouping summary used, and the title is placed at the top (side = 3) and right justified (adj = 1).

Exercise 4 Solution

The table is a 3D table with rows, columns, and a grouping (male, female). The following command makes a grouped bar chart with appropriate colors for the bars:

> barplot(HairEyeColor[,,2], legend =TRUE, col = c('black', 'tan', 
'tomato','cornsilk'), beside = TRUE)
> title(xlab = 'Hair Color', ylab = 'Frequency')

The main command uses [, , 2] to select all rows and all columns from the second group (female), which is the part that contains the data required.

You could have specified the colors as a separate object and simply referred to them. The colors are specified in row order; look at the data to see the colors required. You have to use the colors() command to determine what colors are available; some experimentation might be required to get the best colors from those at your disposal.

Exercise 5 Solution

To get the summary data you need to use the tapply() command. The bar chart is then simply drawn using the barplot() command like so:

> barplot(tapply(bfs$count, bfs$site, FUN = median))
> abline(h=0)
> title(xlab = 'Habitat', ylab = 'Butterfly abundance')

The $ syntax is used in the preceding code, but you can also use the with() command to achieve the same result. Alternatively, you might create a new data object to hold the result of the tapply() command and then create the bar chart from that:

> with(bfs, barplot(tapply(count, site, FUN = median)))

In any event, the final commands draw a line under the bars to “ground” them and add some axis labels.

Chapter 8

Exercise 1 Solution

The formula syntax enables you to specify complex models. You do not need to use the attach() or with() commands or use the $ syntax because the data = instruction points to the data. In addition, it is easy to create graphs because you can copy the majority of the command for the graph from the command used for the analysis.

Exercise 2 Solution

The chick data comprises six columns, and NA items need to be removed. The first stage is to stack the data into two columns with a response variable and a predictor variable (make this into a new object). These columns will need sensible names.

> chicks = na.omit(stack(chick))
> names(chicks) = c("weight", "feed")

The ANOVA is carried out using the aov() command fairly simply:

> chicks.aov = aov(weight ~ feed, data = chicks)
> summary(chicks.aov)

The result is highly significant; feed does have an effect on the weights of the chicks.

Exercise 3 Solution

The first step is to draw a boxplot of the data. It is fairly easy to use the up arrow to recall the aov() command and edit it to make the graphic.

> boxplot(weight ~ feed, data = chicks)

It looks like there are differences in feeds, but a post-hoc test will show you the significance of the pairwise comparisons:

> TukeyHSD(chicks.aov)
> TukeyHSD(chicks.aov, ordered = TRUE)

The first command is perhaps easier to compare to the boxplot, but the second version will be more useful when you plot the differences in means:

> plot(TukeyHSD(chicks.aov, ordered = TRUE))

You can see where the significant differences lie, but the labels do not fit very well, so you need to modify the margins of the plot and redraw it with different settings:

> oldpar = par(mar = c(5,8,4,2)+0.1)
> plot(TukeyHSD(chicks.aov, ordered = TRUE), las = 1, cex.axis = 0.85)
> abline(v = 0, lty = 3, col = 'gray50')
> par(oldpar)

The first command alters the margins (the second value relates to the left margin), making more room for the labels on the left. The previous plot() command can be edited to rotate the labels (las = 1) and make the axis labels smaller (cex.axis = 0.85). Your computer may have different graphics settings, so you may require slightly different values than these. The abline() command draws a vertical line, which shows where the significant differences lie (any bars that do not cross this line are significant). The final command resets the margins back to the previous settings.

Exercise 4 Solution

To begin with you should look at the data itself and see what you are dealing with. The summary() and str() commands are useful and using the names() command helps to remind you of the column headings. You might have used the tapply() command to check for a balanced design like so:

> tapply(bats$count, list(bats$spp, bats$method), FUN = length)

Because you have two predictor variables, a two-way ANOVA is indicated. This can be carried out using the aov() command:

> bats.aov = aov(count ~ spp * method, data = bats)
> summary(bats.aov)

Neither main effect is significant, but there is a highly significant interaction term.

Exercise 5 Solution

The interaction is highly significant. A good first step is to draw a boxplot of the data. The aov() command can be recalled and edited to save some typing:

> boxplot(count ~ spp * method, data = bats, cex.axis = 0.8, las = 1)

The axis labels need to be made smaller to fit using the cex.axis = 0.8 instruction. The las = 1 instruction makes all labels horizontal.

The TukeyHSD() command will run the post-hoc test. The basic command is useful to compare to the boxplot, but reordering the factors is more helpful when plotting the pairwise comparisons:

> TukeyHSD(bats.aov)
> TukeyHSD(bats.aov, ordered = TRUE)

If you try plotting the post-hoc result, you will see that the labels do not fit, so you will need to modify the margins to make room:

> oldpar = par(mar = c(5,8,4,2)+0.1)
> plot(TukeyHSD(bats.aov, ordered = TRUE), las = 1, cex.axis = 0.75)
> abline(v = 0, lty = 3, col = 'gray50')
> par(oldpar)

The las = 1 instruction forces axis labels to be horizontal and the cex.axis = 0.75 instruction makes the labels smaller. You may need slightly different values to get the best fit.

To create an interaction plot you will need to use the interaction.plot() command. You can make a simple plot tracing the method or the species like so:

> with(bats, interaction.plot(spp, method, count))
> with(bats, interaction.plot(method, spp, count))

Note that you need to use the with() command here unless you use attach() first (remember to use detach() afterwards). Alternatively, you can use the $ syntax instead. Which one you use is up to you. The first plot shows two lines, one for each method; the second plot shows three lines, one for each species. The plot can be jazzed up and made more “interesting” with a few additional instructions:

> with(bats, interaction.plot(spp, method, count, type = 'b', pch = 1:3, 
col = 1:3))

Note that three colors and plotting symbols were used even though there are only two lines. The third is ignored. You could edit this and switch the spp and method variables to trace the species.

Chapter 9

Exercise 1 Solution

First look at the data:

> bees
               Buff.tail Garden.bee Red.tail Honey.bee Carder.bee
Thistle               10          8       18        12          8
Vipers.bugloss         1          3        9        13         27
Golden.rain           37         19        1        16          6
Yellow.alfalfa         5          6        2         9         32
Blackberry            12          4        4        10         23

Treat the Thistle and Vipers.bugloss rows as being the same color and the others as another color. You can make a simple character variable like so:

> flcol = c(rep('blue',2), rep('yellow', 3))

This is not a factor as is. You can still use it as a grouping variable, but to force it to be a factor you need to re-do the command or convert the result. Either of the following will do:

> flcol = as.factor(flcol)
> flcol = factor(c(rep('blue',2), rep('yellow', 3)))

A third way you can achieve the result is to use the factor command with a vector and set the levels like so:

> flcol = factor(c(1,1,2,2,2), labels = c('blue', 'yellow'))

Exercise 2 Solution

A matrix can contain data only of a single type, either numeric or character. If you need to add something to an existing matrix, the data type must match. Here you have a numeric matrix and a factor (as a vector). The simplest way to achieve the result is to convert the flcol factor variable into a numeric vector; then it can be added to the matrix. You can do this with separate commands or in one go:

> bees2 = cbind(bees, flcolor = as.numeric(flcol))

In this case the flcolor object is created as a temporary object from the original factor variable. The cbind() command adds this to the original matrix.

It is useful to add the new data as a named object, because then the column will take on the name of the data. This saves you having to use the colnames() command afterward.

Exercise 3 Solution

Because the bees object is a matrix, you need to specify the column using square brackets. You can obtain a mean for the grouping variable using either of the following commands:

> tapply(bees[,1], flcol, FUN = mean)
> tapply(bees[,'Buff.tail'], flcol, FUN = mean)
  blue yellow 
   5.5   18.0

If you specify all the columns, you can get a summary for all the bee species. You can name the columns explicitly or simply omit the square brackets entirely (thus specifying the entire data):

> tapply(bees[1:5], flcol, FUN = mean)
> tapply(bees, flcol, FUN = mean)
$blue
 Buff.tail Garden.bee   Red.tail  Honey.bee Carder.bee 
       5.5        5.5       13.5       12.5       17.5 

$yellow
 Buff.tail Garden.bee   Red.tail  Honey.bee Carder.bee 
 18.000000   9.666667   2.333333  11.666667  20.333333

Notice that the result is split into two parts, one for each level of the grouping variable. The result is, in fact, an array, but it has only one dimension! You can extract the elements using the dollar sign and square brackets appropriately:

> bee.sum$blue
 Buff.tail Garden.bee   Red.tail  Honey.bee Carder.bee 
       5.5        5.5       13.5       12.5       17.5 
> bee.sum[2]
$yellow
 Buff.tail Garden.bee   Red.tail  Honey.bee Carder.bee 
 18.000000   9.666667   2.333333  11.666667  20.333333 

> bee.sum$blue[1]
Buff.tail 
      5.5

Exercise 4 Solution

The ChickWeight data are built in to R and you can access them simply by typing the name. You can see what data are available by using the data() command:

> data()

The weight variable can be summarized by Diet using the tapply() or aggregate() commands:

> tapply(ChickWeight$weight, ChickWeight$Diet, FUN = median)
> with(ChickWeight, tapply(weight, Diet, median))
    1     2     3     4 
 88.0 104.5 125.5 129.5

> aggregate(ChickWeight$weight, by = list(ChickWeight$Diet), FUN = median)
> aggregate(weight ~ Diet, data = ChickWeight, FUN = median)
  Diet weight
1    1   88.0
2    2  104.5
3    3  125.5
4    4  129.5

Notice that you get a slightly different output as the result, although the values are the same. If you add a second grouping variable, you have similar options:

> tapply(ChickWeight$weight, list(ChickWeight$Diet, ChickWeight$Time), median)
     0    2    4    6   8    10    12    14    16    18    20    21
1 41.0 49.0 56.0 67.0  79  93.0 106.0 120.5 149.0 160.0 160.0 166.0
2 40.5 48.5 59.0 74.0  90 104.5 130.5 141.0 157.0 184.0 198.5 212.5
3 41.0 49.5 62.5 77.5  98 113.5 141.0 160.0 195.0 229.5 265.0 281.0
4 41.0 51.5 64.5 84.0 103 123.5 153.0 161.5 179.5 200.5 231.0 237.0

> aggregate(ChickWeight$weight, by = list(ChickWeight$Diet, 
ChickWeight$Time), FUN = median)

> aggregate(weight ~ Diet + Time, data = ChickWeight, FUN = median)
   Diet Time weight
1     1    0   41.0
2     2    0   40.5
3     3    0   41.0
4     4    0   41.0
5     1    2   49.0
6     2    2   48.5
7     3    2   49.5
8     4    2   51.5
9     1    4   56.0
10    2    4   59.0
11    3    4   62.5
12    4    4   64.5
...

The aggregate() command produces a longer output (the display has been shortened in this example), but the values are the same.

The order in which you specify the grouping variables will affect the order of the result, but the values remain the same.

Exercise 5 Solution

The data are built into R and you can access them simply by typing the name mtcars. You can see all the available data by using the data() command like so:

> data()

You can gain further information by looking at the help for the data item:

> help(mtcars)

The best way to create a summary here is to use the aggregate() command. Because you will be using three grouping variables, the output will be more easily dealt with:

> aggregate(mpg ~ cyl + gear + carb, data = mtcars, FUN = mean)
   cyl gear carb   mpg
1    4    3    1 21.50
2    6    3    1 19.75
3    4    4    1 29.10
4    8    3    2 17.15
5    4    4    2 24.75
6    4    5    2 28.20
7    8    3    3 16.30
8    8    3    4 12.62
9    6    4    4 19.75
10   8    5    4 15.80
11   6    5    6 19.70
12   8    5    8 15.00

Here the formula syntax is used, which makes a nicer display and is easier to type than the alternative:

> with(mtcars, aggregate(mpg, by = list(cyl, gear, carb), FUN = mean))

The tapply() command is less useful here because the results are not so comprehensible and come out as an array object:

> tapply(mtcars$mpg, list(mtcars$cyl, mtcars$gear, mtcars$carb), FUN = mean)
, , 1

      3    4  5
4 21.50 29.1 NA
6 19.75   NA NA
8    NA   NA NA

, , 2

      3     4    5
4    NA 24.75 28.2
6    NA    NA   NA
8 17.15    NA   NA
...

Notice, too, that you get NA items where the combination of grouping variables does not contain a result.

Chapter 10

Exercise 1 Solution

Creating the three models involves using the lm() command and specifying the predictor variables as appropriate. Give each model a name like so:

> mtcars.lm1 = lm(mpg ~ wt, data = mtcars)
> mtcars.lm2 = lm(mpg ~ cyl, data = mtcars)
> mtcars.lm3 = lm(mpg ~ wt + cyl, data = mtcars)

You can compare models using the anova() command by specifying the models to compare:

> anova(mtcars.lm1, mtcars.lm2, mtcars.lm3)
Analysis of Variance Table

Model 1: mpg ~ wt
Model 2: mpg ~ cyl
Model 3: mpg ~ wt + cyl
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
1     30 278.32                                 
2     30 308.33  0   -30.012                    
3     29 191.17  1   117.162 17.773 0.000222 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

If you specify the models in a different order, you get results in a different order, but the conclusions are the same. There is no appreciable difference between the single-response models, but adding the second response variable does make a significant difference.

Exercise 2 Solution

The regression is determined easily enough and you have already created a model for this, like so:

> mtcars.lm1 = lm(mpg ~ wt, data = mtcars)

Draw the relationship using the plot() command. You can add the line of best-fit using the abline() command and taking the instructions from the linear model like so:

> plot(mpg ~ wt, data = mtcars)
> abline(mtcars.lm1)

Here, the axis titles are kept to their defaults and the line of best-fit is also kept at its default, but you could use standard instructions to alter the appearance. To get the values for the confidence intervals, you need to use the predict() command. Here you want 99-percent confidence intervals so you have to specify the level explicitly (the default is 0.95, that is, 95 percent). Once you have the fitted values and their confidence intervals as a result object, you must convert it to a data frame and add the predictor variable. Then you sort the data in order of the predictor variable:

> prd = predict(mtcars.lm1, interval = 'confidence', level = 0.99)
> prd = as.data.frame(prd)
> prd$wt = mtcars$wt
> prd = prd[order(prd$wt),]

The final task is to add the confidence interval bands. The lines() command will do this, and the spline() command will make the lines smooth:

> lines(spline(prd$wt, prd$upr))
> lines(spline(prd$wt, prd$lwr))

The lines here are kept to their defaults, but you could make them appear differently using some simple instructions (for example, lty, lwd, and col).

Exercise 3 Solution

The starting point for a backward deletion model is all the terms. You can do this by using a period in the model formula like so:

> mtcars.lm = lm(mpg ~ ., data = mtcars)

Now you need to use the drop1() command to examine the terms of the regression model and decide which can be dropped. The first time you run the command you see a result that looks like this:

> drop1(mtcars.lm, mtcars.lm, test = 'F')
Single term deletions

Model:
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
       Df Sum of Sq    RSS    AIC F value   Pr(F)  
<none>              147.49 70.898                  
cyl     1    0.0799 147.57 68.915  0.0114 0.91609  
disp    1    3.9167 151.41 69.736  0.5576 0.46349  
hp      1    6.8399 154.33 70.348  0.9739 0.33496  
drat    1    1.6270 149.12 69.249  0.2317 0.63528  
wt      1   27.0144 174.51 74.280  3.8463 0.06325 .
qsec    1    8.8641 156.36 70.765  1.2621 0.27394  
vs      1    0.1601 147.66 68.932  0.0228 0.88142  
am      1   10.5467 158.04 71.108  1.5016 0.23399  
gear    1    1.3531 148.85 69.190  0.1926 0.66521  
carb    1    0.4067 147.90 68.986  0.0579 0.81218

You need to select the term (that is, predictor variable) with the lowest AIC value; it will also have the smallest F-value (and largest p-value). In this first run you can see that the cyl variable meets these criteria (even though it was previously in your model), and therefore you must remove the term from the model. The simplest way to do this is to copy the formula from the drop1() result and paste it into the previous lm() command (use the up arrow to recall it). You can then edit out the cyl term.

You repeat the process, selecting the predictor with the lowest AIC value each time and removing it from the model. Eventually you will get to a point where all the remaining terms are statistically significant.

> mtcars.lm = lm(mpg ~ wt + qsec + am, data = mtcars)
Single term deletions

Model:
mpg ~ wt + qsec + am
       Df Sum of Sq    RSS    AIC F value     Pr(F)    
<none>              169.29 61.307                      
wt      1   183.347 352.63 82.790 30.3258 6.953e-06 ***
qsec    1   109.034 278.32 75.217 18.0343 0.0002162 ***
am      1    26.178 195.46 63.908  4.3298 0.0467155 *

This is the point where you stop. You have now whittled away the non-significant terms.

Exercise 4 Solution

To compare the forward and backward models you need to make sure you know what they are called. Here are the model definitions:

> mtcars.lm3 = lm(mpg ~ wt + cyl, data = mtcars)
> mtcars.lm4 = lm(mpg ~ wt + qsec + am, data = mtcars)

The first one is the forward model and the second is the backward model. To compare them you need the anova() command once again:

> anova(mtcars.lm3, mtcars.lm4, test = 'F')
Analysis of Variance Table

Model 1: mpg ~ wt + cyl
Model 2: mpg ~ wt + qsec + am
  Res.Df    RSS Df Sum of Sq    F  Pr(>F)  
1     29 191.17                            
2     28 169.29  1    21.886 3.62 0.06742 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You can see that there is little to choose between them and the difference is not statistically significant. If you use the summary() command on each model you will see that the variance explained by both models is similar (that is, the Adjusted R Squared values are similar).

Exercise 5 Solution

The regression model with the fewest terms in it should be the “best” model to select. You generally aim to produce regression models that are as simple as possible.

> mtcars.lm3 = lm(mpg ~ wt + cyl, data = mtcars)

You are going to end up with a scatter plot that shows the response variable against the fitted values. You need to start by making a result object containing the fitted values and confidence intervals; the predict() command will do this:

> predict(mtcars.lm3, interval = 'confidence')

Now you need to make the data into a data frame, add the response data, and reorder the values in ascending order (of fitted value):

> prd = as.data.frame(prd)
> prd$mpg = mtcars$mpg
> prd = prd[order(prd$fit),]

Now you have all the data you need to make the plot and add the lines of best-fit and confidence intervals:

> plot(mpg ~ fit, data = prd)
> abline(lm(mpg ~ fit, data = prd))
> lines(prd$fit, prd$upr)
> lines(prd$fit, prd$lwr)

Here the commands are using all the default settings, but you could add customized axis titles and make the added lines appear differently by using some of the instructions you saw earlier (for example, lwd, col, and lty). Note that the confidence intervals are not completely smooth!

Chapter 11

Exercise 1 Solution

Because the data contain NA items, you need to take them into account using the na.rm instruction. Here you must use the apply() command to get the mean values. The length() command does not use na.rm, so you need to find a different way to get the number of replicates in each sample. The round() command is used in conjunction with max() to work out the size of the y-axis:

> hogl.m = apply(hogl, 2, FUN = mean, na.rm = T)
> hogl.s = apply(hogl, 2, FUN = sum, na.rm = T)
> hogl.sd = apply(hogl, 2, FUN = sd, na.rm = T)
> hogl.l = hogl.s / hogl.m
> hogl.se = hogl.sd / sqrt(hogl.l)
> hogl.y = round(max(hogl.m + hogl.se) + 0.5, 0)

To make a graphics window of a fixed size, you use the windows() command (on a Mac use quartz() and on Linux use X11() instead). The bar chart itself is given a name so that you can use it as coordinates for the arrows():

> windows(width = 4, height = 7)
> bp = barplot(hogl.m, ylim = c(0, hogl.y))
> arrows(bp, hogl.m + hogl.se, bp, hogl.m - hogl.se, length = 0.1, angle = 90, code = 3)

The arrows() command is used to make the error bars; here the line style, width, and color are kept as standard but you might like to experiment!

Exercise 2 Solution

The data are in a data frame and must be a matrix to be dealt with by the barplot() command. You could convert the data into a matrix, but it is just as easy here to do it as part of the barplot() command. The legend can be produced easily here as part of the command, rather than separately. The colors of the plot (and also legend) are set using the col instruction; the palette() command sets up nine rainbow colors:

> barplot(as.matrix(hoglouse), beside = TRUE, legend = TRUE, col = palette(rainbow(9)), 
args.legend = list(x = 'topleft', bty = 'n'))

The colors are rather lurid; experiment with some others (see the help entry for palette). You can also add axis titles with the title() command.

Exercise 3 Solution

This could take some experimentation on your system. The illustration here is based on a 7 inch default graphic size. Start by making the plot and see how much extra room you need. The hoglouse data are a data frame, so you need to make them into a matrix. However, you are required to differentiate between the fast and the slow. This means that you have to transpose the data. The result of t(hoglouse) will be a matrix, so you can simply plot that.

Once you have estimated how much to alter the margin, you can use the par() command and the mar instruction to alter the values; note that you have to specify all the margins (the defaults are c(5, 4, 4, 2) + 0.1). Now you can issue the barplot() command again. You will need to specify the colors explicitly so that you can match them in the separate legend() command:

> oldpar = par(mar = c(5,5,4,2) + 0.1)
> barplot(t(hoglouse), horiz = TRUE, las = 1, cex.name = 1, legend = FALSE, col = c('gray30', 'gray80'))

The legend can be placed with a call to the locator() command. The colors are set to match the previous command. Using your mouse, click the top-left corner of the legend. The final commands simply add a neat line to “ground” the bars and to reset the margins:

> legend(locator(1), legend = c('Fast', 'Slow'), bty = 'n', fill = c('gray30', 'gray80'))
> abline(v=0)
> par(oldpar)

You might also want to add axis titles.

Exercise 4 Solution

You have two ways to tackle this. You could create a separate matrix for the Length variable and another to hold the two series (that is, Speed and Algae), or you could create these matrix data “on the fly.” Here you see the former approach, which is easier to follow. The x and y data are prepared first from the original data frame. Now the matplot() command is used to draw the series using two explicit plotting characters and named colors. This is so that you can match them in the legend. Note that the expression() command is used to make the subscript. Finally, the legend is added using the same plotting characters and colors as the original plot:

> mf.x = as.matrix(mf[,2:3])
> mf.y = as.matrix(mf[,1])
> matplot(mf.x, mf.y, type = 'p', pch = 16:17, col = c('black', 'darkgreen'), ylab = expression(Length[mm]), xlab = 'Speed/Algae', las = 1)
> legend(x = 'bottomright', legend = c('Speed', 'Algae'), col = c('black', 'darkgreen'), pch = 16:17, bty = 'n')

You might also have used a basic plot() command to draw one series and then added the other using the points() command.

Chapter 12

Exercise 1 Solution

You can use a simple one-line function() command to do this:

> pwr = function(x, power = 2) (x^power)

Now if you run the new pwr() command and do not specify the power = instruction, the value of 2 will be the default (that is, the square).

Exercise 2 Solution

You have two ways to save the simple customized function to disk:

> dump('pwr', file = 'power function dump.R')
> save(pwr, file = 'power function save.R')

If you use the dump() command the file is written as text, which you can edit in a text editor and load once more using the source() command.

If you use the save() command, the file is written as binary data, which you cannot edit in a text editor. You can reload this file using the load() command.

In either case, the filename must be in quotes, and is written to the working directory unless you specify otherwise.

Exercise 3 Solution

To incorporate some annotations (using the # symbol) you will need multiple lines. This means you need curly brackets. You could type the following lines directly into R or write them with a text editor:

pwr = function(x, power = 2) { 

      #     x = a number of some kind
      # power = the power to raise x by, defaults to 2

   (x^power)
 }

Exercise 4 Solution

You will need the readline() command to prompt the user to enter a value:

pwr = function(x, power) { 

      #     x = a number of some kind
      # power = user will input a value to raise x by

  power = readline(prompt = 'Enter the required power: ')

   (x^as.numeric(power))
 }

There is no point in specifying a default here because the user will have to give the required power. Note that you must force the input to be numeric using the as.numeric() command.

Exercise 5 Solution

Take the values input by the user and present them as a text summary at the end. You must use the deparse(substitute()) command(s) here:

pwr = function(x, power) { 

      #     x = a number of some kind
      # power = user will input a value to raise x by

 power = readline(prompt = 'Enter the required power: ') # wait for user
  power = as.numeric(power) # make sure this is numeric

 result = (x^power)
  cat(deparse(substitute(x)), '^', power, '=', result)
 }

Note that this time you modified the power value that was input by the user on a separate line; this forced the input to be numeric. The previous method is perfectly acceptable; this merely illustrates an alternative.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.185.196