Chapter 6

Simple Hypothesis Testing

What you will learn in this chapter:

  • How to carry out some basic hypothesis tests
  • How to carry out the Student’s t-test
  • How to conduct the U-test for non-parametric data
  • How to carry out paired tests for parametric and non-parametric data
  • How to produce correlation and covariance matrices
  • How to carry out a range of correlations tests
  • How to test for association using chi-squared
  • How to carry out goodness of fit tests

Many statistical analyses are concerned with testing hypotheses. In this chapter you look at methods of testing some simple hypotheses using standard and classic tests. You start by comparing differences between two samples. Then you look at the correlation between two samples, and finally look at tests for association and goodness of fit. Other tests are available in R, but the ones illustrated here will form a good foundation and give you an idea of how R works. Should you require a different test, you will be able to work out how to carry it out for yourself.

Using the Student’s t-test

The Student’s t-test is a method for comparing two samples; looking at the means to determine if the samples are different. This is a parametric test and the data should be normally distributed. You looked at the distribution of data previously in Chapter 5.

Several versions of the t-test exist, and R can handle these using the t.test() command, which has a variety of options (see Table 6-1), and the test can be pressed into service to deal with two- and one-sample tests as well as paired tests. The latter option is discussed in the later section “Paired T- and U-Tests”; in this section you look at some more basic options.

Table 6-1: The t.test() Command and Some of the Options Available.

Command Explanation
t.test(data.1, data.2) The basic method of applying a t-test is to compare two vectors of numeric data.
var.equal = FALSE If the var.equal instruction is set to TRUE, the variance is considered to be equal and the standard test is carried out. If the instruction is set to FALSE (the default), the variance is considered unequal and the Welch two-sample test is carried out.
mu = 0 If a one-sample test is carried out, mu indicates the mean against which the sample should be tested.
alternative = “two.sided” Sets the alternative hypothesis. The default is two.sided but you can specify greater or less. You can abbreviate the instruction (but you still need quotes).
conf.level = 0.95 Sets the confidence level of the interval (default = 0.95).
paired = FALSE If set to TRUE, a matched pair t-test is carried out.
t.test(y ~ x, data, subset) The required data can be specified as a formula of the form response ~ predictor. In this case, the data should be named and a subset of the predictor variable can be specified.
subset = predictor %in% c(“sample.1”, “sample.2”) If the data is in the form response ~ predictor, the subset instruction can specify which two samples to select from the predictor column of the data.

Two-Sample t-Test with Unequal Variance

The general way to use the t.test() command is to compare two vectors of numeric values. You can specify the vectors in a variety of ways, depending how your data objects are set out. The default form of the t.test() does not assume that the samples have equal variance, so the Welch two-sample test is carried out unless you specify otherwise:

> t.test(data2, data3)

     Welch Two Sample t-test

data:  data2 and data3 
t = -2.8151, df = 24.564, p-value = 0.009462
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -3.5366789 -0.5466544 
sample estimates:
mean of x mean of y 
 5.125000  7.166667

Two-Sample t-Test with Equal Variance

You can override the default and use the classic t-test by adding the var.equal = TRUE instruction, which forces the command to assume that the variance of the two samples is equal. The calculation of the t-value uses pooled variance and the degrees of freedom are unmodified; as a result, the p-value is slightly different from the Welch version:

> t.test(data2, data3, var.equal = TRUE)

     Two Sample t-test

data:  data2 and data3 
t = -2.7908, df = 26, p-value = 0.009718
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -3.5454233 -0.5379101 
sample estimates:
mean of x mean of y 
 5.125000  7.166667

One-Sample t-Testing

You can also carry out a one-sample t-test. In this version you supply the name of a single vector and the mean to compare it to (this defaults to 0):

> t.test(data2, mu = 5)

     One Sample t-test

data:  data2 
t = 0.2548, df = 15, p-value = 0.8023
alternative hypothesis: true mean is not equal to 5 
95 percent confidence interval:
 4.079448 6.170552 
sample estimates:
mean of x 
    5.125

Using Directional Hypotheses

You can also specify a “direction” to your hypothesis. In many cases you are simply testing to see if the means of two samples are different, but you may want to know if a sample mean is lower than another sample mean (or greater). You can use the alternative = instruction to switch the emphasis from a two-sided test (the default) to a one-sided test. The choices you have are between two.sided, less, or greater, and your choice can be abbreviated.

> t.test(data2, mu = 5, alternative = 'greater')

     One Sample t-test

data:  data2 
t = 0.2548, df = 15, p-value = 0.4012
alternative hypothesis: true mean is greater than 5 
95 percent confidence interval:
 4.265067      Inf 
sample estimates:
mean of x 
    5.125

Formula Syntax and Subsetting Samples in the t-Test

The t-test is designed to compare two samples (or one sample with a “standard”). So far you have seen how to carry out the t-test on separate vectors of values. However, your data may well be in a more structured form with a column for the response variable and a column for the predictor variable. The following data are set out in this manner:

> grass
  rich graze
1   12   mow
2   15   mow
3   17   mow
4   11   mow
5   15   mow
6    8 unmow
7    9 unmow
8    7 unmow
9    9 unmow

This way of setting out data is more sensible and flexible, but you need a new way to deal with the layout. R deals with this by having a “formula syntax.” You create a formula using the tilde (~) symbol. Essentially your response variable goes on the left of the ~ and the predictor goes on the right like so:

> t.test(rich ~ graze, data = grass)

     Welch Two Sample t-test

data:  rich by graze 
t = 4.8098, df = 5.411, p-value = 0.003927
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 2.745758 8.754242 
sample estimates:
  mean in group mow mean in group unmow 
              14.00                8.25

If your predictor column contains more than two items, the t-test cannot be used. However, you can still carry out a test by subsetting this predictor column and specifying which two samples you want to compare. You must use the subset = instruction as part of the t.test() command. The following example illustrates how to do this using the same data as in the previous example.

> t.test(rich ~ graze, data = grass, subset = graze %in% c('mow', 'unmow'))

You first specify which column you want to take your subset from (graze in this case) and then type %in%; this tells the command that the list that follows is contained in the graze column. Note that you have to put the levels in quotes; here you compare mow and unmow and your result (not shown) is identical to that you obtained before.

Try It Out: Carry Out Student’s t-Tests on Some Data
download.eps
Use the data on orchids (orchid, orchid2, orchis, and orchis2) from the Beginning.RData file for this activity, on which you will be carrying out a range of t-tests.
1. Use the ls() command to see the data you require; they all begin with “orchi”:
> ls(pattern='^orc')
[1] "orchid"  "orchid2" "orchis"  "orchis2"
2. Look first at the orchid data. This comprises two columns relating to two samples:
> orchid
   closed open
1       7    3
2       8    5
3       6    6
4       9    7
5      10    6
6      11    8
7       7    8
8       8    4
9      10    7
10      9    6
3. Carry out a t-test on these data without making any assumptions about the variance, like so:
> attach(orchid)
> t.test(open, closed)

   Welch Two Sample t-test

data:  open and closed 
t = -3.478, df = 17.981, p-value = 0.002688
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -4.0102455 -0.9897545 
sample estimates:
mean of x mean of y 
      6.0       8.5 

> detach(orchid)
4. Now carry out another two-sample t-test but use the “classic” version and assume the variance of the two samples is equal:
> with(orchid, t.test(open, closed, var.equal = TRUE))
    Two Sample t-test

data:  open and closed 
t = -3.478, df = 18, p-value = 0.002684
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -4.0101329 -0.9898671 
sample estimates:
mean of x mean of y 
      6.0       8.5 
5. This time look at the open sample only and carry out a one-sample test to compare the data to a mean of 5:
> t.test(orchid$open, mu = 5)

    One Sample t-test

data:  orchid$open 
t = 1.9365, df = 9, p-value = 0.08479
alternative hypothesis: true mean is not equal to 5 
95 percent confidence interval:
 4.831827 7.168173 
sample estimates:
mean of x 
        6 
6. Now look at the orchis data object. It has two columns, flower and site. Use the str() or summary() command to confirm that there are two samples in the site column:
> str(orchis)
'data.frame':      20 obs. of  2 variables:
 $ flower: num  7 8 6 9 10 11 7 8 10 9 ...
 $ site  : Factor w/ 2 levels "closed","open": 1 1 1 1 1 1 1 1 1 1 ...
7. Carry out a t-test using the formula syntax; you do not need to make assumptions about the variance:
> t.test(flower ~ site, data = orchis)
8. Now look at the orchis2 data object. It has two columns, flower and site. Use the str() or summary() command to confirm that there are three samples in the site column:
> str(orchis2)
'data.frame':       30 obs. of  2 variables:
 $ flower: num  7 8 6 9 10 11 7 8 10 9 ...
 $ site  : Factor w/ 3 levels "closed","open",..: 1 1 1 1 1 1 1 1 1 1 ...
9. Use a subset instruction to carry out a t-test on the open and closed sites:
> t.test(flower ~ site, data = orchis2, subset = site %in% c('open', 'closed'))
10. Now return to the orchid data. Carry out a one-sample test on the open sample to see if it has a mean of less than 7:
> t.test(orchid$open, alternative = 'less', mu = 7)

    One Sample t-test

data:  orchid$open 
t = -1.9365, df = 9, p-value = 0.04239
alternative hypothesis: true mean is less than 7 
95 percent confidence interval:
     -Inf 6.946615 
sample estimates:
mean of x 
        6
11. Look again at the orchis2 data, which has three samples in the site column. Carry out a t-test on the sprayed sample to see if its mean is greater than 3. You can use either of the following commands:
> t.test(orchis2$flower[orchis2$site=='sprayed'], mu = 3, alt = 'greater')
> with(orchis2, t.test(flower[site=='sprayed'], mu = 3, alt = 'g'))

    One Sample t-test

data:  orchis2$flower[orchis2$site == "sprayed"] 
t = 1.9412, df = 9, p-value = 0.04208
alternative hypothesis: true mean is greater than 3 
95 percent confidence interval:
 3.061236      Inf 
sample estimates:
mean of x 
      4.1
How It Works
The first part is simply a way to list the data objects by matching items that begin with the text “orc”. In the first t-test you had to use the attach() command to enable you to specify the column names. Notice that the result begins by telling you that you have carried out the Welch Two-Sample t-test.
In the next case you used the with() command to allow R to access the columns in the orchid data. By adding var.equal = TRUE you carry out the “classic” t-test and treat the variances of the samples as equal. Note that in step 11 you used an abbreviation.
The formula syntax is a convenient way to describe your data; the formula is of the form response ~ predictor. The subset instruction enables you to select two samples from a column variable; the form of the instruction is subset = predictor %in% c(item.1, item.2).
The subset instruction works only in conjunction with the formula syntax.

The t-test is a powerful and flexible tool, as you have seen. You can also carry out paired tests using the t.test() command, but before that you look at the U-test, which you can think of as the non-parametric equivalent to the t-test.

The Wilcoxon U-Test (Mann-Whitney)

When you have two samples to compare and your data are non-parametric, you can use the U-test. This goes by various names and may be known as the Mann-Whitney U-test or Wilcoxon sign rank test. You use the wilcox.test() command to carry out the analysis. You operate this very much like you did when performing the t.test() previously.

The wilcox.test() command can conduct two-sample or one-sample tests, and you can add a variety of instructions to carry out the test you want. The main options are shown in Table 6-2.

Table 6-2: The wilcox.test() Command and Some of the Options Available.

Command Explanation
wilcox.test(sample.1, sample.2) Carries out a basic two-sample U-test on the numerical vectors specified.
mu = 0 If a one-sample test is carried out, mu indicates the value against which the sample should be tested.
alternative = “two.sided” Sets the alternative hypothesis. The default is two.sided but you can specify greater or less. You can abbreviate the instruction (but you still need quotes).
conf.int = FALSE Sets whether confidence intervals should be reported.
conf.level = 0.95 Sets the confidence level of the interval (default = 0.95).
correct = TRUE By default the continuity correction is applied. You can turn this off by setting it to FALSE.
paired = FALSE If set to TRUE, a matched pair U-test is carried out.
exact = NULL Sets whether an exact p-value should be computed. The default is to do so for < 50 items.
wilcox.test(y ~ x, data, subset) The required data can be specified as a formula of the form response ~ predictor. In this case the data should be named and a subset of the predictor variable can be specified.
subset = predictor %in% c(“sample.1”, “sample.2”) If the data is in the form response ~ predictor, the subset instruction can specify which two samples to select from the predictor column of the data.

Two-Sample U-Test

The basic way of using the wilcox.test() is to specify the two samples you want to compare as separate vectors, as the following example shows:

> data1 ; data2
 [1] 3 5 7 5 3 2 6 8 5 6 9
 [1] 3 5 7 5 3 2 6 8 5 6 9 4 5 7 3 4
> wilcox.test(data1, data2)

       Wilcoxon rank sum test with continuity correction

data:  data1 and data2 
W = 94.5, p-value = 0.7639
alternative hypothesis: true location shift is not equal to 0 

Warning message:
In wilcox.test.default(data1, data2) :
  cannot compute exact p-value with ties

By default the confidence intervals are not calculated and the p-value is adjusted using the “continuity correction”; a message tells you that the latter has been used. In this case you see a warning message because you have tied values in the data. If you set exact = FALSE, this message would not be displayed because the p-value would be determined from a normal approximation method.

One-Sample U-Test

If you specify a single numerical vector, a one-sample U-test is carried out; the default is to set mu = 0, as in the following example:

> wilcox.test(data3, exact = FALSE)

       Wilcoxon signed rank test with continuity correction

data:  data3 
V = 78, p-value = 0.002430
alternative hypothesis: true location is not equal to 0

In this case the p-value is taken from a normal approximation because the exact = FALSE instruction is used. The command has assumed mu = 0 because it is not specified explicitly.

Using Directional Hypotheses

Both one- and two-sample tests use an alternative hypothesis that the location shift is not equal to 0 as their default. This is essentially a two-sided hypothesis. You can change this by using the alternative = instruction, where you can select two.sided, less, or greater as your alternative hypothesis (an abbreviation is acceptable but you still need quotes, single or double). You can also specify mu, the location shift. By default mu = 0. In the following example the hypothesis is set to something other than 0:

> data3
 [1]  6  7  8  7  6  3  8  9 10  7  6  9
> summary(data3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   6.000   7.000   7.167   8.250  10.000 

> wilcox.test(data3, mu = 8, exact = FALSE, conf.int = TRUE, alt = 'less')

       Wilcoxon signed rank test with continuity correction

data:  data3 
V = 13.5, p-value = 0.08021
alternative hypothesis: true location is less than 8 
95 percent confidence interval:
     -Inf 8.000002 
sample estimates:
(pseudo)median 
      6.999956

In this example a one-sample test is carried out on the data3 sample vector. The test looks to see if the sample median is less than 8. The instructions also specify to display the confidence interval and not to use an exact p-value.

Formula Syntax and Subsetting Samples in the U-test

It is generally a good idea to have your data arranged into a data frame where one column represents the response variable and another represents the predictor variable. In this case you can use the formula syntax to describe the situation and carry out the wilcox.test() on your data. This is much the same method you used for the t-test previously. The basic form of the command becomes:

wilcox.test(response ~ predictor, data = my.data)

You can also use additional instructions as you could with the other syntax. If your predictor variable contains more than two samples, you cannot conduct a U-test and must use a subset that contains exactly two samples. The subset instruction works like so:

wilcox.test(response ~ predictor, data = my.data, subset = predictor %in% c("sample1", "sample2"))

Notice that you use a c() command to group the samples together, and their names must be in quotes.

The U-test is one of the most widely used statistical methods, so it is important to be comfortable using the wilcox.test() command. In the following activity you try conducting a range of U-tests for yourself.

Try It Out: Carry Out U-Tests on Some Data
download.eps
Use the butterfly abundance data called bfc, bf2, and bfs, which are all contained in the Beginning.RData file. Use the ls() command to remind you of the names:
> ls(pattern='^bf')
1. Look at the bfc data object; there are two columns in this data frame, one for each sample:
> bfc
   grass heath
1      3     6
2      4     7
3      3     8
4      5     8
5      6     9
6     12    11
7     21    12
8      4    11
9      5    NA
10     4    NA
11     7    NA
12     8    NA
2. Carry out a two-sample U-test on the two samples in the bfc data object. There is no need to use any additional instructions:
> wilcox.test(bfc$grass, bfc$heath)

    Wilcoxon rank sum test with continuity correction

data:  bfc$grass and bfc$heath 
W = 20.5, p-value = 0.03625
alternative hypothesis: true location shift is not equal to 0 

Warning message:
In wilcox.test.default(bfc$grass, bfc$heath) :
  cannot compute exact p-value with ties
3. Now look at the grass sample of the bfc data using the summary() command:
> summary(bfc$grass)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   4.000   5.000   6.833   7.250  21.000
4. Carry out a one-sample test on the grass sample of the bfc data. Set a hypothesis that the location shift is less than 7.5:
> with(bfc, wilcox.test(grass, mu = 7.5, exact = F, alt = 'less'))

    Wilcoxon signed rank test with continuity correction

data:  grass 
V = 23.5, p-value = 0.1188
alternative hypothesis: true location is less than 7.5
5. Look at the bf2 data object. It comprises two columns, with a response variable count and a predictor variable site.
> str(bf2)
'data.frame':       20 obs. of  2 variables:
 $ count: int  3 4 3 5 6 12 21 4 5 4 ...
 $ site : Factor w/ 2 levels "Grass","Heath": 1 1 1 1 1 1 1 1 1 1 ...
6. Conduct a two-sample U-test on the bf2 data. This time you will need to use the formula syntax:
> wilcox.test(count ~ site, data = bf2, exact = FALSE)

    Wilcoxon rank sum test with continuity correction

data:  count by site 
W = 20.5, p-value = 0.03625
alternative hypothesis: true location shift is not equal to 0
7. Look at the bf2 data object again. This time look at the Heath sample and carry out a one-sided U-test. Set an alternative hypothesis that the location shift is greater than the first quartile:
> with(bf2, summary(count[which(site=='Heath')]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.00    7.75    8.50    9.00   11.00   12.00

> with(bf2, wilcox.test(count[which(site=='Heath')], exact = F, 
alt = 'greater', mu = 7.75))


    Wilcoxon signed rank test with continuity correction

data:  count[which(site == "Heath")] 
V = 28, p-value = 0.09118
alternative hypothesis: true location is greater than 7.75
8. Now look at the bfs data object. This time you have a predictor variable with three samples. Carry out a two-sample U-test between the Grass and Arable samples:
> wilcox.test(count ~ site, data = bfs, subset = site %in% c('Grass', 'Arable'), 
exact = F)

    Wilcoxon rank sum test with continuity correction

data:  count by site 
W = 81.5, p-value = 0.05375
alternative hypothesis: true location shift is not equal to 0
How It Works
The basic form of the command requires the numerical vectors to be specified. If the data are inside a data frame you must use attach(), with(), or the $ syntax to enable R to “read” them.
If a single vector is specified, a one-sample test is carried out. The mu = instruction gives the location shift to test and the alternative = instruction sets the direction of the alternative hypothesis, with two.sided being the default.
The formula syntax enables you to specify response ~ predictor for when you have data in that format. This also allows you to specify the data so that you do not need to use attach() or with() commands or the $ syntax.
If you have a response variable column you have to use a more complex method to extract the sample you require. Here you used a conditional statement to select the sample and used the summary() command to determine the first quartile. This value was used as the mu = instruction along with an alternative = greater instruction to make a one-sided test.
When you have more than two samples in a predictor variable the subset instruction enables you to select two samples to compare; the subset instruction works only with the formula syntax and you specify the samples to compare in the following way: subset = response %in% c(sample1, sample2).

The U-test is a useful tool for comparing two samples and is one of the most widely used of all simple statistical tests. Both the t.test() and wilcox.test() commands can also deal with matched pair data, which you have not seen yet. This is the subject of the next section.

note.eps
NOTE
The results of the t.test() and wilcox.test() commands are displayed when you run the command. However, not all of the results are displayed. If you create a new object to “hold” the result of a test, you can view the elements of the result by using the names() command. You can then access the various elements using the $ syntax.

Paired t- and U-Tests

If you have a situation in which you have paired data, you can use matched pair versions of the t-test and the U-test with a simple extra instruction. You simply add paired = TRUE as an instruction to your command. It does not matter if the data are in two separate sample columns or are represented as response and predictor as long as you indicate what is required using the appropriate syntax. In fact, R will carry out a paired test even if the data do not really match up as pairs. It is up to you to carry out something sensible. You can use all the regular syntax and instructions, so you can use subsetting and directional hypotheses as you like. In the following activity you try a few paired tests for yourself.

Try It Out: Conduct Paired t and U Tests on Some Data
download.eps
You will need to get the Beginning.RData file for this activity: You will require several data objects, which you will use to carry out some paired tests. The file contains all the data you need.
1. Look at the mpd data; you can see two samples, white and yellow. These data are matched pair data and each row represents a bi-colored target. The values are for numbers of whitefly attracted to each half of the target.
> mpd
  white yellow
1     4      4
2     3      7
3     4      2
4     1      2
5     6      7
6     4     10
7     6      5
8     4      8
2. Use a paired U-test (Wilcoxon matched pair test) on these data like so:
> wilcox.test(mpd$white, mpd$yellow, exact = FALSE, paired = TRUE)

   Wilcoxon signed rank test with continuity correction

data:  mpd$white and mpd$yellow 
V = 6, p-value = 0.2008
alternative hypothesis: true location shift is not equal to 0
3. Look at the means for the two samples in the mpd data. Round the difference up and then carry out a paired t-test, but set an alternative hypothesis that the difference in these means is less than this difference:
> mean(mpd)
 white yellow 
 4.000  5.625
> with(mpd, t.test(white, yellow, paired = TRUE, mu = 2, alt = 'less'))

    Paired t-test

data:  white and yellow 
t = -3.6958, df = 7, p-value = 0.003849
alternative hypothesis: true difference in means is less than 2 
95 percent confidence interval:
      -Inf 0.2332847 
sample estimates:
mean of the differences 
                 -1.625
4. Look at the mpd.s data object. This comprises two columns. One is the response variable count and the other is the predictor variable trap. These are the same data as the mpd and are paired (the only difference is the form of the data object). Carry out a pared t-test on these data:
> wilcox.test(count ~ trap, data = mpd.s, paired = TRUE, exact = F)

    Wilcoxon signed rank test with continuity correction

data:  count by trap 
V = 6, p-value = 0.2008
alternative hypothesis: true location shift is not equal to 0
5. Carry out a two-sided and paired t-test on the mpd.s data. Set the alternative hypothesis that the difference in means is 1 and show the 99 percent confidence intervals:
> t.test(count ~ trap, data = mpd.s, paired = TRUE, mu = 1, conf.level = 0.99)

    Paired t-test

data:  count by trap 
t = -2.6763, df = 7, p-value = 0.03171
alternative hypothesis: true difference in means is not equal to 1 
99 percent confidence interval:
 -5.057445  1.807445 
sample estimates:
mean of the differences 
                 -1.625
6. Look at the orchis2 data. Here you have a response variable flower and a predictor variable site. The predictor variable has three samples (open, closed, and sprayed). Carry out a paired t-test on the open and sprayed samples:
> t.test(flower ~ site, data = orchis2, subset = site %in% c('open', 'sprayed'), 
paired = TRUE)

    Paired t-test

data:  flower by site 
t = 4.1461, df = 9, p-value = 0.002499
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.8633494 2.9366506 
sample estimates:
mean of the differences 
                    1.9
How It Works
Simply adding paired = TRUE as an instruction to a t.test() or wilcox.test() command will carry out a paired version of the test. If the sample vectors are inside a data frame you must use attach(), with(), or use the $ syntax to allow R to read the variables.
All the regular instructions can be used, so you can carry out a directional hypothesis, for example, using alternative = less (or greater).
If the predictor variable has more than two levels (samples), you can use the subset instruction exactly as you did for the unpaired version.
warning.eps
WARNING
Paired tests are useful and more sensitive than their unpaired cousins. However, you must be careful when using them to make sure it is appropriate since all data in a data frame will appear paired. R will look to see if the length of the vectors used is the same, but if you have NA items, by default they will be removed and your result may not be what you expect.

Correlation and Covariance

When you have two continuous variables you can look for a link between them; this link is called a correlation. You can go about finding this several ways using R. The cor() command determines correlations between two vectors, all the columns of a data frame (or matrix), or two data frames (or matrix objects). The cov() command examines covariance. By default the Pearson product moment (that is regular parametric correlation) is used but Spearman (rho) and Kendall (tau) methods (both non-parametric correlation) can be specified instead. The cor.test() command carries out a test of significance of the correlation.

You can add a variety of additional instructions to these commands. Table 6-3 gives a brief summary of them.

Table 6-3: Correlation Commands and Main Options.

Command Explanation
cor(x, y = NULL) Carries out a basic correlation between x and y. If x is a matrix or data frame, y can be omitted.
cov(x, y = NULL) Determines covariance between x and y. If x is a matrix or data frame, y can be omitted.
cov2cor(V) Takes a covariance matrix V and calculates the correlations.
method = The default is pearson, but spearman or kendall can be specified as the methods for correlation or covariance. These can be abbreviated but you still need the quotes, and note that they are lowercase.
var(x, y = NULL) Determines the variance of x. If x is a matrix or data frame or y is specified, the covariance is also determined.
cor.test(x, y) Carries out a significance test of the correlation between x and y.
alternative = “two.sided” The default is for a two-sided test but the alternative hypothesis can be given as two.sided, greater, or less and abbreviations are permitted.
conf.level = 0.95 If the method = pearson and n > 3, the confidence intervals will be shown. This instruction sets the confidence level and defaults to 0.95.
exact = NULL For Kendall or Spearman, should an exact p-value be determined? Set this to TRUE or FALSE (the default NULL is equivalent to FALSE).
continuity = FALSE For Spearman or Kendall tests setting this to TRUE carries out a continuity correction.
cor.test( ~ x + y, data) If the data are in a data frame, a formula syntax can be used. This is of the form ~ x + y where x and y are two variables. The data frame can be specified. All other instructions can be used including subset.
subset = group %in% “sample” If the data includes a grouping variable, the subset instruction can be used to select one or more samples from this grouping.

The commands summarized in Table 6-3 enable you to carry out a range of correlation tasks. In the following sections you see a few of these options illustrated, and you can then try some correlations yourself in the activity that follows.

Simple Correlation

Simple correlations are between two continuous variables and you can use the cor() command to obtain a correlation coefficient like so:

> count = c(9, 25, 15, 2, 14, 25, 24, 47)
> speed = c(2, 3, 5, 9, 14, 24, 29, 34)

> cor(count, speed)
[1] 0.7237206

The default for R is to carry out the Pearson product moment, but you can specify other correlations using the method = instruction, like so:

> cor(count, speed, method = 'spearman')
[1] 0.5269556

This example used the Spearman rho correlation but you can also apply Kendall’s tau by specifying method = kendall. Note that you can abbreviate this but you still need the quotes. You also have to use lowercase.

If your vectors are contained within a data frame or some other object, you need to extract them in a different fashion. Look at the women data frame. This comes as example data with your distribution of R.

> data(women)
> str(women)
'data.frame':	15 obs. of  2 variables:
$ height: num  58 59 60 61 62 63 64 65 66 67 ...
 $ weight: num  115 117 120 123 126 129 132 135 139 142 ...

You need to use attach() or with() commands to allow R to “read inside” the data frame and access the variables within. You could also use the $ syntax so that the command can access the variables as the following example shows:

> cor(women$height, women$weight)
[1] 0.9954948

In this example the cor() command has calculated the Pearson correlation coefficient between the height and weight variables contained in the women data frame.

You can also use the cor() command directly on a data frame (or matrix). If you use the data frame women that you just looked at, for example, you get the following:

> cor(women)
          height    weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000

Now you have a correlation matrix that shows you all combinations of the variables in the data frame. When you have more columns the matrix can be much more complex. The following example contains five columns of data:

> head(mf)
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200
2     21    14    45 2.15 180
3     22    12    45 1.75 135
4     23    16    80 1.95 120
5     21    20    75 1.95 110
6     20    21    65 2.75 120

> cor(mf)
           Length       Speed      Algae         NO3        BOD
Length  1.0000000 -0.34322968  0.7650757  0.45476093 -0.8055507
Speed  -0.3432297  1.00000000 -0.1134416  0.02257931  0.1983412
Algae   0.7650757 -0.11344163  1.0000000  0.37706463 -0.8365705
NO3     0.4547609  0.02257931  0.3770646  1.00000000 -0.3751308
BOD    -0.8055507  0.19834122 -0.8365705 -0.37513077  1.0000000

The correlation matrix can be helpful but you may not always want to see all the possible combinations; indeed, the first column is the response variable and the others are predictor variables. If you choose the Length variable and compare it to all the others in the mf data frame using the default Pearson coefficient, you can select a single variable and compare it to all the others like so:

> cor(mf$Length, mf)
     Length      Speed     Algae       NO3        BOD
[1,]      1 -0.3432297 0.7650757 0.4547609 -0.8055507

Covariance

The cov() command uses syntax similar to the cor() command to examine covariance. The women data are used with the cov() command in the following example:

> cov(women$height, women$weight)
[1] 69
> cov(women)
       height   weight
height     20  69.0000
weight     69 240.2095

The cov2cor() command is used to determine the correlation from a matrix of covariance in the following example:

> women.cv = cov(women)
> cov2cor(women.cv)
          height    weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000

Significance Testing in Correlation Tests

You can apply a significance test to your correlations using the cor.test() command. In this case you can compare only two vectors at a time as the following example shows:

> cor.test(women$height, women$weight)

       Pearson's product-moment correlation

data:  women$height and women$weight 
t = 37.8553, df = 13, p-value = 1.088e-14
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.9860970 0.9985447 
sample estimates:
      cor 
0.9954948

In the previous example you can see that the Pearson correlation has been carried out between height and weight in the women data and the result also shows the statistical significance of the correlation.

Formula Syntax

If your data are contained in a data frame, using the attach() or with() commands is tedious, as is using the $ syntax. A formula syntax is available as an alternative, which provides a neater representation of your data:

> data(cars)
> cor.test(~ speed + dist, data = cars, method = 'spearman', exact = F)

       Spearman's rank correlation rho

data:  speed and dist 
S = 3532.819, p-value = 8.825e-14
alternative hypothesis: true rho is not equal to 0 
sample estimates:
      rho 
0.8303568

Here you examine the cars data, which comes built into R. The formula is slightly different from the one that you met previously. Here you specify both variables to the right of the ~. You also give the name of the data as a separate instruction.

All the additional instructions are available when using the formula syntax as well as the subset instruction. If your data contain a separate grouping column, you can specify the samples to use from it using an instruction along the following lines:

subset = grouping %in% “sample”

Correlation is a common method used widely in many areas of study. In the following activity you will be able to practice carrying out correlation and covariance of some data.

Try It Out: Carry Out Correlation and Covariance
download.eps
Use the fw, fw2, and fw3 data from the Beginning.RData file for this activity. The other data items are built into R.
1. Look at the fw data object; this contains two columns, count and speed. Conduct a Pearson correlation on these two variables:
> cor(fw$count, fw$speed)
[1] 0.7237206
2. Now look at the swiss data object; this is built into R. Use Kendall’s tau correlation to create a matrix of correlations:
> cor(swiss, method = 'kendall')
3. The swiss data produced a sizeable matrix. Simplify this by looking at the Fertility variable and correlating that to the others in the dataset. This time use the Spearman rho correlation.
> cor(swiss$Fertility, swiss, method = 'spearman')
     Fertility Agriculture Examination  Education  Catholic Infant.Mortality
[1,]         1   0.2426643   -0.660903 -0.4432577 0.4136456        0.4371367
4. Now look at the fw data object. It has two variables, count and speed. Create a covariance matrix:
> (fw.cov = cov(fw))
         count    speed
count 185.8393 123.0000
speed 123.0000 155.4286
5. Convert the covariance matrix into a correlation:
> cov2cor(fw.cov)
          count     speed
count 1.0000000 0.7237206
speed 0.7237206 1.0000000
6. Look at the fw2 data object. This has the same number of rows as the fw object. It also has two columns, abund and flow. Carry out a correlation between the columns of one data frame and the other:
> cor(fw, fw2)
          abund      flow
count 0.9905759 0.7066437
speed 0.6527244 0.9889997
7. Carry out a Spearman rho test of significance on the count and speed variables from the fw data:
> with(fw, cor.test(count, speed, method = 'spearman'))

    Spearman's rank correlation rho

data:  count and speed 
S = 39.7357, p-value = 0.1796
alternative hypothesis: true rho is not equal to 0 
sample estimates:
      rho 
0.5269556 

Warning message:
In cor.test.default(count, speed, method = "spearman") :
  Cannot compute exact p-values with ties
8. Now look at the fw2 data again. Conduct a Pearson correlation between the abund and flow variables. Set the confidence intervals to the 99 percent level and use an alternative hypothesis that the correlation is greater than 0:
> cor.test(fw2$abund, fw2$flow, conf = 0.99, alt = 'greater')

      Pearson's product-moment correlation

data:  fw2$abund and fw2$flow 
t = 2.0738, df = 6, p-value = 0.04173
alternative hypothesis: true correlation is greater than 0 
99 percent confidence interval:
 -0.265223  1.000000 
sample estimates:
      cor 
0.6461473
9. Use the formula syntax to carry out a Kendall tau correlation significance test between the Length and NO3 variables from the mf data object:
> cor.test(~ Length + NO3, data = mf, method = 'k', exact = F)

    Kendall's rank correlation tau

data:  Length and NO3 
z = 1.969, p-value = 0.04895
alternative hypothesis: true tau is not equal to 0 
sample estimates:
      tau 
0.2959383
10. Look at the fw3 data object. This is the same as fw, except that there is an additional grouping variable called cover. Use a subset of the data that corresponds to the open group and carry out a Pearson correlation significance test:
> cor.test(~ count + speed, data = fw3, subset = cover %in% 'open')

    Pearson's product-moment correlation

data:  count and speed 
t = -1.1225, df = 2, p-value = 0.3783
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.9907848  0.8432203 
sample estimates:
       cor 
-0.6216869
How It Works
The basic form of the cor() command requires two vectors, but if you have a data frame or numeric matrix all the columns will be used to form a correlation matrix. Any object can be correlated against any other object as long as the length of the individual vectors matches up. This works for the cov() command too, which determines covariance.
The cor.test() command enables you to carry out a significance test on the correlation. In this case you can now specify only two data vectors, but you can use a formula syntax, which makes it easier when the variables are contained within a data frame or matrix. The Pearson product moment is the default, but Spearman’s rho or Kendall’s tau tests can also be used. You can use the subset command to select data based on a grouping variable.
note.eps
NOTE
Often you will create a new object to “hold” the result of a command. This will not be displayed until you type its name. However, if you enclose the entire command in brackets you can force R to display your result immediately. Compare the following examples:
> fw.cor = cor(fw, fw2)
> fw.cor
          abund      flow
count 0.9905759 0.7066437
speed 0.6527244 0.9889997

> (fw.cor = cor(fw, fw2))
          abund      flow
count 0.9905759 0.7066437
speed 0.6527244 0.9889997
In the first example you have to type the name of the result to display it but in the second example the result is displayed immediately as well as stored (to fw.cor).

Tests for Association

When you have categorical data you can look for associations between categories by using the chi-squared test. Routines to achieve this are accessed using the chisq.test() command. You can add various additional instructions to the basic command to suit your requirements. These are summarized in Table 6-4.

Table 6-4: The Chi-Squared Test and its Various Options.

Command Explanation
chisq.test(x, y = NULL) A basic chi-squared test is carried out on a matrix or data frame. If x is provided as a vector, a second vector can be supplied. If x is a single vector and y is not given, a goodness of fit test is carried out.
correct = TRUE If the data form a 2 × 2 contingency table the Yates’ correction is applied.
p = A vector of probabilities for use with a goodness of fit test. If p is not given, the goodness of fit tests that the probabilities are all equal.
rescale.p = FALSE If TRUE, p is rescaled to sum to 1. For use with goodness of fit tests.
simulate.p.value = FALSE If set to TRUE, a Monte Carlo simulation is used to calculate p-values.
B = 2000 The number of replicates to use in the Monte Carlo simulation.

Multiple Categories: Chi-Squared Tests

The most common use for a chi-squared test is where you have multiple categories and want to see if associations exist between them. In the following example you can see some categorical data set out in a data frame. You have seen these data before:

> bird.df
              Garden Hedgerow Parkland Pasture Woodland
Blackbird         47       10       40       2        2
Chaffinch         19        3        5       0        2
Great Tit         50        0       10       7        0
House Sparrow     46       16        8       4        0
Robin              9        3        0       0        2
Song Thrush        4        0        6       0        0

The data here are already in a contingency table and each cell represents a unique combination of the two categories; here you have several habitats and several species. You run the chisq.test() command simply by giving the name of the data to the command like so:

> bird.cs = chisq.test(bird.df)
Warning message:
In chisq.test(bird.df) : Chi-squared approximation may be incorrect
> bird.cs

       Pearson's Chi-squared test

data:  bird.df 
X-squared = 78.2736, df = 20, p-value = 7.694e-09

In this case you give the result a name and set it up as a new object, which you examine in more detail in a moment. You get an error message in this example; this is because you have some small values for your observed data and the expected values will probably include some that are smaller than 5. When you issue the name of the result object you see a very brief result that contains the salient points.

Your original data were in the form of a data frame but you might also have used a matrix. If that were so, the result is exactly the same. You can also use a table result; perhaps the result of using the xtabs() command on the raw data. In any event you end up with a result object, which you can examine in more detail. You might start by trying a summary() command:

> summary(bird.cs)
          Length Class  Mode     
statistic  1     -none- numeric  
parameter  1     -none- numeric  
p.value    1     -none- numeric  
method     1     -none- character
data.name  1     -none- character
observed  30     -none- numeric  
expected  30     -none- numeric  
residuals 30     -none- numeric  

This does not produce the result that you may have expected. However, it does show that the result object you created contains several parts. A simpler way to see what you are dealing with is to use the names() command:

> names(bird.cs)
[1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
[7] "expected"  "residuals"

You can access the various parts of your result object by using the $ syntax and adding the part you want to examine. For example:

> bird.cs$stat
X-squared 
 78.27364 
> bird.cs$p.val
[1] 7.693581e-09

Here you select the statistic (the X2 value) and the p-value; notice that you do not need to use the full name here, an abbreviation is fine as long as it is unambiguous. You can see the calculated expected values as well as the Pearson residuals by using the appropriate abbreviation. In the following example you look at the expected values:

> bird.cs$exp
                 Garden  Hedgerow  Parkland   Pasture  Woodland
Blackbird     59.915254 10.955932 23.623729 4.4508475 2.0542373
Chaffinch     17.203390  3.145763  6.783051 1.2779661 0.5898305
Great Tit     39.745763  7.267797 15.671186 2.9525424 1.3627119
House Sparrow 43.898305  8.027119 17.308475 3.2610169 1.5050847
Robin          8.305085  1.518644  3.274576 0.6169492 0.2847458
Song Thrush    5.932203  1.084746  2.338983 0.4406780 0.2033898

You can see in this example that you have some expected values < 5 and this is the reason for the warning message. You might prefer to display the values as whole numbers and you can adjust the output “on the fly” by using the round() command to choose how many decimal points to display the values like so:

> round(bird.cs$exp, 0)
              Garden Hedgerow Parkland Pasture Woodland
Blackbird         60       11       24       4        2
Chaffinch         17        3        7       1        1
Great Tit         40        7       16       3        1
House Sparrow     44        8       17       3        2
Robin              8        2        3       1        0
Song Thrush        6        1        2       0        0

In this instance you chose to use no decimals at all and so use 0 as an instruction in the round() command.

Monte Carlo Simulation

You can decide to determine the p-value by a slightly different method and can use a Monte Carlo simulation to do this. You add an extra instruction to the chisq.test() command, simulate.p.value = TRUE, like so:

> chisq.test(bird.df, simulate.p.value = TRUE, B = 2500)

       Pearson's Chi-squared test with simulated p-value (based on 2500
       replicates)

data:  bird.df 
X-squared = 78.2736, df = NA, p-value = 0.0003998

The default is that simulate.p.value = FALSE and that B = 2000. The latter is the number of replicates to use in the Monte Carlo test, which is set to 2500 for this example.

Yates’ Correction for 2 × 2 Tables

When you have a 2 × 2 contingency table it is common to apply the Yates’ correction. By default this is used if the contingency table has two rows and two columns. You can turn off the correction using the correct = FALSE instruction in the command. In the following example you can see a 2 × 2 table:

> nd
          Urt.dio.y Urt.dio.n
Rum.obt.y        96        41
Rum.obt.n        26        57

> chisq.test(nd)

       Pearson's Chi-squared test with Yates' continuity correction

data:  nd 
X-squared = 29.8653, df = 1, p-value = 4.631e-08

> chisq.test(nd, correct = FALSE)

       Pearson's Chi-squared test

data:  nd 
X-squared = 31.4143, df = 1, p-value = 2.084e-08

At the top you see the data and when you run the chisq.test() command you see that Yates’ correction is applied automatically. In the second example you force the command not to apply the correction by setting correct = FALSE. Yates’ correction is applied only when the matrix is 2 × 2, and even if you tell R to apply the correction explicitly it will do so only if the table is 2 × 2.

Single Category: Goodness of Fit Tests

You can use the chisq.test() command to carry out a goodness of fit test. In this case you must have two vectors of numerical values, one representing the observed values and the other representing the expected ratio of values. The goodness of fit tests the data against the ratios (probabilities) you specified. If you do not specify any, the data are tested against equal probability.

In the following example you have a simple data frame containing two columns; the first column contains values relating to an old survey. The second column contains values relating to a new survey. You want to see if the proportions of the new survey match the old one, so you perform a goodness of fit test:

> survey
        old new
woody    23  19
shrubby  34  30
tall    132 111
short    98 101
grassy   45  52
mossy    53  26

To run the test you use the chisq.test() command, but this time you must specify the test data as a single vector and also point to the vector that contains the probabilities:

> survey.cs = chisq.test(survey$new, p = survey$old, rescale.p = TRUE)
> survey.cs

       Chi-squared test for given probabilities

data:  survey$new 
X-squared = 15.8389, df = 5, p-value = 0.00732

In this example you did not have the probabilities as true probabilities but as frequencies; you use the rescale.p = TRUE instruction to make sure that these are converted to probabilities (this instruction is set to FALSE by default).

The result contains all the usual items for a chi-squared result object, but if you display the expected values, for example, you do not automatically get to see the row names, even though they are present in the data:

> survey.cs$exp
[1]  20.25195  29.93766 116.22857  86.29091  39.62338  46.66753

You can get the row names from the original data using the row.names() command. You could set the names of the expected values in the following way:

names(survey.cs$expected) = row.names(survey)
> survey.cs$exp
    woody   shrubby      tall     short    grassy     mossy 
 20.25195  29.93766 116.22857  86.29091  39.62338  46.66753

You could do something similar for the residuals and then when you inspected your result it would be easier to keep track of which value was related to which category.

In the following activity you can get a chance to practice the chi-squared test for association as well as goodness of fit by using a simple data example.

Try It Out: Carry Out Chi-Squared Tests on Some Data
download.eps
Use the bees data object from the Beginning.RData file for this activity, which you will use to carry out a range of association and goodness of fit tests. The data are in a data frame and represent visits by various bee species to different plant species.
1. Carry out a basic chi-squared test on these data and save the result as a named object:
> bees
               Buff.tail Garden.bee Red.tail Honey.bee Carder.bee
Thistle               10          8       18        12          8
Vipers.bugloss         1          3        9        13         27
Golden.rain           37         19        1        16          6
Yellow.alfalfa         5          6        2         9         32
Blackberry            12          4        4        10         23

> (bees.cs = chisq.test(bees))

    Pearson's Chi-squared test

data:  bees 
X-squared = 120.6531, df = 16, p-value < 2.2e-16
2. Look at the result you just obtained—it contains several parts. Display the Pearson residuals for the result:
> names(bees.cs)
[1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed"  "expected" 
[8] "residuals"
> bees.cs$resid
                 Buff.tail Garden.bee  Red.tail   Honey.bee Carder.bee
Thistle        -0.66586684  0.1476203  4.544647  0.18079727  -2.394918
Vipers.bugloss -3.12467558 -1.5616655  1.169932  0.67626472   2.348309
Golden.rain     4.69620024  2.5323534 -2.686059 -0.01691336  -3.887003
Yellow.alfalfa -1.99986117 -0.4885699 -1.693054 -0.59837350   3.441582
Blackberry      0.09423625 -1.1886361 -0.853104 -0.23746700   1.385152
3. Now run the chi-squared test again but this time use a Monte Carlo simulation with 3000 replicates to determine the p-value:
> (bees.cs = chisq.test(bees, simulate.p.value = TRUE, B = 3000))

    Pearson's Chi-squared test with simulated p-value (based on 3000 replicates)

data:  bees 
X-squared = 120.6531, df = NA, p-value = 0.0003332
4. Look at a portion of the data as a 2 × 2 contingency table. Examine the effect of Yates’ correction on this subset:
> bees[1:2, 4:5]
               Honey.bee Carder.bee
Thistle               12          8
Vipers.bugloss        13         27

> chisq.test(bees[1:2, 4:5], correct = FALSE)

    Pearson's Chi-squared test

data:  bees[1:2, 4:5] 
X-squared = 4.1486, df = 1, p-value = 0.04167

> chisq.test(bees[1:2, 4:5], correct = TRUE)

    Pearson's Chi-squared test with Yates' continuity correction

data:  bees[1:2, 4:5] 
X-squared = 3.0943, df = 1, p-value = 0.07857
5. Look at the last two columns, representing two bee species. Carry out a goodness of fit test to determine if the proportions of visits are the same:
> with(bees, chisq.test(Honey.bee, p = Carder.bee, rescale = T))

    Chi-squared test for given probabilities

data:  Honey.bee 
X-squared = 58.088, df = 4, p-value = 7.313e-12

Warning message:
In chisq.test(Honey.bee, p = Carder.bee, rescale = T) :
  Chi-squared approximation may be incorrect
6. Carry out the same goodness of fit test but use a simulation to determine the p-value (you can abbreviate the command):
> with(bees, chisq.test(Honey.bee, p = Carder.bee, rescale = T, sim = T))

    Chi-squared test for given probabilities with simulated p-value (based on 2000
    replicates)

data:  Honey.bee 
X-squared = 58.088, df = NA, p-value = 0.0004998
7. Now look at a single column and carry out a goodness of fit test. This time omit the p = instruction to test the fit to equal probabilities:
> chisq.test(bees$Honey.bee)

    Chi-squared test for given probabilities

data:  bees$Honey.bee 
X-squared = 2.5, df = 4, p-value = 0.6446
How It Works
The basic form of the chisq.test() command will operate on a matrix or data frame. By enclosing the entire command in parentheses you can get the result object to display immediately. The results of many commands are stored as a list containing several elements, and you can see what is available using the names() command and view them using the $ syntax.
The p-value can be determined using a Monte Carlo simulation by using the simulate.p.value and B instructions. If the data form a 2 × 2 contingency, then Yates’ correction is automatically applied but only if the Monte Carlo simulation is not used.
To conduct a goodness of fit test you must specify p, the vector of probabilities; if this does not sum to 1 you will get an error unless you use rescale.p = TRUE. You can use a Monte Carlo simulation on a goodness of fit test. If a single vector is specified, a goodness of fit test is carried out but the probabilities are assumed to be equal.

Summary

  • A variety of simple statistical tests are built into R.
  • The t-test can be carried out using the t.test() command. This can conduct one- or two-sample tests and a range of options allow one-tailed and two-tailed tests.
  • The U-test is accessed via the wilcox.test() command. This non-parametric test of differences can be applied as one-sample or two-sample versions.
  • Matched paired data can be analyzed using t-test or U-test by the simple addition of the paired = TRUE instruction in the t.test() or wilcox.test() commands.
  • The subset instruction can be used to select one or more samples from a variable containing several groups.
  • Correlation and covariance can be carried out on pairs of vectors, or on entire data frames or matrix objects using the cor() and cov() commands. A single variable can be specified to produce a targeted correlation or covariance matrix.
  • Three types of correlation can be used; Pearson’s Product Moment, Spearman’s rho or Kendall’s tau.
  • Correlation hypothesis tests can be carried out using Pearson, Spearman, or Kendall methods via the cor.test() command. Two variables can be specified as separate vectors or using the formula syntax.
  • Tests using categorical data can be carried out via the chisq.test() command. This can conduct standard tests of association (chi-squared tests) or goodness of fit tests. Monte Carlo simulation can be used to produce the p-value.

Exercises

download.eps

You can find answers to these exercises in Appendix A.

Use the hogl and bv data objects in the Beginning.RData file for these exercises. The sleep, InsectSprays, and mtcars data objects are part of the regular distribution of R.

1. Look at the InsectSprays data. Compare the effectiveness of spray types A and B using a t-test.
2. Look at the hogl data. This data frame contains two columns, representing the abundance of a freshwater invertebrate (hoglouse) at two habitats (slow and fast). Use a U-test to compare the abundance.
3. Look at the sleep data; you will see that it has three columns. The extra column represents time of additional sleep induced by a drug. The group column gives a numeric value; this is the drug (1 or 2). The final column, ID, is simply the patient identification. Each patient was given both drugs (on different occasions) and the time of additional sleep recorded. Carry out a paired t-test on the additional sleep times and the different drugs.
4. Look at the mtcars data that gives data on the fuel consumption and other features of some automobiles from the 1970s. First look at a correlation matrix of these data, then focus on the correlation between mpg and the other variables. Finally, carry out a correlation test on the mpg and qsec (time taken to travel a quarter mile) variables.
5. Look at the bv data. Here you can see a column, visit, which relates to numbers of bees visiting various colors of flowers. The ratio column refers to the relative numbers of visits from a previous experiment. Carry out a goodness of fit test to see if the two experiments have given the same results.

What You Learned in This Chapter

Topic Key Points
T-test:
t.test(data1, data2 = NULL)t.test(y ~ x, data)
Student’s t-test can be carried out using the t.test() command. You must specify two vectors if you want a two-sample test; otherwise a one-sample test is conducted. A formula can be specified if the data are in the appropriate layout.You can use various additional instructions to specify the test you require.
U-test:
wilcox.test(data1, data2 = NULL)wilcox.test(y ~ x, data)
The U-test (Mann-Whitney or Wilcoxon test) can be carried out using the wilcox.test() command. One-sample or two-sample tests can be executed and a formula can be used if the data are in an appropriate layout.You can use various additional instructions to specify the test you require.
Paired tests:
t.test(x, y, paired = TRUE)wilcox.test(x, y, paired = TRUE)
Paired versions of the t-test and the U-test can be carried out by adding the paired = TRUE instruction to the command. Pairs containing NA items are dropped. You get an error if you try to run a paired test on two vectors of unequal length.
Subsetting:
subset = group %in% c(“grp1”, “grp2”)
If your data are in a form where you have a response variable and a predictor variable you can select a subset of the data using the subset instruction.
Covariance:
cov(x, y)
Pearson, Spearman, Kendall
cov2cor(matrix)
Covariance can be examined using the cov() command. You can specify two objects, which can be vector, data frame, or matrix. All objects must be of equal length. You can specify one of ”pearson” (default), ”spearman”, or ”kendall” (can be abbreviated).A covariance matrix can be converted to a correlation matrix using the cov2cor() command.
Correlation:
cor(x, y)
Pearson, Spearman, Kendall
Correlation can be carried out using the cor() command. You can specify two objects, which can be vector, data frame, or matrix. All objects must be of equal length. You can specify one of ”pearson” (default), ”spearman”, or ”kendall” (can be abbreviated).
Correlation hypothesis tests:
cor.test(x, y)cor.test(~ y + x, data)
Correlation hypothesis tests can be carried out using the cor.test() command. You can specify two vectors or use the formula syntax. Unlike cov() or cor() commands you can compare only two variables at a time. You can specify one of ”pearson” (default), ”spearman”, or ”kendall” (can be abbreviated) as the method to use.
Association tests:
chisq.test(x, y = NULL)
Chi-squared tests of association can be carried out using the chisq.test() command. If x is a data frame or matrix, y is ignored. Yates’ correction is applied by default to 2 × 2 contingency tables.
Goodness of fit tests:
chisq.test(x, p = , rescale.p = FALSE)
Chi-squared goodness of fit tests can be carried out using the chisq.test() command. A single vector must be given for the test data and the probabilities to test against are given as p. If they do not sum to 1, you can use the rescale.p instruction. If p is not supplied the probabilities are taken as equal.
Monte Carlo simulation:
simulate.p.value = FALSEB = 2000
For chi-squared tests of association or goodness of fit you can determine the p-value by Monte Carlo simulation using the simulate.p.value instruction. The number of trials is set at 2000, which you can alter.
Rounding values:
round(object, digits = 6)
The level of precision of displayed results can be altered using the round() command. You specify the numerical results to use and the number of digits to use, which defaults to 6.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.44.229