What you will learn in this chapter:
Many statistical analyses are concerned with testing hypotheses. In this chapter you look at methods of testing some simple hypotheses using standard and classic tests. You start by comparing differences between two samples. Then you look at the correlation between two samples, and finally look at tests for association and goodness of fit. Other tests are available in R, but the ones illustrated here will form a good foundation and give you an idea of how R works. Should you require a different test, you will be able to work out how to carry it out for yourself.
The Student’s t-test is a method for comparing two samples; looking at the means to determine if the samples are different. This is a parametric test and the data should be normally distributed. You looked at the distribution of data previously in Chapter 5.
Several versions of the t-test exist, and R can handle these using the t.test() command, which has a variety of options (see Table 6-1), and the test can be pressed into service to deal with two- and one-sample tests as well as paired tests. The latter option is discussed in the later section “Paired T- and U-Tests”; in this section you look at some more basic options.
Command | Explanation |
t.test(data.1, data.2) | The basic method of applying a t-test is to compare two vectors of numeric data. |
var.equal = FALSE | If the var.equal instruction is set to TRUE, the variance is considered to be equal and the standard test is carried out. If the instruction is set to FALSE (the default), the variance is considered unequal and the Welch two-sample test is carried out. |
mu = 0 | If a one-sample test is carried out, mu indicates the mean against which the sample should be tested. |
alternative = “two.sided” | Sets the alternative hypothesis. The default is “two.sided” but you can specify “greater” or “less”. You can abbreviate the instruction (but you still need quotes). |
conf.level = 0.95 | Sets the confidence level of the interval (default = 0.95). |
paired = FALSE | If set to TRUE, a matched pair t-test is carried out. |
t.test(y ~ x, data, subset) | The required data can be specified as a formula of the form response ~ predictor. In this case, the data should be named and a subset of the predictor variable can be specified. |
subset = predictor %in% c(“sample.1”, “sample.2”) | If the data is in the form response ~ predictor, the subset instruction can specify which two samples to select from the predictor column of the data. |
The general way to use the t.test() command is to compare two vectors of numeric values. You can specify the vectors in a variety of ways, depending how your data objects are set out. The default form of the t.test() does not assume that the samples have equal variance, so the Welch two-sample test is carried out unless you specify otherwise:
> t.test(data2, data3)
Welch Two Sample t-test
data: data2 and data3
t = -2.8151, df = 24.564, p-value = 0.009462
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.5366789 -0.5466544
sample estimates:
mean of x mean of y
5.125000 7.166667
You can override the default and use the classic t-test by adding the var.equal = TRUE instruction, which forces the command to assume that the variance of the two samples is equal. The calculation of the t-value uses pooled variance and the degrees of freedom are unmodified; as a result, the p-value is slightly different from the Welch version:
> t.test(data2, data3, var.equal = TRUE)
Two Sample t-test
data: data2 and data3
t = -2.7908, df = 26, p-value = 0.009718
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.5454233 -0.5379101
sample estimates:
mean of x mean of y
5.125000 7.166667
You can also carry out a one-sample t-test. In this version you supply the name of a single vector and the mean to compare it to (this defaults to 0):
> t.test(data2, mu = 5)
One Sample t-test
data: data2
t = 0.2548, df = 15, p-value = 0.8023
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.079448 6.170552
sample estimates:
mean of x
5.125
You can also specify a “direction” to your hypothesis. In many cases you are simply testing to see if the means of two samples are different, but you may want to know if a sample mean is lower than another sample mean (or greater). You can use the alternative = instruction to switch the emphasis from a two-sided test (the default) to a one-sided test. The choices you have are between “two.sided”, “less”, or “greater”, and your choice can be abbreviated.
> t.test(data2, mu = 5, alternative = 'greater')
One Sample t-test
data: data2
t = 0.2548, df = 15, p-value = 0.4012
alternative hypothesis: true mean is greater than 5
95 percent confidence interval:
4.265067 Inf
sample estimates:
mean of x
5.125
The t-test is designed to compare two samples (or one sample with a “standard”). So far you have seen how to carry out the t-test on separate vectors of values. However, your data may well be in a more structured form with a column for the response variable and a column for the predictor variable. The following data are set out in this manner:
> grass
rich graze
1 12 mow
2 15 mow
3 17 mow
4 11 mow
5 15 mow
6 8 unmow
7 9 unmow
8 7 unmow
9 9 unmow
This way of setting out data is more sensible and flexible, but you need a new way to deal with the layout. R deals with this by having a “formula syntax.” You create a formula using the tilde (~) symbol. Essentially your response variable goes on the left of the ~ and the predictor goes on the right like so:
> t.test(rich ~ graze, data = grass)
Welch Two Sample t-test
data: rich by graze
t = 4.8098, df = 5.411, p-value = 0.003927
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.745758 8.754242
sample estimates:
mean in group mow mean in group unmow
14.00 8.25
If your predictor column contains more than two items, the t-test cannot be used. However, you can still carry out a test by subsetting this predictor column and specifying which two samples you want to compare. You must use the subset = instruction as part of the t.test() command. The following example illustrates how to do this using the same data as in the previous example.
> t.test(rich ~ graze, data = grass, subset = graze %in% c('mow', 'unmow'))
You first specify which column you want to take your subset from (graze in this case) and then type %in%; this tells the command that the list that follows is contained in the graze column. Note that you have to put the levels in quotes; here you compare “mow” and “unmow” and your result (not shown) is identical to that you obtained before.
> ls(pattern='^orc')
[1] "orchid" "orchid2" "orchis" "orchis2"
> orchid
closed open
1 7 3
2 8 5
3 6 6
4 9 7
5 10 6
6 11 8
7 7 8
8 8 4
9 10 7
10 9 6
> attach(orchid)
> t.test(open, closed)
Welch Two Sample t-test
data: open and closed
t = -3.478, df = 17.981, p-value = 0.002688
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.0102455 -0.9897545
sample estimates:
mean of x mean of y
6.0 8.5
> detach(orchid)
> with(orchid, t.test(open, closed, var.equal = TRUE))
Two Sample t-test
data: open and closed
t = -3.478, df = 18, p-value = 0.002684
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.0101329 -0.9898671
sample estimates:
mean of x mean of y
6.0 8.5
> t.test(orchid$open, mu = 5)
One Sample t-test
data: orchid$open
t = 1.9365, df = 9, p-value = 0.08479
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.831827 7.168173
sample estimates:
mean of x
6
> str(orchis)
'data.frame': 20 obs. of 2 variables:
$ flower: num 7 8 6 9 10 11 7 8 10 9 ...
$ site : Factor w/ 2 levels "closed","open": 1 1 1 1 1 1 1 1 1 1 ...
> t.test(flower ~ site, data = orchis)
> str(orchis2)
'data.frame': 30 obs. of 2 variables:
$ flower: num 7 8 6 9 10 11 7 8 10 9 ...
$ site : Factor w/ 3 levels "closed","open",..: 1 1 1 1 1 1 1 1 1 1 ...
> t.test(flower ~ site, data = orchis2, subset = site %in% c('open', 'closed'))
> t.test(orchid$open, alternative = 'less', mu = 7)
One Sample t-test
data: orchid$open
t = -1.9365, df = 9, p-value = 0.04239
alternative hypothesis: true mean is less than 7
95 percent confidence interval:
-Inf 6.946615
sample estimates:
mean of x
6
> t.test(orchis2$flower[orchis2$site=='sprayed'], mu = 3, alt = 'greater')
> with(orchis2, t.test(flower[site=='sprayed'], mu = 3, alt = 'g'))
One Sample t-test
data: orchis2$flower[orchis2$site == "sprayed"]
t = 1.9412, df = 9, p-value = 0.04208
alternative hypothesis: true mean is greater than 3
95 percent confidence interval:
3.061236 Inf
sample estimates:
mean of x
4.1
The t-test is a powerful and flexible tool, as you have seen. You can also carry out paired tests using the t.test() command, but before that you look at the U-test, which you can think of as the non-parametric equivalent to the t-test.
When you have two samples to compare and your data are non-parametric, you can use the U-test. This goes by various names and may be known as the Mann-Whitney U-test or Wilcoxon sign rank test. You use the wilcox.test() command to carry out the analysis. You operate this very much like you did when performing the t.test() previously.
The wilcox.test() command can conduct two-sample or one-sample tests, and you can add a variety of instructions to carry out the test you want. The main options are shown in Table 6-2.
Command | Explanation |
wilcox.test(sample.1, sample.2) | Carries out a basic two-sample U-test on the numerical vectors specified. |
mu = 0 | If a one-sample test is carried out, mu indicates the value against which the sample should be tested. |
alternative = “two.sided” | Sets the alternative hypothesis. The default is “two.sided” but you can specify “greater” or “less”. You can abbreviate the instruction (but you still need quotes). |
conf.int = FALSE | Sets whether confidence intervals should be reported. |
conf.level = 0.95 | Sets the confidence level of the interval (default = 0.95). |
correct = TRUE | By default the continuity correction is applied. You can turn this off by setting it to FALSE. |
paired = FALSE | If set to TRUE, a matched pair U-test is carried out. |
exact = NULL | Sets whether an exact p-value should be computed. The default is to do so for < 50 items. |
wilcox.test(y ~ x, data, subset) | The required data can be specified as a formula of the form response ~ predictor. In this case the data should be named and a subset of the predictor variable can be specified. |
subset = predictor %in% c(“sample.1”, “sample.2”) | If the data is in the form response ~ predictor, the subset instruction can specify which two samples to select from the predictor column of the data. |
The basic way of using the wilcox.test() is to specify the two samples you want to compare as separate vectors, as the following example shows:
> data1 ; data2
[1] 3 5 7 5 3 2 6 8 5 6 9
[1] 3 5 7 5 3 2 6 8 5 6 9 4 5 7 3 4
> wilcox.test(data1, data2)
Wilcoxon rank sum test with continuity correction
data: data1 and data2
W = 94.5, p-value = 0.7639
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(data1, data2) :
cannot compute exact p-value with ties
By default the confidence intervals are not calculated and the p-value is adjusted using the “continuity correction”; a message tells you that the latter has been used. In this case you see a warning message because you have tied values in the data. If you set exact = FALSE, this message would not be displayed because the p-value would be determined from a normal approximation method.
If you specify a single numerical vector, a one-sample U-test is carried out; the default is to set mu = 0, as in the following example:
> wilcox.test(data3, exact = FALSE)
Wilcoxon signed rank test with continuity correction
data: data3
V = 78, p-value = 0.002430
alternative hypothesis: true location is not equal to 0
In this case the p-value is taken from a normal approximation because the exact = FALSE instruction is used. The command has assumed mu = 0 because it is not specified explicitly.
Both one- and two-sample tests use an alternative hypothesis that the location shift is not equal to 0 as their default. This is essentially a two-sided hypothesis. You can change this by using the alternative = instruction, where you can select “two.sided”, “less”, or “greater” as your alternative hypothesis (an abbreviation is acceptable but you still need quotes, single or double). You can also specify mu, the location shift. By default mu = 0. In the following example the hypothesis is set to something other than 0:
> data3
[1] 6 7 8 7 6 3 8 9 10 7 6 9
> summary(data3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 6.000 7.000 7.167 8.250 10.000
> wilcox.test(data3, mu = 8, exact = FALSE, conf.int = TRUE, alt = 'less')
Wilcoxon signed rank test with continuity correction
data: data3
V = 13.5, p-value = 0.08021
alternative hypothesis: true location is less than 8
95 percent confidence interval:
-Inf 8.000002
sample estimates:
(pseudo)median
6.999956
In this example a one-sample test is carried out on the data3 sample vector. The test looks to see if the sample median is less than 8. The instructions also specify to display the confidence interval and not to use an exact p-value.
It is generally a good idea to have your data arranged into a data frame where one column represents the response variable and another represents the predictor variable. In this case you can use the formula syntax to describe the situation and carry out the wilcox.test() on your data. This is much the same method you used for the t-test previously. The basic form of the command becomes:
wilcox.test(response ~ predictor, data = my.data)
You can also use additional instructions as you could with the other syntax. If your predictor variable contains more than two samples, you cannot conduct a U-test and must use a subset that contains exactly two samples. The subset instruction works like so:
wilcox.test(response ~ predictor, data = my.data, subset = predictor %in% c("sample1", "sample2"))
Notice that you use a c() command to group the samples together, and their names must be in quotes.
The U-test is one of the most widely used statistical methods, so it is important to be comfortable using the wilcox.test() command. In the following activity you try conducting a range of U-tests for yourself.
> ls(pattern='^bf')
> bfc
grass heath
1 3 6
2 4 7
3 3 8
4 5 8
5 6 9
6 12 11
7 21 12
8 4 11
9 5 NA
10 4 NA
11 7 NA
12 8 NA
> wilcox.test(bfc$grass, bfc$heath)
Wilcoxon rank sum test with continuity correction
data: bfc$grass and bfc$heath
W = 20.5, p-value = 0.03625
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(bfc$grass, bfc$heath) :
cannot compute exact p-value with ties
> summary(bfc$grass)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 4.000 5.000 6.833 7.250 21.000
> with(bfc, wilcox.test(grass, mu = 7.5, exact = F, alt = 'less'))
Wilcoxon signed rank test with continuity correction
data: grass
V = 23.5, p-value = 0.1188
alternative hypothesis: true location is less than 7.5
> str(bf2)
'data.frame': 20 obs. of 2 variables:
$ count: int 3 4 3 5 6 12 21 4 5 4 ...
$ site : Factor w/ 2 levels "Grass","Heath": 1 1 1 1 1 1 1 1 1 1 ...
> wilcox.test(count ~ site, data = bf2, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: count by site
W = 20.5, p-value = 0.03625
alternative hypothesis: true location shift is not equal to 0
> with(bf2, summary(count[which(site=='Heath')]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 7.75 8.50 9.00 11.00 12.00
> with(bf2, wilcox.test(count[which(site=='Heath')], exact = F,
alt = 'greater', mu = 7.75))
Wilcoxon signed rank test with continuity correction
data: count[which(site == "Heath")]
V = 28, p-value = 0.09118
alternative hypothesis: true location is greater than 7.75
> wilcox.test(count ~ site, data = bfs, subset = site %in% c('Grass', 'Arable'),
exact = F)
Wilcoxon rank sum test with continuity correction
data: count by site
W = 81.5, p-value = 0.05375
alternative hypothesis: true location shift is not equal to 0
The U-test is a useful tool for comparing two samples and is one of the most widely used of all simple statistical tests. Both the t.test() and wilcox.test() commands can also deal with matched pair data, which you have not seen yet. This is the subject of the next section.
If you have a situation in which you have paired data, you can use matched pair versions of the t-test and the U-test with a simple extra instruction. You simply add paired = TRUE as an instruction to your command. It does not matter if the data are in two separate sample columns or are represented as response and predictor as long as you indicate what is required using the appropriate syntax. In fact, R will carry out a paired test even if the data do not really match up as pairs. It is up to you to carry out something sensible. You can use all the regular syntax and instructions, so you can use subsetting and directional hypotheses as you like. In the following activity you try a few paired tests for yourself.
> mpd
white yellow
1 4 4
2 3 7
3 4 2
4 1 2
5 6 7
6 4 10
7 6 5
8 4 8
> wilcox.test(mpd$white, mpd$yellow, exact = FALSE, paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: mpd$white and mpd$yellow
V = 6, p-value = 0.2008
alternative hypothesis: true location shift is not equal to 0
> mean(mpd)
white yellow
4.000 5.625
> with(mpd, t.test(white, yellow, paired = TRUE, mu = 2, alt = 'less'))
Paired t-test
data: white and yellow
t = -3.6958, df = 7, p-value = 0.003849
alternative hypothesis: true difference in means is less than 2
95 percent confidence interval:
-Inf 0.2332847
sample estimates:
mean of the differences
-1.625
> wilcox.test(count ~ trap, data = mpd.s, paired = TRUE, exact = F)
Wilcoxon signed rank test with continuity correction
data: count by trap
V = 6, p-value = 0.2008
alternative hypothesis: true location shift is not equal to 0
> t.test(count ~ trap, data = mpd.s, paired = TRUE, mu = 1, conf.level = 0.99)
Paired t-test
data: count by trap
t = -2.6763, df = 7, p-value = 0.03171
alternative hypothesis: true difference in means is not equal to 1
99 percent confidence interval:
-5.057445 1.807445
sample estimates:
mean of the differences
-1.625
> t.test(flower ~ site, data = orchis2, subset = site %in% c('open', 'sprayed'),
paired = TRUE)
Paired t-test
data: flower by site
t = 4.1461, df = 9, p-value = 0.002499
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8633494 2.9366506
sample estimates:
mean of the differences
1.9
When you have two continuous variables you can look for a link between them; this link is called a correlation. You can go about finding this several ways using R. The cor() command determines correlations between two vectors, all the columns of a data frame (or matrix), or two data frames (or matrix objects). The cov() command examines covariance. By default the Pearson product moment (that is regular parametric correlation) is used but Spearman (rho) and Kendall (tau) methods (both non-parametric correlation) can be specified instead. The cor.test() command carries out a test of significance of the correlation.
You can add a variety of additional instructions to these commands. Table 6-3 gives a brief summary of them.
Command | Explanation |
cor(x, y = NULL) | Carries out a basic correlation between x and y. If x is a matrix or data frame, y can be omitted. |
cov(x, y = NULL) | Determines covariance between x and y. If x is a matrix or data frame, y can be omitted. |
cov2cor(V) | Takes a covariance matrix V and calculates the correlations. |
method = | The default is “pearson”, but “spearman” or “kendall” can be specified as the methods for correlation or covariance. These can be abbreviated but you still need the quotes, and note that they are lowercase. |
var(x, y = NULL) | Determines the variance of x. If x is a matrix or data frame or y is specified, the covariance is also determined. |
cor.test(x, y) | Carries out a significance test of the correlation between x and y. |
alternative = “two.sided” | The default is for a two-sided test but the alternative hypothesis can be given as “two.sided”, “greater”, or “less” and abbreviations are permitted. |
conf.level = 0.95 | If the method = “pearson” and n > 3, the confidence intervals will be shown. This instruction sets the confidence level and defaults to 0.95. |
exact = NULL | For Kendall or Spearman, should an exact p-value be determined? Set this to TRUE or FALSE (the default NULL is equivalent to FALSE). |
continuity = FALSE | For Spearman or Kendall tests setting this to TRUE carries out a continuity correction. |
cor.test( ~ x + y, data) | If the data are in a data frame, a formula syntax can be used. This is of the form ~ x + y where x and y are two variables. The data frame can be specified. All other instructions can be used including subset. |
subset = group %in% “sample” | If the data includes a grouping variable, the subset instruction can be used to select one or more samples from this grouping. |
The commands summarized in Table 6-3 enable you to carry out a range of correlation tasks. In the following sections you see a few of these options illustrated, and you can then try some correlations yourself in the activity that follows.
Simple correlations are between two continuous variables and you can use the cor() command to obtain a correlation coefficient like so:
> count = c(9, 25, 15, 2, 14, 25, 24, 47)
> speed = c(2, 3, 5, 9, 14, 24, 29, 34)
> cor(count, speed)
[1] 0.7237206
The default for R is to carry out the Pearson product moment, but you can specify other correlations using the method = instruction, like so:
> cor(count, speed, method = 'spearman')
[1] 0.5269556
This example used the Spearman rho correlation but you can also apply Kendall’s tau by specifying method = “kendall”. Note that you can abbreviate this but you still need the quotes. You also have to use lowercase.
If your vectors are contained within a data frame or some other object, you need to extract them in a different fashion. Look at the women data frame. This comes as example data with your distribution of R.
> data(women)
> str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
You need to use attach() or with() commands to allow R to “read inside” the data frame and access the variables within. You could also use the $ syntax so that the command can access the variables as the following example shows:
> cor(women$height, women$weight)
[1] 0.9954948
In this example the cor() command has calculated the Pearson correlation coefficient between the height and weight variables contained in the women data frame.
You can also use the cor() command directly on a data frame (or matrix). If you use the data frame women that you just looked at, for example, you get the following:
> cor(women)
height weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000
Now you have a correlation matrix that shows you all combinations of the variables in the data frame. When you have more columns the matrix can be much more complex. The following example contains five columns of data:
> head(mf)
Length Speed Algae NO3 BOD
1 20 12 40 2.25 200
2 21 14 45 2.15 180
3 22 12 45 1.75 135
4 23 16 80 1.95 120
5 21 20 75 1.95 110
6 20 21 65 2.75 120
> cor(mf)
Length Speed Algae NO3 BOD
Length 1.0000000 -0.34322968 0.7650757 0.45476093 -0.8055507
Speed -0.3432297 1.00000000 -0.1134416 0.02257931 0.1983412
Algae 0.7650757 -0.11344163 1.0000000 0.37706463 -0.8365705
NO3 0.4547609 0.02257931 0.3770646 1.00000000 -0.3751308
BOD -0.8055507 0.19834122 -0.8365705 -0.37513077 1.0000000
The correlation matrix can be helpful but you may not always want to see all the possible combinations; indeed, the first column is the response variable and the others are predictor variables. If you choose the Length variable and compare it to all the others in the mf data frame using the default Pearson coefficient, you can select a single variable and compare it to all the others like so:
> cor(mf$Length, mf)
Length Speed Algae NO3 BOD
[1,] 1 -0.3432297 0.7650757 0.4547609 -0.8055507
The cov() command uses syntax similar to the cor() command to examine covariance. The women data are used with the cov() command in the following example:
> cov(women$height, women$weight)
[1] 69
> cov(women)
height weight
height 20 69.0000
weight 69 240.2095
The cov2cor() command is used to determine the correlation from a matrix of covariance in the following example:
> women.cv = cov(women)
> cov2cor(women.cv)
height weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000
You can apply a significance test to your correlations using the cor.test() command. In this case you can compare only two vectors at a time as the following example shows:
> cor.test(women$height, women$weight)
Pearson's product-moment correlation
data: women$height and women$weight
t = 37.8553, df = 13, p-value = 1.088e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9860970 0.9985447
sample estimates:
cor
0.9954948
In the previous example you can see that the Pearson correlation has been carried out between height and weight in the women data and the result also shows the statistical significance of the correlation.
If your data are contained in a data frame, using the attach() or with() commands is tedious, as is using the $ syntax. A formula syntax is available as an alternative, which provides a neater representation of your data:
> data(cars)
> cor.test(~ speed + dist, data = cars, method = 'spearman', exact = F)
Spearman's rank correlation rho
data: speed and dist
S = 3532.819, p-value = 8.825e-14
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8303568
Here you examine the cars data, which comes built into R. The formula is slightly different from the one that you met previously. Here you specify both variables to the right of the ~. You also give the name of the data as a separate instruction.
All the additional instructions are available when using the formula syntax as well as the subset instruction. If your data contain a separate grouping column, you can specify the samples to use from it using an instruction along the following lines:
subset = grouping %in% “sample”
Correlation is a common method used widely in many areas of study. In the following activity you will be able to practice carrying out correlation and covariance of some data.
> cor(fw$count, fw$speed)
[1] 0.7237206
> cor(swiss, method = 'kendall')
> cor(swiss$Fertility, swiss, method = 'spearman')
Fertility Agriculture Examination Education Catholic Infant.Mortality
[1,] 1 0.2426643 -0.660903 -0.4432577 0.4136456 0.4371367
> (fw.cov = cov(fw))
count speed
count 185.8393 123.0000
speed 123.0000 155.4286
> cov2cor(fw.cov)
count speed
count 1.0000000 0.7237206
speed 0.7237206 1.0000000
> cor(fw, fw2)
abund flow
count 0.9905759 0.7066437
speed 0.6527244 0.9889997
> with(fw, cor.test(count, speed, method = 'spearman'))
Spearman's rank correlation rho
data: count and speed
S = 39.7357, p-value = 0.1796
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.5269556
Warning message:
In cor.test.default(count, speed, method = "spearman") :
Cannot compute exact p-values with ties
> cor.test(fw2$abund, fw2$flow, conf = 0.99, alt = 'greater')
Pearson's product-moment correlation
data: fw2$abund and fw2$flow
t = 2.0738, df = 6, p-value = 0.04173
alternative hypothesis: true correlation is greater than 0
99 percent confidence interval:
-0.265223 1.000000
sample estimates:
cor
0.6461473
> cor.test(~ Length + NO3, data = mf, method = 'k', exact = F)
Kendall's rank correlation tau
data: Length and NO3
z = 1.969, p-value = 0.04895
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.2959383
> cor.test(~ count + speed, data = fw3, subset = cover %in% 'open')
Pearson's product-moment correlation
data: count and speed
t = -1.1225, df = 2, p-value = 0.3783
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9907848 0.8432203
sample estimates:
cor
-0.6216869
> fw.cor = cor(fw, fw2)
> fw.cor
abund flow
count 0.9905759 0.7066437
speed 0.6527244 0.9889997
> (fw.cor = cor(fw, fw2))
abund flow
count 0.9905759 0.7066437
speed 0.6527244 0.9889997
When you have categorical data you can look for associations between categories by using the chi-squared test. Routines to achieve this are accessed using the chisq.test() command. You can add various additional instructions to the basic command to suit your requirements. These are summarized in Table 6-4.
Command | Explanation |
chisq.test(x, y = NULL) | A basic chi-squared test is carried out on a matrix or data frame. If x is provided as a vector, a second vector can be supplied. If x is a single vector and y is not given, a goodness of fit test is carried out. |
correct = TRUE | If the data form a 2 × 2 contingency table the Yates’ correction is applied. |
p = | A vector of probabilities for use with a goodness of fit test. If p is not given, the goodness of fit tests that the probabilities are all equal. |
rescale.p = FALSE | If TRUE, p is rescaled to sum to 1. For use with goodness of fit tests. |
simulate.p.value = FALSE | If set to TRUE, a Monte Carlo simulation is used to calculate p-values. |
B = 2000 | The number of replicates to use in the Monte Carlo simulation. |
The most common use for a chi-squared test is where you have multiple categories and want to see if associations exist between them. In the following example you can see some categorical data set out in a data frame. You have seen these data before:
> bird.df
Garden Hedgerow Parkland Pasture Woodland
Blackbird 47 10 40 2 2
Chaffinch 19 3 5 0 2
Great Tit 50 0 10 7 0
House Sparrow 46 16 8 4 0
Robin 9 3 0 0 2
Song Thrush 4 0 6 0 0
The data here are already in a contingency table and each cell represents a unique combination of the two categories; here you have several habitats and several species. You run the chisq.test() command simply by giving the name of the data to the command like so:
> bird.cs = chisq.test(bird.df)
Warning message:
In chisq.test(bird.df) : Chi-squared approximation may be incorrect
> bird.cs
Pearson's Chi-squared test
data: bird.df
X-squared = 78.2736, df = 20, p-value = 7.694e-09
In this case you give the result a name and set it up as a new object, which you examine in more detail in a moment. You get an error message in this example; this is because you have some small values for your observed data and the expected values will probably include some that are smaller than 5. When you issue the name of the result object you see a very brief result that contains the salient points.
Your original data were in the form of a data frame but you might also have used a matrix. If that were so, the result is exactly the same. You can also use a table result; perhaps the result of using the xtabs() command on the raw data. In any event you end up with a result object, which you can examine in more detail. You might start by trying a summary() command:
> summary(bird.cs)
Length Class Mode
statistic 1 -none- numeric
parameter 1 -none- numeric
p.value 1 -none- numeric
method 1 -none- character
data.name 1 -none- character
observed 30 -none- numeric
expected 30 -none- numeric
residuals 30 -none- numeric
This does not produce the result that you may have expected. However, it does show that the result object you created contains several parts. A simpler way to see what you are dealing with is to use the names() command:
> names(bird.cs)
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals"
You can access the various parts of your result object by using the $ syntax and adding the part you want to examine. For example:
> bird.cs$stat
X-squared
78.27364
> bird.cs$p.val
[1] 7.693581e-09
Here you select the statistic (the X2 value) and the p-value; notice that you do not need to use the full name here, an abbreviation is fine as long as it is unambiguous. You can see the calculated expected values as well as the Pearson residuals by using the appropriate abbreviation. In the following example you look at the expected values:
> bird.cs$exp
Garden Hedgerow Parkland Pasture Woodland
Blackbird 59.915254 10.955932 23.623729 4.4508475 2.0542373
Chaffinch 17.203390 3.145763 6.783051 1.2779661 0.5898305
Great Tit 39.745763 7.267797 15.671186 2.9525424 1.3627119
House Sparrow 43.898305 8.027119 17.308475 3.2610169 1.5050847
Robin 8.305085 1.518644 3.274576 0.6169492 0.2847458
Song Thrush 5.932203 1.084746 2.338983 0.4406780 0.2033898
You can see in this example that you have some expected values < 5 and this is the reason for the warning message. You might prefer to display the values as whole numbers and you can adjust the output “on the fly” by using the round() command to choose how many decimal points to display the values like so:
> round(bird.cs$exp, 0)
Garden Hedgerow Parkland Pasture Woodland
Blackbird 60 11 24 4 2
Chaffinch 17 3 7 1 1
Great Tit 40 7 16 3 1
House Sparrow 44 8 17 3 2
Robin 8 2 3 1 0
Song Thrush 6 1 2 0 0
In this instance you chose to use no decimals at all and so use 0 as an instruction in the round() command.
You can decide to determine the p-value by a slightly different method and can use a Monte Carlo simulation to do this. You add an extra instruction to the chisq.test() command, simulate.p.value = TRUE, like so:
> chisq.test(bird.df, simulate.p.value = TRUE, B = 2500)
Pearson's Chi-squared test with simulated p-value (based on 2500
replicates)
data: bird.df
X-squared = 78.2736, df = NA, p-value = 0.0003998
The default is that simulate.p.value = FALSE and that B = 2000. The latter is the number of replicates to use in the Monte Carlo test, which is set to 2500 for this example.
When you have a 2 × 2 contingency table it is common to apply the Yates’ correction. By default this is used if the contingency table has two rows and two columns. You can turn off the correction using the correct = FALSE instruction in the command. In the following example you can see a 2 × 2 table:
> nd
Urt.dio.y Urt.dio.n
Rum.obt.y 96 41
Rum.obt.n 26 57
> chisq.test(nd)
Pearson's Chi-squared test with Yates' continuity correction
data: nd
X-squared = 29.8653, df = 1, p-value = 4.631e-08
> chisq.test(nd, correct = FALSE)
Pearson's Chi-squared test
data: nd
X-squared = 31.4143, df = 1, p-value = 2.084e-08
At the top you see the data and when you run the chisq.test() command you see that Yates’ correction is applied automatically. In the second example you force the command not to apply the correction by setting correct = FALSE. Yates’ correction is applied only when the matrix is 2 × 2, and even if you tell R to apply the correction explicitly it will do so only if the table is 2 × 2.
You can use the chisq.test() command to carry out a goodness of fit test. In this case you must have two vectors of numerical values, one representing the observed values and the other representing the expected ratio of values. The goodness of fit tests the data against the ratios (probabilities) you specified. If you do not specify any, the data are tested against equal probability.
In the following example you have a simple data frame containing two columns; the first column contains values relating to an old survey. The second column contains values relating to a new survey. You want to see if the proportions of the new survey match the old one, so you perform a goodness of fit test:
> survey
old new
woody 23 19
shrubby 34 30
tall 132 111
short 98 101
grassy 45 52
mossy 53 26
To run the test you use the chisq.test() command, but this time you must specify the test data as a single vector and also point to the vector that contains the probabilities:
> survey.cs = chisq.test(survey$new, p = survey$old, rescale.p = TRUE)
> survey.cs
Chi-squared test for given probabilities
data: survey$new
X-squared = 15.8389, df = 5, p-value = 0.00732
In this example you did not have the probabilities as true probabilities but as frequencies; you use the rescale.p = TRUE instruction to make sure that these are converted to probabilities (this instruction is set to FALSE by default).
The result contains all the usual items for a chi-squared result object, but if you display the expected values, for example, you do not automatically get to see the row names, even though they are present in the data:
> survey.cs$exp
[1] 20.25195 29.93766 116.22857 86.29091 39.62338 46.66753
You can get the row names from the original data using the row.names() command. You could set the names of the expected values in the following way:
names(survey.cs$expected) = row.names(survey)
> survey.cs$exp
woody shrubby tall short grassy mossy
20.25195 29.93766 116.22857 86.29091 39.62338 46.66753
You could do something similar for the residuals and then when you inspected your result it would be easier to keep track of which value was related to which category.
In the following activity you can get a chance to practice the chi-squared test for association as well as goodness of fit by using a simple data example.
> bees
Buff.tail Garden.bee Red.tail Honey.bee Carder.bee
Thistle 10 8 18 12 8
Vipers.bugloss 1 3 9 13 27
Golden.rain 37 19 1 16 6
Yellow.alfalfa 5 6 2 9 32
Blackberry 12 4 4 10 23
> (bees.cs = chisq.test(bees))
Pearson's Chi-squared test
data: bees
X-squared = 120.6531, df = 16, p-value < 2.2e-16
> names(bees.cs)
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed" "expected"
[8] "residuals"
> bees.cs$resid
Buff.tail Garden.bee Red.tail Honey.bee Carder.bee
Thistle -0.66586684 0.1476203 4.544647 0.18079727 -2.394918
Vipers.bugloss -3.12467558 -1.5616655 1.169932 0.67626472 2.348309
Golden.rain 4.69620024 2.5323534 -2.686059 -0.01691336 -3.887003
Yellow.alfalfa -1.99986117 -0.4885699 -1.693054 -0.59837350 3.441582
Blackberry 0.09423625 -1.1886361 -0.853104 -0.23746700 1.385152
> (bees.cs = chisq.test(bees, simulate.p.value = TRUE, B = 3000))
Pearson's Chi-squared test with simulated p-value (based on 3000 replicates)
data: bees
X-squared = 120.6531, df = NA, p-value = 0.0003332
> bees[1:2, 4:5]
Honey.bee Carder.bee
Thistle 12 8
Vipers.bugloss 13 27
> chisq.test(bees[1:2, 4:5], correct = FALSE)
Pearson's Chi-squared test
data: bees[1:2, 4:5]
X-squared = 4.1486, df = 1, p-value = 0.04167
> chisq.test(bees[1:2, 4:5], correct = TRUE)
Pearson's Chi-squared test with Yates' continuity correction
data: bees[1:2, 4:5]
X-squared = 3.0943, df = 1, p-value = 0.07857
> with(bees, chisq.test(Honey.bee, p = Carder.bee, rescale = T))
Chi-squared test for given probabilities
data: Honey.bee
X-squared = 58.088, df = 4, p-value = 7.313e-12
Warning message:
In chisq.test(Honey.bee, p = Carder.bee, rescale = T) :
Chi-squared approximation may be incorrect
> with(bees, chisq.test(Honey.bee, p = Carder.bee, rescale = T, sim = T))
Chi-squared test for given probabilities with simulated p-value (based on 2000
replicates)
data: Honey.bee
X-squared = 58.088, df = NA, p-value = 0.0004998
> chisq.test(bees$Honey.bee)
Chi-squared test for given probabilities
data: bees$Honey.bee
X-squared = 2.5, df = 4, p-value = 0.6446
Exercises
You can find answers to these exercises in Appendix A.
Use the hogl and bv data objects in the Beginning.RData file for these exercises. The sleep, InsectSprays, and mtcars data objects are part of the regular distribution of R.
What You Learned in This Chapter
Topic | Key Points |
T-test: t.test(data1, data2 = NULL)t.test(y ~ x, data) |
Student’s t-test can be carried out using the t.test() command. You must specify two vectors if you want a two-sample test; otherwise a one-sample test is conducted. A formula can be specified if the data are in the appropriate layout.You can use various additional instructions to specify the test you require. |
U-test: wilcox.test(data1, data2 = NULL)wilcox.test(y ~ x, data) |
The U-test (Mann-Whitney or Wilcoxon test) can be carried out using the wilcox.test() command. One-sample or two-sample tests can be executed and a formula can be used if the data are in an appropriate layout.You can use various additional instructions to specify the test you require. |
Paired tests: t.test(x, y, paired = TRUE)wilcox.test(x, y, paired = TRUE) |
Paired versions of the t-test and the U-test can be carried out by adding the paired = TRUE instruction to the command. Pairs containing NA items are dropped. You get an error if you try to run a paired test on two vectors of unequal length. |
Subsetting: subset = group %in% c(“grp1”, “grp2”) |
If your data are in a form where you have a response variable and a predictor variable you can select a subset of the data using the subset instruction. |
Covariance: cov(x, y) Pearson, Spearman, Kendall cov2cor(matrix) |
Covariance can be examined using the cov() command. You can specify two objects, which can be vector, data frame, or matrix. All objects must be of equal length. You can specify one of ”pearson” (default), ”spearman”, or ”kendall” (can be abbreviated).A covariance matrix can be converted to a correlation matrix using the cov2cor() command. |
Correlation: cor(x, y) Pearson, Spearman, Kendall |
Correlation can be carried out using the cor() command. You can specify two objects, which can be vector, data frame, or matrix. All objects must be of equal length. You can specify one of ”pearson” (default), ”spearman”, or ”kendall” (can be abbreviated). |
Correlation hypothesis tests: cor.test(x, y)cor.test(~ y + x, data) |
Correlation hypothesis tests can be carried out using the cor.test() command. You can specify two vectors or use the formula syntax. Unlike cov() or cor() commands you can compare only two variables at a time. You can specify one of ”pearson” (default), ”spearman”, or ”kendall” (can be abbreviated) as the method to use. |
Association tests: chisq.test(x, y = NULL) |
Chi-squared tests of association can be carried out using the chisq.test() command. If x is a data frame or matrix, y is ignored. Yates’ correction is applied by default to 2 × 2 contingency tables. |
Goodness of fit tests: chisq.test(x, p = , rescale.p = FALSE) |
Chi-squared goodness of fit tests can be carried out using the chisq.test() command. A single vector must be given for the test data and the probabilities to test against are given as p. If they do not sum to 1, you can use the rescale.p instruction. If p is not supplied the probabilities are taken as equal. |
Monte Carlo simulation: simulate.p.value = FALSEB = 2000 |
For chi-squared tests of association or goodness of fit you can determine the p-value by Monte Carlo simulation using the simulate.p.value instruction. The number of trials is set at 2000, which you can alter. |
Rounding values: round(object, digits = 6) |
The level of precision of displayed results can be altered using the round() command. You specify the numerical results to use and the number of digits to use, which defaults to 6. |
13.58.44.229