An even more common hypothesis test is the independent samples t-test. You would use this to check the equality of two samples' means. Concretely, an example of using this test would be if you have an experiment where you are testing to see if a new drug lowers blood pressure. You would give one group a placebo and the other group the real medication. If the mean improvement in blood pressure was significantly greater than the improvement with the placebo, you might infer that the blood pressure medication works. Outside of more academic uses, web companies use this test all the time to test the effectiveness of, for example, different internet ad campaigns; they expose random users to either one of two types of ads and test if one is more effective than the other. In web-business parlance, this is called an A-B test, but that's just business-ese for controlled experiment.
The term independent means that the two samples are separate, and that data from one sample doesn't affect data in the other. For example, if instead of having two different groups in the blood pressure trial, we used the same participants to test both the conditions (randomizing the order we administer the placebo and the real medication), we would violate independence.
The dataset we will be using for this is the mtcars
dataset that we first met in Chapter 2, The Shape of Data and saw again in Chapter 3, Describing Relationships. Specifically, we are going to test the hypothesis that the mileage is better for manual cars than it is for cars with automatic transmission. Let's compare the means and produce a boxplot:
> mean(mtcars$mpg[mtcars$am==0]) [1] 17.14737 > mean(mtcars$mpg[mtcars$am==1]) [1] 24.39231 > > mtcars.copy <- mtcars > # make new column with better labels > mtcars.copy$transmission <- ifelse(mtcars$am==0, "auto", "manual") > mtcars.copy$transmission <- factor(mtcars.copy$transmission) > qplot(transmission, mpg, data=mtcars.copy, + geom="boxplot", fill=transmission) + + # no legend + guides(fill=FALSE)
Hmm, looks different… but let's check that hypothesis formally. Our hypotheses are:
To do this, we use the t.test
function, too; only this time, we provide two vectors: one for each sample. We also specify our directional hypothesis in the same way:
> automatic.mpgs <- mtcars$mpg[mtcars$am==0] > manual.mpgs <- mtcars$mpg[mtcars$am==1] > t.test(automatic.mpgs, manual.mpgs, alternative="less") Welch Two Sample t-test data: automatic.mpgs and manual.mpgs t = -3.7671, df = 18.332, p-value = 0.0006868 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -3.913256 sample estimates: mean of x mean of y 17.14737 24.39231 p < .05. Yipee!
There is an easier way to use the t-test for independent samples that doesn't require us to make two vectors.
> t.test(mpg ~ am, data=mtcars, alternative="less")
This reads, roughly, perform a t-test of the mpg
column grouping by the am
column in the data frame mtcars
. Confirm for yourself that these incantations are equivalent.
Remember when I said that statistical significance was not synonymous with important and that we can use very large sample sizes to achieve statistical significance without any clinical relevance? Check this snippet out:
> set.seed(16) > t.test(rnorm(1000000,mean=10), rnorm(1000000, mean=10)) Welch Two Sample t-test data: rnorm(1e+06, mean = 10) and rnorm(1e+06, mean = 10) t = -2.1466, df = 1999998, p-value = 0.03183 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.0058104638 -0.0002640601 sample estimates: mean of x mean of y 9.997916 10.000954
Here, two vectors of one million normal deviates each are created with a mean of 10. When we use a t-test on these two vectors, it should indicate that the two vectors' means are not significantly different, right?
Well, we got a p-value of less that .05—why? If you look carefully at the last line of the R output, you might see why; the mean of the first vector is 9.997916,
and the mean of the second vector is 10.000954
. This tiny difference, a meagre .003, is enough to tip the scale into significant territory. However, I can think of very few applications of statistics where .003 of anything is noteworthy even though it is, technically, statistically significant.
The larger point is that the t-test tests for equality of means, and if the means aren't exactly the same in the population, the t-test will, with enough power, detect this. Not all tiny differences in population means are important, though, so it is important to frame the results of a t-test and the p-value in context.
As mentioned earlier in the chapter, a salient strategy for putting the differences in context is to use an effect size. The effect size commonly used in association with the t-test is Cohen's d. Cohen's d is, conceptually, pretty simple: it is a ratio of the variance explained by the "effect" and the variance in the data itself. Concretely, Cohen's d is the difference in means divided by the sample standard deviation. A high d indicates that there is a big effect (difference in means) relative to the internal variability of the data.
I mentioned that to calculate d, you have to divide the difference in means by the sample standard deviation—but which one? Although Cohen's d is conceptually straightforward (even elegant!), it is also sometimes a pain to calculate by hand, because the sample standard deviation from both samples has to be pooled. Fortunately, there's an R package that let's us calculate Cohen's d—and other effect size metrics, to boot, quite easily. Let's use it on the auto vs. manual transmission example:
> install.packages("effsize") > library(effsize) > cohen.d(automatic.mpgs, manual.mpgs) Cohen's d d estimate: -1.477947 (large) 95 percent confidence interval: inf sup -2.3372176 -0.6186766
Cohen's d is -1.478, which is considered a very large effect size. The cohen.d
function even tells you this by using canned interpretations of effect sizes. If you try this with the two million element vectors from above, the cohen.d
function will indicate that the effect was negligible.
Although these canned interpretations were on target these two times, make sure you evaluate your own effect sizes in context.
Homogeneity of variance (or homoscedasticity - a scary sounding word), in this case, simply means that the variance in the miles per gallon of the automatic cars is the same as the variance in miles per gallon of the manual cars. In reality, this assumption can be violated as long as you use a Welch's T-test like we did, instead of the Student's T-test. You can still use the Student's T-test with the t.test
function, like by specifying the optional parameter var.equal=TRUE
. You can test for this formally using var.test
or leveneTest
from the car
package. If you are sure that the assumption of homoscedasticity is not violated, you may want to do this because it is a more powerful test (fewer Type II errors). Nevertheless, I usually use Welch's T-test to be on the safe side. Also, always use Welch's test if the two samples' sizes are different.
t.test(<vector1>, <vector2>, paired=TRUE)
.3.138.34.75