The salt-and-pepper of inferential statistics is estimation and testing hypotheses. In the last chapter, we talked about estimation and making certain inferences about the world. In this chapter, we will be talking about how to test the hypotheses on how the world works and evaluate the hypotheses using only sample data.
In the last chapter, I promised that this would be a very practical chapter, and I'm a man of my word; this chapter goes over a broad range of the most popular methods in modern data analysis at a relatively high level. Even so, this chapter might have a little more detail than the lazy and impatient would want. At the same time, it will have way too little detail than what the extremely curious and mathematically inclined want. In fact, some statisticians would have a heart attack at the degree to which I skip over the math involved with these subjects—but I won't tell if you don't!
Nevertheless, certain complicated concepts and math are beyond the scope of this book. The good news is that once you, dear reader, have the general concepts down, it is easy to deepen your knowledge of these techniques and their intricacies—and I advocate that you do before making any major decisions based on the tests introduced in these chapters.
For better or worse, Null Hypothesis Significance Testing (NHST) is the most popular hypothesis testing framework in modern use. So, even though there are competing approaches that—at least in some cases—are better, you need to know this stuff up and down!
Okay—Null Hypothesis Significance Testing—those are a bunch of big words. What do they mean?
NHST is a lot like being a prosecutor in the United States' or Great Britain's justice system. In these two countries—and a few others—the person being charged is presumed innocent, and the burden of proving the defendant's guilt is placed on the prosecutor. The prosecutor then has to argue that the evidence is inconsistent with the defendant being innocent. Only after it is shown that the extant evidence is unlikely if the person is innocent, does the court rule a guilty verdict. If the extant evidence is weak, or is likely to be observed even if the dependent is innocent, then the court rules not guilty. That doesn't mean the defendant is innocent (the defendant may very well be guilty!)—it means that either the defendant was guilty, or there was not sufficient evidence to prove guilt.
With simple NHST, we are testing two competing hypotheses: the null and the alternative hypotheses. The default hypothesis is called the null hypothesis—it is the hypothesis that our observation occurred from chance alone. In the justice system analogy, this is the hypothesis that the defendant is innocent. The alternative hypothesis is the opposite (or complementary) hypothesis; this would be like the prosecutor's hypothesis.
The null hypothesis terminology was introduced by a statistician named R. A. Fischer in regard to the curious case of Muriel Bristol: a woman who claimed that she could discern, just by tasting it, whether milk was added before tea in a teacup or whether the tea was poured before the milk. She is more commonly known as the lady tasting tea.
Her claim was put to the test! The lady tasting tea was given eight cups; four had milk added first, and four had tea added first. Her task was to correctly identify the four cups that had tea added first. The null hypothesis was that she couldn't tell the difference and would choose a random four teacups. The alternative hypothesis is, of course, that she had the ability to discern wither the tea or milk was poured first.
It turned out that she correctly identified the four cups. The chances of randomly choosing the correct four cups is 70 to 1, or about 1.4%. In other words, the chances of that happening under the null hypothesis is 1.4%. Given that it is so very unlikely to have occurred under the null hypothesis, we may choose to reject the null hypothesis. If the null and alternative hypotheses are mutually exclusive and collectively exhaustive, then a rejection of the null hypothesis is tantamount to an acceptance of the alternative hypothesis.
We can't say anything for certain, but we can work with probabilities. In this example, we wanted to prove or disprove the lady tasting tea's claims. We did not try to evaluate the probability that the lady could tell the difference; we assumed that she could not and tried to show that it was unlikely that she couldn't, given her stellar performance on the assessment.
So, here's the basic idea behind NHST as we know it so far:
We have heretofore been rather hand-wavy about what constitutes sufficient unlikelihood to reject the null hypothesis and how we determine the probability in the first place. We'll discuss this now.
In order to quantify how likely or unlikely the results we receive are, we need to define a test statistic—some measure of the sample. The sampling distribution of the test statistic will tell us which test statistics are most likely to occur by chance (under the null hypothesis) with repeated trials of the experiment. Once we know what the sampling distribution of the test statistic looks like, we can tell what the probability of getting a result as extreme as we got is. This is called a p-value. If it is equal to or below some pre-specified boundary, called an alpha level (α level), we decide that the null hypothesis is a bad hypothesis and embrace the alternative hypothesis. Largely, as a matter of tradition, an alpha level of .05 is used most often, though other levels are occasionally used as well. So, if the observed result would only occur 5% or less of the time (p-value < .05), we consider it a sufficiently unlikely event and reject the null hypothesis. If the .05 cut-off sounds rather arbitrary, it's because it is.
So, here's our updated and expanded basic idea behind NHST:
The illustrative example that's going to make sense out of all of this is none other than the gambit of Larry the Untrustworthy Knave that we met in Chapter 4, Probability. If you recall, Larry, who can only be trusted some of the time, gave us a coin that he alleges is fair. We flip it 30 times and observe 10 heads. Let's hypothesize that the coin is unfair; let's formalize our hypotheses:
Let's just use the number of heads in our sample as the test statistic. What is the sampling distribution of this test statistic? In other words, if the coin were fair, and you repeated the flipping-30-times experiment many times, what is the relative frequency of observing particular numbers of heads? We've seen it already! It's the binomial distribution. A binomial distribution with parameters n=30
and p=0.5
describes the number of heads we should expect in 30 flips.
As you can see, the outcome that is the most likely is getting 15 heads (as you might imagine). Can you see what the probability of getting 10 heads is? Fairly unlikely, right?
So, what's the p-value, and is it less than our pre-specified alpha level? Well, we have already worked out the probability of observing 10 or fewer heads in Chapter 4, Probability, as follows:
> pbinom(10, size=30, prob=.5) [1] 0.04936857
It's less than .05. We can conclude the coin is unfair, right? Well, yes and no. Mostly no. Allow me to explain.
You may reject the null hypothesis if the test statistic falls within a region under the curve of the sampling distribution that covers 5% of the area (if the alpha level is .05). This is called the critical region. Do you remember, in the last chapter, we constructed 95% confidence intervals that covered 95% percent of the sampling distribution? Well, the 5% critical region is like the opposite of this. Recall that, in order to make a symmetric 95% of the area under the curve, we had to start at the .025 quantile and end at the .975 quantile, leaving 2.5% percent on the left tail and 2.5% of the right tail uncovered.
Similarly, in order for the critical region of a hypothesis test to cover 5% of the most extreme areas under the curve, the area must cover everything from the left of the .025 quantile and everything to the right of the .975 quantile.
So, in order to determine that the 10 heads out of 30 flips is statistically significant, the probability that you would observe 10 or fewer heads has to be less than .025.
There's a function built right into R, called binom.test
, which will perform the calculations that we have, until now, been doing by hand. In the most basic incantation of binom.test
, the first argument is the number of successes in a Bernoulli trial (the number of heads), and the second argument is the number of trials in the sample (the number of coin flips).
> binom.test(10,30) Exact binomial test data: 10 and 30 number of successes = 10, number of trials = 30, p-value = 0.09874 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.1728742 0.5281200 sample estimates: probability of success 0.3333333
If you study the output, you'll see that the p-value does not cross the significance threshold.
Now, suppose that Larry said that the coin was not biased towards tails. To see if Larry was lying, we only want to test the alternative hypothesis that the probability of heads is less than .5. In that case, we would set up our hypotheses like this:
This is called a directional hypothesis, because we have a hypothesis that asserts that the deviation from chance goes in a particular direction. In this hypothesis suite, we are only testing whether the observed probability of heads falls into a critical region on only one side of the sampling distribution of the test statistic. The statistical test that we would perform in this case is, therefore, called a one-tailed test—the critical region only lies on one tail. Since the area of the critical region no longer has to be divided between the two tails (like in the two-tailed test we performed earlier), the critical region only contains the area to the left of the .05 quantile.
As you can see from the figure, for the directional alternative hypothesis that heads has a probability less than .5, 10 heads is now included in the green critical region.
We can use the binom.test
function to test this directional hypothesis, too. All we have to do is specify the optional parameter alternative and set its value to "less"
(its default is "two.sided"
for a two-tailed test).
> binom.test(10,30, alternative="less") Exact binomial test data: 10 and 30 number of successes = 10, number of trials = 30, p-value = 0.04937 alternative hypothesis: true probability of success is less than 0.5 95 percent confidence interval: 0.0000000 0.4994387 sample estimates: probability of success 0.3333333
If we wanted to test the directional hypothesis that the probability of heads was greater than .5, we would use alternative="greater"
.
Take note of the fact that the p-value is now less than .05. In fact, it is identical to the probability we got from the pbinom
function.
Certainty is a card rarely used in the deck of a data analyst. Since we make judgments and inferences based on probabilities, mistakes happen. In particular, there are two types of mistakes that are possible in NHST: Type I errors and Type II errors.
Check the following table for errors encountered in the coin example:
Coin type |
Failure to reject null hypothesis (conclude no detectable effect) |
Reject the null hypothesis (conclude that there is an effect) |
---|---|---|
Coin is fair |
Correct positive identification |
Type I error (false positive) |
Coin is unfair |
Type II error (false negative) |
Correct identification |
In the criminal justice system, Type I errors are considered especially heinous. Legal theorist William Blackstone is famous for his quote: it is better that ten guilty persons escape than one innocent suffer. This is why the court instructs jurors (in the United States, at least) to only convict the defendant if the jury believes the defendant is guilty beyond a reasonable doubt. The consequence is that if the jury favors the hypothesis that the defendant is guilty, but only by a little bit, the jury must give the defendant the benefit of the doubt and acquit.
This line of reasoning holds for hypothesis testing as well. Science would be in a sorry state if we accepted alternative hypotheses on rather flimsy evidence willy-nilly; it is better that we err on the side of caution when making claims about the world, even if that means that we make fewer discoveries of honest-to-goodness, real-world phenomena because our statistical tests failed to reach significance.
This sentiment underlies that decision to use an alpha level like .05. An alpha level of .05 means that we will only commit a Type I error (false positive) 5% of the time. If the alpha level were higher, we would make fewer Type II errors, but at the cost of making more Type I errors, which are more dangerous in most circumstances.
There is a similar metric to the alpha level, and it is called the beta level (β level). The beta level is the probability that we would fail to reject the null hypothesis if the alternative hypothesis were true. In other words, it is the probability of making a Type II error.
The complement of the beta level, 1 minus the beta level, is the probability of correctly detecting a true effect if one exists. This is called power. This varies from test to test. Computing the power of a test, a technique called power analysis, is a topic beyond the scope of this book. For our purposes, it will suffice to say that it depends on the type of test being performed, the sample size being used, and on the size of the effect that is being tested (the effect size). Greater effects, like the average difference in height between women and men, are far easier to detect than small effects, like the average difference in the length of earthworms in Carlisle and in Birmingham. Statisticians like to aim for a power of at least 80% (a beta level of .2). A test that doesn't reach this level of power (because of a small sample size or small effect size, and so on) is said to be underpowered.
It's perhaps regrettable that we use the term significance in relation to null-hypothesis testing. When the term was first used to describe hypothesis tests, the word significance was chosen because it signified something. As I wrote this chapter, I checked the thesaurus for the word significant, and it indicated that synonyms include notable, worthy of attention, and important. This is misleading in that it is not equivalent to its intended, vestigial meaning. One thing that really confuses people is that they think statistical significance is of great importance in and of itself. This is sadly untrue; there are a few ways to achieve statistical significance without discovering anything of significance, in the colloquial sense.
As we'll see later in the chapter, one way to achieve non-significant statistical significance is by using a very large sample size. Very small differences, that make little to no difference in the real world, will nevertheless be considered statistically significant if there is a large enough sample size.
For this reason, many people make the distinction between statistical significance and practical significance or clinical relevance. Many hold the view that hypothesis testing should only be used to answer the question is there an effect? or is there a discernable difference?, and that the follow-up questions is it important? or does it make a real difference? should be addressed separately. I subscribe to this point of view.
To answer the follow-up questions, many use effect sizes, which, as we know, capture the magnitude of an effect in the real world. We will see an example of determining the effect size in a test later in this chapter.
P-values are, by far, the most talked about metric in NHST. P-values are also notorious for lending themselves to misinterpretation. Of the many criticisms of NHST (of which there are many, in spite of its ubiquity), the misinterpretation of p-values ranks highly. The following are two of the most common misinterpretations:
3.144.114.223