10.5. Beyond α and β: Crucial Type I and Type II Error Rates

Are α and β (or power = 1 − β) the only good ways to quantify the risk of making Type I and Type II errors? While they may be the classical rates to consider and report, they fail to directly address two fundamental questions:

  • If the trial yields traditional statistical significance (p ≤ α), what is the chance this will be an incorrect inference?

  • If the trial does not yield traditional statistical significance (p > α), what is the chance this will be an incorrect inference?

To answer these in some reasonable way, we need to go beyond using just α and β.

10.5.1. A Little Quiz: Which Study Provides the Strongest Evidence?

Table 10.2 summarizes outcomes from three possible QCA trials. Which study has the strongest evidence that QCA is effective? Studies #1 and #2 have N = 150 + 300 subjects, whereas #3 has N = 700 + 1400 subjects. Studies #1 and #3 have identical 0.79 estimates of relative risk, but with p = 0.36, Study #1 does not adequately support QCA efficacy. Choosing between Studies #2 and #3 is harder. They have the same p-value, so many people would argue that they have the same inferential support. If so, then #2 is the strongest result, because its relative risk of 0.57 is substantially lower than the relative risk of 0.79 found in Study #3. However, Study #3 has nearly 5 times the sample size, so it has greater power. How should that affect our assessment?

Table 10-2. Which Study Has the Strongest Evidence That QCA is Effective?
StudyDeaths/NMortalityRelative riskLR test p-value
UCOQCAUCOQCARR[95% CI]
#121/15033/30014.0%11.0%0.79[0.47, 1.31]0.36
#221/15024/30014.0%8.0%0.57[0.33, 0.992]0.05
#398/700154/140014.0%11.0%0.79[0.62, 0.995]0.05

10.5.2. To Answer the Quiz: Compare the Studies' Crucial Error Rates

Suppose that Mother Nature has set the true usual care mortality rate at 0.15 and the QCA relative risk at 0.67, the most powerful scenario we considered above. We have already seen (Figure 10.3, Output 10.3) that with N = 700 + 1400 subjects and using α = 0.05 (two-sided), the power is 90%. With 150 subjects getting usual care and 300 getting QCA, the power is only about 33%.

Now, in addition, suppose that Dr. Capote and his team are quite optimistic that QCA is effective. This does not mean they have lost their ordinary scientific skepticism and already believe that QCA is effective. Consider another Feynman-ism (1999, P. 200):

The thing that's unusual about good scientists is that they're not so sure of themselves as others usually are. They can live with steady doubt, think "maybe it's so" and act on that, all the time knowing it's only "maybe."

Dr. Capote's team understands that even for the most promising experimental treatments, the clear majority fail to work when tested extensively. In fact, Lee and Zelen (2000) estimated that among 87 trials completed and reported out by the Eastern Cooperative Oncology Group at Harvard from 1980-1995, only about 30% seem to have been testing therapies that had some clinical efficacy.

Let us suppose that Dr. Capote's team conducted 1000 independent trials looking for significant treatment effects, but Mother Nature had set things up so that 700 effects were actually null. What would we expect to happen if Dr. Capote ran all 1000 trials at average powers of 33%? 90%? Table 10.3 presents some straightforward computations that illustrate what we call the crucial Type I and Type II error rates. With 700 null tests, we would expect to get 35 (5%) Type I errors (false positives). From the 300 non-null hypotheses tested with 33% power, we would expect to get 99 true positives. Thus, each "significant" test (p ≤ 0.05) has an α* = 35/134 = 0.26 chance of being misleading. Note how much larger this is than α = 0.05. Some people (including authors of successful statistics books) confuse α and α*, and hence they also misinterpret what p-values are. A p-value of 0.032 does not imply that there is a 0.032 chance that the null hypothesis is true.

Table 10-3. Expected Results for 1000 Tests Run at α = 0.05. Note: The true hypothesis is null in 700 tests. For the 300 non-null tests, the average power is 33% or 90%.
 Result of hypothesis test
p ≤ 0.05 ("significant")p > 0.05 ("not significant")
33% average power  
700 true null5% of 700 = 3595% of 700 = 665
300 true non-null33% of 300 = 9967% of 300 = 201
 Crucial Type I error rate: α* = 35/134 = 0.26Crucial Type II error rate: β* = 201/866 = 0.23
90% average power  
700 true null5% of 700 = 3595% of 700 = 665
300 true non-null90% of 300 = 27010% of 300 = 30
 Crucial Type I error rate: α* = 35/305 = 0.11Crucial Type II error rate: β* = 30/695 = 0.04

The crucial Type II error rate, β*, is defined similarly. With 33% power, we would expect to get 201 Type II errors (false negatives) to go with 665 true negatives; thus β* = 210/866 = 0.23. Note that this is not equal to β = 0.67.

10.5.3. Greater Power Reduces Both Types of Crucial Error Rates

A key point illustrated in Table 10.3 is that greater power reduces both types of crucial error rates. In other words, statistical inferences are generally more trustworthy when the underlying power is greater. Let us return to Table 10.2. Again, which study has the strongest evidence that QCA is effective? Even under our most powerful scenario, a p ≤ 0.05 result has a 0.26 chance of being misleading when using N = 150 + 300, as per Study #2. This falls to 0.11 using N = 700 + 1400 (Study #3). Both studies may have yielded p = 0.05, but they do not provide the same level of support for inferring that QCA is effective. Study #3 provides the strongest evidence that QCA has some degree of efficacy. This concept is poorly understood throughout all of science.

10.5.4. The March of Science and Sample-Size Analysis

Consistent with Lee and Zelen (2000), we think that investigators designing clinical trials are well served by considering α* and β*. (Note that Lee and Zelen's definition is reversed from ours in that our α* and β* correspond to their β* and α*, respectively.) Ioannidis (2005b) used the same logic in arguing "why most published research findings are false." Wacholder et al. (2004) described the same methodology to more carefully infer whether a genetic variant is really associated with a disease. Their "false positive report probability" is identical to α*. Also, many readers familiar with accuracy statistics for medical tests will see that 1 − α and β are isomorphic to thr specificity and sensitivity of the diagnostic method and 1 − α* and 1 − β* are isomorphic with the positive and negative predictive values.

Formally, let γ be the probability that the null hypothesis is false. We like to think of γ as measuring where the state of knowledge currently is in terms of confirming the non-null hypothesis; in short, its location along its March of Science (Figure 10.1). Thus, for novel research hypotheses, γ will be nearer to 0. For mature hypotheses that stand ripe for solid confirmation with say, a large Phase III trial, γ will be markedly greater than 0. We might regard γ = 0.5 as scientific equipoise, saying that the hypothesis is halfway along its path to absolute confirmation in that we consider the null and non-null hypothesis as equally viable. Lee and Zelen's (2000) calculations put γ around 0.3 for Phase III trials coordinated in the Eastern Cooperative Oncology Group.

Given γ, α and some β set by some particular design, sample size, and non-null scenario, we can apply Bayes' Theorem to get


and


To be precise, "H0 false" really means "H0 false, as conjectured in some specific manner." For the example illustrated first in Table 10.3, we have γ = 0.30, α = 0.05 and β = 0.67, thus


and


In Bayesian terminology, γ = 0.3 is the prior probability that QCA is effective, and 1 − α* = 0.739 is the posterior probability given that p ≤ α. However, nothing here involves Bayesian data analysis methods, which have much to offer in clinical research, but are not germane to this chapter. Some people are bothered by the subjectivity involved in specifying prior probabilities like γ, but we counter by pointing out that there are many other subjectivities involved in sample-size analysis for study planning, especially the conjectures made in defining the infinite dataset. Indeed, we find that most investigators are comfortable specifying γ, at least with a range of values, and that computing various α* and β* values of interest gives them a much better insight into the true inferential strength of their proposed (frequentist) analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.218.157