10.3. p-Values, α, β and Power

The primary efficacy analysis will yield a p-value that compares the mortality rates of control versus QCA. Smaller p-values indicate greater statistical separation between the two samples, but how that p-value is determined is an issue that is not critical to understanding the essential concepts in sample-size analysis. In this case, that p-value may come from one of the many methods to compare two independent proportions, including the likelihood ratio chi-square test, as used here, or it may come from a logistic or hazard modeling that includes co-predictors. Regardless of what test is used to get the p-value, if p is small enough ("significant") and the QCA mortality rates are better, Dr. Capote will report that the study supported the hypothesis that QCA reduces mortality in children with severe malaria complicated with lactic acidosis. If the p-value is not small enough ("not significant"), then he will report that the data provided insufficient evidence to support the hypothesis.

10.3.1. Null and Non-Null Distributions of p-Values; Type I and Type II Errors

Dr. Capote's quest here is to answer: Does QCA decrease mortality in children with severe malaria? While Mother Nature knows the correct answer, only if we mortals in science were able to gather an infinitely large, perfectly clean dataset could we figure this out ourselves. Rather, we must design a study or, usually, a series of studies, that will yield sample data sets that give us a solid chance of inferring what Mother Nature knows. Unfortunately, Lady Luck builds randomness into those sample datasets, and thus even the best studies can deliver misleading answers.

Figure 10-2. Distribution of the p-value under the null hypothesis and two non-null scenarios

Please study the top distribution in Figure 10.2. Here, there is no difference between the two groups' mortality in the infinite dataset, so regardless of the sample size, all values for 0 < p < 1 are equally likely. Accordingly, there is a 5% chance that p ≤ 0.05, or a 100α% chance that p ≤ α (in practice, these percentages are rarely exact, because the data are discrete or they fail to perfectly meet the test's underlying mathematical assumptions). If there is no true effect but p ≤ α indicates otherwise, this triggers a Type I error, which is why α is called the Type I error rate. α should be chosen after some thought; it should not be automatically set at 0.05.

What if QCA has some true effect, good or bad? Then the non-null (non-central) distribution of the p-value will be skewed towards 0.0, as in the middle and bottom plots of Figure 10.2. The middle one comes from presuming (1) a true mortality rates of 0.28 for UCO and 0.21 for QCA, a 25% reduction in mortality; (2) 700 patients randomized to UCO versus 1400 to QCA, and (3) the p-value arises from testing whether the two mortality proportions differ (non-directional) using the likelihood ratio chi-square statistic. The bottom plot conforms to presuming that QCA cuts mortality by 33%.

Inferential power is the chance that p ≤ α when the null hypothesis is false, which is why α could be called the "null power." If there is some true effect, but p > α, then a Type II error is triggered. Consider the middle plot, which is based on a 25% reduction in mortality. Using the common Type I error rate, α = 0.05, the power is 0.68, so the Type II error rate is β = 0.32. By tolerating a higher α-level, we can increase power (decrease β). Here, using α = 0.20. the power is 0.87, so β = 0.13. If QCA is more effective (bottom plot, 33% reduction in mortality), then the power rises to 0.91 with α = 0.05 and 0.98 with α = 0.20. Again, we will never know the true power, because Mother Nature will never tell us the true mortality rates in the two groups, and Lady Luck will always add some natural randomness into our outcome data.

10.3.2. Balancing Type I and II Error Rates

Recall that Topol et al (1997) advocated that the power should be around 90%, which puts the Type II error rate around 10%. We generally agree, but stress that there should be no standard power threshold that is accepted blindly as being satisfactory across all situations. Why do so many people routinely perform power analyses using α = 0.05 and 80% power (β = 0.20)? Rarely do they give it any thought.

Consider the middle plot in Figure 10.2. We could achieve a much better Type II error rate of 13% if we are willing to accept a substantially greater Type I error rate of 20%. Investigators should seek to obtain α versus β values that are in line with the consequences of making a Type I error versus a Type II error. In some cases, making a Type II error may be far more costly than making a Type I error. In particular, in the early stages of the March of Science, making a Type I error may only extend the research to another stage. This is undesirable, of course, as are all mistaken inferences in science, but making a Type II error may be far more problematic, because it may halt a line of research that would ultimately be successful. So it might be justified to use α = 0.20$ (maybe more) in order to reduce β as much as possible. Using such high α values is not standard, so investigators adopting this philosophy must be convincing in their argument.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.149.238