What if my assumptions are unfounded?

The t-test and ANOVA are both considered parametric statistical tests. The word parametric is used in different contexts to signal different things but, essentially, it means that these tests make certain assumptions about the parameters of the population distributions from which the samples are drawn. When these assumptions are met (with varying degrees of tolerance to violation), the inferences are accurate, powerful (in the statistical sense), and are usually quick to calculate. When those parametric assumptions are violated, though, parametric tests can often lead to inaccurate results.

We've spoken about two main assumptions in this chapter: normality and homogeneity of variance. I mentioned that, even though you can test for homogeneity of variance with the leveneTest function from the car package, the default t.test in R removes this restriction. I also mentioned that you could use the oneway.test function in lieu of aov if you don't have to have to adhere to this assumption when performing an ANOVA. Due to these affordances, I'll just focus on the assumption of normality from now on.

In a t-test, the assumption that the sample is an approximately normal distribution can be visually verified, to a certain extent. The naïve way is to simply make a histogram of the data. A more proper approach is to use a QQ-plot (quantile-quantile plot). You can view a QQ-plot in R by using the qqPlot function from the car package. Let's use it to evaluate the normality of the miles per gallon vector in mtcars.

  > library(car)
  > qqPlot(mtcars$mpg)
What if my assumptions are unfounded?

Figure 6.9: A QQ-plot of the mile per gallon vector in mtcars

A QQ-plot can actually be used to compare any sample from any theoretical distribution, but it is most often associated with the normal distribution. The plot depicts the quantiles of the sample and the quantiles of the normal distribution against each other. If the sample were perfectly normal, the points would fall on the solid red diagonal line—its divergence from this line signals a divergence from normality. Even though it is clear that the quantiles for mpg don't precisely comport with the quantiles of the normal distribution, its divergence is relatively minor.

The most powerful method for evaluating adherence to the assumption of normality is to use a statistical test. We are going to use the Shapiro-Wilk test, because it's my favorite, though there are a few others.

  > shapiro.test(mtcars$mpg)

          Shapiro-Wilk normality test

  data:  mtcars$mpg
  W = 0.9476, p-value = 0.1229

This non-significant result indicates that the deviations from normality are not statistically significant.

For ANOVAs, the assumption of normality applies to the residuals, not the actual values of the data. After performing the ANOVA, we can check the normality of the residuals quite easily:

  > # I'm repeating the set-up
  > library(car)
  > the.anova <- aov(wl2 ~ group, data=WeightLoss)
  >
  > shapiro.test(the.anova$residuals)
  
          Shapiro-Wilk normality test
  
  data:  the.anova$residuals
  W = 0.9694, p-value = 0.4444

We're in the clear!

But what if we do violate our parametric assumptions!? In cases like these, many analysts will fall back on using non-parametric tests.

Many statistical tests, including the t-test and ANOVA, have non-parametric alternatives. The appeal of these tests is, of course, that they are resistant to violations of parametric assumptions—that they are robust. The drawback is that these tests are usually less powerful than their parametric counterparts. In other words, they have a somewhat diminished capacity for detecting an effect if there truly is one to detect. For this reason, if you are going to use NHST, you should use the more powerful tests by default, and switch only if you're assumptions are violated.

The non-parametric alternative to the independent t-test is called the Mann-Whitney U test, though it is also known as the Wilcoxon rank-sum test. As you might expect by now, there is a function to perform this test in R. Let's use it on the auto vs. manual transmission example:

  > wilcox.test(automatic.mpgs, manual.mpgs)
  
          Wilcoxon rank sum test with continuity correction
  
  data:  automatic.mpgs and manual.mpgs
  W = 42, p-value = 0.001871
  alternative hypothesis: true location shift is not equal to 0

Simple!

The non-parametric alternative to the one-way ANOVA is called the Kruskal-Wallis test. Can you see where I'm going with this?

  > kruskal.test(wl2 ~ group, data=WeightLoss)
  
          Kruskal-Wallis rank sum test
  
  data:  wl2 by group
  Kruskal-Wallis chi-squared = 14.7474, df = 2, p-value = 0.0006275

Super!

What if my assumptions are unfounded?
What if my assumptions are unfounded?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.93.122