Implementing Student's t-test

One of the most famous statistical tests is Student's t-test. You might have heard of it before: it allows us to determine whether two sets of data are significantly different from one another. This was a really important test for William Sealy Gosset, the inventor of the test, who worked at the Guinness brewery and wanted to know whether two batches of stout differed in quality.

Note that "Student" here is capitalized. Although Gosset wasn't allowed to publish his test due to company policy, he did so anyway under his pen name, Student.

In practice, the t-test allows us to determine whether two data samples come from underlying distributions with the same mean or expected value.

For our purposes, this means that we can use the t-test to determine whether the test scores of two independent classifiers have the same mean value. We start by hypothesizing that the two sets of test scores are identical. We call this the null hypothesis because this is the hypothesis we want to nullify, that is, we are looking for evidence to reject the hypothesis because we want to ensure that one classifier is significantly better than the other.

We accept or reject a null hypothesis based on a parameter known as the p-value that the t-test returns. The p-value takes on values between 0 and 1. A p-value of 0.05 would mean that the null hypothesis is right only 5 out of 100 times. A small p-value hence indicates strong evidence that the hypothesis can be safely rejected. It is customary to use p=0.05 as a cut-off value below which we reject the null hypothesis.

If this is all too confusing, think of it this way: when we run a t-test to compare classifier test scores, we are looking to obtain a small p-value because that means that the two classifiers give significantly different results.

We can implement Student's t-test with SciPy's ttest_ind function from the stats module:

In [25]: from scipy.stats import ttest_ind

Let's start with a simple example. Assume we ran five-fold cross-validation on two classifiers and obtained the following scores:

In [26]: scores_a = [1, 1, 1, 1, 1]
...      scores_b = [0, 0, 0, 0, 0]

This means that Model A achieved 100% accuracy in all five folds, whereas Model B got 0% accuracy. In this case, it is clear that the two results are significantly different. If we run the t-test on this data, we should hence find a really small p-value:

In [27]: ttest_ind(scores_a, scores_b)
Out[27]: Ttest_indResult(statistic=inf, pvalue=0.0)

And we do! We actually get the smallest possible p-value, p=0.0.

On the other hand, what if the two classifiers got exactly the same numbers, except during different folds? In this case, we would expect the two classifiers to be equivalent, which is indicated by a really large p-value:

In [28]: scores_a = [0.9, 0.9, 0.9, 0.8, 0.8]
...      scores_b = [0.8, 0.8, 0.9, 0.9, 0.9]
...      ttest_ind(scores_a, scores_b)
Out[28]: Ttest_indResult(statistic=0.0, pvalue=1.0)

Analogous to the aforementioned, we get the largest possible p-value, p=1.0.

To see what happens in a more realistic example, let's return to our kNN classifier from an earlier example. Using the test scores obtained from the ten-fold cross-validation procedure, we can compare two different kNN classifiers with the following procedure:

Obtain a set of test scores for Model A. We choose Model A to be the kNN classifier from earlier (k=1):

      In [29]: k1 = KNeighborsClassifier(n_neighbors=1)
      ...      scores_k1 = cross_val_score(k1, X, y, cv=10)
      ...      np.mean(scores_k1), np.std(scores_k1)
      Out[29]: (0.95999999999999996, 0.053333333333333323)

Obtain a set of test scores for Model B. Let's choose Model B to be a kNN classifier with k=3:

      In [30]: k3 = KNeighborsClassifier(n_neighbors=3)
      ...      scores_k3 = cross_val_score(k3, X, y, cv=10)
      ...      np.mean(scores_k3), np.std(scores_k3)
      Out[30]: (0.96666666666666656, 0.044721359549995787)

Apply the t-test to both sets of scores:

      In [31]: ttest_ind(scores_k1, scores_k3)
      Out[31]: Ttest_indResult(statistic=-0.2873478855663425,
               pvalue=0.77712784875052965)

As you can see, this is a good example of two classifiers giving different cross-validation scores (96.0% and 96.7%) that turn out to be not significantly different! Because we get a large p-value (p=0.777), we expect the two classifiers to be equivalent 77 out of 100 times.

Table of Contents for Implementing Student's t-test

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing Student's t-test