Implementing McNemar's test

A more advanced statistical technique is McNemar's test. This test can be used on paired data to determine whether there are any differences between the two samples. As in the case of the t-test, we can use McNemar's test to determine whether two models give significantly different classification results.

McNemar's test operates on pairs of data points. This means that we need to know, for both classifiers, how they classified each data point. Based on the number of data points that the first classifier got right but the second got wrong and vice versa, we can determine whether the two classifiers are equivalent.

Let's assume the preceding Model A and Model B were applied to the same five data points. Whereas Model A classified every data point correctly (denoted with a 1), Model B got all of them wrong (denoted with a 0):

In [33]: scores_a = np.array([1, 1, 1, 1, 1])
...      scores_b = np.array([0, 0, 0, 0, 0])

McNemar's test wants to know two things:

How many data points did Model A get right but Model B got wrong?
How many data points did Model A get wrong but Model B got right?

We can check which data points Model A got right but Model B got wrong as follows:

In [34]: a1_b0 = scores_a * (1 - scores_b)
...      a1_b0
Out[34]: array([1, 1, 1, 1, 1])

Of course, this applies to all of the data points. The opposite is true for the data points that Model B got right and Model A got wrong:

In [35]: a0_b1 = (1 - scores_a) * scores_b
...      a0_b1
Out[35]: array([0, 0, 0, 0, 0])

Feeding these numbers to McNemar's test should return a small p-value because the two classifiers are obviously different:

In [36]: mcnemar_midp(a1_b0.sum(), a0_b1.sum())
Out[36]: 0.03125

And it does!

We can apply McNemar's test to a more complicated example, but we cannot operate on cross-validation scores anymore. The reason is that we need to know the classification result for every data point, not just an average. Hence, it makes more sense to apply McNemar's test to the leave-one-out cross-validation.

Going back to kNN with k=1 and k=3, we can calculate their scores as follows:

In [37]: scores_k1 = cross_val_score(k1, X, y, cv=LeaveOneOut())
...      scores_k3 = cross_val_score(k3, X, y, cv=LeaveOneOut())

The number of data points that one of the classifiers got right but the other got wrong are as follows:

In [38]: np.sum(scores_k1 * (1 - scores_k3))
Out[38]: 0.0
In [39]: np.sum((1 - scores_k1) * scores_k3)
Out[39]: 0.0

We got no differences whatsoever! Now it becomes clear why the t-test led us to believe that the two classifiers are identical. As a result, if we feed the two sums into McNemar's test function, we get the largest possible p-value, p=1.0:

In [40]: mcnemar_midp(np.sum(scores_k1 * (1 - scores_k3)),
...                   np.sum((1 - scores_k1) * scores_k3))
Out[40]: 1.0

Now that we know how to assess the significance of our results, we can take the next step and improve the model's performance by tuning its hyperparameters.

Table of Contents for Implementing McNemar's test

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing McNemar's test