Assessing the significance of our results

Assume for a moment that we implemented the cross-validation procedure for two versions of our kNN classifier. The resulting test scores are—92.34% for Model A and 92.73% for Model B. How do we know which model is better?

Following our logic introduced here, we might argue for Model B because it has a better test score. But what if the two models are not significantly different? These could have two underlying causes, which are both a consequence of the randomness of our testing procedure:

For all we know, Model B just got lucky. Perhaps we chose a really low k for our cross-validation procedure. Perhaps Model B ended up with a beneficial train-test split so that the model had no problem classifying the data. After all, we didn't run tens of thousands of iterations like in bootstrapping to make sure the result holds in general.
Variability in the test scores is so high that we cannot know for sure whether the two results are essentially the same. This could be the case even over 10,000 iterations in bootstrapping! If the testing procedure is inherently noisy, the resulting test scores will be noisy, too.

So, how can we know for sure that the two models are different?

Table of Contents for Assessing the significance of our results

Create new playlist

Sign In

Sign Up

Table of Contents for
Assessing the significance of our results