Investigating different approaches to handling imbalanced data

A very common issue when working with classification tasks is that of class imbalance: when one class is highly outnumbered in comparison to the second one (this can also be extended to multi-class). In general, we are talking about imbalance when the ratio of the two classes is not 1:1. In some cases, a delicate imbalance is not that big of a problem, but there are industries/problems in which we can encounter ratios of 100:1, 1000:1, or even worse.

In this recipe, we show an example of a credit card fraud problem, where the fraudulent class is only 0.17% of the entire sample. In such cases, gathering more data (especially of the fraudulent class) might simply not be feasible, and we need to resort to some techniques that can help us in understanding and avoiding the accuracy paradox.

Accuracy paradox refers to a case in which inspecting accuracy as the evaluation metric creates the impression of having a very good classifier (a score of 90%, or even 99.9%), while in reality, it simply reflects the distribution of the classes. That is why, in cases of class imbalance, it is highly advisable to use evaluation metrics that account for that, such as precision/recall, F1 Score, or Cohen's kappa.

In this recipe, we consider a small selection of methods commonly used in tackling classification challenges with imbalanced classes. For a more detailed description of the techniques, please refer to the There's more... section.

Table of Contents for Investigating different approaches to handling imbalanced data

Create new playlist

Sign In

Sign Up

Table of Contents for
Investigating different approaches to handling imbalanced data