There's more...

We begin with a description of the techniques used in this recipe:

Undersampling: A very simple method to deal with class imbalance is to undersample the majority class—draw random samples from the majority class to obtain a 1:1, or any other desired ratio between the target classes. Using this method can lead to some issues, such as lower accuracy of the model trained on undersampled data (due to information loss, caused by discarding the majority of the training set). Another possible implication is the increased number of false positives, as the distribution of the training and test sets is not the same after the resampling. This results in a biased classifier.

Oversampling: In this approach, we sample multiple times with replacement from the minority class, until the desired ratio is achieved. It often outperforms undersampling, as there is no information loss due to discarding data. However, it comes with the danger of overfitting, caused by replication of observations from the minority class.

Synthetic Minority Over-sampling Technique (SMOTE): SMOTE is a more advanced oversampling algorithm that creates new, synthetic observations from the minority class. This way, it overcomes the previously mentioned problem of overfitting.

To create the synthetic samples, the algorithm looks at the observations from the minority class, identifies their k-nearest neighbors (using the k-NN algorithm with a distance metric such as Euclidean), and creates new, synthetic observations on the lines connecting (interpolating) the observation to the nearest neighbors.

There are many variants of the algorithm available in the imblearn library, such as Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC), suitable for a dataset containing both numerical and categorical features.

Aside from reducing the problem of overfitting, SMOTE causes no loss of information, as there is no discarding of majority class observations. However, SMOTE can accidentally introduce more noise to the data (and cause overlapping of classes). This is because it does not take into account the observations from the majority class while creating the new synthetic observations. Also, the algorithm is not very effective for high-dimensional data (due to the curse of dimensionality).

Adaptive Synthetic Sampling (ADASYN): This algorithm is a modification of the SMOTE algorithm. It differs in the way it selects the number of synthetic observations to create, and how to do it. In ADASYN, the number of observations to be created for a certain point is determined by a density distribution (instead of a uniform weight for all points, as in SMOTE). This is how ADASYN's adaptive nature enables it to generate more samples for observations that come from hard-to-learn neighborhoods. There are two additional elements worth mentioning:

The synthetic points are not limited to linear interpolation between two points, but can also lie on a plane created by three or more observations.
After creating the synthetic observations, the algorithm adds a small random noise to increase the variance (that is, make them more scattered), thus making the samples more realistic.

Potential drawbacks of ADASYN include a possible decrease in precision of the algorithm, caused by the adaptability and struggling with a case when minority observations are sparsely distributed (a neighborhood can contain only 1, or very few points).

Class weights in ML models: A lot of machine learning models (especially the ones in scikit-learn, but not exclusively those) have a hyperparameter called class_weights. We can use it to pass specific weights for the classes, thus putting more weight on the minority class. In the background, the class weights are incorporated into calculating the loss function, which in practice means that misclassifying minority observations increases the value of the loss function significantly more than in the case of the observations from the majority class.

Additionally, the imblearn library also features some modified versions of popular classifiers. We present an example of the modified Random Forest—the BalancedRandomForestClassifier. The API is virtually the same as in the scikit-learn implementation (including the tunable hyperparameters). The difference is that in the balanced Random Forest the algorithm randomly undersamples each bootstrapped sample to balance the classes.

Execute the following steps to estimate the balanced Random Forest classifier.

Import the library:

from imblearn.ensemble import BalancedRandomForestClassifier

Train the balanced Random Forest classifiers:

balanced_rf = BalancedRandomForestClassifier(
    random_state=RANDOM_STATE
)

balanced_rf.fit(X_train, y_train)

balanced_rf_cw = BalancedRandomForestClassifier(
    random_state=RANDOM_STATE,   
    class_weight='balanced'
)

balanced_rf_cw.fit(X_train, y_train)

Group the performance results into a DataFrame:

performance_results = {'random_forest': rf_perf,
                       'undersampled rf': rf_rus_perf,
                       'oversampled_rf': rf_ros_perf,
                       'smote': rf_smote_perf,
                       'adasyn': rf_adasyn_perf,
                       'random_forest_cw': rf_cw_perf,
                       'balanced_random_forest': balanced_rf_perf,
                       'balanced_random_forest_cw': 
                        balanced_rf_cw_perf}
                        pd.DataFrame(performance_results).T

Running the code results in a table, summarizing the performance of the models (evaluated on the test set):

As we are dealing with a highly imbalanced problem (the positive class accounts for 0.17% of all the observations), we compare the performance of the models using recall—the number of correctly predicted frauds over all the frauds in the test sample. The best performing model is the balanced Random Forest classifier, while the worst one is the Random Forest with class weights. It is also important to mention that no hyperparameter tuning was performed, which could potentially improve the performance.

Also, we can observe the case of the accuracy paradox, where many models have an accuracy of ~99.99%, but they still fail to detect fraudulent cases, which are the most important ones.

Some general notes on tackling problems with imbalanced classes:

Do not apply under-/oversampling on the test set.
For evaluating problems with imbalanced data, use metrics that account for this, such as precision/recall/F1 score/Cohen's kappa/PR-AUC.
Use SMOTE-NC (a modified version of SMOTE) when dealing with a dataset containing categorical features, as the original SMOTE can create illogical values for one-hot encoded variables.
Use stratification when creating folds for cross-validation.
Introduce under-/oversampling during cross-validation, not before. Doing so before leads to overestimating the model's performance!
Experiment with selecting a different probability threshold than the default 50% to potentially tune the performance.

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...