How it works...

In Step 1, we loaded the required libraries. Then, we loaded the CSV data, downloaded from Kaggle (available at https://www.kaggle.com/mlg-ulb/creditcardfraud) into Python, using the pd.read_csv function.

The dataset we selected for this exercise contains information about transactions made by credit cards in September 2013 by European cardholders. The transactions happened within a period of 2 days. All the numerical variables with a name starting with V are the principal components, obtained by running the PCA on the original dataset (original features are not included, due to confidentiality). The only two features with clear interpretation are Time (seconds elapsed between each transaction and the first transaction in the dataset) and Amount (transaction amount).

We additionally separated the target from the features using the pop method, created an 80-20 train-test split with stratification (this is especially important when dealing with imbalanced data), and—lastly—verified that the positive (fraudulent) observations were indeed only 0.17% of the entire sample.

In this recipe, we only focused on working with imbalanced data. That is why we did not cover any EDA, feature engineering, dropping unnecessary features, and so on. Also, all the features were numerical, which reduced the complexity of preprocessing (no categorical encoding).

In Step 3, we fitted a vanilla Random Forest model, which we will use as a point of reference when comparing the outcomes of the different approaches (for a table summarizing the performance, please refer to the There's more... section).

In Step 4, we used the RandomUnderSampler class from the imblearn library to randomly undersample the majority class in order to match the size of the minority sample. Classes from imblearn follow the scikit-learn's API style; that is why we had to first define the class with the arguments (we only set the random_state). Then we applied the fit_resample method to obtain the undersampled data. We reused the Random Forest object to train the model on the undersampled data and stored the results for later comparison.

Step 5 is analogical to Step 4, with the difference of using the RandomOverSampler to randomly oversample the minority class in order to match the size of the majority class.

In Step 6 and Step 7, we applied the SMOTE and ADASYN variants of oversampling. As the imblearn API makes it very easy to apply different sampling methods, we do not go deeper into the description of the process. It is interesting to notice that using SMOTE resulted in a 1:1 ratio between classes, while ADASYN actually returned more observations from the minority (fraudulent) class than the size of the majority class.

In all the mentioned resampling methods, we can actually specify the desired ratio between classes by passing a float to the sampling_strategy argument. The number represents the desired ratio of the number of observations in the minority class over the number of observations in the majority class.

In the last step, we used the class_weight hyperparameter of the RandomForestClassifier. By passing 'balanced', the algorithm automatically assigns weights inversely proportional to class frequencies in the training data.

There are different possible approaches to using the class_weight hyperparameter. We can pass 'balanced_subsample', which results in a similar weights assignment as in 'balanced', however, the weights are computed based on the bootstrap sample for every tree. We can additionally pass a dictionary containing the desired weights. One way of determining the weights can be the compute_class_weight function from sklearn.utils.class_weight.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...