Data sampling

You may encounter datasets that have a high level of imbalanced outcome classes. For instance, if you were working with a dataset on a rare disease, with your outcome variable being true or false, due to the rarity of the occurrence, you may find that the number of observations marked as false (that is, the person did not have the rare disease) is much higher than the number of observations marked as true (that is, the person did have the rare disease).

Machine learning algorithms attempt to maximize performance, which in many cases could be the accuracy of the predictions. Say, in a sample of 1000 records, only 10 are marked as true and the rest of the 990 observations are false.

If someone were to randomly assign all observations as false, the accuracy rate would be:

(990/1000) * 100 = 99%

But, the objective of the exercise was to find the individuals who had the rare disease. We are already well aware that due to the nature of the disease, most individuals will not belong to the category.

Data sampling, in essence, is the process of maximizing machine learning metrics such as specificity, sensitivity, precision, recall, and kappa. These will be discussed at a later stage, but for the purposes of this section, we'll show some ways by which you can sample the data so as to produce a more evenly balanced dataset.

The R package, caret, includes several helpful functions to create a balanced distribution of the classes from an imbalanced dataset.

In these cases, we need to re-sample the data to get a better distribution of the classes in order to build a more effective model.

Some of the general methods include:

Up-sample: Increase instances of the class with lesser examples
Down-sample: Reduce the instances of the class with higher examples
Create synthetic examples (for example, SMOTE (Synthetic Minority Oversampling TechniquE))
Random oversampling (for example, (ROSE) Randomly OverSampling Examples)

We will create a simulated dataset using the same data from the prior example where 95% of the rows will be marked as negative:

library(mlbench) 
library(caret) 
diab<- PimaIndiansDiabetes 
 
diabsim<- diab 
diabrows<- nrow(diabsim) 
negrows<- floor(.95 * diabrows) 
posrows<- (diabrows - negrows) 
 
negrows 
[1] 729 
 
posrows 
[1] 39 
 
diabsim$diabetes[1:729]     <- as.factor("neg")
diabsim$diabetes[-c(1:729)] <- as.factor("pos")
table(diabsim$diabetes) 
 
neg. pos 
729  39 
 
# We observe that in this simulated dataset, we have 729 occurrences of positive outcome and 39 occurrences of negative outcome
 
# Method 1: Upsampling, i.e., increasing the number of observations marked as 'pos' (i.e., positive) 
 
upsampled_simdata<- upSample(diabsim[,-ncol(diabsim)], diabsim$diabetes) 
table(upsampled_simdata$Class) 
 
negpos 
729 729 
 
# NOTE THAT THE OUTCOME IS CALLED AS 'Class' and not 'diabetes' 
# This is because of the use of the variable separately 
# We can always rename the column to revert to the original name 
 
# Method 2: Downsampling, i.e., reducing the number of observations marked as 'pos' (i.e., positive) 
 
downsampled_simdata<- downSample(diabsim[,-ncol(diabsim)], diabsim$diabetes) 

table(downsampled_simdata$Class) 
 
neg pos 
39  39

The SMOTE (Synthetic Minority Over-sampling TechniquE) is a third method that, instead of plain vanilla up-/down-sampling, creates synthetic records from the nearest neighbors of the minority class. In our simulated dataset, it is obvious that neg is the minority class, that is, the class with the lowest number of occurrences.
The help file on the SMOTE function explains the concept succinctly:

Unbalanced classification problems cause problems to many learning algorithms. These problems are characterized by the uneven proportion of cases that are available for each class of the problem.

SMOTE (Chawla et al., 2002) is a well-known algorithm to fight this problem. The general idea of this method is to artificially generate new examples of the minority class using the nearest neighbors of these cases. Furthermore, the majority class examples are also under-sampled, leading to a more balanced dataset:

# Method 3: SMOTE 
# The function SMOTE is available in the R Package DMwR 
# In order to use it, we first need to install DmWR as follows 
 
install.packages ("DMwR") 
 
# Once the package has been installed, we will create a synthetic 
# Dataset in which we will increase the number of 'neg' records 
# Let us check once again the distribution of neg/pos in the dataset 
 
table(diabsim$diabetes) 
 
negpos 
729  39 
 
# Using SMOTE we can create synthetic cases of 'pos' as follows 
 
diabsyn<- SMOTE(diabetes ~ ., diabsim, perc.over = 500, perc.under = 150) 
 
# perc.over = 500 means, increase the occurrence of the minority 
# class by 500%, i.e., 39 + 5*39 = 39 + 195 = 234 
 
# perc.under = 150 means, that for each new record generated for the 
# Minority class, we will generate 1.5 cases of the majority class 
# In this case, we created 195 new records (500% of 39) and hence 
# we will generate 150% of 195 records = 195 * 150% = 195 * 1.5 
# = 292.5, or 292 (rounded down) new records 
 
# We can verify this by running the table command against the newly 
# Created synthetic dataset, diabsyn 
 
table(diabsyn$diabetes) 
 
negpos 
292 234

ROSE (Randomly OverSampling Examples), the final method in this section, is available via the ROSE package in R. Like SMOTE, it is a method for generating synthetic samples. The help file for ROSE states the high-level use of the function as follows:

Generation of synthetic data by Randomly Over Sampling Examples creates a sample of synthetic data by enlarging the features space of minority and majority class examples. Operationally, the new examples are drawn from a conditional kernel density estimate of the two classes, as described in Menardi and Torelli (2013).

install.packages("ROSE") 
library(ROSE) 
 
# Loaded ROSE 0.0-3 
set.seed(1) 
 
diabsyn2 <- ROSE(diabetes ~ ., data=diabsim) 
 
table(diabsyn2$data$diabetes) 
 
# negpos 
# 395 373

Table of Contents for Data sampling

Create new playlist

Sign In

Sign Up

Table of Contents for
Data sampling