Dataset rebalancing

As the number of negative examples, that is, instances of fraud, is very small compared to positive examples, the learning algorithms struggle with induction. We can help them by giving them a dataset where the share of positive and negative examples is comparable. This can be achieved with dataset rebalancing.

Weka has a built-in filter, Resample, which produces a random subsample of a dataset, using sampling either with replacement or without replacement. The filter can also bias the distribution toward a uniform class distribution.

We will proceed by manually implementing k-fold cross-validation. First, we will split the dataset into k equal folds. Fold k will be used for testing, while the other folds will be used for learning. To split the dataset into folds, we'll use the StratifiedRemoveFolds filter, which maintains the class distribution within the folds, as follows:

StratifiedRemoveFolds kFold = new StratifiedRemoveFolds(); 
kFold.setInputFormat(data); 
 
double measures[][] = new double[models.size()][3]; 
 
for(int k = 1; k <= FOLDS; k++){ 
 
  // Split data to test and train folds 
  kFold.setOptions(new String[]{ 
    "-N", ""+FOLDS, "-F", ""+k, "-S", "1"}); 
  Instances test = Filter.useFilter(data, kFold); 
   
  kFold.setOptions(new String[]{ 
    "-N", ""+FOLDS, "-F", ""+k, "-S", "1", "-V"}); 
    // select inverse "-V" 
  Instances train = Filter.useFilter(data, kFold);

Next, we can rebalance the training dataset, where the -Z parameter specifies the percentage of the dataset to be resampled, and -B biases the class distribution toward uniform distribution:

Resample resample = new Resample(); 
resample.setInputFormat(data); 
resample.setOptions(new String[]{"-Z", "100", "-B", "1"}); //with 
   replacement 
Instances balancedTrain = Filter.useFilter(train, resample);

Next, we can build classifiers and perform evaluation:

for(ListIterator<Classifier>it = models.listIterator(); 
   it.hasNext();){ 
  Classifier model = it.next(); 
  model.buildClassifier(balancedTrain); 
  eval = new Evaluation(balancedTrain); 
  eval.evaluateModel(model, test); 
   
// save results for average 
  measures[it.previousIndex()][0] += eval.recall(FRAUD); 
  measures[it.previousIndex()][1] += eval.precision(FRAUD); 
 measures[it.previousIndex()][2] += eval.fMeasure(FRAUD); 
}

Finally, we calculate the average and provide the best model as output using the following lines of code:

// calculate average 
for(int i = 0; i < models.size(); i++){ 
  measures[i][0] /= 1.0 * FOLDS; 
  measures[i][1] /= 1.0 * FOLDS; 
  measures[i][2] /= 1.0 * FOLDS; 
} 
 
// output results and select best model 
Classifier bestModel = null; double bestScore = -1; 
for(ListIterator<Classifier> it = models.listIterator(); 
   it.hasNext();){ 
  Classifier model = it.next(); 
  double fMeasure = measures[it.previousIndex()][2]; 
  System.out.println( 
    model.getClass().getName() + "
"+ 
    "	Recall:    "+measures[it.previousIndex()][0] + "
"+ 
    "	Precision: "+measures[it.previousIndex()][1] + "
"+ 
    "	F-measure: "+fMeasure); 
  if(fMeasure > bestScore){ 
    bestScore = fMeasure; 
    bestModel = model; 
     
  } 
} 
System.out.println("Best model:"+bestModel.getClass().getName());

Now, the performance of the models has significantly improved, as follows:

    weka.classifiers.trees.J48
      Recall:    0.44204845100610574
      Precision: 0.14570766048577555
      F-measure: 0.21912423640160392
    ...
    weka.classifiers.functions.Logistic
      Recall:    0.7670657247204478
      Precision: 0.13507459756495374
      F-measure: 0.22969038530557626
    Best model: weka.classifiers.functions.Logistic

We can see that all of the models have scored significantly better; for instance, the best model, logistic regression, correctly discovers 76% of the fraud, while producing a reasonable amount of false alarms–only 13% of the claims marked as fraud are indeed fraudulent. If an undetected fraud is significantly more expensive than the investigation of false alarms, then it makes sense to deal with an increased number of false alarms.

The overall performance most likely still has some room for improvement; we could perform attribute selection and feature generation and apply more complex model learning, which we discussed in Chapter 3, Basic Algorithms – Classification, Regression, Clustering.

Table of Contents for Dataset rebalancing

Create new playlist

Sign In

Sign Up

Table of Contents for
Dataset rebalancing