Basic modeling

In this section, we will implement our own baseline model by following the approach that the KDD Cup organizers took. However, before we go to the model, let's first implement the evaluation engine that will return AUC on all three problems.

Evaluating models

Now, let's take a closer look at the evaluation function. The evaluation function accepts an initialized model, cross-validates the model on all three problems, and reports the results as an area under the ROC curve (AUC), as follows:

public static double[] evaluate(Classifier model) throws Exception {

  double results[] = new double[4];

  String[] labelFiles = new String[]{
    "churn", "appetency", "upselling"};

  double overallScore = 0.0;
  for (int i = 0; i < labelFiles.length; i++) {

First, we call the Instance loadData(String, String) function that we implemented earlier to load the train data and merge it with the selected labels:

    // Load data
    Instances train_data = loadData(
     path + "orange_small_train.data",
      path+"orange_small_train_"+labelFiles[i]+".labels.txt");

Next, we initialize the weka.classifiers.Evaluation class and pass our dataset (the dataset is used only to extract data properties, the actual data are not considered). We call the void crossValidateModel(Classifier, Instances, int, Random) method to begin cross validation and select to create five folds. As validation is done on random subsets of the data, we need to pass a random seed as well:

    // cross-validate the data
    Evaluation eval = new Evaluation(train_data);
    eval.crossValidateModel(model, train_data, 5, 
    new Random(1));

After the evaluation completes, we read the results by calling the double areUnderROC(int) method. As the metric depends on the target value that we are interested in, the method expects a class value index, which can be extracted by searching the index of the "1" value in the class attribute:

    // Save results
    results[i] = eval.areaUnderROC(
      train_data.classAttribute().indexOfValue("1"));
    overallScore += results[i];
  }

Finally, the results are averaged and returned:

  // Get average results over all three problems
  results[3] = overallScore / 3;
  return results;
}

Implementing naive Bayes baseline

Now, when we have all the ingredients, we can replicate the naive Bayes approach that we are expected to outperform. This approach will not include any additional data pre-processing, attribute selection, and model selection. As we do not have true labels for test data, we will apply the five-fold cross validation to evaluate the model on a small dataset.

First, we initialize a naive Bayes classifier, as follows:

Classifier baselineNB = new NaiveBayes();

Next, we pass the classifier to our evaluation function, which loads the data and applies cross validation. The function returns an area under the ROC curve score for all three problems and overall results:

double resNB[] = evaluate(baselineNB);
System.out.println("Naive Bayes
" + 
"	churn:     " + resNB[0] + "
" + 
"	appetency: " + resNB[1] + "
" + 
"	up-sell:   " + resNB[2] + "
" + 
"	overall:   " + resNB[3] + "
");

In our case, the model achieves the following results:

Naive Bayes
  churn:     0.5897891153549814
  appetency: 0.630778394752436
  up-sell:   0.6686116692438094
  overall:   0.6297263931170756

These results will serve as a baseline when we tackle the challenge with more advanced modeling. If we process the data with significantly more sophisticated, time-consuming, and complex techniques, we expect the results to be much better. Otherwise, we are simply wasting the resources. In general, when solving machine learning problems, it is always a good idea to create a simple baseline classifier that serves us as an orientation point.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.106.233