Basic naive Bayes classifier baseline

As per the rules of the challenge, the participants had to outperform the basic naive Bayes classifier to qualify for prizes, which makes an assumption that features are independent (refer to Chapter 1, Applied Machine Learning Quick Start).

The KDD Cup organizers run the vanilla naive Bayes classifier, without any feature selection or hyperparameter adjustments. For the large dataset, the overall scores of the naive Bayes on the test set were as follows:

  • Churn problem: AUC = 0.6468
  • Appetency problem: AUC = 0.6453
  • Upselling problem: AUC=0.7211

Note that the baseline results are reported for large dataset only. Moreover, while both training and test datasets are provided at the KDD Cup site, the actual true labels for the test set are not provided. Therefore, when we process the data with our models, there is no way to know how well the models will perform on the test set. What we will do is use only the training data and evaluate our models with cross validation. The results will not be directly comparable, but, nevertheless, we have an idea for what a reasonable magnitude of the AUC score is.

Getting the data

At the KDD Cup web page (http://kdd.org/kdd-cup/view/kdd-cup-2009/Data), you should see a page that looks like the following screenshot. First, under the Small version (230 var.) header, download orange_small_train.data.zip. Next, download the three sets of true labels associated with this training data. The following files are found under the Real binary targets (small) header:

  • orange_small_train_appentency.labels
  • orange_small_train_churn.labels
  • orange_small_train_upselling.labels

Save and unzip all the files marked in the red boxes, as shown in the following screenshot:

Getting the data

In the following sections, we will first load the data into Weka and apply basic modeling with the naive Bayes to obtain our own baseline AUC scores. Later, we will look into more advanced modeling techniques and tricks.

Loading the data

We will load the data to Weka directly from the .cvs format. For this purpose, we will write a function that accepts the path to the data file and the true labels file. The function will load and merge both datasets and remove empty attributes:

public static Instances loadData(String pathData, String pathLabeles) throws Exception {

First, we load the data using the CSVLoader() class. Additionally, we specify the tab as a field separator and force the last 40 attributes to be parsed as nominal:

// Load data
CSVLoader loader = new CSVLoader();
loader.setFieldSeparator("	");
loader.setNominalAttributes("191-last");
loader.setSource(new File(pathData));
Instances data = loader.getDataSet();

Note

The CSVLoader class accepts many additional parameters specifying column separator, string enclosures, whether a header row is present or not, and so on. Complete documentation is available here:

http://weka.sourceforge.net/doc.dev/weka/core/converters/CSVLoader.html

Next, some of the attributes do not contain a single value and Weka automatically recognizes them as the String attributes. We actually do not need them, so we can safely remove them using the RemoveType filter. Additionally, we specify the –T parameters, which means remove attribute of specific type and the attribute type that we want to remove:

// remove empty attributes identified as String attribute 
RemoveType removeString = new RemoveType();
removeString.setOptions(new String[]{"-T", "string"});
removeString.setInputFormat(data);
Instances filteredData = Filter.useFilter(data, removeString);

Alternatively, we could use the void deleteStringAttributes() method implemented within the Instances class, which has the same effect, for example, data.removeStringAttributes().

Now, we will load and assign class labels to the data. We will again utilize CVSLoader, where we specify that the file does not have any header line, that is, setNoHeaderRowPresent(true):

// Load labeles
loader = new CSVLoader();
loader.setFieldSeparator("	");
loader.setNoHeaderRowPresent(true);
loader.setNominalAttributes("first-last");
loader.setSource(new File(pathLabeles));
Instances labels = loader.getDataSet();

Once we have loaded both files, we can merge them together by calling the Instances.mergeInstances (Instances, Instances) static method. The method returns a new dataset that has all the attributes from the first dataset plus the attributes from the second set. Note that the number of instances in both datasets must be the same:

// Append label as class value
Instances labeledData = Instances.mergeInstances(filteredData, labeles);

Finally, we set the last attribute, that is, the label attribute that we have just added, as a target variable and return the resulting dataset:

// set the label attribute as class 
labeledData.setClassIndex(labeledData.numAttributes() - 1);

System.out.println(labeledData.toSummaryString());
return labeledData;
}

The function outputs a summary as shown in the following and returns the labeled dataset:

Relation Name:  orange_small_train.data-weka.filters.unsupervised.attribute.RemoveType-Tstring_orange_small_train_churn.labels.txt
Num Instances:  50000
Num Attributes: 215

Name          Type  Nom  Int Real     Missing      Unique  Dist
1 Var1        Num   0%   1%   0% 49298 / 99%     8 /  0%    18 
2 Var2        Num   0%   2%   0% 48759 / 98%     1 /  0%     2 
3 Var3        Num   0%   2%   0% 48760 / 98%   104 /  0%   146 
4 Var4        Num   0%   3%   0% 48421 / 97%     1 /  0%     4
...
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.33.235