Advanced modeling with ensembles

In the previous section, we implemented an orientation baseline, so let's focus on heavy machinery. We will follow the approach taken by the KDD Cup 2009 winning solution developed by the IBM Research team (Niculescu-Mizil and others, 2009).

Their strategy to address the challenge was using the Ensemble Selection algorithm (Caruana and Niculescu-Mizil, 2004). This is an ensemble method, which means it constructs a series of models and combines their output in a specific way to provide the final classification. It has several desirable properties as shown in the following list that make it a good fit for this challenge:

  • It was proven to be robust, yielding excellent performance
  • It can be optimized for a specific performance metric, including AUC
  • It allows different classifiers to be added to the library
  • It is an anytime method, meaning that, if we run out of time, we have a solution available

In this section, we will loosely follow the steps as described in their report. Note, this is not an exact implementation of their approach, but rather a solution overview that will include the necessary steps to dive deeper.

The general overview of steps is as follows:

  1. First, we will preprocess the data by removing attributes that clearly do not bring any value, for example, all the missing or constant values; fixing missing values in order to help machine learning algorithms, which cannot deal with them; and converting categorical attributes to numerical.
  2. Next, we will run attributes selection algorithm to select only a subset of attribute that can help in prediction of tasks.
  3. In the third step, we will instantiate the Ensemble Selection algorithms with a wide variety of models, and, finally, evaluate the performance.

Before we start

For this task, we will need an additional Weka package, ensembleLibrary. Weka 3.7.2 or higher versions support external packages developed mainly by the academic community. A list of WEKA Packages is available at http://weka.sourceforge.net/packageMetaData as shown at the following screenshot:

Before we start

Find and download the latest available version of the ensembleLibrary package at http://prdownloads.sourceforge.net/weka/ensembleLibrary1.0.5.zip?download.

After you unzip the package, locate ensembleLibrary.jar and import it to your code, as follows:

import weka.classifiers.meta.EnsembleSelection;

Data pre-processing

First, we will utilize Weka's built-in weka.filters.unsupervised.attribute.RemoveUseless filter, which works exactly as its name suggests. It removes the attributes that do not vary much, for instance, all constant attributes are removed, and attributes that vary too much, almost at random. The maximum variance, which is applied only to nominal attributes, is specified with the –M parameter. The default parameter is 99%, which means that if more than 99% of all instances have unique attribute values, the attribute is removed, as follows:

RemoveUseless removeUseless = new RemoveUseless();
removeUseless.setOptions(new String[] { "-M", "99" });// threshold
removeUseless.setInputFormat(data);
data = Filter.useFilter(data, removeUseless);

Next, we will replace all the missing values in the dataset with the modes (nominal attributes) and means (numeric attributes) from the training data by using the weka.filters.unsupervised.attribute.ReplaceMissingValues filter. In general, missing values replacement should be proceeded with caution while taking into consideration the meaning and context of the attributes:

ReplaceMissingValues fixMissing = new ReplaceMissingValues();
fixMissing.setInputFormat(data);
data = Filter.useFilter(data, fixMissing);

Finally, we will discretize numeric attributes, that is, we transform numeric attributes into intervals using the weka.filters.unsupervised.attribute.Discretize filter. With the –B option, we set to split numeric attributes into four intervals, and the –R option will specify the range of attributes (only numeric attributes will be discretized):

Discretize discretizeNumeric = new Discretize();
discretizeNumeric.setOptions(new String[] {
    "-B",  "4",  // no of bins
    "-R",  "first-last"}); //range of attributes
fixMissing.setInputFormat(data);
data = Filter.useFilter(data, fixMissing);

Attribute selection

In the next step, we will select only informative attributes, that is, attributes that more likely help with prediction. A standard approach to this problem is to check the information gain carried by each attribute. We will use the weka.attributeSelection.AttributeSelection filter, which requires two additional methods: evaluator, that is, how attribute usefulness is calculated, and search algorithms, that is, how to select a subset of attributes.

In our case, we first initialize weka.attributeSelection.InfoGainAttributeEval that implements calculation of information gain:

InfoGainAttributeEval eval = new InfoGainAttributeEval();
Ranker search = new Ranker();

To select only top attributes above some threshold, we initialize weka.attributeSelection.Ranker to rank the attributes with information gain above a specific threshold. We specify this with the –T parameter, while keeping the value of the threshold low to keep the attributes with at least some information:

search.setOptions(new String[] { "-T", "0.001" });

Tip

The general rule for settings this threshold is to sort the attributes by information gain and pick the threshold where the information gain drops to negligible value.

Next, we can initialize the AttributeSelection class, set the evaluator and ranker, and apply the attribute selection to our dataset:

AttributeSelection attSelect = new AttributeSelection();
attSelect.setEvaluator(eval);
attSelect.setSearch(search);

// apply attribute selection
attSelect.SelectAttributes(data);

Finally, we remove the attributes that were not selected in the last run by calling the reduceDimensionality(Instances) method.

// remove the attributes not selected in the last run
data = attSelect.reduceDimensionality(data);

At the end, we are left with 214 out of 230 attributes.

Model selection

Over the years, practitioners in the field of machine learning have developed a wide variety of learning algorithms and improvements to the existing ones. There are so many unique supervised learning methods that it is challenging to keep track of all of them. As characteristics of the datasets vary, no one method is the best in all the cases, but different algorithms are able to take advantage of different characteristics and relationships of a given dataset. The property the Ensemble Selection algorithm is to try to leverage (Jung, 2005):

Intuitively, the goal of ensemble selection algorithm is to automatically detect and combine the strengths of these unique algorithms to create a sum that is greater than the parts. This is accomplished by creating a library that is intended to be as diverse as possible to capitalize on a large number of unique learning approaches. This paradigm of overproducing a huge number of models is very different from more traditional ensemble approaches. Thus far, our results have been very encouraging.

First, we need to create the model library by initializing the weka.classifiers.EnsembleLibrary class, which will help us define the models:

EnsembleLibrary ensembleLib = new EnsembleLibrary();

Next, we add the models and their parameters as strings to the library as string values, for example, we can add three decision tree learners with different parameters, as follows:

ensembleLib.addModel("weka.classifiers.trees.J48 -S -C 0.25 -B -M 2");
ensembleLib.addModel("weka.classifiers.trees.J48 -S -C 0.25 -B -M 2 -A");

If you are familiar with the Weka graphical interface, you can also explore the algorithms and their configurations there and copy the configuration as shown in the following screenshot: right-click on the algorithm name and navigate to Edit configuration | Copy configuration string:

Model selection

To complete the example, we added the following algorithms and their parameters:

  • Naive Bayes that was used as default baseline:
    ensembleLib.addModel("weka.classifiers.bayes.NaiveBayes");
  • k-nearest neighbors based on lazy models?:
    ensembleLib.addModel("weka.classifiers.lazy.IBk");
  • Logistic regression as simple logistic with default parameters:
    ensembleLib.addModel("weka.classifiers.functions.SimpleLogistic");
  • Support vector machines with default parameters:
    ensembleLib.addModel("weka.classifiers.functions.SMO");
  • AdaBoost, which is an ensemble method itself:
    ensembleLib.addModel("weka.classifiers.meta.AdaBoostM1");
  • LogitBoost, an ensemble method based on logistic regression:
    ensembleLib.addModel("weka.classifiers.meta.LogitBoost");
  • Decision stump, an ensemble method based on one-level decision trees:
    ensembleLib.addModel("classifiers.trees.DecisionStump");

As the EnsembleLibrary implementation is primarily focused on GUI and console users, we have to save the models into a file by calling the saveLibrary(File, EnsembleLibrary, JComponent) method, as follows:

EnsembleLibrary.saveLibrary(new File(path+"ensembleLib.model.xml"), ensembleLib, null);
System.out.println(ensembleLib.getModels());

Next, we can initialize the Ensemble Selection algorithm by instantiating the weka.classifiers.meta.EnsembleSelection class. Let's first review the following method options:

  • -L </path/to/modelLibrary>: This specifies the modelLibrary file, continuing the list of all models.
  • -W </path/to/working/directory>: This specifies the working directory, where all models will be stored.
  • -B <numModelBags>: This sets the number of bags, that is, the number of iterations to run the Ensemble Selection algorithm.
  • -E <modelRatio>: This sets the ratio of library models that will be randomly chosen to populate each bag of models.
  • -V <validationRatio>: This sets the ratio of the training data set that will be reserved for validation.
  • -H <hillClimbIterations>: This sets the number of hill climbing iterations to be performed on each model bag.
  • -I <sortInitialization>: This sets the ratio of the ensemble library that the sort initialization algorithm will be able to choose from, while initializing the ensemble for each model bag.
  • -X <numFolds>: This sets the number of cross validation folds.
  • -P <hillclimbMettric>: This specifies the metric that will be used for model selection during the hill climbing algorithm. Valid metrics are accuracy, rmse, roc, precision, recall, fscore, and all.
  • -A <algorithm>: This specifies the algorithm to be used for ensemble selection. Valid algorithms are forward (default) for forward selection, backward for backward elimination, both for both forward and backward elimination, best to simply print the top performer from the ensemble library, and library to only train the models in the ensemble library.
  • -R: This flags whether or not the models can be selected more than once for an ensemble.
  • -G: This states whether the sort initialization greedily stops adding models when the performance degrades.
  • -O: This is a flag for verbose output. This prints the performance of all the selected models.
  • -S <num>: This is a random number seed (default 1).
  • -D: If set, the classifier is run in the debug mode and may output additional information to the console.

We initialize the algorithm with the following initial parameters, where we specified optimizing the ROC metric:

EnsembleSelection ensambleSel = new EnsembleSelection();
ensambleSel.setOptions(new String[]{
  "-L", path+"ensembleLib.model.xml", // </path/to/modelLibrary>"-W", path+"esTmp", // </path/to/working/directory> - 
"-B", "10", // <numModelBags> 
  "-E", "1.0", // <modelRatio>.
  "-V", "0.25", // <validationRatio>
  "-H", "100", // <hillClimbIterations> 
"-I", "1.0", // <sortInitialization> 
  "-X", "2", // <numFolds>
  "-P", "roc", // <hillclimbMettric>
  "-A", "forward", // <algorithm> 
  "-R", "true", // - Flag to be selected more than once
  "-G", "true", // - stops adding models when performance degrades
  "-O", "true", // - verbose output.
  "-S", "1", // <num> - Random number seed.
  "-D", "true" // - run in debug mode 
});

Performance evaluation

The evaluation is heavy both computationally and memory-wise, so make sure that you initialize the JVM with extra heap space—for instance, java –Xmx16g—while the computation can take a couple of hours or days, depending on the number of algorithms you include in the model library. This example took 4 hours and 22 minutes on 12-core Intel Xeon E5-2420 CPU with 32 GB of memory, utilizing 10% CPU and 6 GB of memory on average.

We call our evaluation method and output the results, as follows:

double resES[] = evaluate(ensambleSel);
System.out.println("Ensemble Selection
" 
+ "	churn:     " + resES[0] + "
"
+ "	appetency: " + resES[1] + "
" 
+ "	up-sell:   " + resES[2] + "
" 
+ "	overall:   " + resES[3] + "
");

The specific set of classifiers in the model library achieved the following result:

Ensamble
  churn:     0.7109874158176481
  appetency: 0.786325687118347
  up-sell:   0.8521363243575182
  overall:   0.7831498090978378

Overall, the approach has brought us to a significant improvement of more than 15 percentage points compared to the initial baseline that we designed at the beginning of the chapter. While it is hard to give a definite answer, the improvement was mainly due to three factors: data pre-processing and attribute selection, exploration of a large variety of learning methods, and use of an ensemble-building technique that is able to take advantage of the variety of base classifiers without overfitting. However, the improvement requires a significant increase in processing time, as well as working memory.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.123.73