Model selection

Over the years, practitioners in the field of machine learning have developed a wide variety of learning algorithms and improvements for existing ones. There are so many unique supervised learning methods that it is challenging to keep track of all of them. As the characteristics of the datasets vary, no one method is the best in all of the cases, but different algorithms are able to take advantage of the different characteristics and relationships of a given dataset.

First, we need to create the model library by initializing the weka.classifiers.EnsembleLibrary class, which will help us define the models:

EnsembleLibrary ensembleLib = new EnsembleLibrary();

Next, we add the models and their parameters to the library as string values; for example, we can add three decision tree learners with different parameters, as follows:

ensembleLib.addModel("weka.classifiers.trees.J48 -S -C 0.25 -B -M 
   2"); 
ensembleLib.addModel("weka.classifiers.trees.J48 -S -C 0.25 -B -M 
   2 -A");

If you are familiar with the Weka graphical interface, you can also explore the algorithms and their configurations there and copy the configuration, as shown in the following screenshot. Right-click on the algorithm name and navigate to Edit configuration | Copy configuration string:

To complete this example, we added the following algorithms and their parameters:

The Naive Bayes that was used as the default baseline:

ensembleLib.addModel("weka.classifiers.bayes.NaiveBayes");

The k-nearest neighbors, based on lazy models:

ensembleLib.addModel("weka.classifiers.lazy.IBk");

Logistic regression as a simple logistic with default parameters:

ensembleLib.addModel("weka.classifiers.functions.SimpleLogi
   stic");

Support vector machines with default parameters:

ensembleLib.addModel("weka.classifiers.functions.SMO");

AdaBoost, which is, in itself, an ensemble method:

ensembleLib.addModel("weka.classifiers.meta.AdaBoostM1");

LogitBoost, an ensemble method based on logistic regression:

ensembleLib.addModel("weka.classifiers.meta.LogitBoost");

DecisionStump, an ensemble method based on one-level decision trees:

ensembleLib.addModel("classifiers.trees.DecisionStump");

As the EnsembleLibrary implementation is primarily focused on GUI and console users, we have to save the models into a file by calling the saveLibrary(File, EnsembleLibrary, JComponent) method, as follows:

EnsembleLibrary.saveLibrary(new 
   File(path+"ensembleLib.model.xml"), ensembleLib, null); 
System.out.println(ensembleLib.getModels());

Next, we can initialize the ensemble selection algorithm by instantiating the weka.classifiers.meta.EnsembleSelection class. First, let's review the following method options:

-L </path/to/modelLibrary>: This specifies the modelLibrary file, continuing the list of all models.
-W </path/to/working/directory>: This specifies the working directory, where all models will be stored.
-B <numModelBags>: This sets the number of bags, that is, the number of iterations to run the ensemble selection algorithm.
-E <modelRatio>: This sets the ratio of library models that will be randomly chosen to populate each bag of models.
-V <validationRatio>: This sets the ratio of the training dataset that will be reserved for validation.
-H <hillClimbIterations>: This sets the number of hill climbing iterations to be performed on each model bag.
-I <sortInitialization>: This sets the ratio of the ensemble library that the sort initialization algorithm will be able to choose from, while initializing the ensemble for each model bag.
-X <numFolds>: This sets the number of cross-validation folds.
-P <hillclimbMetric>: This specifies the metric that will be used for model selection during the hill climbing algorithm. Valid metrics include the accuracy, rmse, roc, precision, recall, fscore, and all.
-A <algorithm>: This specifies the algorithm to be used for ensemble selection. Valid algorithms include forward (default) for forward selection, backward for backward elimination, both for both forward and backward elimination, best to simply print the top performer from the ensemble library, and library to only train the models in the ensemble library.
-R: This flags whether the models can be selected more than once for an ensemble.
-G: This states whether the sort initialization greedily stops adding models when the performance degrades.
-O: This is a flag for verbose output. This prints the performance of all of the selected models.
-S <num>: This is a random number seed (the default is 1).
-D: If set, the classifier is run in debug mode, and may provide additional information to the console as output.

We initialize the algorithm with the following initial parameters, where we specify optimizing the ROC metric:

EnsembleSelection ensambleSel = new EnsembleSelection(); 
ensambleSel.setOptions(new String[]{ 
  "-L", path+"ensembleLib.model.xml", // </path/to/modelLibrary>
     "-W", path+"esTmp", // </path/to/working/directory> -  
"-B", "10", // <numModelBags>  
  "-E", "1.0", // <modelRatio>. 
  "-V", "0.25", // <validationRatio> 
  "-H", "100", // <hillClimbIterations>  
"-I", "1.0", // <sortInitialization>  
  "-X", "2", // <numFolds> 
  "-P", "roc", // <hillclimbMettric> 
  "-A", "forward", // <algorithm>  
  "-R", "true", // - Flag to be selected more than once 
  "-G", "true", // - stops adding models when performance degrades 
  "-O", "true", // - verbose output. 
  "-S", "1", // <num> - Random number seed. 
  "-D", "true" // - run in debug mode  
});

Table of Contents for Model selection

Create new playlist

Sign In

Sign Up

Table of Contents for
Model selection