Classification

We will start with the most commonly used machine learning technique, that is, classification. As we reviewed in the first chapter, the main idea is to automatically build a mapping between the input variables and the outcome. In the following sections, we will look at how to load the data, select features, implement a basic classifier in Weka, and evaluate the classifier performance.

Data

For this task, we will have a look at the ZOO database [ref]. The database contains 101 data entries of the animals described with 18 attributes as shown in the following table:

animal

aquatic

fins

hair

predator

legs

feathers

toothed

tail

eggs

backbone

domestic

milk

breathes

cat size

airborne

venomous

type

An example entry in the dataset set is a lion with the following attributes:

  • animal: lion
  • hair: true
  • feathers: false
  • eggs: false
  • milk: true
  • airbone: false
  • aquatic: false
  • predator: true
  • toothed: true
  • backbone: true
  • breaths: true
  • venomous: false
  • fins: false
  • legs: 4
  • tail: true
  • domestic: false
  • catsize: true
  • type: mammal

Our task will be to build a model to predict the outcome variable, animal, given all the other attributes as input.

Loading data

Before we start with the analysis, we will load the data in Weka's ARRF format and print the total number of loaded instances. Each data sample is held within an Instances object, while the complete dataset accompanied with meta-information is handled by the Instances object.

To load the input data, we will use the DataSource object that accepts a variety of file formats and converts them to Instances:

DataSource source = new DataSource(args[0]);
Instances data = source.getDataSet();
System.out.println(data.numInstances() + " instances loaded.");
// System.out.println(data.toString());

This outputs the number of loaded instances, as follows:

101 instances loaded.

We can also print the complete dataset by calling the data.toString() method.

Our task is to learn a model that is able to predict the animal attribute in the future examples for which we know the other attributes but do not know the animal label. Hence, we remove the animal attribute from the training set. We accomplish this by filtering out the animal attribute using the Remove filter.

First, we set a string table of parameters, specifying that the first attribute must be removed. The remaining attributes are used as our dataset for training a classifier:

Remove remove = new Remove();
String[] opts = new String[]{ "-R", "1"};

Finally, we call the Filter.useFilter(Instances, Filter) static method to apply the filter on the selected dataset:

remove.setOptions(opts);
remove.setInputFormat(data);
data = Filter.useFilter(data, remove);

Feature selection

As introduced in Chapter 1, Applied Machine Learning Quick Start, one of the pre-processing steps is focused on feature selection, also known as attribute selection. The goal is to select a subset of relevant attributes that will be used in a learned model. Why is feature selection important? A smaller set of attributes simplifies the models and makes them easier to interpret by users, this usually requires shorter training and reduces overfitting.

Attribute selection can take into account the class value or not. In the first case, an attribute selection algorithm evaluates the different subsets of features and calculates a score that indicates the quality of selected attributes. We can use different searching algorithms such as exhaustive search, best first search, and different quality scores such as information gain, Gini index, and so on.

Weka supports this process by an AttributeSelection object, which requires two additional parameters: evaluator, which computes how informative an attribute is and a ranker, which sorts the attributes according to the score assigned by the evaluator.

In this example, we will use information gain as an evaluator and rank the features by their information gain score:

InfoGainAttributeEval eval = new InfoGainAttributeEval();
Ranker search = new Ranker();

Next, we initialize an AttributeSelection object and set the evaluator, ranker, and data:

AttributeSelection attSelect = new AttributeSelection();
attSelect.setEvaluator(eval);
attSelect.setSearch(search);
attSelect.SelectAttributes(data);

Finally, we can print an order list of attribute indices, as follows:

int[] indices = attSelect.selectedAttributes();
System.out.println(Utils.arrayToString(indices));

The method outputs the following result:

12,3,7,2,0,1,8,9,13,4,11,5,15,10,6,14,16

The top three most informative attributes are 12 (fins), 3 (eggs), 7 (aquatic), 2 (hair), and so on. Based on this list, we can remove additional, non-informative features in order to help learning algorithms achieve more accurate and faster learning models.

What would make the final decision about the number of attributes to keep? There's no rule of thumb related to an exact number—the number of attributes depends on the data and problem. The purpose of attribute selection is choosing attributes that serve your model better, so it is better to focus whether the attributes are improving the model.

Learning algorithms

We have loaded our data, selected the best features, and are ready to learn some classification models. Let's begin with the basic decision trees.

Decision tree in Weka is implemented within the J48 class, which is a re-implementation of Quinlan's famous C4.5 decision tree learner [Quinlan, 1993].

First, we initialize a new J48 decision tree learner. We can pass additional parameters with a string table, for instance, tree pruning that controls the model complexity (refer to Chapter 1, Applied Machine Learning Quick Start). In our case, we will build an un-pruned tree, hence we will pass a single U parameter:

J48 tree = new J48();
String[] options = new String[1];
options[0] = "-U";

tree.setOptions(options);

Next, we call the buildClassifier(Instances) method to initialize the learning process:

tree.buildClassifier(data);

The built model is now stored in a tree object. We can output the entire J48 unpruned tree calling the toString() method:

System.out.println(tree);

The output is as follows:

J48 unpruned tree
------------------

feathers = false
|   milk = false
|   |   backbone = false
|   |   |   airborne = false
|   |   |   |   predator = false
|   |   |   |   |   legs <= 2: invertebrate (2.0)
|   |   |   |   |   legs > 2: insect (2.0)
|   |   |   |   predator = true: invertebrate (8.0)
|   |   |   airborne = true: insect (6.0)
|   |   backbone = true
|   |   |   fins = false
|   |   |   |   tail = false: amphibian (3.0)
|   |   |   |   tail = true: reptile (6.0/1.0)
|   |   |   fins = true: fish (13.0)
|   milk = true: mammal (41.0)
feathers = true: bird (20.0)

Number of Leaves  : .9

Size of the tree : ..17

The outputted tree has 17 nodes in total, 9 of these are terminal (Leaves).

Another way to present the tree is to leverage the built-in TreeVisualizer tree viewer, as follows:

TreeVisualizer tv = new TreeVisualizer(null, tree.graph(), new PlaceNode2());
JFrame frame = new javax.swing.JFrame("Tree Visualizer");
frame.setSize(800, 500);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.getContentPane().add(tv);
frame.setVisible(true);
tv.fitToScreen();

The code results in the following frame:

Learning algorithms

The decision process starts at the top node, also known as the root node. The node label specifies the attribute value that will be checked. In our example, we first check the value of the feathers attribute. If the feather is present, we follow the right-hand branch, which leads us to the leaf labeled bird, indicating there are 20 examples supporting this outcome. If the feather is not present, we follow the left-hand branch, which leads us to the next milk attribute. We check the value of the attribute again and follow the branch that matches the attribute value. We repeat the process until we reach a leaf node.

We can build other classifiers by following the same steps: initialize a classifier, pass the parameters controlling the model complexity, and call the buildClassifier(Instances) method.

In the next section, we will learn how to use a trained model to assign a class label to a new example whose label is unknown.

Classify new data

Suppose we record attributes for an animal whose label we do not know, we can predict its label from the learned classification model:

Classify new data

We first construct a feature vector describing the new specimen, as follows:

double[] vals = new double[data.numAttributes()];
vals[0] = 1.0;  //hair {false, true}
vals[1] = 0.0;  //feathers {false, true}
vals[2] = 0.0;  //eggs {false, true}
vals[3] = 1.0;  //milk {false, true}
vals[4] = 0.0;  //airborne {false, true}
vals[5] = 0.0;  //aquatic {false, true}
vals[6] = 0.0;  //predator {false, true}
vals[7] = 1.0;  //toothed {false, true}
vals[8] = 1.0;  //backbone {false, true}
vals[9] = 1.0;  //breathes {false, true}
vals[10] = 1.0;  //venomous {false, true}
vals[11] = 0.0;  //fins {false, true}
vals[12] = 4.0;  //legs INTEGER [0,9]
vals[13] = 1.0;  //tail {false, true}
vals[14] = 1.0;  //domestic {false, true}
vals[15] = 0.0;  //catsize {false, true}
Instance myUnicorn = new Instance(1.0, vals);

Finally, we call the classify(Instance) method on the model to obtain the class value. The method returns label index, as follows:

double result = tree.classifyInstance(myUnicorn);
System.out.println(data.classAttribute().value((int) result));

This outputs the mammal class label.

Evaluation and prediction error metrics

We built a model, but we do not know if it can be trusted. To estimate its performance, we can apply a cross-validation technique explained in Chapter 1, Applied Machine Learning Quick Start.

Weka offers an Evaluation class implementing cross validation. We pass the model, data, number of folds, and an initial random seed, as follows:

Classifier cl = new J48();
Evaluation eval_roc = new Evaluation(data);
eval_roc.crossValidateModel(cl, data, 10, new Random(1), new Object[] {});
System.out.println(eval_roc.toSummaryString());

The evaluation results are stored in the Evaluation object.

A mix of the most common metrics can be invoked by calling the toString() method. Note that the output does not differentiate between regression and classification, so pay attention to the metrics that make sense, as follows:

Correctly Classified Instances          93               92.0792 %
Incorrectly Classified Instances         8                7.9208 %
Kappa statistic                          0.8955
Mean absolute error                      0.0225
Root mean squared error                  0.14  
Relative absolute error                 10.2478 %
Root relative squared error             42.4398 %
Coverage of cases (0.95 level)          96.0396 %
Mean rel. region size (0.95 level)      15.4173 %
Total Number of Instances              101  

In the classification, we are interested in the number of correctly/incorrectly classified instances.

Confusion matrix

Furthermore, we can inspect where a particular misclassification has been made by examining the confusion matrix. Confusion matrix shows how a specific class value was predicted:

double[][] confusionMatrix = eval_roc.confusionMatrix();
System.out.println(eval_roc.toMatrixString());

The resulting confusion matrix is as follows:

=== Confusion Matrix ===

  a  b  c  d  e  f  g   <-- classified as
 41  0  0  0  0  0  0 |  a = mammal
  0 20  0  0  0  0  0 |  b = bird
  0  0  3  1  0  1  0 |  c = reptile
  0  0  0 13  0  0  0 |  d = fish
  0  0  1  0  3  0  0 |  e = amphibian
  0  0  0  0  0  5  3 |  f = insect
  0  0  0  0  0  2  8 |  g = invertebrate

The first column names in the first row correspond to labels assigned by the classification mode. Each additional row then corresponds to an actual true class value. For instance, the second row corresponds instances with the mammal true class label. In the column line, we read that all mammals were correctly classified as mammals. In the fourth row, reptiles, we notice that three were correctly classified as reptiles, while one was classified as fish and one as an insect. Confusion matrix hence, gives us an insight into the kind of errors that our classification model makes.

Choosing a classification algorithm

Naive Bayes is one of the most simple, efficient, and effective inductive algorithms in machine learning. When features are independent, which is rarely true in real world, it is theoretically optimal, and even with dependent features, its performance is amazingly competitive (Zhang, 2004). The main disadvantage is that it cannot learn how features interact with each other, for example, despite the fact that you like your tea with lemon or milk, you hate a tea having both of them at the same time.

Decision tree's main advantage is a model, that is, a tree, which is easy to interpret and explain as we studied in our example. It can handle both nominal and numeric features and you don't have to worry about whether the data is linearly separable.

Some other examples of classification algorithms are as follows:

  • weka.classifiers.rules.ZeroR: This predicts the majority class and is considered as a baseline, that is, if your classifier's performance is worse than the average value predictor, it is not worth considering it.
  • weka.classifiers.trees.RandomTree: This constructs a tree that considers K randomly chosen attributes at each node.
  • weka.classifiers.trees.RandomForest: This constructs a set (that is, forest) of random trees and uses majority voting to classify a new instance.
  • weka.classifiers.lazy.IBk: This is the k-nearest neighbor's classifier that is able to select an appropriate value of neighbors based on cross-validation.
  • weka.classifiers.functions.MultilayerPerceptron: This is a classifier based on neural networks that use back-propagation to classify instances. The network can be built by hand, or created by an algorithm, or both.
  • weka.classifiers.bayes.NaiveBayes: This is a naive Bayes classifier that uses estimator classes, where numeric estimator precision values are chosen based on the analysis of the training data.
  • weka.classifiers.meta.AdaBoostM1: This is the class for boosting a nominal class classifier using the AdaBoost M1 method. Only nominal class problems can be tackled. This often dramatically improves the performance, but sometimes it overfits.
  • weka.classifiers.meta.Bagging: This is the class for bagging a classifier to reduce the variance. This can perform classification and regression, depending on the base learner.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.133.233