Building a classifier

Once sensor samples are represented as feature vectors and have the class assigned, it is possible to apply standard techniques for supervised classification, including feature selection, feature discretization, model learning, k-fold cross-validation, and so on. The chapter will not delve into the details of the machine learning algorithms. Any algorithm that supports numerical features can be applied, including SVMs, random forest, AdaBoost, decision trees, neural networks, multilayer perceptrons, and others.

Therefore, let's start with a basic one: decision trees. Here, we will load the dataset, build the set class attribute, build a decision tree model, and output the model:

String databasePath = "/Users/bostjan/Dropbox/ML Java Book/book/datasets/chap9/features.arff"; 
 
// Load the data in arff format 
Instances data = new Instances(new BufferedReader(new 
   FileReader(databasePath))); 
 
// Set class the last attribute as class 
data.setClassIndex(data.numAttributes() - 1); 
 
// Build a basic decision tree model 
String[] options = new String[]{}; 
J48 model = new J48(); 
model.setOptions(options); 
model.buildClassifier(data); 
 
// Output decision tree 
System.out.println("Decision tree model:
"+model);

The algorithm first outputs the model, as follows:

    Decision tree model:
    J48 pruned tree
    ------------------
    
    max <= 10.353474
    |   fft_coef_0000 <= 38.193106: standing (46.0)
    |   fft_coef_0000 > 38.193106
    |   |   fft_coef_0012 <= 1.817792: walking (77.0/1.0)
    |   |   fft_coef_0012 > 1.817792
    |   |   |   max <= 4.573082: running (4.0/1.0)
    |   |   |   max > 4.573082: walking (24.0/2.0)
    max > 10.353474: running (93.0)
    
    Number of Leaves  : 5
    
    Size of the tree : 9

The tree is quite simplistic and seemingly accurate, as majority class distributions in the terminal nodes are quite high. Let's run a basic classifier evaluation to validate the results, as follows:

// Check accuracy of model using 10-fold cross-validation 
Evaluation eval = new Evaluation(data); 
eval.crossValidateModel(model, data, 10, new Random(1), new 
   String[] {}); 
System.out.println("Model performance:
"+ 
   eval.toSummaryString());

This outputs the following model performance:

    Correctly Classified Instances         226               92.623  %
    Incorrectly Classified Instances        18                7.377  %
    Kappa statistic                          0.8839
    Mean absolute error                      0.0421
    Root mean squared error                  0.1897
    Relative absolute error                 13.1828 %
    Root relative squared error             47.519  %
    Coverage of cases (0.95 level)          93.0328 %
    Mean rel. region size (0.95 level)      27.8689 %
    Total Number of Instances              244

The classification accuracy scores very high, 92.62%, which is an amazing result. One important reason why the result is so good lies in our evaluation design. What I mean here is the following: sequential instances are very similar to each other, so if we split them randomly during a 10-fold cross-validation, there is a high chance that we use almost identical instances for both training and testing; hence, straightforward k-fold cross-validation produces an optimistic estimate of model performance.

A better approach is to use folds that correspond to different sets of measurements or even different people. For example, we can use the application to collect learning data from five people. Then, it makes sense to run k-person cross-validation, where the model is trained on four people and tested on the fifth person. The procedure is repeated for each person and the results are averaged. This will give us a much more realistic estimate of the model performance.

Leaving evaluation comments aside, let's look at how to deal with classifier errors.

Table of Contents for Building a classifier

Create new playlist

Sign In

Sign Up

Table of Contents for
Building a classifier