Feature selection

As introduced in Chapter 1, Applied Machine Learning Quick Start, one of the preprocessing steps is focused on feature selection, also known as attribute selection. The goal is to select a subset of relevant attributes that will be used in a learned model. Why is feature selection important? A smaller set of attributes simplifies the models and makes them easier for users to interpret. This usually requires shorter training and reduces overfitting.

Attribute selection can take the class value into account or it cannot. In the first case, an attribute selection algorithm evaluates the different subsets of features and calculates a score that indicates the quality of selected attributes. We can use different searching algorithms, such as exhaustive search and best-first search, and different quality scores, such as information gain, the Gini index, and so on.

Weka supports this process with an AttributeSelection object, which requires two additional parameters: an evaluator, which computes how informative an attribute is, and a ranker, which sorts the attributes according to the score assigned by the evaluator.

We will use the following steps to perform selection:

In this example, we will use information gain as an evaluator, and we will rank the features by their information gain score:

InfoGainAttributeEval eval = new InfoGainAttributeEval(); 
Ranker search = new Ranker();

We will initialize an AttributeSelection object and set the evaluator, ranker, and data:

AttributeSelection attSelect = new AttributeSelection(); 
attSelect.setEvaluator(eval); 
attSelect.setSearch(search); 
attSelect.SelectAttributes(data);

We will print an order list of attribute indices, as follows:

int[] indices = attSelect.selectedAttributes(); 
System.out.println(Utils.arrayToString(indices));

This process will provide the following result as output:

12,3,7,2,0,1,8,9,13,4,11,5,15,10,6,14,16

The most informative attributes are 12 (fins), 3 (eggs), 7 (aquatic), 2 (hair), and so on. Based on this list, we can remove additional, non-informative features in order to help the learning algorithms achieve more accurate and faster learning models.

What would make the final decision about the number of attributes to keep? There's no rule of thumb related to an exact number; the number of attributes depends on the data and the problem. The purpose of attribute selection is to choose attributes that serve your model better, so it is best to focus on whether the attributes are improving the model.

Table of Contents for Feature selection

Create new playlist

Sign In

Sign Up

Table of Contents for
Feature selection