In the next step, we will select only informative attributes, that is, attributes that are more likely to help with prediction. A standard approach to this problem is to check the information gain carried by each attribute. We will use the weka.attributeSelection.AttributeSelection filter, which requires two additional methods: an evaluator (how attribute usefulness is calculated) and search algorithms (how to select a subset of attributes).
In our case, first, we initialize weka.attributeSelection.InfoGainAttributeEval, which implements the calculation of information gain:
InfoGainAttributeEval eval = new InfoGainAttributeEval(); Ranker search = new Ranker();
To only select the top attributes above a threshold, we initialize weka.attributeSelection.Ranker, in order to rank the attributes with information gain above a specific threshold. We specify this with the -T parameter, while keeping the value of the threshold low, in order to keep the attributes with at least some information:
search.setOptions(new String[] { "-T", "0.001" });
Next, we can initialize the AttributeSelection class, set the evaluator and ranker, and apply the attribute selection to our dataset, as follows:
AttributeSelection attSelect = new AttributeSelection(); attSelect.setEvaluator(eval); attSelect.setSearch(search); // apply attribute selection attSelect.SelectAttributes(data);
Finally, we remove the attributes that were not selected in the last run by calling the reduceDimensionality(Instances) method:
// remove the attributes not selected in the last run data = attSelect.reduceDimensionality(data);
In the end, we are left with 214 out of 230 attributes.