Data preprocessing

First, we will utilize Weka's built-in weka.filters.unsupervised.attribute.RemoveUseless filter, which works exactly as its name suggests. It removes the attributes that do not vary much, for instance, all constant attributes are removed. The maximum variance, which is only applied to nominal attributes, is specified with the -M parameter. The default parameter is 99%, which means that if more than 99% of all instances have unique attribute values, the attribute is removed, as follows:

RemoveUseless removeUseless = new RemoveUseless(); 
removeUseless.setOptions(new String[] { "-M", "99" });// threshold 
removeUseless.setInputFormat(data); 
data = Filter.useFilter(data, removeUseless);

Next, we will replace all of the missing values in the dataset with the modes (nominal attributes) and means (numeric attributes) from the training data, by using the weka.filters.unsupervised.attribute.ReplaceMissingValues filter. In general, missing value replacement should be proceeded with caution, while taking into consideration the meaning and context of the attributes:

ReplaceMissingValues fixMissing = new ReplaceMissingValues(); 
fixMissing.setInputFormat(data); 
data = Filter.useFilter(data, fixMissing);

Finally, we will discretize numeric attributes, that is, we will transform numeric attributes into intervals by using the weka.filters.unsupervised.attribute.Discretize filter. With the -B option, we set splitting numeric attributes into four intervals, and the -R option specifies the range of attributes (only numeric attributes will be discretized):

Discretize discretizeNumeric = new Discretize(); 
discretizeNumeric.setOptions(new String[] { 
    "-B",  "4",  // no of bins 
    "-R",  "first-last"}); //range of attributes 
fixMissing.setInputFormat(data); 
data = Filter.useFilter(data, fixMissing);

Table of Contents for Data preprocessing

Create new playlist

Sign In

Sign Up

Table of Contents for
Data preprocessing