Loading the data

We will load the data to Weka directly from the .csv format. For this purpose, we will write a function that accepts the path to the data file and the true labels file. The function will load and merge both datasets and remove empty attributes. We will begin with the following code block:

public static Instances loadData(String pathData, String 
pathLabeles) throws Exception {

First, we load the data using the CSVLoader() class. Additionally, we specify the tab as a field separator and force the last 40 attributes to be parsed as nominal:

// Load data 
CSVLoader loader = new CSVLoader(); 
loader.setFieldSeparator("	"); 
loader.setNominalAttributes("191-last"); 
loader.setSource(new File(pathData)); 
Instances data = loader.getDataSet(); 
The CSVLoader class accepts many additional parameters, specifying the column separator, string enclosures, whether a header row is present, and so on. The complete documentation is available at http://weka.sourceforge.net/doc.dev/weka/core/converters/CSVLoader.html.

Some of the attributes do not contain a single value, and Weka automatically recognizes them as String attributes. We actually do not need them, so we can safely remove them by using the RemoveType filter. Additionally, we specify the -T parameters, which removes an attribute of a specific type and specifies the attribute type that we want to remove:

// remove empty attributes identified as String attribute  
RemoveType removeString = new RemoveType(); 
removeString.setOptions(new String[]{"-T", "string"}); 
removeString.setInputFormat(data); 
Instances filteredData = Filter.useFilter(data, removeString); 

Alternatively, we could use the void deleteStringAttributes() method, implemented within the Instances class, which has the same effect; for example, data.removeStringAttributes().

Now, we will load and assign class labels to the data. We will utilize CVSLoader again, where we specify that the file does not have any header line, that is, setNoHeaderRowPresent(true):

// Load labeles 
loader = new CSVLoader(); 
loader.setFieldSeparator("	"); 
loader.setNoHeaderRowPresent(true); 
loader.setNominalAttributes("first-last"); 
loader.setSource(new File(pathLabeles)); 
Instances labels = loader.getDataSet(); 

Once we have loaded both files, we can merge them together by calling the Instances.mergeInstances (Instances, Instances) static method. The method returns a new dataset that has all of the attributes from the first dataset, plus the attributes from the second set. Note that the number of instances in both datasets must be the same:

// Append label as class value 
Instances labeledData = Instances.mergeInstances(filteredData, 
labeles);

Finally, we set the last attribute, that is, the label attribute that we just added, as a target variable, and return the resulting dataset:

// set the label attribute as class  
labeledData.setClassIndex(labeledData.numAttributes() - 1); 
 
System.out.println(labeledData.toSummaryString()); 
return labeledData; 
} 

The function provides a summary as output, as shown in the following code block, and returns the labeled dataset:

    Relation Name:  orange_small_train.data-weka.filters.unsupervised.attribute.RemoveType-Tstring_orange_small_train_churn.labels.txt
    Num Instances:  50000
    Num Attributes: 215
    
    Name          Type  Nom  Int Real     Missing      Unique  Dist
    1 Var1        Num   0%   1%   0% 49298 / 99%     8 /  0%    18 
    2 Var2        Num   0%   2%   0% 48759 / 98%     1 /  0%     2 
    3 Var3        Num   0%   2%   0% 48760 / 98%   104 /  0%   146 
    4 Var4        Num   0%   3%   0% 48421 / 97%     1 /  0%     4
    ...
  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.141.219