Loading the data (Simple)

This task will show you how to load a dataset in Attribute-Relation File Format (ARFF), typically used to store training and testing data. In addition, it will demonstrate how to create a dataset on the fly and save the data into a file. The detailed ARFF description is available at http://weka.wikispaces.com/ARFF+(book+version).

To demonstrate the recipe, we will use a dataset that describes the fate of the passengers of the ocean liner Titanic. The sinking of the Titanic is a famous event, and many well-known facts—from the proportion of first-class passengers to the "women-and-children-first" policy, and the fact that that policy was not entirely successful in saving the women and children from the third class—are reflected in the survival rates for various classes of passengers.

How to do it...

Use the following snippet (saved in LoadData.java) to import a dataset to Weka, print the number of examples in the dataset, and output the complete dataset.

import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;

public class LoadData{
  public static void main(String args[]) throws Exception{
    DataSource source = new DataSource("dataset/titanic.arff");
    Instances data = source.getDataSet();
    System.out.println(data.numInstances()+" instances loaded.");
    System.out.println(data.toString());
  }
}

How it works...

Each example in the dataset, represented by a line in the dataset, is handled with an Instance object. The complete dataset is handled with the Instances object, which can handle an ordered set of weighted instances (used by algorithms that can handle weighted instances, the default weight is 1.0). Hence, we first import the weka.core.Instances class with the following command:

import weka.core.Instances;

Next, import the DataSource class that is able to load a variety of file formats—it is not limited only to ARFF files, but it can also read all file formats that Weka can import via its converters.

import weka.core.converters.ConverterUtils.DataSource;

Now, instantiate a DataSource object and specify the location of the ARFF file containing the dataset:

DataSource source = new DataSource("dataset/titanic.arff");

Call getDataSet(), which reads the specified file and returns a new dataset object if the specified ARFF file is valid.

Instances data = source.getDataSet();

Finally, we can print the number of records (that is, instances) in our dataset with the numInstances() method or list the complete dataset with the toString() method.

System.out.println(data.numInstances()+" instances loaded.");
System.out.println(data.toString());

There's more...

Sometimes, the data is readily available in real time. In this case, it might be time consuming to make the ARFF files first. This section first demonstrates how to create a dataset on-the-fly and then explains how to save a dataset for later use.

Creating a dataset in runtime

The following lines of code show an example of how to create a dataset at runtime. We will create a dataset describing a basic blog post with attributes such as category, number of visits, title, and date when it was posted.

First, import the following objects:

import weka.core.Attribute;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.FastVector;

The Attribute class contains information about the attribute type (for example, nominal, numeric, or date) and a list of possible values in case the attribute is nominal.

Next, set up attributes with the Weka's FastVector class. Weka's data structure is quite similar to Vector in Java:

FastVector attributes = new FastVector();

To add a nominal attribute such as category, first specify all the possible values in a FastVector object and then initialize an Attribute object with these values:

FastVector catVals = new FastVector(3);
  catVals.addElement("sports");
  catVals.addElement("finance");
  catVals.addElement("news");
attributes.addElement(new Attribute("category (att1)", catVals));

A numeric attribute is initialized only with the attribute name:

attributes.addElement(new Attribute("visits (att2)"));

A string attribute is initialized with the null FastVector object:

attributes.addElement(new Attribute("title (att3)", (FastVector) null));

A date attribute is initialized with the date format string in the ISO-8601 standard(http://www.iso.org/iso/home/standards/iso8601.htm), for example:

attributes.addElement(new Attribute("posted (att4)", "yyyy-MM-dd"));

After the attributes are initialized, create an empty dataset with them and initial the size to zero:

Instances data = new Instances("Runtime dataset", attributes, 0);

Next, add some instances by first specifying the instance attribute values and then appending a new Instance object to the dataset. Note, that attribute order is important; it must correspond to the order of attributes added to the data.

double[] vals = new double[data.numAttributes()];
  // nominal value
  vals[0] = data.attribute(0).indexOf("sports");
  // numeric value
  vals[1] = 8527.0;
  // string value
  vals[2] = data.attribute(2).addStringValue("2012 Summer Olympics in London");
  // date value
  vals[3] = data.attribute(3).parseDate("2012-07-27");
data.add(new Instance(1.0, vals));

If there is a missing value, you can specify it as follows:

vals[3] = Instance.missingValue();

Finally, output the data:

System.out.println(data.toString());

Saving the dataset to ARFF file

This example shows how to save a large dataset in a file. The toString() method of the weka.core.Instances class does not scale very well for large datasets, since the complete string has to fit into memory. It is best to use a converter which uses an incremental approach for writing the dataset to disk. You should use the ArffSaver class (weka.core.converters.ArffSaver) for saving a weka.core.Instances object to a file.

import weka.core.converters.ArffSaver;
import java.io.File;

Instances dataSet = ...
  ArffSaver saver = new ArffSaver();
  saver.setInstances(dataSet);
  saver.setFile(new File("./data/test.arff"));
  saver.writeBatch();

Instead of using saver.writeBatch(), you can also write the data incrementally yourself as follows:

ArffSaver saver = new ArffSaver();
saver.setRetrieval(ArffSaver.INCREMENTAL);
saver.setInstances(dataSet);
saver.setFile(new File("./data/test.arff"));
for(int i=0; i < dataSet.numInstances(); i++){
saver.writeIncremental(data.instance(i));
}
saver.writeIncremental(null);

That's it. The dataset is now saved and can be reused, as shown in the recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.236.70