Feature generation

We will perform feature generation using the following steps:

We will create a default pipeline, as described previously:

ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); 
pipeList.add(new Input2CharSequence("UTF-8")); 
Pattern tokenPattern = Pattern.compile("[\p{L}\p{N}_]+"); 
pipeList.add(new CharSequence2TokenSequence(tokenPattern)); 
pipeList.add(new TokenSequenceLowercase()); 
pipeList.add(new TokenSequenceRemoveStopwords(new 
   File(stopListFilePath), "utf-8", false, false, false)); 
pipeList.add(new TokenSequence2FeatureSequence()); 
pipeList.add(new FeatureSequence2FeatureVector()); 
pipeList.add(new Target2Label()); 
SerialPipes pipeline = new SerialPipes(pipeList);

Note that we added an additional FeatureSequence2FeatureVector pipe that transforms a feature sequence into a feature vector. When we have data in a feature vector, we can use any classification algorithm, as we saw in the previous chapters. We'll continue our example in Mallet to demonstrate how to build a classification model.

We initialize a folder iterator to load our examples in the train folder comprising email examples in the spam and nonspam subfolders, which will be used as example labels:

FileIterator folderIterator = new FileIterator( 
    new File[] {new File(dataFolderPath)}, 
    new TxtFilter(), 
    FileIterator.LAST_DIRECTORY);

We will construct a new instance list with the pipeline object that we want to use to process the text:

InstanceList instances = new InstanceList(pipeline);

We will process each instance provided by the iterator:

instances.addThruPipe(folderIterator);

We have now loaded the data and transformed it into feature vectors. Let's train our model on the training set and predict the spam/nonspam classification on the test set.

Table of Contents for Feature generation

Create new playlist

Sign In

Sign Up

Table of Contents for
Feature generation