We will perform feature generation using the following steps:
- We will create a default pipeline, as described previously:
ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); pipeList.add(new Input2CharSequence("UTF-8")); Pattern tokenPattern = Pattern.compile("[\p{L}\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); pipeList.add(new TokenSequenceLowercase()); pipeList.add(new TokenSequenceRemoveStopwords(new
File(stopListFilePath), "utf-8", false, false, false)); pipeList.add(new TokenSequence2FeatureSequence()); pipeList.add(new FeatureSequence2FeatureVector()); pipeList.add(new Target2Label()); SerialPipes pipeline = new SerialPipes(pipeList);
Note that we added an additional FeatureSequence2FeatureVector pipe that transforms a feature sequence into a feature vector. When we have data in a feature vector, we can use any classification algorithm, as we saw in the previous chapters. We'll continue our example in Mallet to demonstrate how to build a classification model.
- We initialize a folder iterator to load our examples in the train folder comprising email examples in the spam and nonspam subfolders, which will be used as example labels:
FileIterator folderIterator = new FileIterator( new File[] {new File(dataFolderPath)}, new TxtFilter(), FileIterator.LAST_DIRECTORY);
- We will construct a new instance list with the pipeline object that we want to use to process the text:
InstanceList instances = new InstanceList(pipeline);
- We will process each instance provided by the iterator:
instances.addThruPipe(folderIterator);
We have now loaded the data and transformed it into feature vectors. Let's train our model on the training set and predict the spam/nonspam classification on the test set.