Feature generation

We will perform feature generation using the following steps:

  1. We will create a default pipeline, as described previously:
ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); 
pipeList.add(new Input2CharSequence("UTF-8")); 
Pattern tokenPattern = Pattern.compile("[\p{L}\p{N}_]+"); 
pipeList.add(new CharSequence2TokenSequence(tokenPattern)); 
pipeList.add(new TokenSequenceLowercase()); 
pipeList.add(new TokenSequenceRemoveStopwords(new 
File(stopListFilePath), "utf-8", false, false, false)); pipeList.add(new TokenSequence2FeatureSequence()); pipeList.add(new FeatureSequence2FeatureVector()); pipeList.add(new Target2Label()); SerialPipes pipeline = new SerialPipes(pipeList);

Note that we added an additional FeatureSequence2FeatureVector pipe that transforms a feature sequence into a feature vector. When we have data in a feature vector, we can use any classification algorithm, as we saw in the previous chapters. We'll continue our example in Mallet to demonstrate how to build a classification model.

  1. We initialize a folder iterator to load our examples in the train folder comprising email examples in the spam and nonspam subfolders, which will be used as example labels:
FileIterator folderIterator = new FileIterator( 
    new File[] {new File(dataFolderPath)}, 
    new TxtFilter(), 
    FileIterator.LAST_DIRECTORY); 
  1. We will construct a new instance list with the pipeline object that we want to use to process the text:
InstanceList instances = new InstanceList(pipeline); 
  1. We will process each instance provided by the iterator:
instances.addThruPipe(folderIterator); 

We have now loaded the data and transformed it into feature vectors. Let's train our model on the training set and predict the spam/nonspam classification on the test set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.121.101