Once we initialize an iterator that will go through the data, we need to pass the data through a sequence of transformations as described at the beginning of this section. Mallet supports this process through a pipeline and a wide variety of steps that could be included in a pipeline, which are collected in the cc.mallet.pipe package. Some examples are as follows:
- Input2CharSequence: This is a pipe that can read from various kinds of text sources (either URL, file, or reader) into CharSequence
- CharSequenceRemoveHTML: This pipe removes HTML from CharSequence
- MakeAmpersandXMLFriendly: This converts & into & in tokens of a token sequence
- TokenSequenceLowercase: This converts the text in each token in the token sequence in the data field into lowercase
- TokenSequence2FeatureSequence: This converts the token sequence in the data field of each instance into a feature sequence
- TokenSequenceNGrams: This converts the token sequence in the data field into a token sequence of ngrams, that is, a combination of two or more words
The full list of processing steps is available in the following Mallet documentation: http://mallet.cs.umass.edu/api/index.html?cc/mallet/pipe/iterator/package-tree.html.
Now we are ready to build a class that will import our data. We will do that using the following steps:
- Let's build a pipeline, where each processing step is denoted as a pipeline in Mallet. Pipelines can be wired together in a serial fashion with a list of ArrayList<Pipe> objects:
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
- Let's begin by reading data from a file object and converting all of the characters into lowercase:
pipeList.add(new Input2CharSequence("UTF-8")); pipeList.add( new CharSequenceLowercase() );
- We will tokenize raw strings with a regular expression. The following pattern includes unicode letters and numbers and the underscore character:
Pattern tokenPattern = Pattern.compile("[\p{L}\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern));
- We will now remove stop words, that is, frequent words with no predictive power, using a standard English stop list. Two additional parameters indicate whether stop-word removal should be case-sensitive and mark deletions instead of just deleting the words. We'll set both of them to false:
pipeList.add(new TokenSequenceRemoveStopwords(new File(stopListFilePath), "utf-8", false, false, false));
- Instead of storing the actual words, we can convert them into integers, indicating a word index in the BoW:
pipeList.add(new TokenSequence2FeatureSequence());
- We'll do the same for the class label; instead of the label string, we'll use an integer, indicating a position of the label in our bag of words:
pipeList.add(new Target2Label());
- We could also print the features and the labels by invoking the PrintInputAndTarget pipe:
pipeList.add(new PrintInputAndTarget());
- We store the list of pipelines in a SerialPipes class that will covert an instance through a sequence of pipes:
SerialPipes pipeline = new SerialPipes(pipeList);
Now let's take a look at how apply this in a text-mining application!