Pre-processing text data

Once we initialize an iterator that will go through the data, we need to pass the data through a sequence of transformations as described at the beginning of this section. Mallet supports this process through a pipeline and a wide variety of steps that could be included in a pipeline, which are collected in the cc.mallet.pipe package. Some examples are as follows:

Input2CharSequence: This is a pipe that can read from various kinds of text sources (either URL, file, or reader) into CharSequence
CharSequenceRemoveHTML: This pipe removes HTML from CharSequence
MakeAmpersandXMLFriendly: This converts & into &amp in tokens of a token sequence
TokenSequenceLowercase: This converts the text in each token in the token sequence in the data field into lowercase
TokenSequence2FeatureSequence: This converts the token sequence in the data field of each instance into a feature sequence
TokenSequenceNGrams: This converts the token sequence in the data field into a token sequence of ngrams, that is, a combination of two or more words

The full list of processing steps is available in the following Mallet documentation: http://mallet.cs.umass.edu/api/index.html?cc/mallet/pipe/iterator/package-tree.html.

Now we are ready to build a class that will import our data. We will do that using the following steps:

Let's build a pipeline, where each processing step is denoted as a pipeline in Mallet. Pipelines can be wired together in a serial fashion with a list of ArrayList<Pipe> objects:

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

Let's begin by reading data from a file object and converting all of the characters into lowercase:

pipeList.add(new Input2CharSequence("UTF-8")); 
pipeList.add( new CharSequenceLowercase() );

We will tokenize raw strings with a regular expression. The following pattern includes unicode letters and numbers and the underscore character:

Pattern tokenPattern = 
Pattern.compile("[\p{L}\p{N}_]+"); 
 
pipeList.add(new CharSequence2TokenSequence(tokenPattern));

We will now remove stop words, that is, frequent words with no predictive power, using a standard English stop list. Two additional parameters indicate whether stop-word removal should be case-sensitive and mark deletions instead of just deleting the words. We'll set both of them to false:

pipeList.add(new TokenSequenceRemoveStopwords(new File(stopListFilePath), "utf-8", false, false, false));

Instead of storing the actual words, we can convert them into integers, indicating a word index in the BoW:

pipeList.add(new TokenSequence2FeatureSequence());

We'll do the same for the class label; instead of the label string, we'll use an integer, indicating a position of the label in our bag of words:

pipeList.add(new Target2Label());

We could also print the features and the labels by invoking the PrintInputAndTarget pipe:

pipeList.add(new PrintInputAndTarget());

We store the list of pipelines in a SerialPipes class that will covert an instance through a sequence of pipes:

SerialPipes pipeline = new SerialPipes(pipeList);

Now let's take a look at how apply this in a text-mining application!

Table of Contents for Pre-processing text data

Create new playlist

Sign In

Sign Up

Table of Contents for
Pre-processing text data