Pre-processing text data

Once we initialize an iterator that will go through the data, we need to pass the data through a sequence of transformations as described at the beginning of this section. Mallet supports this process through a pipeline and a wide variety of steps that could be included in a pipeline, which are collected in the cc.mallet.pipe package. Some examples are as follows:

  • Input2CharSequence: This is a pipe that can read from various kinds of text sources (either URL, file, or reader) into CharSequence
  • CharSequenceRemoveHTML: This pipe removes HTML from CharSequence
  • MakeAmpersandXMLFriendly: This converts & into &amp in tokens of a token sequence
  • TokenSequenceLowercase: This converts the text in each token in the token sequence in the data field into lowercase
  • TokenSequence2FeatureSequence: This converts the token sequence in the data field of each instance into a feature sequence
  • TokenSequenceNGrams: This converts the token sequence in the data field into a token sequence of ngrams, that is, a combination of two or more words
The full list of processing steps is available in the following Mallet documentation: http://mallet.cs.umass.edu/api/index.html?cc/mallet/pipe/iterator/package-tree.html.

Now we are ready to build a class that will import our data. We will do that using the following steps:

  1. Let's build a pipeline, where each processing step is denoted as a pipeline in Mallet. Pipelines can be wired together in a serial fashion with a list of ArrayList<Pipe> objects:
ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); 
  1. Let's begin by reading data from a file object and converting all of the characters into lowercase:
pipeList.add(new Input2CharSequence("UTF-8")); 
pipeList.add( new CharSequenceLowercase() );
  1. We will tokenize raw strings with a regular expression. The following pattern includes unicode letters and numbers and the underscore character:
Pattern tokenPattern = 
Pattern.compile("[\p{L}\p{N}_]+"); 
 
pipeList.add(new CharSequence2TokenSequence(tokenPattern)); 
  1. We will now remove stop words, that is, frequent words with no predictive power, using a standard English stop list. Two additional parameters indicate whether stop-word removal should be case-sensitive and mark deletions instead of just deleting the words. We'll set both of them to false:
pipeList.add(new TokenSequenceRemoveStopwords(new File(stopListFilePath), "utf-8", false, false, false));
  1. Instead of storing the actual words, we can convert them into integers, indicating a word index in the BoW:
pipeList.add(new TokenSequence2FeatureSequence()); 
  1. We'll do the same for the class label; instead of the label string, we'll use an integer, indicating a position of the label in our bag of words:
pipeList.add(new Target2Label()); 
  1. We could also print the features and the labels by invoking the PrintInputAndTarget pipe:
pipeList.add(new PrintInputAndTarget()); 
  1. We store the list of pipelines in a SerialPipes class that will covert an instance through a sequence of pipes:
SerialPipes pipeline = new SerialPipes(pipeList); 

Now let's take a look at how apply this in a text-mining application!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.180.43