NLP tokenizer APIs

In this section, we will demonstrate several different tokenization techniques using the OpenNLP, Stanford, and LingPipe APIs. Although there are a number of other APIs available, we restricted the demonstration to these APIs. The examples will give you an idea of what techniques are available.

We will use a string called paragraph to illustrate these techniques. The string includes a new line break that may occur in real text in unexpected places. It is defined here:

private String paragraph = "Let's pause, 
and then ++ "reflect.";

Using the OpenNLPTokenizer class

OpenNLP possesses a Tokenizer interface that is implemented by three classes: SimpleTokenizer, TokenizerME, and WhitespaceTokenizer. This interface supports two methods:

  • tokenize: This is passed a string to tokenize and returns an array of tokens as strings.
  • tokenizePos: This is passed a string and returns an array of Span objects. The Span class is used to specify the beginning and ending offsets of the tokens.

Each of these classes is demonstrated in the following sections.

Using the SimpleTokenizer class

As the name implies, the SimpleTokenizer class performs simple tokenization of text. The INSTANCE field is used to instantiate the class as shown in the following code sequence. The tokenize method is executed against the paragraph variable and the tokens are then displayed:

SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokens[] = simpleTokenizer.tokenize(paragraph);
for(String token : tokens) {
    System.out.println(token);
}

When executed, we get the following output:

Let
'
s
pause
,
and
then
reflect
.

Using this tokenizer, punctuation is returned as separate tokens.

Using the WhitespaceTokenizer class

As its name implies, this class uses whitespaces as delimiters. In the following code sequence, an instance of the tokenizer is created and the tokenize method is executed against it using paragraph as input. The for statement then displays the tokens:

String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(paragraph);
for (String token : tokens) {
    System.out.println(token);
}

The output is as follows:

Let's
pause,
and
then
reflect.

Although this does not separate contractions and similar units of text, it can be useful for some applications. The class also possesses a tokizePos method that returns boundaries of the tokens.

Using the TokenizerME class

The TokenizerME class uses models created using Maximum Entropy (maxent) and a statistical model to perform tokenization. The maxent model is used to determine the relationship between data, in our case, text. Some text sources, such as various social media, are not well formatted and use a lot of slang and special symbols such as emoticons. A statistical tokenizer, such as the maxent model, improves the quality of the tokenization process.

Note

A detailed discussion of this model is not possible here due to its complexity. A good starting point for an interested reader can be found at http://en.wikipedia.org/w/index.php?title=Multinomial_logistic_regression&redirect=no.

A TokenizerModel class hides the model and is used to instantiate the tokenizer. The model must have been previously trained. In the next example, the tokenizer is instantiated using the model found in the en-token.bin file. This model has been trained to work with common English text.

The location of the model file is returned by the method getModelDir, which you will need to implement. The returned value is dependent on where the models are stored on your system. Many of these models can be found at http://opennlp.sourceforge.net/models-1.5/.

After the instance of a FileInputStream class is created, the input stream is used as the argument of the TokenizerModel constructor. The tokenize method will generate an array of strings. This is followed by code to display the tokens:

try {
    InputStream modelInputStream = new FileInputStream(
        new File(getModelDir(), "en-token.bin"));
    TokenizerModel model = new TokenizerModel(modelInputStream);
    Tokenizer tokenizer = new TokenizerME(model);
    String tokens[] = tokenizer.tokenize(paragraph);
    for (String token : tokens) {
        System.out.println(token);
    }
} catch (IOException ex) {
    // Handle the exception
}

The output is as follows:

Let
's
pause
,
and
then
reflect
.

Using the Stanford tokenizer

Tokenization is supported by several Stanford NLP API classes; a few of them are as follows:

  • The PTBTokenizer class
  • The DocumentPreprocessor class
  • The StanfordCoreNLP class as a pipeline

Each of these examples will use the paragraph string as defined earlier.

Using the PTBTokenizer class

This tokenizer mimics the Penn Treebank 3 (PTB) tokenizer (http://www.cis.upenn.edu/~treebank/). It differs from PTB in terms of its options and its support for Unicode. The PTBTokenizer class supports several older constructors; however, it is suggested that the three-argument constructor be used. This constructor uses a Reader object, a LexedTokenFactory<T>argument, and a string to specify which of the several options to use.

The LexedTokenFactory interface is implemented by the CoreLabelTokenFactory and WordTokenFactory classes. The former class supports the retention of the beginning and ending character positions of a token whereas the latter class simply returns a token as a string without any positional information. The WordTokenFactory class is used by default. We will demonstrate the use of both classes.

The CoreLabelTokenFactory class is used in the following example. A StringReader instance is created using paragraph. The last argument is used for the options, which is null for this example. The Iterator interface is implemented by the PTBTokenizer class allowing us to use the hasNext and next method to display the tokens.

PTBTokenizer ptb = new PTBTokenizer(
    new StringReader(paragraph), new CoreLabelTokenFactory(),null);
while (ptb.hasNext()) {
    System.out.println(ptb.next());
}

The output is as follows:

Let
's
pause
,
and
then
reflect
.

The same output can be obtained using the WordTokenFactory class, as shown here:

PTBTokenizerptb = new PTBTokenizer(
    new StringReader(paragraph), new WordTokenFactory(), null);

The power of the CoreLabelTokenFactory class is realized with the options parameter of the PTBTokenizer constructor. These options provide a means to control the behavior of the tokenizer. Options include such controls as how to handle quotes, how to map ellipses, and whether it should treat British English spellings or American English spellings. A list of options can be found at http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html.

In the following code sequence, the PTBTokenizer object is created using the CoreLabelTokenFactory variable ctf along with an option of "invertible=true". This option allows us to obtain and use a CoreLabel object which will give us the beginning and ending position of each token:

CoreLabelTokenFactory ctf = new CoreLabelTokenFactory();
PTBTokenizer ptb = new PTBTokenizer(
    new StringReader(paragraph),ctf,"invertible=true");
while (ptb.hasNext()) {
    CoreLabel cl = (CoreLabel)ptb.next();
    System.out.println(cl.originalText() + " (" + 
        cl.beginPosition() + "-" + cl.endPosition() + ")");
}

The output of this sequence is as follows. The numbers within the parentheses indicate the tokens' beginning and ending positions:

Let (0-3)
's (3-5)
pause (6-11)
, (11-12)
and (14-17)
then (18-22)
reflect (23-30)
. (30-31)

Using the DocumentPreprocessor class

The DocumentPreprocessor class tokenizes input from an input stream. In addition, it implements the Iterable interface making it easy to traverse the tokenized sequence. The tokenizer supports the tokenization of simple text and XML data.

To illustrate this process, we will use an instance of StringReader class that uses the paragraph string, as defined here:

Reader reader = new StringReader(paragraph);

An instance of the DocumentPreprocessor class is then instantiated:

DocumentPreprocessor documentPreprocessor =
      new DocumentPreprocessor(reader);

The DocumentPreprocessor class implements the Iterable<java.util.List<HasWord>> interface. The HasWord interface contains two methods that deal with words: a setWord and a word method. The latter method returns a word as a string. In the next code sequence, the DocumentPreprocessor class splits the input text into sentences which are stored as a List<HasWord>. An Iterator object is used to extract a sentence and then a for-each statement will display the tokens:

Iterator<List<HasWord>> it = documentPreprocessor.iterator();
while (it.hasNext()) {
    List<HasWord> sentence = it.next();
    for (HasWord token : sentence) {
        System.out.println(token);
    }
}

When executed, we get the following output:

Let
's
pause
,
and
then
reflect
.

Using a pipeline

Here, we will use the StanfordCoreNLP class as demonstrated in Chapter 1, Introduction to NLP. However, we use a simpler annotator string to tokenize the paragraph. As shown next, a Properties object is created and assigned the annotators tokenize and ssplit.

The tokenize annotator specifies that tokenization will occur and the ssplit annotation results in sentences being split:

Properties properties = new Properties();
properties.put("annotators", "tokenize, ssplit");

The StanfordCoreNLP class and the Annotation classes are created next:

StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
Annotation annotation = new Annotation(paragraph);

The annotate method is executed to tokenize the text and then the prettyPrint method will display the tokens:

pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, System.out);

Various statistics are displayed followed by the tokens marked up with position information in the output, which is as follows:

Sentence #1 (8 tokens):
Let's pause, 
and then reflect.
[Text=Let CharacterOffsetBegin=0 CharacterOffsetEnd=3] [Text='s CharacterOffsetBegin=3 CharacterOffsetEnd=5] [Text=pause CharacterOffsetBegin=6 CharacterOffsetEnd=11] [Text=, CharacterOffsetBegin=11 CharacterOffsetEnd=12] [Text=and CharacterOffsetBegin=14 CharacterOffsetEnd=17] [Text=then CharacterOffsetBegin=18 CharacterOffsetEnd=22] [Text=reflect CharacterOffsetBegin=23 CharacterOffsetEnd=30] [Text=. CharacterOffsetBegin=30 CharacterOffsetEnd=31]

Using LingPipe tokenizers

LingPipe supports a number of tokenizers. In this section, we will illustrate the use of the IndoEuropeanTokenizerFactory class. In later sections, we will demonstrate other ways that LingPipe supports tokenization. Its INSTANCE field provides an instance of an Indo-European tokenizer. The tokenizer method returns an instance of a Tokenizer class based on the text to be processed, as shown here:

char text[] = paragraph.toCharArray();
TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE;
Tokenizer tokenizer = tokenizerFactory.tokenizer(text, 0, text.length);
for (String token : tokenizer) {
    System.out.println(token);
}

The output is as follows:

Let
'
s
pause
,
and
then
reflect
.

These tokenizers support the tokenization of "normal" text. In the next section, we will demonstrate how a tokenizer can be trained to deal with unique text.

Training a tokenizer to find parts of text

Training a tokenizer is useful when we encounter text that is not handled well by standard tokenizers. Instead of writing a custom tokenizer, we can create a tokenizer model that can be used to perform the tokenization.

To demonstrate how such a model can be created, we will read training data from a file and then train a model using this data. The data is stored as a series of words separated by whitespace and <SPLIT> fields. This <SPLIT> field is used to provide further information about how tokens should be identified. They can help identify breaks between numbers, such as 23.6, and punctuation characters such as commas. The training data we will use is stored in the file training-data.train, and is shown here:

These fields are used to provide further information about how tokens should be identified<SPLIT>. 
They can help identify breaks between numbers<SPLIT>, such as 23.6<SPLIT>, punctuation characters such as commas<SPLIT>.

The data that we use does not represent unique text, but it does illustrate how to annotate text and the process used to train a model.

We will use the OpenNLP TokenizerME class' overloaded train method to create a model. The last two parameters require additional explanations. The maxent is used to determine the relationship between elements of text.

We can specify the number of features the model must address before it is included in the model. These features can be thought of as aspects of the model. Iterations refer to the number of times the training procedure will iterate when determining the model's parameters. Few of the TokenME class parameters are as follows:

Parameter

Usage

String

A code for the language used

ObjectStream<TokenSample>

An ObjectStream parameter containing the training data

boolean

If true, then alphanumeric data is ignored

int

Specifies how many times a feature is processed

int

The number of iterations used to train the maxent model

In the example that follows, we start by defining a BufferedOutputStream object that will be used to store the new model. Several of the methods used in the example will generate exceptions, which are handled in catch blocks:

BufferedOutputStream modelOutputStream = null;
try {
    …
} catch (UnsupportedEncodingException ex) {
    // Handle the exception
} catch (IOException ex) {
    // Handle the exception
}

An instance of an ObjectStream class is created using the PlainTextByLineStream class. This uses the training file and the character encoding scheme as its constructor arguments. This is used to create a second ObjectStream instance of the TokenSample objects. These objects are text with token span information included:

ObjectStream<String> lineStream = new PlainTextByLineStream(
    new FileInputStream("training-data.train"), "UTF-8");
ObjectStream<TokenSample> sampleStream = 
    new TokenSampleStream(lineStream);

The train method can now be used as shown in the following code. English is specified as the language. Alphanumeric information is ignored. The feature and iteration values are set to 5 and 100 respectively.

TokenizerModel model = TokenizerME.train(
    "en", sampleStream, true, 5, 100);

The parameters of the train method are given in detail in the following table:

Parameter

Meaning

Language code

A string specifying the natural language used

Samples

The sample text

Alphanumeric optimization

If true, then alphanumeric are skipped

Cutoff

The number of times a feature is processed

Iterations

The number of iterations performed to train the model

The next code sequence will create an output stream and then write the model out to the mymodel.bin file. The model is then ready to be used:

BufferedOutputStream modelOutputStream = new BufferedOutputStream(
    new FileOutputStream(new File("mymodel.bin")));
model.serialize(modelOutputStream);

The details of the output will not be discussed here. However, it essentially chronicles the training process. The output of the sequence is as follows, but the last section has been abbreviated where most of the iterations steps have been deleted to save space:

Indexing events using cutoff of 5

Dropped event F:[p=2, s=3.6,, p1=2, p1_num, p2=bok, p1f1=23, f1=3, f1_num, f2=., f2_eos, f12=3.]
Dropped event F:[p=23, s=.6,, p1=3, p1_num, p2=2, p2_num, p21=23, p1f1=3., f1=., f1_eos, f2=6, f2_num, f12=.6]
Dropped event F:[p=23., s=6,, p1=., p1_eos, p2=3, p2_num, p21=3., p1f1=.6, f1=6, f1_num, f2=,, f12=6,]
  Computing event counts...  done. 27 events
  Indexing...  done.
Sorting and merging events... done. Reduced 23 events to 4.
Done indexing.
Incorporating indexed data for training...  
done.
  Number of Event Tokens: 4
      Number of Outcomes: 2
    Number of Predicates: 4
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ...loglikelihood=-15.942385152878742  0.8695652173913043
  2:  ...loglikelihood=-9.223608340603953  0.8695652173913043
  3:  ...loglikelihood=-8.222154969329086  0.8695652173913043
  4:  ...loglikelihood=-7.885816898591612  0.8695652173913043
  5:  ...loglikelihood=-7.674336804488621  0.8695652173913043
  6:  ...loglikelihood=-7.494512270303332  0.8695652173913043
Dropped event T:[p=23.6, s=,, p1=6, p1_num, p2=., p2_eos, p21=.6, p1f1=6,, f1=,, f2=bok]
  7:  ...loglikelihood=-7.327098298508153  0.8695652173913043
  8:  ...loglikelihood=-7.1676028756216965  0.8695652173913043
  9:  ...loglikelihood=-7.014728408489079  0.8695652173913043
...
100:  ...loglikelihood=-2.3177060257465376  1.0

We can use the model as shown in the following sequence. This is the same technique we used in the section Using the TokenizerME class. The only difference is the model used here:

try {
    paragraph = "A demonstration of how to train a tokenizer.";
    InputStream modelIn = new FileInputStream(new File(
        ".", "mymodel.bin"));
    TokenizerModel model = new TokenizerModel(modelIn);
    Tokenizer tokenizer = new TokenizerME(model);
    String tokens[] = tokenizer.tokenize(paragraph);
    for (String token : tokens) {
        System.out.println(token);
} catch (IOException ex) {
    ex.printStackTrace();
}

The output is as follows:

A
demonstration
of
how
to
train
a
tokenizer
.

Comparing tokenizers

A brief comparison of the NLP APIs tokenizers is shown in the following table. The tokens generated are listed under the tokenizer's name. They are based on the same text, "Let's pause, and then reflect.". Keep in mind that the output is based on a simple use of the classes. There may be options not included in the examples that will influence how the tokens are generated. The intent is to simply show the type of output that can be expected based on the sample code and data.

SimpleTokenizer

WhitespaceTokenizer

TokenizerME

PTBTokenizer

Document

Preprocessor

IndoEuropean

TokenizerFactory

Let

Let's

Let

Let

Let

Let

'

pause,

's

's

's

'

s

and

pause

pause

pause

s

pause

then

,

,

,

pause

,

reflect.

and

and

and

,

and

 

then

then

then

and

then

 

reflect

reflect

reflect

then

reflect

 

.

.

.

reflect

.

    

.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.223.190