In this section, we will demonstrate several different tokenization techniques using the OpenNLP, Stanford, and LingPipe APIs. Although there are a number of other APIs available, we restricted the demonstration to these APIs. The examples will give you an idea of what techniques are available.
We will use a string called paragraph
to illustrate these techniques. The string includes a new line break that may occur in real text in unexpected places. It is defined here:
private String paragraph = "Let's pause, and then ++ "reflect.";
OpenNLP possesses a Tokenizer
interface that is implemented by three classes: SimpleTokenizer
, TokenizerME
, and WhitespaceTokenizer
. This interface supports two methods:
Each of these classes is demonstrated in the following sections.
As the name implies, the SimpleTokenizer
class performs simple tokenization of text. The INSTANCE
field is used to instantiate the class as shown in the following code sequence. The tokenize
method is executed against the paragraph
variable and the tokens are then displayed:
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE; String tokens[] = simpleTokenizer.tokenize(paragraph); for(String token : tokens) { System.out.println(token); }
When executed, we get the following output:
Let ' s pause , and then reflect .
Using this tokenizer, punctuation is returned as separate tokens.
As its name implies, this class uses whitespaces as delimiters. In the following code sequence, an instance of the tokenizer is created and the tokenize
method is executed against it using paragraph
as input. The for statement then displays the tokens:
String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(paragraph); for (String token : tokens) { System.out.println(token); }
The output is as follows:
Let's pause, and then reflect.
Although this does not separate contractions and similar units of text, it can be useful for some applications. The class also possesses a tokizePos
method that returns boundaries of the tokens.
The TokenizerME
class uses models created using Maximum Entropy (maxent) and a statistical model to perform tokenization. The maxent model is used to determine the relationship between data, in our case, text. Some text sources, such as various social media, are not well formatted and use a lot of slang and special symbols such as emoticons. A statistical tokenizer, such as the maxent model, improves the quality of the tokenization process.
A detailed discussion of this model is not possible here due to its complexity. A good starting point for an interested reader can be found at http://en.wikipedia.org/w/index.php?title=Multinomial_logistic_regression&redirect=no.
A TokenizerModel
class hides the model and is used to instantiate the tokenizer. The model must have been previously trained. In the next example, the tokenizer is instantiated using the model found in the en-token.bin
file. This model has been trained to work with common English text.
The location of the model file is returned by the method getModelDir
, which you will need to implement. The returned value is dependent on where the models are stored on your system. Many of these models can be found at http://opennlp.sourceforge.net/models-1.5/.
After the instance of a FileInputStream
class is created, the input stream is used as the argument of the TokenizerModel
constructor. The tokenize
method will generate an array of strings. This is followed by code to display the tokens:
try { InputStream modelInputStream = new FileInputStream( new File(getModelDir(), "en-token.bin")); TokenizerModel model = new TokenizerModel(modelInputStream); Tokenizer tokenizer = new TokenizerME(model); String tokens[] = tokenizer.tokenize(paragraph); for (String token : tokens) { System.out.println(token); } } catch (IOException ex) { // Handle the exception }
The output is as follows:
Let 's pause , and then reflect .
Tokenization is supported by several Stanford NLP API classes; a few of them are as follows:
PTBTokenizer
classDocumentPreprocessor
classStanfordCoreNLP
class as a pipelineEach of these examples will use the paragraph
string as defined earlier.
This tokenizer mimics the Penn Treebank 3 (PTB) tokenizer (http://www.cis.upenn.edu/~treebank/). It differs from PTB in terms of its options and its support for Unicode. The PTBTokenizer
class supports several older constructors; however, it is suggested that the three-argument constructor be used. This constructor uses a Reader
object, a LexedTokenFactory<T>
argument, and a string to specify which of the several options to use.
The LexedTokenFactory
interface is implemented by the CoreLabelTokenFactory
and WordTokenFactory
classes. The former class supports the retention of the beginning and ending character positions of a token whereas the latter class simply returns a token as a string without any positional information. The WordTokenFactory
class is used by default. We will demonstrate the use of both classes.
The CoreLabelTokenFactory
class is used in the following example. A StringReader
instance is created using paragraph
. The last argument is used for the options, which is null
for this example. The Iterator
interface is implemented by the PTBTokenizer
class allowing us to use the hasNext
and next
method to display the tokens.
PTBTokenizer ptb = new PTBTokenizer( new StringReader(paragraph), new CoreLabelTokenFactory(),null); while (ptb.hasNext()) { System.out.println(ptb.next()); }
The output is as follows:
Let 's pause , and then reflect .
The same output can be obtained using the WordTokenFactory
class, as shown here:
PTBTokenizerptb = new PTBTokenizer( new StringReader(paragraph), new WordTokenFactory(), null);
The power of the CoreLabelTokenFactory
class is realized with the options parameter of the PTBTokenizer
constructor. These options provide a means to control the behavior of the tokenizer. Options include such controls as how to handle quotes, how to map ellipses, and whether it should treat British English spellings or American English spellings. A list of options can be found at http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html.
In the following code sequence, the PTBTokenizer
object is created using the CoreLabelTokenFactory
variable ctf
along with an option of "invertible=true". This option allows us to obtain and use a CoreLabel
object which will give us the beginning and ending position of each token:
CoreLabelTokenFactory ctf = new CoreLabelTokenFactory(); PTBTokenizer ptb = new PTBTokenizer( new StringReader(paragraph),ctf,"invertible=true"); while (ptb.hasNext()) { CoreLabel cl = (CoreLabel)ptb.next(); System.out.println(cl.originalText() + " (" + cl.beginPosition() + "-" + cl.endPosition() + ")"); }
The output of this sequence is as follows. The numbers within the parentheses indicate the tokens' beginning and ending positions:
Let (0-3) 's (3-5) pause (6-11) , (11-12) and (14-17) then (18-22) reflect (23-30) . (30-31)
The DocumentPreprocessor
class tokenizes input from an input stream. In addition, it implements the Iterable
interface making it easy to traverse the tokenized sequence. The tokenizer supports the tokenization of simple text and XML data.
To illustrate this process, we will use an instance of StringReader
class that uses the paragraph
string, as defined here:
Reader reader = new StringReader(paragraph);
An instance of the DocumentPreprocessor
class is then instantiated:
DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader);
The DocumentPreprocessor
class implements the Iterable<java.util.List<HasWord>>
interface. The HasWord
interface contains two methods that deal with words: a setWord
and a word
method. The latter method returns a word as a string. In the next code sequence, the DocumentPreprocessor
class splits the input text into sentences which are stored as a List<HasWord>
. An Iterator
object is used to extract a sentence and then a for-each statement will display the tokens:
Iterator<List<HasWord>> it = documentPreprocessor.iterator(); while (it.hasNext()) { List<HasWord> sentence = it.next(); for (HasWord token : sentence) { System.out.println(token); } }
When executed, we get the following output:
Let 's pause , and then reflect .
Here, we will use the StanfordCoreNLP
class as demonstrated in Chapter 1, Introduction to NLP. However, we use a simpler annotator string to tokenize the paragraph. As shown next, a Properties
object is created and assigned the annotators tokenize
and ssplit
.
The tokenize
annotator specifies that tokenization will occur and the ssplit
annotation results in sentences being split:
Properties properties = new Properties(); properties.put("annotators", "tokenize, ssplit");
The StanfordCoreNLP
class and the Annotation
classes are created next:
StanfordCoreNLP pipeline = new StanfordCoreNLP(properties); Annotation annotation = new Annotation(paragraph);
The annotate
method is executed to tokenize the text and then the prettyPrint
method will display the tokens:
pipeline.annotate(annotation); pipeline.prettyPrint(annotation, System.out);
Various statistics are displayed followed by the tokens marked up with position information in the output, which is as follows:
Sentence #1 (8 tokens): Let's pause, and then reflect. [Text=Let CharacterOffsetBegin=0 CharacterOffsetEnd=3] [Text='s CharacterOffsetBegin=3 CharacterOffsetEnd=5] [Text=pause CharacterOffsetBegin=6 CharacterOffsetEnd=11] [Text=, CharacterOffsetBegin=11 CharacterOffsetEnd=12] [Text=and CharacterOffsetBegin=14 CharacterOffsetEnd=17] [Text=then CharacterOffsetBegin=18 CharacterOffsetEnd=22] [Text=reflect CharacterOffsetBegin=23 CharacterOffsetEnd=30] [Text=. CharacterOffsetBegin=30 CharacterOffsetEnd=31]
LingPipe supports a number of tokenizers. In this section, we will illustrate the use of the IndoEuropeanTokenizerFactory
class. In later sections, we will demonstrate other ways that LingPipe supports tokenization. Its INSTANCE
field provides an instance of an Indo-European tokenizer. The tokenizer
method returns an instance of a Tokenizer
class based on the text to be processed, as shown here:
char text[] = paragraph.toCharArray(); TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; Tokenizer tokenizer = tokenizerFactory.tokenizer(text, 0, text.length); for (String token : tokenizer) { System.out.println(token); }
The output is as follows:
Let ' s pause , and then reflect .
These tokenizers support the tokenization of "normal" text. In the next section, we will demonstrate how a tokenizer can be trained to deal with unique text.
Training a tokenizer is useful when we encounter text that is not handled well by standard tokenizers. Instead of writing a custom tokenizer, we can create a tokenizer model that can be used to perform the tokenization.
To demonstrate how such a model can be created, we will read training data from a file and then train a model using this data. The data is stored as a series of words separated by whitespace and <SPLIT>
fields. This <SPLIT>
field is used to provide further information about how tokens should be identified. They can help identify breaks between numbers, such as 23.6, and punctuation characters such as commas. The training data we will use is stored in the file training-data.train
, and is shown here:
These fields are used to provide further information about how tokens should be identified<SPLIT>. They can help identify breaks between numbers<SPLIT>, such as 23.6<SPLIT>, punctuation characters such as commas<SPLIT>.
The data that we use does not represent unique text, but it does illustrate how to annotate text and the process used to train a model.
We will use the OpenNLP TokenizerME
class' overloaded train
method to create a model. The last two parameters require additional explanations. The maxent is used to determine the relationship between elements of text.
We can specify the number of features the model must address before it is included in the model. These features can be thought of as aspects of the model. Iterations refer to the number of times the training procedure will iterate when determining the model's parameters. Few of the TokenME class parameters are as follows:
Parameter |
Usage |
---|---|
|
A code for the language used |
|
An |
|
If |
| |
|
The number of iterations used to train the maxent model |
In the example that follows, we start by defining a BufferedOutputStream
object that will be used to store the new model. Several of the methods used in the example will generate exceptions, which are handled in catch blocks:
BufferedOutputStream modelOutputStream = null; try { … } catch (UnsupportedEncodingException ex) { // Handle the exception } catch (IOException ex) { // Handle the exception }
An instance of an ObjectStream
class is created using the PlainTextByLineStream
class. This uses the training file and the character encoding scheme as its constructor arguments. This is used to create a second ObjectStream
instance of the TokenSample
objects. These objects are text with token span information included:
ObjectStream<String> lineStream = new PlainTextByLineStream( new FileInputStream("training-data.train"), "UTF-8"); ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
The train
method can now be used as shown in the following code. English is specified as the language. Alphanumeric information is ignored. The feature and iteration values are set to 5 and 100 respectively.
TokenizerModel model = TokenizerME.train( "en", sampleStream, true, 5, 100);
The parameters of the train method are given in detail in the following table:
Parameter |
Meaning |
---|---|
Language code |
A string specifying the natural language used |
Samples |
The sample text |
Alphanumeric optimization |
If |
Cutoff |
The number of times a feature is processed |
Iterations |
The number of iterations performed to train the model |
The next code sequence will create an output stream and then write the model out to the mymodel.bin
file. The model is then ready to be used:
BufferedOutputStream modelOutputStream = new BufferedOutputStream( new FileOutputStream(new File("mymodel.bin"))); model.serialize(modelOutputStream);
The details of the output will not be discussed here. However, it essentially chronicles the training process. The output of the sequence is as follows, but the last section has been abbreviated where most of the iterations steps have been deleted to save space:
Indexing events using cutoff of 5 Dropped event F:[p=2, s=3.6,, p1=2, p1_num, p2=bok, p1f1=23, f1=3, f1_num, f2=., f2_eos, f12=3.] Dropped event F:[p=23, s=.6,, p1=3, p1_num, p2=2, p2_num, p21=23, p1f1=3., f1=., f1_eos, f2=6, f2_num, f12=.6] Dropped event F:[p=23., s=6,, p1=., p1_eos, p2=3, p2_num, p21=3., p1f1=.6, f1=6, f1_num, f2=,, f12=6,] Computing event counts... done. 27 events Indexing... done. Sorting and merging events... done. Reduced 23 events to 4. Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 4 Number of Outcomes: 2 Number of Predicates: 4 ...done. Computing model parameters ... Performing 100 iterations. 1: ...loglikelihood=-15.942385152878742 0.8695652173913043 2: ...loglikelihood=-9.223608340603953 0.8695652173913043 3: ...loglikelihood=-8.222154969329086 0.8695652173913043 4: ...loglikelihood=-7.885816898591612 0.8695652173913043 5: ...loglikelihood=-7.674336804488621 0.8695652173913043 6: ...loglikelihood=-7.494512270303332 0.8695652173913043 Dropped event T:[p=23.6, s=,, p1=6, p1_num, p2=., p2_eos, p21=.6, p1f1=6,, f1=,, f2=bok] 7: ...loglikelihood=-7.327098298508153 0.8695652173913043 8: ...loglikelihood=-7.1676028756216965 0.8695652173913043 9: ...loglikelihood=-7.014728408489079 0.8695652173913043 ... 100: ...loglikelihood=-2.3177060257465376 1.0
We can use the model as shown in the following sequence. This is the same technique we used in the section Using the TokenizerME class. The only difference is the model used here:
try { paragraph = "A demonstration of how to train a tokenizer."; InputStream modelIn = new FileInputStream(new File( ".", "mymodel.bin")); TokenizerModel model = new TokenizerModel(modelIn); Tokenizer tokenizer = new TokenizerME(model); String tokens[] = tokenizer.tokenize(paragraph); for (String token : tokens) { System.out.println(token); } catch (IOException ex) { ex.printStackTrace(); }
A demonstration of how to train a tokenizer .
A brief comparison of the NLP APIs tokenizers is shown in the following table. The tokens generated are listed under the tokenizer's name. They are based on the same text, "Let's pause, and then reflect.". Keep in mind that the output is based on a simple use of the classes. There may be options not included in the examples that will influence how the tokens are generated. The intent is to simply show the type of output that can be expected based on the sample code and data.
SimpleTokenizer |
WhitespaceTokenizer |
TokenizerME |
PTBTokenizer |
Document Preprocessor |
IndoEuropean TokenizerFactory |
---|---|---|---|---|---|
Let |
Let's |
Let |
Let |
Let |
Let |
' |
pause, |
's |
's |
's |
' |
s |
and |
pause |
pause |
pause |
s |
pause |
then |
, |
, |
, |
pause |
, |
reflect. |
and |
and |
and |
, |
and |
then |
then |
then |
and | |
then |
reflect |
reflect |
reflect |
then | |
reflect |
. |
. |
. |
reflect | |
. |
. |
3.137.223.190