We will demonstrate POS tagging using OpenNLP, Stanford API, and LingPipe. Each of the examples will use the following sentence. It is the first sentence of Chapter 5, At A Venture of Twenty Thousands Leagues Under the Sea by Jules Verne:
private String[] sentence = {"The", "voyage", "of", "the", "Abraham", "Lincoln", "was", "for", "a", "long", "time", "marked", "by", "no", "special", "incident."};
The text to be processed may not always be defined in this fashion. Sometimes the sentence will be available as a single string:
String theSentence = "The voyage of the Abraham Lincoln was for a " + "long time marked by no special incident.";
We might need to convert a string to an array of strings. There are numerous techniques for converting this string to an array of words. The following tokenizeSentence
method performs this operation:
public String[] tokenizeSentence(String sentence) { String words[] = sentence.split("S+"); return words; }
The following code demonstrates the use of this method:
String words[] = tokenizeSentence(theSentence); for(String word : words) { System.out.print(word + " "); } System.out.println();
The voyage of the Abraham Lincoln was for a long time marked by no special incident.
Alternately, we could use a tokenizer such as OpenNLP's WhitespaceTokenizer
class, as shown here:
String words[] = WhitespaceTokenizer.INSTANCE.tokenize(sentence);
OpenNLP provides several classes in support of POS tagging. We will demonstrate how to use the POSTaggerME
class to perform basic tagging and the ChunkerME
class to perform chunking. Chunking involves grouping related words according to their types. This can provide additional insight into the structure of a sentence. We will also examine the creation and use of a POSDictionary
instance.
The OpenNLP POSTaggerME
class uses maximum entropy to process the tags. The tagger determines the type of tag based on the word itself and the word's context. Any given word may have multiple tags associated with it. The tagger uses a probability model to determine the specific tag to be assigned.
POS models are loaded from a file. The en-pos-maxent.bin
model is used frequently and is based on the Penn TreeBank tag set. Various pretrained POS models for OpenNLP can be found at http://opennlp.sourceforge.net/models-1.5/.
We start with a try-catch block to handle any IOException
that might be generated when loading a model, as shown here.
We use the en-pos-maxent.bin
file for the model:
try (InputStream modelIn = new FileInputStream( new File(getModelDir(), "en-pos-maxent.bin"));) { … } catch (IOException e) { // Handle exceptions }
Next, create the POSModel
and POSTaggerME
instances as shown here:
POSModel model = new POSModel(modelIn); POSTaggerME tagger = new POSTaggerME(model);
The tag
method can now be applied to the tagger using the text to be processed as its argument:
String tags[] = tagger.tag(sentence);
The words and their tags are then displayed as shown here:
for (int i = 0; i<sentence.length; i++) { System.out.print(sentence[i] + "/" + tags[i] + " "); }
The output is as follows. Each word is followed by its type:
The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident./NN
With any sentence, there may be more than one possible assignment of tags to words. The topKSequences
method will return a set of sequences based on their probability of being correct. In the next code sequence, the topKSequences
method is executed using the sentence
variable and then displayed:
Sequence topSequences[] = tagger.topKSequences(sentence); for (inti = 0; i<topSequences.length; i++) { System.out.println(topSequences[i]); }
Its output follows in which the first number represents a weighted score and the tags within the brackets are the sequence of tags scored:
-0.5563571615737618 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, JJ, NN] -2.9886144610050907 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, JJ, .] -3.771930515521527 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, NN, NN]
The Sequence
class has several methods, as detailed in the following table:
Method |
Meaning |
---|---|
|
Returns a list of strings representing the tags for the sentence |
|
Returns an array of |
|
Returns a weighted value for the sequence |
In the following sequence, we use several of these methods to demonstrate what they do. For each sequence, the tags and their probability are displayed, separated by a forward slash:
for (int i = 0; i<topSequences.length; i++) { List<String> outcomes = topSequences[i].getOutcomes(); double probabilities[] = topSequences[i].getProbs(); for (int j = 0; j <outcomes.size(); j++) { System.out.printf("%s/%5.3f ",outcomes.get(j), probabilities[j]); } System.out.println(); } System.out.println();
The output is as follows. Each pair of lines represents one sequence where the output has been wrapped:
DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 JJ/0.919 NN/0.832 DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 JJ/0.919 ./0.073 DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 NN/0.073 NN/0.419
The process of chunking involves breaking a sentence into parts or chunks. These chunks can then be annotated with tags. We will use the ChunkerME
class to illustrate how this is accomplished. This class uses a model loaded into a ChunkerModel
instance. The ChunkerME
class' chunk
method performs the actual chunking process. We will also examine the use of the chunkAsSpans
method to return information about the span of these chunks. This allows us to see how long a chunk is and what elements make up the chunk.
We will use the en-pos-maxent.bin
file to create a model for the POSTaggerME
instance. We need to use this instance to tag the text as we did in the Using OpenNLP POSTaggerME class for POS taggers section earlier in this chapter. We will also use the en-chunker.bin
file to create a ChunkerModel
instance to be used with the ChunkerME
instance.
These models are created using input streams, as shown in the following example.
We use a try-with-resources block to open and close files and to deal with any exceptions that may be thrown:
try ( InputStream posModelStream = new FileInputStream( getModelDir() + "\en-pos-maxent.bin"); InputStream chunkerStream = new FileInputStream( getModelDir() + "\en-chunker.bin");) { … } catch (IOException ex) { // Handle exceptions }
The following code sequence creates and uses a tagger to find the POS of the sentence. The sentence and its tags are then displayed:
POSModel model = new POSModel(posModelStream); POSTaggerME tagger = new POSTaggerME(model); String tags[] = tagger.tag(sentence); for(int i=0; i<tags.length; i++) { System.out.print(sentence[i] + "/" + tags[i] + " "); } System.out.println();
The output is as follows. We have shown this output so that it will be clear how the chunker works:
The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident./NN
A ChunkerModel
instance is created using the input stream. From this, the ChunkerME
instance is created followed by the use of the chunk
method as shown here. The chunk
method will use the sentence's token and its tags to create an array of strings. Each string will hold information about the token and its chunk:
ChunkerModel chunkerModel = new ChunkerModel(chunkerStream); ChunkerME chunkerME = new ChunkerME(chunkerModel); String result[] = chunkerME.chunk(sentence, tags);
Each token in the results
array and its chunk tag are displayed as shown here:
for (int i = 0; i < result.length; i++) { System.out.println("[" + sentence[i] + "] " + result[i]); }
The output is as follows. The token is enclosed in brackets followed by the chunk tag. These tags are explained in the following table:
First Part | |
---|---|
|
Beginning of a tag |
|
Continuation of a tag |
|
End of a tag (will not appear if tag is one word long) |
Second Part | |
|
Noun chunk |
|
Verb chunk |
Multiple words are grouped together such as "The voyage" and "the Abraham Lincoln".
[The] B-NP [voyage] I-NP [of] B-PP [the] B-NP [Abraham] I-NP [Lincoln] I-NP [was] B-VP [for] B-PP [a] B-NP [long] I-NP [time] I-NP [marked] B-VP [by] B-PP [no] B-NP [special] I-NP [incident.] I-NP
If we are interested in getting more detailed information about the chunks, we can use the ChunkerME
class' chunkAsSpans
method. This method returns an array of Span
objects. Each object represents one span found in the text.
There are several other ChunkerME
class methods available. Here, we will illustrate the use of the getType
, getStart
, and getEnd
methods. The getType
method returns the second part of the chunk tag, and the getStart
and getEnd
methods return the beginning and ending index of the tokens, respectively, in the original sentence
array. The length
method returns the length of the span in number of tokens.
In the following sequence, the chunkAsSpans
method is executed using the sentence
and tags
arrays. The spans
array is then displayed. The outer for loop processes one Span
object at a time displaying the basic span information. The inner for loop displays the spanned text enclosed within brackets:
Span[] spans = chunkerME.chunkAsSpans(sentence, tags); for (Span span : spans) { System.out.print("Type: " + span.getType() + " - " + " Begin: " + span.getStart() + " End:" + span.getEnd() + " Length: " + span.length() + " ["); for (int j = span.getStart(); j < span.getEnd(); j++) { System.out.print(sentence[j] + " "); } System.out.println("]"); }
The following output clearly shows the span type, its position in the sentence
array, its length, and then the actual spanned text:
Type: NP - Begin: 0 End:2 Length: 2 [The voyage ] Type: PP - Begin: 2 End:3 Length: 1 [of ] Type: NP - Begin: 3 End:6 Length: 3 [the Abraham Lincoln ] Type: VP - Begin: 6 End:7 Length: 1 [was ] Type: PP - Begin: 7 End:8 Length: 1 [for ] Type: NP - Begin: 8 End:11 Length: 3 [a long time ] Type: VP - Begin: 11 End:12 Length: 1 [marked ] Type: PP - Begin: 12 End:13 Length: 1 [by ] Type: NP - Begin: 13 End:16 Length: 3 [no special incident. ]
A tag dictionary specifies what are the valid tags for a word. This can prevent a tag from being applied inappropriately to a word. In addition, some search algorithms execute faster since they do not have to consider other less probable tags.
In this section, we will demonstrate how to:
As with the previous example, we will use a try-with-resources block to open our input streams for the POS model and then create our model and tagger factory, as shown here:
try (InputStream modelIn = new FileInputStream( new File(getModelDir(), "en-pos-maxent.bin"));) { POSModel model = new POSModel(modelIn); POSTaggerFactory posTaggerFactory = model.getFactory(); … } catch (IOException e) { //Handle exceptions }
We used the POSModel
class' getFactory
method to get a POSTaggerFactory
instance. We will use its getTagDictionary
method to obtain its TagDictionary
instance. This is illustrated here:
MutableTagDictionary tagDictionary = (MutableTagDictionary)posTaggerFactory.getTagDictionary();
The MutableTagDictionary
interface extends the TagDictionary
interface. The TagDictionary
interface possesses a getTags
method, and the MutableTagDictionary
interface adds a put
method that allows tags to be added to the dictionary. These interfaces are implemented by the POSDictionary
class.
To obtain the tags for a given word, use the getTags
method. This returns an array of tags represented by strings. The tags are then displayed as shown here:
String tags[] = tagDictionary.getTags("force"); for (String tag : tags) { System.out.print("/" + tag); } System.out.println();
The output is as follows:
/NN/VBP/VB
This means that the word "force" can be interpreted in three different ways.
The MutableTagDictionary
interface's put
method allows us to add tags to a word. The method has two arguments: the word, and its new tags. The method returns an array containing the previous tags.
In the following example, we replace the old tags with a new tag. The old tags are then displayed.
String oldTags[] = tagDictionary.put("force", "newTag"); for (String tag : oldTags) { System.out.print("/" + tag); } System.out.println();
The following output lists the old tags for the word.
/NN/VBP/VB
These tags have been replaced by the new tag as demonstrated here where the current tags are displayed:
tags = tagDictionary.getTags("force"); for (String tag : tags) { System.out.print("/" + tag); } System.out.println();
All we get is the following:
/newTag
To retain the old tags we will need to create an array of strings to hold the old and the new tags and then use the array as the second argument of the put
method as shown here:
String newTags[] = new String[tags.length+1]; for (int i=0; i<tags.length; i++) { newTags[i] = tags[i]; } newTags[tags.length] = "newTag"; oldTags = tagDictionary.put("force", newTags);
If we redisplay the current tags as shown here, we can see that the old tags have been retained and the new one added:
/NN/VBP/VB/newTag
A new tag dictionary can be added to a POSTaggerFactory
instance. We will illustrate this process by creating a new POSTaggerFactory
and then adding the tagDictionary
we developed earlier. First, we create a new factory using the default constructor as shown next. This is followed by calling the setTagDictionary
method against the new factory.
POSTaggerFactory newFactory = new POSTaggerFactory(); newFactory.setTagDictionary(tagDictionary);
To confirm that the tag dictionary has been added, we display the tags for the word "force" as shown here:
tags = newFactory.getTagDictionary().getTags("force"); for (String tag : tags) { System.out.print("/" + tag); } System.out.println();
The tags are the same as shown here:
/NN/VBP/VB/newTag
If we need to create a new dictionary, then one approach is to create an XML file containing all of the words and their tags, and then create the dictionary from the file. OpenNLP supports this approach with the POSDictionary
class' create
method.
The XML file consists of the dictionary
root element followed by a series of entry
elements. The entry
element uses the tags
attribute to specify the tags for the word. The word is contained within the entry
element as a token
element. A simple example using two words stored in the file dictionary.txt
is as follows:
<dictionary case_sensitive="false"> <entry tags="JJ VB"> <token>strong</token> </entry> <entry tags="NN VBP VB"> <token>force</token> </entry> </dictionary>
To create the dictionary, we use the create
method based on an input stream as shown here:
try (InputStream dictionaryIn = new FileInputStream(new File("dictionary.txt"));) { POSDictionary dictionary = POSDictionary.create(dictionaryIn); … } catch (IOException e) { // Handle exceptions }
The POSDictionary
class has an iterator
method that returns an iterator object. Its next
method returns a string for each word in the dictionary. We can use these methods to display the contents of the dictionary, as shown here:
Iterator<String> iterator = dictionary.iterator(); while (iterator.hasNext()) { String entry = iterator.next(); String tags[] = dictionary.getTags(entry); System.out.print(entry + " "); for (String tag : tags) { System.out.print("/" + tag); } System.out.println(); }
The output that follows displays what we can expect:
strong /JJ/VB force /NN/VBP/VB
In this section, we will examine two different approaches supported by the Stanford API to perform tagging. The first technique uses the MaxentTagger
class. As its name implies, it uses maximum entropy to find the POS. We will also use this class to demonstrate a model designed to handle textese-type text. The second approach will use the pipeline approach with annotators. The English taggers use the Penn Treebank English POS tag set.
The MaxentTagger
class uses a model to perform the tagging task. There are a number of models that come bundled with the API, all with the file extension .tagger
. They include English, Chinese, Arabic, French, and German models. The English models are listed here. The prefix, wsj
, refers to models based on the Wall Street Journal. The other terms refer to techniques used to train the model. These concepts are not covered here:
wsj-0-18-bidirectional-distsim.tagger
wsj-0-18-bidirectional-nodistsim.tagger
wsj-0-18-caseless-left3words-distsim.tagger
wsj-0-18-left3words-distsim.tagger
wsj-0-18-left3words-nodistsim.tagger
english-bidirectional-distsim.tagger
english-caseless-left3words-distsim.tagger
english-left3words-distsim.tagger
The example reads in a series of sentences from a file. Each sentence is then processed and various ways of accessing and displaying the words and tags are illustrated.
We start with a try-with-resources block to deal with IO exceptions as shown here. The wsj-0-18-bidirectional-distsim.tagger
file is used to create an instance of the MaxentTagger
class.
A List
instance of List
instances of HasWord
objects is created using the MaxentTagger
class' tokenizeText
method. The sentences are read in from the file sentences.txt
.The HasWord
interface represents words and contains two methods: a setWord
and a word
method. The latter method returns a word as a string. Each sentence is represented by a List
instance of HasWord
objects:
try { MaxentTagger tagger = new MaxentTagger(getModelDir() + "//wsj-0-18-bidirectional-distsim.tagger"); List<List<HasWord>> sentences = MaxentTagger.tokenizeText( new BufferedReader(new FileReader("sentences.txt"))); … } catch (FileNotFoundException ex) { // Handle exceptions }
The sentences.txt
file contains the first four sentences of Chapter 5, At A Venture of the book Twenty Thousands Leagues Under the Sea:
The voyage of the Abraham Lincoln was for a long time marked by no special incident. But one circumstance happened which showed the wonderful dexterity of Ned Land, and proved what confidence we might place in him. The 30th of June, the frigate spoke some American whalers, from whom we learned that they knew nothing about the narwhal. But one of them, the captain of the Monroe, knowing that Ned Land had shipped on board the Abraham Lincoln, begged for his help in chasing a whale they had in sight.
A loop is added to process each sentence of the sentences
list. The tagSentence
method returns a List
instance of TaggedWord
objects as shown next. The TaggedWord
class implements the HasWord
interface and adds a tag
method that returns the tag associated with the word. As shown here, the toString
method is used to display each sentence:
List<TaggedWord> taggedSentence = tagger.tagSentence(sentence); for (List<HasWord> sentence : sentences) { List<TaggedWord> taggedSentence= tagger.tagSentence(sentence); System.out.println(taggedSentence); }
The output is as follows:
[The/DT, voyage/NN, of/IN, the/DT, Abraham/NNP, Lincoln/NNP, was/VBD, for/IN, a/DT, long/JJ, --- time/NN, marked/VBN, by/IN, no/DT, special/JJ, incident/NN, ./.] [But/CC, one/CD, circumstance/NN, happened/VBD, which/WDT, showed/VBD, the/DT, wonderful/JJ, dexterity/NN, of/IN, Ned/NNP, Land/NNP, ,/,, and/CC, proved/VBD, what/WP, confidence/NN, we/PRP, might/MD, place/VB, in/IN, him/PRP, ./.] [The/DT, 30th/JJ, of/IN, June/NNP, ,/,, the/DT, frigate/NN, spoke/VBD, some/DT, American/JJ, whalers/NNS, ,/,, from/IN, whom/WP, we/PRP, learned/VBD, that/IN, they/PRP, knew/VBD, nothing/NN, about/IN, the/DT, narwhal/NN, ./.] [But/CC, one/CD, of/IN, them/PRP, ,/,, the/DT, captain/NN, of/IN, the/DT, Monroe/NNP, ,/,, knowing/VBG, that/IN, Ned/NNP, Land/NNP, had/VBD, shipped/VBN, on/IN, board/NN, the/DT, Abraham/NNP, Lincoln/NNP, ,/,, begged/VBN, for/IN, his/PRP$, help/NN, in/IN, chasing/VBG, a/DT, whale/NN, they/PRP, had/VBD, in/IN, sight/NN, ./.]
Alternately, we can use the Sentence
class' listToString
method to convert the tagged sentence to a simple String
object.
A value of false
for its second parameter is used by the toString
method of the HasWord
to create the resulting string, as shown here:
List<TaggedWord> taggedSentence = tagger.tagSentence(sentence); for (List<HasWord> sentence : sentences) { List<TaggedWord> taggedSentence= tagger.tagSentence(sentence); System.out.println(Sentence.listToString(taggedSentence, false)); }
This produces a more aesthetically pleasing output:
The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident/NN ./. But/CC one/CD circumstance/NN happened/VBD which/WDT showed/VBD the/DT wonderful/JJ dexterity/NN of/IN Ned/NNP Land/NNP ,/, and/CC proved/VBD what/WP confidence/NN we/PRP might/MD place/VB in/IN him/PRP ./. The/DT 30th/JJ of/IN June/NNP ,/, the/DT frigate/NN spoke/VBD some/DT American/JJ whalers/NNS ,/, from/IN whom/WP we/PRP learned/VBD that/IN they/PRP knew/VBD nothing/NN about/IN the/DT narwhal/NN ./. But/CC one/CD of/IN them/PRP ,/, the/DT captain/NN of/IN the/DT Monroe/NNP ,/, knowing/VBG that/IN Ned/NNP Land/NNP had/VBD shipped/VBN on/IN board/NN the/DT Abraham/NNP Lincoln/NNP ,/, begged/VBN for/IN his/PRP$ help/NN in/IN chasing/VBG a/DT whale/NN they/PRP had/VBD in/IN sight/NN ./.
We can use the following code sequence to produce the same results. The word
and tag
methods extract the words and their tags:
List<TaggedWord> taggedSentence = tagger.tagSentence(sentence); for (TaggedWord taggedWord : taggedSentence) { System.out.print(taggedWord.word() + "/" + taggedWord.tag() + " "); } System.out.println();
If we are only interested in finding specific occurrences of a given tag, we can use a sequence such as the following, which will list only the singular nouns (NN
):
List<TaggedWord> taggedSentence = tagger.tagSentence(sentence); for (TaggedWord taggedWord : taggedSentence) { if (taggedWord.tag().startsWith("NN")) { System.out.print(taggedWord.word() + " "); } } System.out.println();
The singular nouns are displayed for each sentence as shown here:
NN Tagged: voyage Abraham Lincoln time incident NN Tagged: circumstance dexterity Ned Land confidence NN Tagged: June frigate whalers nothing narwhal NN Tagged: captain Monroe Ned Land board Abraham Lincoln help whale sight
We can use a different model to handle twitter text that may include textese. The GATE (https://gate.ac.uk/wiki/twitter-postagger.html) has developed a model for twitter text. The model is used here to process textese:
MaxentTagger tagger = new MaxentTagger(getModelDir() + "//gate-EN-twitter.model");
Here, we use the MaxentTagger
class' tagString
method from the What makes POS difficult? section earlier in this chapter to process the textese:
System.out.println(tagger.tagString("AFAIK she H8 cth!")); System.out.println(tagger.tagString( "BTW had a GR8 tym at the party BBIAM."));
The output will be as follows:
AFAIK_NNP she_PRP H8_VBP cth!_NN BTW_UH had_VBD a_DT GR8_NNP tym_NNP at_IN the_DT party_NN BBIAM._NNP
We have used the Stanford pipeline in several previous examples. In this example, we will use the Stanford pipeline to extract POS tags. As with our previous Stanford examples, we create a pipeline based on a set of annotators: tokenize
, ssplit
, and pos
.
These will tokenize, split the text into sentences, and then find the POS tags:
Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
To process the text, we will use the theSentence
variable as input to Annotator
. The pipeline's annotate
method is then invoked as shown here:
Annotation document = new Annotation(theSentence); pipeline.annotate(document);
Since the pipeline can perform different types of processing, a list of CoreMap
objects is used to access the words and tags. The Annotation
class' get
method returns the list of sentences, as shown here.
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
The contents of the CoreMap
objects can be accessed using its get
method. The method's argument is the class for the information needed. As shown in the following code example, tokens are accessed using the TextAnnotation
class, and the POS tags can be retrieved using the PartOfSpeechAnnotation
class. Each word of each sentence and its tags is displayed:
for (CoreMap sentence : sentences) { for (CoreLabel token : sentence.get(TokensAnnotation.class)) { String word = token.get(TextAnnotation.class); String pos = token.get(PartOfSpeechAnnotation.class); System.out.print(word + "/" + pos + " "); } System.out.println(); }
The output will be as follows:
The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident/NN ./.
The pipeline can use additional options to control how the tagger works. For example, by default the english-left3words-distsim.tagger
tagger model is used. We can specify a different model using the pos.model
property, as shown here. There is also a pos.maxlen
property to control the maximum sentence size:
props.put("pos.model", "C:/.../Models/english-caseless-left3words-distsim.tagger");
Sometimes it is useful to have a tagged document that is XML formatted. The StanfordCoreNLP
class' xmlPrint
method will write out such a document. The method's first argument is the annotator to be displayed. Its second argument is the OutputStream
object to write to. In the following code sequence, the previous tagging results are written to standard output. It is enclosed in a try-catch block to handle IO exceptions:
try { pipeline.xmlPrint(document, System.out); } catch (IOException ex) { // Handle exceptions }
A partial listing of the results is as follows. Only the first two words and the last word are displayed. Each token tag contains the word, its position, and its POS tag:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?> <root> <document> <sentences> <sentence id="1"> <tokens> <token id="1"> <word>The</word> <CharacterOffsetBegin>0</CharacterOffsetBegin> <CharacterOffsetEnd>3</CharacterOffsetEnd> <POS>DT</POS> </token> <token id="2"> <word>voyage</word> <CharacterOffsetBegin>4</CharacterOffsetBegin> <CharacterOffsetEnd>10</CharacterOffsetEnd> <POS>NN</POS> </token> … <token id="17"> <word>.</word> <CharacterOffsetBegin>83</CharacterOffsetBegin> <CharacterOffsetEnd>84</CharacterOffsetEnd> <POS>.</POS> </token> </tokens> </sentence> </sentences> </document> </root>
The prettyPrint
method works in a similar manner:
pipeline.prettyPrint(document, System.out);
However, the output is not really that pretty, as shown here. The original sentence is displayed followed by each word, its position, and its tag. The output has been formatted to make it more readable:
The voyage of the Abraham Lincoln was for a long time marked by no special incident. [Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT] [Text=voyage CharacterOffsetBegin=4 CharacterOffsetEnd=10 PartOfSpeech=NN] [Text=of CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=IN] [Text=the CharacterOffsetBegin=14 CharacterOffsetEnd=17 PartOfSpeech=DT] [Text=Abraham CharacterOffsetBegin=18 CharacterOffsetEnd=25 PartOfSpeech=NNP] [Text=Lincoln CharacterOffsetBegin=26 CharacterOffsetEnd=33 PartOfSpeech=NNP] [Text=was CharacterOffsetBegin=34 CharacterOffsetEnd=37 PartOfSpeech=VBD] [Text=for CharacterOffsetBegin=38 CharacterOffsetEnd=41 PartOfSpeech=IN] [Text=a CharacterOffsetBegin=42 CharacterOffsetEnd=43 PartOfSpeech=DT] [Text=long CharacterOffsetBegin=44 CharacterOffsetEnd=48 PartOfSpeech=JJ] [Text=time CharacterOffsetBegin=49 CharacterOffsetEnd=53 PartOfSpeech=NN] [Text=marked CharacterOffsetBegin=54 CharacterOffsetEnd=60 PartOfSpeech=VBN] [Text=by CharacterOffsetBegin=61 CharacterOffsetEnd=63 PartOfSpeech=IN] [Text=no CharacterOffsetBegin=64 CharacterOffsetEnd=66 PartOfSpeech=DT] [Text=special CharacterOffsetBegin=67 CharacterOffsetEnd=74 PartOfSpeech=JJ] [Text=incident CharacterOffsetBegin=75 CharacterOffsetEnd=83 PartOfSpeech=NN] [Text=. CharacterOffsetBegin=83 CharacterOffsetEnd=84 PartOfSpeech=.]
LingPipe uses the Tagger
interface to support POS tagging. This interface has a single method: tag
. It returns a List
instance of the Tagging
objects. These objects are the words and their tags. The interface is implemented by the ChainCrf
and HmmDecoder
classes.
The ChainCrf
class uses linear-chain conditional random field decoding and estimation for determining tags. The HmmDecoder
class uses an HMM to perform tagging. We will illustrate this class next.
The HmmDecoder
class uses the tag
method to determine the most likely (first best) tags. It also has a tagNBest
method that scores the possible tagging and returns an iterator of these scored tagging. There are three POS models that come with the LingPipe, which can be downloaded from http://alias-i.com/lingpipe/web/models.html. These are listed in the following table. For our demonstration, we will use the Brown Corpus model:
Model |
File |
---|---|
English General Text: Brown Corpus |
|
| |
|
We start with a try-with-resources block to handle exceptions and the code to create the HmmDecoder
instance, as shown next.
The model is read from the file and then used as the argument of the HmmDecoder
constructor:
try ( FileInputStream inputStream = new FileInputStream(getModelDir() + "//pos-en-general-brown.HiddenMarkovModel"); ObjectInputStream objectStream = new ObjectInputStream(inputStream);) { HiddenMarkovModel hmm = (HiddenMarkovModel) objectStream.readObject(); HmmDecoder decoder = new HmmDecoder(hmm); … } catch (IOException ex) { // Handle exceptions } catch (ClassNotFoundException ex) { // Handle exceptions };
We will perform tagging on theSentence
variable. First, it needs to be tokenized. We will use an Indo-European tokenizer as shown here. The tokenizer
method requires that the text string be converted to an array of chars. The tokenize
method then returns an array of tokens as strings:
TokenizerFactory TOKENIZER_FACTORY = IndoEuropeanTokenizerFactory.INSTANCE; char[] charArray = theSentence.toCharArray(); Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer( charArray, 0, charArray.length); String[] tokens = tokenizer.tokenize();
The actual tagging is performed by the HmmDecoder
class' tag
method. However, this method requires a List
instance of String
tokens. This list is created using the Arrays
class' asList
method. The Tagging
class holds a sequence of tokens and tags:
List<String> tokenList = Arrays.asList(tokens); Tagging<String> tagString = decoder.tag(tokenList);
We are now ready to display the tokens and their tags. The following loop uses the token
and tag
methods to access the tokens and tags, respectively, in the Tagging
object. They are then displayed:
for (int i = 0; i < tagString.size(); ++i) { System.out.print(tagString.token(i) + "/" + tagString.tag(i) + " "); }
The output is as follows:
The/at voyage/nn of/in the/at Abraham/np Lincoln/np was/bedz for/in a/at long/jj time/nn marked/vbn by/in no/at special/jj incident/nn ./.
The tagging process considers multiple combinations of tags. The HmmDecoder
class' tagNBest
method returns an iterator of the ScoredTagging
objects that reflect the confidence of different orders. This method takes a token list and a number specifying the maximum number of results desired.
The previous sentence is not ambiguous enough to demonstrate the combination of tags. Instead, we will use the following sentence:
String[] sentence = {"Bill", "used", "the", "force", "to", "force", "the", "manager", "to", "tear", "the", "bill","in", "to."}; List<String> tokenList = Arrays.asList(sentence);
The example using this method is shown here starting with declarations for the number of results:
int maxResults = 5;
Using the decoder object created in the previous section, we apply the tagNBest
method to it as follows:
Iterator<ScoredTagging<String>> iterator = decoder.tagNBest(tokenList, maxResults);
The iterator will allows us to access each of the five different scores. The ScoredTagging
class possesses a score
method that returns a value reflecting how well it believes it performs. In the following code sequence, a printf
statement displays this score. This is followed by a loop where the token and its tag are displayed.
The result is a score followed by the word sequence with the tag attached:
while (iterator.hasNext()) { ScoredTagging<String> scoredTagging = iterator.next(); System.out.printf("Score: %7.3f Sequence: ", scoredTagging.score()); for (int i = 0; i < tokenList.size(); ++i) { System.out.print(scoredTagging.token(i) + "/" + scoredTagging.tag(i) + " "); } System.out.println(); }
The output is as follows. Notice that the word "force" can have a tag of nn
, jj
, or vb
:
Score: -148.796 Sequence: Bill/np used/vbd the/at force/nn to/to force/vb the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn Score: -154.434 Sequence: Bill/np used/vbn the/at force/nn to/to force/vb the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn Score: -154.781 Sequence: Bill/np used/vbd the/at force/nn to/in force/nn the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn Score: -157.126 Sequence: Bill/np used/vbd the/at force/nn to/to force/vb the/at manager/jj to/to tear/vb the/at bill/nn in/in two./nn Score: -157.340 Sequence: Bill/np used/vbd the/at force/jj to/to force/vb the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn
Statistical analysis can be performed using a lattice structure, which is useful for analyzing alternative word orderings. This structure represents forward/backward scores. The HmmDecoder
class' tagMarginal
method returns an instance of a TagLattice
class, which represents a lattice.
We can examine each token of the lattice using an instance of the ConditionalClassification
class. In the following example, the tagMarginal
method returns a TagLattice
instance. A loop is used to obtain the ConditionalClassification
instance for each token in the lattice.
We are using the same tokenList
instance developed in the previous section:
TagLattice<String> lattice = decoder.tagMarginal(tokenList); for (int index = 0; index < tokenList.size(); index++) { ConditionalClassification classification = lattice.tokenClassification(index); … }
The ConditionalClassification
class has a score
and a category
method. The score
method returns a relative score for a given category. The category
method returns this category, which is the tag. The token, its score, and category are displayed as shown here:
System.out.printf("%-8s",tokenList.get(index)); for (int i = 0; i < 4; ++i) { double score = classification.score(i); String tag = classification.category(i); System.out.printf("%7.3f/%-3s ",score,tag); } System.out.println();
The output is shown as follows:
Bill 0.974/np 0.018/nn 0.006/rb 0.001/nps used 0.935/vbd 0.065/vbn 0.000/jj 0.000/rb the 1.000/at 0.000/jj 0.000/pps 0.000/pp$$ force 0.977/nn 0.016/jj 0.006/vb 0.001/rb to 0.944/to 0.055/in 0.000/rb 0.000/nn force 0.945/vb 0.053/nn 0.002/rb 0.001/jj the 1.000/at 0.000/jj 0.000/vb 0.000/nn manager 0.982/nn 0.018/jj 0.000/nn$ 0.000/vb to 0.988/to 0.012/in 0.000/rb 0.000/nn tear 0.991/vb 0.007/nn 0.001/rb 0.001/jj the 1.000/at 0.000/jj 0.000/vb 0.000/nn bill 0.994/nn 0.003/jj 0.002/rb 0.001/nns in 0.990/in 0.004/rp 0.002/nn 0.001/jj two. 0.960/nn 0.013/np 0.011/nns 0.008/rb
Training an OpenNLP POSModel
is similar to the previous training examples. A training file is needed and should be large enough to provide a good sample set. Each sentence of the training file must be on a line by itself. Each line consists of a token followed by the underscore character and then the tag.
The following training data was created using the first five sentences of Chapter 5, At A Venture of Twenty Thousands Leagues Under the Sea. Although this is not a large sample set, it is easy to create and adequate for illustration purposes.
It is saved in a file named sample.train
:
The_DT voyage_NN of_IN the_DT Abraham_NNP Lincoln_NNP was_VBD for_IN a_DT long_JJ time_NN marked_VBN by_IN no_DT special_JJ incident._NN But_CC one_CD circumstance_NN happened_VBD which_WDT showed_VBD the_DT wonderful_JJ dexterity_NN of_IN Ned_NNP Land,_NNP and_CC proved_VBD what_WP confidence_NN we_PRP might_MD place_VB in_IN him._PRP$ The_DT 30th_JJ of_IN June,_NNP the_DT frigate_NN spoke_VBD some_DT American_NNP whalers,_, from_IN whom_WP we_PRP learned_VBD that_IN they_PRP knew_VBD nothing_NN about_IN the_DT narwhal._NN But_CC one_CD of_IN them,_PRP$ the_DT captain_NN of_IN the_DT Monroe,_NNP knowing_VBG that_IN Ned_NNP Land_NNP had_VBD shipped_VBN on_IN board_NN the_DT Abraham_NNP Lincoln,_NNP begged_VBD for_IN his_PRP$ help_NN in_IN chasing_VBG a_DT whale_NN they_PRP had_VBD in_IN sight._NN
We will demonstrate the creation of the model using the POSModel
class' train
method and how the model can be saved to a file. We start with the declaration of the POSModel
instance variable:
POSModel model = null;
A try-with-resources block opens the sample file:
try (InputStream dataIn = new FileInputStream("sample.train");) { … } catch (IOException e) { // Handle excpetions }
An instance of the PlainTextByLineStream
class is created and used with the WordTagSampleStream
class to create an ObjectStream<POSSample>
instance. This puts the sample data into the format required by the train
method:
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
The train
method uses its parameters to specify the language, the sample stream, training parameters, and any dictionaries (none) needed, as shown here:
model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
The output of this process is lengthy. The following output has been shortened to conserve space:
Indexing events using cutoff of 5 Computing event counts... done. 90 events Indexing... done. Sorting and merging events... done. Reduced 90 events to 82. Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 82 Number of Outcomes: 17 Number of Predicates: 45 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-254.98920096505964 0.14444444444444443 2: ... loglikelihood=-201.19283975630537 0.6 3: ... loglikelihood=-174.8849213436524 0.6111111111111112 4: ... loglikelihood=-157.58164262220754 0.6333333333333333 5: ... loglikelihood=-144.69272379986646 0.6555555555555556 ... 99: ... loglikelihood=-33.461128002846024 0.9333333333333333 100: ... loglikelihood=-33.29073273669207 0.9333333333333333
To save the model to a file, we use the following code. The output stream is created and the POSModel
class' serialize
method saves the model to the en_pos_verne.bin
file:
try (OutputStream modelOut = new BufferedOutputStream( new FileOutputStream(new File("en_pos_verne.bin")));) { model.serialize(modelOut); } catch (IOException e) { // Handle exceptions }
3.129.211.166