Using the NLP APIs

We will demonstrate POS tagging using OpenNLP, Stanford API, and LingPipe. Each of the examples will use the following sentence. It is the first sentence of Chapter 5, At A Venture of Twenty Thousands Leagues Under the Sea by Jules Verne:

private String[] sentence = {"The", "voyage", "of", "the", 
    "Abraham", "Lincoln", "was", "for", "a", "long", "time", "marked", 
    "by", "no", "special", "incident."};

The text to be processed may not always be defined in this fashion. Sometimes the sentence will be available as a single string:

String theSentence = "The voyage of the Abraham Lincoln was for a " 
    + "long time marked by no special incident.";

We might need to convert a string to an array of strings. There are numerous techniques for converting this string to an array of words. The following tokenizeSentence method performs this operation:

public String[] tokenizeSentence(String sentence) {
    String words[] = sentence.split("S+");
    return words;
}

The following code demonstrates the use of this method:

String words[] = tokenizeSentence(theSentence);
for(String word : words) {
    System.out.print(word + " "); 
}
System.out.println();

The output is as follows:

The voyage of the Abraham Lincoln was for a long time marked by no special incident.

Alternately, we could use a tokenizer such as OpenNLP's WhitespaceTokenizer class, as shown here:

String words[] = WhitespaceTokenizer.INSTANCE.tokenize(sentence);

Using OpenNLP POS taggers

OpenNLP provides several classes in support of POS tagging. We will demonstrate how to use the POSTaggerME class to perform basic tagging and the ChunkerME class to perform chunking. Chunking involves grouping related words according to their types. This can provide additional insight into the structure of a sentence. We will also examine the creation and use of a POSDictionary instance.

Using the OpenNLP POSTaggerME class for POS taggers

The OpenNLP POSTaggerME class uses maximum entropy to process the tags. The tagger determines the type of tag based on the word itself and the word's context. Any given word may have multiple tags associated with it. The tagger uses a probability model to determine the specific tag to be assigned.

POS models are loaded from a file. The en-pos-maxent.bin model is used frequently and is based on the Penn TreeBank tag set. Various pretrained POS models for OpenNLP can be found at http://opennlp.sourceforge.net/models-1.5/.

We start with a try-catch block to handle any IOException that might be generated when loading a model, as shown here.

We use the en-pos-maxent.bin file for the model:

try (InputStream modelIn = new FileInputStream(
    new File(getModelDir(), "en-pos-maxent.bin"));) {
    …
}
catch (IOException e) {
    // Handle exceptions
}

Next, create the POSModel and POSTaggerME instances as shown here:

POSModel model = new POSModel(modelIn);
POSTaggerME tagger = new POSTaggerME(model);

The tag method can now be applied to the tagger using the text to be processed as its argument:

String tags[] = tagger.tag(sentence);

The words and their tags are then displayed as shown here:

for (int i = 0; i<sentence.length; i++) {
    System.out.print(sentence[i] + "/" + tags[i] + " ");
}

The output is as follows. Each word is followed by its type:

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident./NN

With any sentence, there may be more than one possible assignment of tags to words. The topKSequences method will return a set of sequences based on their probability of being correct. In the next code sequence, the topKSequences method is executed using the sentence variable and then displayed:

Sequence topSequences[] = tagger.topKSequences(sentence);
for (inti = 0; i<topSequences.length; i++) {
    System.out.println(topSequences[i]);
}

Its output follows in which the first number represents a weighted score and the tags within the brackets are the sequence of tags scored:

-0.5563571615737618 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, JJ, NN]
-2.9886144610050907 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, JJ, .]
-3.771930515521527 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, NN, NN]

Note

Ensure that you include the correct Sequence class. For this example, use import opennlp.tools.util.Sequence;

The Sequence class has several methods, as detailed in the following table:

Method

Meaning

getOutcomes

Returns a list of strings representing the tags for the sentence

getProbs

Returns an array of double variables representing the probability for each tag in the sequence

getScore

Returns a weighted value for the sequence

In the following sequence, we use several of these methods to demonstrate what they do. For each sequence, the tags and their probability are displayed, separated by a forward slash:

for (int i = 0; i<topSequences.length; i++) {
    List<String> outcomes = topSequences[i].getOutcomes();
    double probabilities[] = topSequences[i].getProbs();
    for (int j = 0; j <outcomes.size(); j++) { 
        System.out.printf("%s/%5.3f ",outcomes.get(j),
        probabilities[j]);
    }
    System.out.println();
}
System.out.println();

The output is as follows. Each pair of lines represents one sequence where the output has been wrapped:

DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 JJ/0.919 NN/0.832 
DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 JJ/0.919 ./0.073 
DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 NN/0.073 NN/0.419

Using OpenNLP chunking

The process of chunking involves breaking a sentence into parts or chunks. These chunks can then be annotated with tags. We will use the ChunkerME class to illustrate how this is accomplished. This class uses a model loaded into a ChunkerModel instance. The ChunkerME class' chunk method performs the actual chunking process. We will also examine the use of the chunkAsSpans method to return information about the span of these chunks. This allows us to see how long a chunk is and what elements make up the chunk.

We will use the en-pos-maxent.bin file to create a model for the POSTaggerME instance. We need to use this instance to tag the text as we did in the Using OpenNLP POSTaggerME class for POS taggers section earlier in this chapter. We will also use the en-chunker.bin file to create a ChunkerModel instance to be used with the ChunkerME instance.

These models are created using input streams, as shown in the following example.

We use a try-with-resources block to open and close files and to deal with any exceptions that may be thrown:

try (
        InputStream posModelStream = new FileInputStream(
            getModelDir() + "\en-pos-maxent.bin");
        InputStream chunkerStream = new FileInputStream(
            getModelDir() + "\en-chunker.bin");) {
    …
} catch (IOException ex) {
    // Handle exceptions
}

The following code sequence creates and uses a tagger to find the POS of the sentence. The sentence and its tags are then displayed:

POSModel model = new POSModel(posModelStream);
POSTaggerME tagger = new POSTaggerME(model);

String tags[] = tagger.tag(sentence);
for(int i=0; i<tags.length; i++) {
    System.out.print(sentence[i] + "/" + tags[i] + " ");
}
System.out.println();

The output is as follows. We have shown this output so that it will be clear how the chunker works:

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident./NN

A ChunkerModel instance is created using the input stream. From this, the ChunkerME instance is created followed by the use of the chunk method as shown here. The chunk method will use the sentence's token and its tags to create an array of strings. Each string will hold information about the token and its chunk:

ChunkerModel chunkerModel = new ChunkerModel(chunkerStream);
ChunkerME chunkerME = new ChunkerME(chunkerModel);
String result[] = chunkerME.chunk(sentence, tags);

Each token in the results array and its chunk tag are displayed as shown here:

for (int i = 0; i < result.length; i++) {
    System.out.println("[" + sentence[i] + "] " + result[i]);
}

The output is as follows. The token is enclosed in brackets followed by the chunk tag. These tags are explained in the following table:

First Part

B

Beginning of a tag

I

Continuation of a tag

E

End of a tag (will not appear if tag is one word long)

Second Part

NP

Noun chunk

VB

Verb chunk

Multiple words are grouped together such as "The voyage" and "the Abraham Lincoln".

[The] B-NP
[voyage] I-NP
[of] B-PP
[the] B-NP
[Abraham] I-NP
[Lincoln] I-NP
[was] B-VP
[for] B-PP
[a] B-NP
[long] I-NP
[time] I-NP
[marked] B-VP
[by] B-PP
[no] B-NP
[special] I-NP
[incident.] I-NP

If we are interested in getting more detailed information about the chunks, we can use the ChunkerME class' chunkAsSpans method. This method returns an array of Span objects. Each object represents one span found in the text.

There are several other ChunkerME class methods available. Here, we will illustrate the use of the getType, getStart, and getEnd methods. The getType method returns the second part of the chunk tag, and the getStart and getEnd methods return the beginning and ending index of the tokens, respectively, in the original sentence array. The length method returns the length of the span in number of tokens.

In the following sequence, the chunkAsSpans method is executed using the sentence and tags arrays. The spans array is then displayed. The outer for loop processes one Span object at a time displaying the basic span information. The inner for loop displays the spanned text enclosed within brackets:

Span[] spans = chunkerME.chunkAsSpans(sentence, tags);
for (Span span : spans) {
    System.out.print("Type: " + span.getType() + " - " 
        + " Begin: " + span.getStart() 
        + " End:" + span.getEnd()
        + " Length: " + span.length() + "  [");
    for (int j = span.getStart(); j < span.getEnd(); j++) {
        System.out.print(sentence[j] + " ");
    }
    System.out.println("]");
}

The following output clearly shows the span type, its position in the sentence array, its length, and then the actual spanned text:

Type: NP -  Begin: 0 End:2 Length: 2  [The voyage ]
Type: PP -  Begin: 2 End:3 Length: 1  [of ]
Type: NP -  Begin: 3 End:6 Length: 3  [the Abraham Lincoln ]
Type: VP -  Begin: 6 End:7 Length: 1  [was ]
Type: PP -  Begin: 7 End:8 Length: 1  [for ]
Type: NP -  Begin: 8 End:11 Length: 3  [a long time ]
Type: VP -  Begin: 11 End:12 Length: 1  [marked ]
Type: PP -  Begin: 12 End:13 Length: 1  [by ]
Type: NP -  Begin: 13 End:16 Length: 3  [no special incident. ]

Using the POSDictionary class

A tag dictionary specifies what are the valid tags for a word. This can prevent a tag from being applied inappropriately to a word. In addition, some search algorithms execute faster since they do not have to consider other less probable tags.

In this section, we will demonstrate how to:

  • Obtain the tag dictionary for a tagger
  • Determine what tags a word has
  • Show how to change the tags for a word
  • Add a new tag dictionary to a new tagger factory

As with the previous example, we will use a try-with-resources block to open our input streams for the POS model and then create our model and tagger factory, as shown here:

try (InputStream modelIn = new FileInputStream(
        new File(getModelDir(), "en-pos-maxent.bin"));) {
    POSModel model = new POSModel(modelIn);
    POSTaggerFactory posTaggerFactory = model.getFactory();
    …
} catch (IOException e) {
    //Handle exceptions
}

Obtaining the tag dictionary for a tagger

We used the POSModel class' getFactory method to get a POSTaggerFactory instance. We will use its getTagDictionary method to obtain its TagDictionary instance. This is illustrated here:

MutableTagDictionary tagDictionary = 
  (MutableTagDictionary)posTaggerFactory.getTagDictionary();

The MutableTagDictionary interface extends the TagDictionary interface. The TagDictionary interface possesses a getTags method, and the MutableTagDictionary interface adds a put method that allows tags to be added to the dictionary. These interfaces are implemented by the POSDictionary class.

Determining a word's tags

To obtain the tags for a given word, use the getTags method. This returns an array of tags represented by strings. The tags are then displayed as shown here:

String tags[] = tagDictionary.getTags("force");
for (String tag : tags) {
    System.out.print("/" + tag);
}
System.out.println();

The output is as follows:

/NN/VBP/VB

This means that the word "force" can be interpreted in three different ways.

Changing a word's tags

The MutableTagDictionary interface's put method allows us to add tags to a word. The method has two arguments: the word, and its new tags. The method returns an array containing the previous tags.

In the following example, we replace the old tags with a new tag. The old tags are then displayed.

String oldTags[] = tagDictionary.put("force", "newTag");
for (String tag : oldTags) {
    System.out.print("/" + tag);
}
System.out.println();

The following output lists the old tags for the word.

/NN/VBP/VB

These tags have been replaced by the new tag as demonstrated here where the current tags are displayed:

tags = tagDictionary.getTags("force");
for (String tag : tags) {
    System.out.print("/" + tag);
}
System.out.println();

All we get is the following:

/newTag

To retain the old tags we will need to create an array of strings to hold the old and the new tags and then use the array as the second argument of the put method as shown here:

String newTags[] = new String[tags.length+1];
for (int i=0; i<tags.length; i++) {
    newTags[i] = tags[i];
}
newTags[tags.length] = "newTag";
oldTags = tagDictionary.put("force", newTags);

If we redisplay the current tags as shown here, we can see that the old tags have been retained and the new one added:

/NN/VBP/VB/newTag

Tip

When adding tags, be careful and assign the tags in the proper order as it will influence which tag is assigned.

Adding a new tag dictionary

A new tag dictionary can be added to a POSTaggerFactory instance. We will illustrate this process by creating a new POSTaggerFactory and then adding the tagDictionary we developed earlier. First, we create a new factory using the default constructor as shown next. This is followed by calling the setTagDictionary method against the new factory.

POSTaggerFactory newFactory = new POSTaggerFactory();
newFactory.setTagDictionary(tagDictionary);

To confirm that the tag dictionary has been added, we display the tags for the word "force" as shown here:

tags = newFactory.getTagDictionary().getTags("force");
for (String tag : tags) {
    System.out.print("/" + tag);
}
System.out.println();

The tags are the same as shown here:

/NN/VBP/VB/newTag

Creating a dictionary from a file

If we need to create a new dictionary, then one approach is to create an XML file containing all of the words and their tags, and then create the dictionary from the file. OpenNLP supports this approach with the POSDictionary class' create method.

The XML file consists of the dictionary root element followed by a series of entry elements. The entry element uses the tags attribute to specify the tags for the word. The word is contained within the entry element as a token element. A simple example using two words stored in the file dictionary.txt is as follows:

<dictionary case_sensitive="false">
    <entry tags="JJ VB">
        <token>strong</token>
    </entry>
    <entry tags="NN VBP VB">
        <token>force</token>
    </entry>
</dictionary>

To create the dictionary, we use the create method based on an input stream as shown here:

try (InputStream dictionaryIn = 
      new FileInputStream(new File("dictionary.txt"));) {
    POSDictionary dictionary = POSDictionary.create(dictionaryIn);
    …
} catch (IOException e) {
    // Handle exceptions
}

The POSDictionary class has an iterator method that returns an iterator object. Its next method returns a string for each word in the dictionary. We can use these methods to display the contents of the dictionary, as shown here:

Iterator<String> iterator = dictionary.iterator();
while (iterator.hasNext()) {
    String entry = iterator.next();
    String tags[] = dictionary.getTags(entry);
    System.out.print(entry + " ");
    for (String tag : tags) {
        System.out.print("/" + tag);
    }
    System.out.println();
}

The output that follows displays what we can expect:

strong /JJ/VB
force /NN/VBP/VB

Using Stanford POS taggers

In this section, we will examine two different approaches supported by the Stanford API to perform tagging. The first technique uses the MaxentTagger class. As its name implies, it uses maximum entropy to find the POS. We will also use this class to demonstrate a model designed to handle textese-type text. The second approach will use the pipeline approach with annotators. The English taggers use the Penn Treebank English POS tag set.

Using Stanford MaxentTagger

The MaxentTagger class uses a model to perform the tagging task. There are a number of models that come bundled with the API, all with the file extension .tagger. They include English, Chinese, Arabic, French, and German models. The English models are listed here. The prefix, wsj, refers to models based on the Wall Street Journal. The other terms refer to techniques used to train the model. These concepts are not covered here:

  • wsj-0-18-bidirectional-distsim.tagger
  • wsj-0-18-bidirectional-nodistsim.tagger
  • wsj-0-18-caseless-left3words-distsim.tagger
  • wsj-0-18-left3words-distsim.tagger
  • wsj-0-18-left3words-nodistsim.tagger
  • english-bidirectional-distsim.tagger
  • english-caseless-left3words-distsim.tagger
  • english-left3words-distsim.tagger

The example reads in a series of sentences from a file. Each sentence is then processed and various ways of accessing and displaying the words and tags are illustrated.

We start with a try-with-resources block to deal with IO exceptions as shown here. The wsj-0-18-bidirectional-distsim.tagger file is used to create an instance of the MaxentTagger class.

A List instance of List instances of HasWord objects is created using the MaxentTagger class' tokenizeText method. The sentences are read in from the file sentences.txt.The HasWord interface represents words and contains two methods: a setWord and a word method. The latter method returns a word as a string. Each sentence is represented by a List instance of HasWord objects:

try {
    MaxentTagger tagger = new MaxentTagger(getModelDir() + 
        "//wsj-0-18-bidirectional-distsim.tagger");
    List<List<HasWord>> sentences = MaxentTagger.tokenizeText(
        new BufferedReader(new FileReader("sentences.txt")));
    …
} catch (FileNotFoundException ex) {
    // Handle exceptions
}

The sentences.txt file contains the first four sentences of Chapter 5, At A Venture of the book Twenty Thousands Leagues Under the Sea:

The voyage of the Abraham Lincoln was for a long time marked by no special incident.
But one circumstance happened which showed the wonderful dexterity of Ned Land, and proved what confidence we might place in him.
The 30th of June, the frigate spoke some American whalers, from whom we learned that they knew nothing about the narwhal.
But one of them, the captain of the Monroe, knowing that Ned Land had shipped on board the Abraham Lincoln, begged for his help in chasing a whale they had in sight.

A loop is added to process each sentence of the sentences list. The tagSentence method returns a List instance of TaggedWord objects as shown next. The TaggedWord class implements the HasWord interface and adds a tag method that returns the tag associated with the word. As shown here, the toString method is used to display each sentence:

List<TaggedWord> taggedSentence = tagger.tagSentence(sentence);
for (List<HasWord> sentence : sentences) {
    List<TaggedWord> taggedSentence= tagger.tagSentence(sentence);
    System.out.println(taggedSentence);
}

The output is as follows:

[The/DT, voyage/NN, of/IN, the/DT, Abraham/NNP, Lincoln/NNP, was/VBD, for/IN, a/DT, long/JJ, --- time/NN, marked/VBN, by/IN, no/DT, special/JJ, incident/NN, ./.]
 [But/CC, one/CD, circumstance/NN, happened/VBD, which/WDT, showed/VBD, the/DT, wonderful/JJ, dexterity/NN, of/IN, Ned/NNP, Land/NNP, ,/,, and/CC, proved/VBD, what/WP, confidence/NN, we/PRP, might/MD, place/VB, in/IN, him/PRP, ./.]
[The/DT, 30th/JJ, of/IN, June/NNP, ,/,, the/DT, frigate/NN, spoke/VBD, some/DT, American/JJ, whalers/NNS, ,/,, from/IN, whom/WP, we/PRP, learned/VBD, that/IN, they/PRP, knew/VBD, nothing/NN, about/IN, the/DT, narwhal/NN, ./.]
[But/CC, one/CD, of/IN, them/PRP, ,/,, the/DT, captain/NN, of/IN, the/DT, Monroe/NNP, ,/,, knowing/VBG, that/IN, Ned/NNP, Land/NNP, had/VBD, shipped/VBN, on/IN, board/NN, the/DT, Abraham/NNP, Lincoln/NNP, ,/,, begged/VBN, for/IN, his/PRP$, help/NN, in/IN, chasing/VBG, a/DT, whale/NN, they/PRP, had/VBD, in/IN, sight/NN, ./.]

Alternately, we can use the Sentence class' listToString method to convert the tagged sentence to a simple String object.

A value of false for its second parameter is used by the toString method of the HasWord to create the resulting string, as shown here:

List<TaggedWord> taggedSentence = tagger.tagSentence(sentence);
for (List<HasWord> sentence : sentences) {
    List<TaggedWord> taggedSentence= tagger.tagSentence(sentence);
    System.out.println(Sentence.listToString(taggedSentence, false));
}

This produces a more aesthetically pleasing output:

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident/NN ./.
But/CC one/CD circumstance/NN happened/VBD which/WDT showed/VBD the/DT wonderful/JJ dexterity/NN of/IN Ned/NNP Land/NNP ,/, and/CC proved/VBD what/WP confidence/NN we/PRP might/MD place/VB in/IN him/PRP ./.
The/DT 30th/JJ of/IN June/NNP ,/, the/DT frigate/NN spoke/VBD some/DT American/JJ whalers/NNS ,/, from/IN whom/WP we/PRP learned/VBD that/IN they/PRP knew/VBD nothing/NN about/IN the/DT narwhal/NN ./.
But/CC one/CD of/IN them/PRP ,/, the/DT captain/NN of/IN the/DT Monroe/NNP ,/, knowing/VBG that/IN Ned/NNP Land/NNP had/VBD shipped/VBN on/IN board/NN the/DT Abraham/NNP Lincoln/NNP ,/, begged/VBN for/IN his/PRP$ help/NN in/IN chasing/VBG a/DT whale/NN they/PRP had/VBD in/IN sight/NN ./. 

We can use the following code sequence to produce the same results. The word and tag methods extract the words and their tags:

List<TaggedWord> taggedSentence = tagger.tagSentence(sentence);
for (TaggedWord taggedWord : taggedSentence) {
    System.out.print(taggedWord.word() + "/" + taggedWord.tag() + " ");
}
System.out.println();

If we are only interested in finding specific occurrences of a given tag, we can use a sequence such as the following, which will list only the singular nouns (NN):

List<TaggedWord> taggedSentence = tagger.tagSentence(sentence);
for (TaggedWord taggedWord : taggedSentence) {
    if (taggedWord.tag().startsWith("NN")) {
        System.out.print(taggedWord.word() + " ");
    }
}
System.out.println();

The singular nouns are displayed for each sentence as shown here:

NN Tagged: voyage Abraham Lincoln time incident 
NN Tagged: circumstance dexterity Ned Land confidence 
NN Tagged: June frigate whalers nothing narwhal 
NN Tagged: captain Monroe Ned Land board Abraham Lincoln help whale sight

Using the MaxentTagger class to tag textese

We can use a different model to handle twitter text that may include textese. The GATE (https://gate.ac.uk/wiki/twitter-postagger.html) has developed a model for twitter text. The model is used here to process textese:

MaxentTagger tagger = new MaxentTagger(getModelDir() 
    + "//gate-EN-twitter.model");

Here, we use the MaxentTagger class' tagString method from the What makes POS difficult? section earlier in this chapter to process the textese:

System.out.println(tagger.tagString("AFAIK she H8 cth!"));
System.out.println(tagger.tagString(
    "BTW had a GR8 tym at the party BBIAM."));

The output will be as follows:

AFAIK_NNP she_PRP H8_VBP cth!_NN 
BTW_UH had_VBD a_DT GR8_NNP tym_NNP at_IN the_DT party_NN BBIAM._NNP  

Using Stanford pipeline to perform tagging

We have used the Stanford pipeline in several previous examples. In this example, we will use the Stanford pipeline to extract POS tags. As with our previous Stanford examples, we create a pipeline based on a set of annotators: tokenize, ssplit, and pos.

These will tokenize, split the text into sentences, and then find the POS tags:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

To process the text, we will use the theSentence variable as input to Annotator. The pipeline's annotate method is then invoked as shown here:

Annotation document = new Annotation(theSentence);
pipeline.annotate(document);

Since the pipeline can perform different types of processing, a list of CoreMap objects is used to access the words and tags. The Annotation class' get method returns the list of sentences, as shown here.

List<CoreMap> sentences = document.get(SentencesAnnotation.class);

The contents of the CoreMap objects can be accessed using its get method. The method's argument is the class for the information needed. As shown in the following code example, tokens are accessed using the TextAnnotation class, and the POS tags can be retrieved using the PartOfSpeechAnnotation class. Each word of each sentence and its tags is displayed:

for (CoreMap sentence : sentences) {
    for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
        String word = token.get(TextAnnotation.class);
        String pos = token.get(PartOfSpeechAnnotation.class);
        System.out.print(word + "/" + pos + " ");
    }
    System.out.println();
}

The output will be as follows:

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident/NN ./.

The pipeline can use additional options to control how the tagger works. For example, by default the english-left3words-distsim.tagger tagger model is used. We can specify a different model using the pos.model property, as shown here. There is also a pos.maxlen property to control the maximum sentence size:

props.put("pos.model",
"C:/.../Models/english-caseless-left3words-distsim.tagger");

Sometimes it is useful to have a tagged document that is XML formatted. The StanfordCoreNLP class' xmlPrint method will write out such a document. The method's first argument is the annotator to be displayed. Its second argument is the OutputStream object to write to. In the following code sequence, the previous tagging results are written to standard output. It is enclosed in a try-catch block to handle IO exceptions:

try {
    pipeline.xmlPrint(document, System.out);
} catch (IOException ex) {
    // Handle exceptions
}

A partial listing of the results is as follows. Only the first two words and the last word are displayed. Each token tag contains the word, its position, and its POS tag:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>The</word>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>3</CharacterOffsetEnd>
<POS>DT</POS>
</token>
<token id="2">
<word>voyage</word>
<CharacterOffsetBegin>4</CharacterOffsetBegin>
<CharacterOffsetEnd>10</CharacterOffsetEnd>
<POS>NN</POS>
</token>

<token id="17">
<word>.</word>
<CharacterOffsetBegin>83</CharacterOffsetBegin>
<CharacterOffsetEnd>84</CharacterOffsetEnd>
<POS>.</POS>
</token>
</tokens>
</sentence>
</sentences>
</document>
</root>

The prettyPrint method works in a similar manner:

pipeline.prettyPrint(document, System.out);

However, the output is not really that pretty, as shown here. The original sentence is displayed followed by each word, its position, and its tag. The output has been formatted to make it more readable:

The voyage of the Abraham Lincoln was for a long time marked by no special incident.
[Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT] 
[Text=voyage CharacterOffsetBegin=4 CharacterOffsetEnd=10 PartOfSpeech=NN] 
[Text=of CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=IN] 
[Text=the CharacterOffsetBegin=14 CharacterOffsetEnd=17 PartOfSpeech=DT] 
[Text=Abraham CharacterOffsetBegin=18 CharacterOffsetEnd=25 PartOfSpeech=NNP]
 [Text=Lincoln CharacterOffsetBegin=26 CharacterOffsetEnd=33 PartOfSpeech=NNP]
 [Text=was CharacterOffsetBegin=34 CharacterOffsetEnd=37 PartOfSpeech=VBD]
 [Text=for CharacterOffsetBegin=38 CharacterOffsetEnd=41 PartOfSpeech=IN]
 [Text=a CharacterOffsetBegin=42 CharacterOffsetEnd=43 PartOfSpeech=DT]
 [Text=long CharacterOffsetBegin=44 CharacterOffsetEnd=48 PartOfSpeech=JJ]
 [Text=time CharacterOffsetBegin=49 CharacterOffsetEnd=53 PartOfSpeech=NN]
 [Text=marked CharacterOffsetBegin=54 CharacterOffsetEnd=60 PartOfSpeech=VBN]
 [Text=by CharacterOffsetBegin=61 CharacterOffsetEnd=63 PartOfSpeech=IN] 
[Text=no CharacterOffsetBegin=64 CharacterOffsetEnd=66 PartOfSpeech=DT]
 [Text=special CharacterOffsetBegin=67 CharacterOffsetEnd=74 PartOfSpeech=JJ]
 [Text=incident CharacterOffsetBegin=75 CharacterOffsetEnd=83 PartOfSpeech=NN]
 [Text=. CharacterOffsetBegin=83 CharacterOffsetEnd=84 PartOfSpeech=.]

Using LingPipe POS taggers

LingPipe uses the Tagger interface to support POS tagging. This interface has a single method: tag. It returns a List instance of the Tagging objects. These objects are the words and their tags. The interface is implemented by the ChainCrf and HmmDecoder classes.

The ChainCrf class uses linear-chain conditional random field decoding and estimation for determining tags. The HmmDecoder class uses an HMM to perform tagging. We will illustrate this class next.

The HmmDecoder class uses the tag method to determine the most likely (first best) tags. It also has a tagNBest method that scores the possible tagging and returns an iterator of these scored tagging. There are three POS models that come with the LingPipe, which can be downloaded from http://alias-i.com/lingpipe/web/models.html. These are listed in the following table. For our demonstration, we will use the Brown Corpus model:

Model

File

English General Text: Brown Corpus

pos-en-general-brown.HiddenMarkovModel

English Biomedical Text: MedPost Corpus

pos-en-bio-medpost.HiddenMarkovModel

English Biomedical Text: GENIA Corpus

pos-en-bio-genia.HiddenMarkovModel

Using the HmmDecoder class with Best_First tags

We start with a try-with-resources block to handle exceptions and the code to create the HmmDecoder instance, as shown next.

The model is read from the file and then used as the argument of the HmmDecoder constructor:

try (
        FileInputStream inputStream = 
            new FileInputStream(getModelDir()
            + "//pos-en-general-brown.HiddenMarkovModel");
        ObjectInputStream objectStream =
            new ObjectInputStream(inputStream);) {
    HiddenMarkovModel hmm = (HiddenMarkovModel)
        objectStream.readObject();
    HmmDecoder decoder = new HmmDecoder(hmm);
    …
} catch (IOException ex) {
 // Handle exceptions
} catch (ClassNotFoundException ex) {
 // Handle exceptions
};

We will perform tagging on theSentence variable. First, it needs to be tokenized. We will use an Indo-European tokenizer as shown here. The tokenizer method requires that the text string be converted to an array of chars. The tokenize method then returns an array of tokens as strings:

TokenizerFactory TOKENIZER_FACTORY = 
    IndoEuropeanTokenizerFactory.INSTANCE;
char[] charArray = theSentence.toCharArray();
Tokenizer tokenizer = 
    TOKENIZER_FACTORY.tokenizer(
      charArray, 0, charArray.length);
String[] tokens = tokenizer.tokenize();

The actual tagging is performed by the HmmDecoder class' tag method. However, this method requires a List instance of String tokens. This list is created using the Arrays class' asList method. The Tagging class holds a sequence of tokens and tags:

List<String> tokenList = Arrays.asList(tokens);
Tagging<String> tagString = decoder.tag(tokenList);

We are now ready to display the tokens and their tags. The following loop uses the token and tag methods to access the tokens and tags, respectively, in the Tagging object. They are then displayed:

for (int i = 0; i < tagString.size(); ++i) {
    System.out.print(tagString.token(i) + "/" 
    + tagString.tag(i) + " ");
}

The output is as follows:

The/at voyage/nn of/in the/at Abraham/np Lincoln/np was/bedz for/in a/at long/jj time/nn marked/vbn by/in no/at special/jj incident/nn ./. 

Using the HmmDecoder class with NBest tags

The tagging process considers multiple combinations of tags. The HmmDecoder class' tagNBest method returns an iterator of the ScoredTagging objects that reflect the confidence of different orders. This method takes a token list and a number specifying the maximum number of results desired.

The previous sentence is not ambiguous enough to demonstrate the combination of tags. Instead, we will use the following sentence:

String[] sentence = {"Bill", "used", "the", "force", "to", "force", "the", "manager", "to", 
    "tear", "the", "bill","in", "to."};
List<String> tokenList = Arrays.asList(sentence);

The example using this method is shown here starting with declarations for the number of results:

int maxResults = 5;

Using the decoder object created in the previous section, we apply the tagNBest method to it as follows:

Iterator<ScoredTagging<String>> iterator = 
    decoder.tagNBest(tokenList, maxResults);

The iterator will allows us to access each of the five different scores. The ScoredTagging class possesses a score method that returns a value reflecting how well it believes it performs. In the following code sequence, a printf statement displays this score. This is followed by a loop where the token and its tag are displayed.

The result is a score followed by the word sequence with the tag attached:

while (iterator.hasNext()) {
    ScoredTagging<String> scoredTagging = iterator.next();
    System.out.printf("Score: %7.3f   Sequence: ", scoredTagging.score());
    for (int i = 0; i < tokenList.size(); ++i) {
        System.out.print(scoredTagging.token(i) + "/" 
            + scoredTagging.tag(i) + " ");
    }
    System.out.println();
}

The output is as follows. Notice that the word "force" can have a tag of nn, jj, or vb:

Score: -148.796   Sequence: Bill/np used/vbd the/at force/nn to/to force/vb the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn 
Score: -154.434   Sequence: Bill/np used/vbn the/at force/nn to/to force/vb the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn 
Score: -154.781   Sequence: Bill/np used/vbd the/at force/nn to/in force/nn the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn 
Score: -157.126   Sequence: Bill/np used/vbd the/at force/nn to/to force/vb the/at manager/jj to/to tear/vb the/at bill/nn in/in two./nn 
Score: -157.340   Sequence: Bill/np used/vbd the/at force/jj to/to force/vb the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn

Determining tag confidence with the HmmDecoder class

Statistical analysis can be performed using a lattice structure, which is useful for analyzing alternative word orderings. This structure represents forward/backward scores. The HmmDecoder class' tagMarginal method returns an instance of a TagLattice class, which represents a lattice.

We can examine each token of the lattice using an instance of the ConditionalClassification class. In the following example, the tagMarginal method returns a TagLattice instance. A loop is used to obtain the ConditionalClassification instance for each token in the lattice.

We are using the same tokenList instance developed in the previous section:

TagLattice<String> lattice = decoder.tagMarginal(tokenList);
for (int index = 0; index < tokenList.size(); index++) {
    ConditionalClassification classification = 
        lattice.tokenClassification(index);
    …
}

The ConditionalClassification class has a score and a category method. The score method returns a relative score for a given category. The category method returns this category, which is the tag. The token, its score, and category are displayed as shown here:

System.out.printf("%-8s",tokenList.get(index));
for (int i = 0; i < 4; ++i) {
    double score = classification.score(i);
    String tag = classification.category(i);
    System.out.printf("%7.3f/%-3s ",score,tag);
}
System.out.println();

The output is shown as follows:

Bill      0.974/np    0.018/nn    0.006/rb    0.001/nps 
used      0.935/vbd   0.065/vbn   0.000/jj    0.000/rb  
the       1.000/at    0.000/jj    0.000/pps   0.000/pp$$ 
force     0.977/nn    0.016/jj    0.006/vb    0.001/rb  
to        0.944/to    0.055/in    0.000/rb    0.000/nn  
force     0.945/vb    0.053/nn    0.002/rb    0.001/jj  
the       1.000/at    0.000/jj    0.000/vb    0.000/nn  
manager   0.982/nn    0.018/jj    0.000/nn$   0.000/vb  
to        0.988/to    0.012/in    0.000/rb    0.000/nn  
tear      0.991/vb    0.007/nn    0.001/rb    0.001/jj  
the       1.000/at    0.000/jj    0.000/vb    0.000/nn  
bill      0.994/nn    0.003/jj    0.002/rb    0.001/nns 
in        0.990/in    0.004/rp    0.002/nn    0.001/jj  
two.      0.960/nn    0.013/np    0.011/nns   0.008/rb  

Training the OpenNLP POSModel

Training an OpenNLP POSModel is similar to the previous training examples. A training file is needed and should be large enough to provide a good sample set. Each sentence of the training file must be on a line by itself. Each line consists of a token followed by the underscore character and then the tag.

The following training data was created using the first five sentences of Chapter 5, At A Venture of Twenty Thousands Leagues Under the Sea. Although this is not a large sample set, it is easy to create and adequate for illustration purposes.

It is saved in a file named sample.train:

The_DT voyage_NN of_IN the_DT Abraham_NNP Lincoln_NNP was_VBD for_IN a_DT long_JJ time_NN marked_VBN by_IN no_DT special_JJ incident._NN
But_CC one_CD circumstance_NN happened_VBD which_WDT showed_VBD the_DT wonderful_JJ dexterity_NN of_IN Ned_NNP Land,_NNP and_CC proved_VBD what_WP confidence_NN we_PRP might_MD place_VB in_IN him._PRP$ 
The_DT 30th_JJ of_IN June,_NNP the_DT frigate_NN spoke_VBD some_DT American_NNP whalers,_, from_IN whom_WP we_PRP learned_VBD that_IN they_PRP knew_VBD nothing_NN about_IN the_DT narwhal._NN 
But_CC one_CD of_IN them,_PRP$ the_DT captain_NN of_IN the_DT Monroe,_NNP knowing_VBG that_IN Ned_NNP Land_NNP had_VBD shipped_VBN on_IN board_NN the_DT Abraham_NNP Lincoln,_NNP begged_VBD for_IN his_PRP$ help_NN in_IN chasing_VBG a_DT whale_NN they_PRP had_VBD in_IN sight._NN

We will demonstrate the creation of the model using the POSModel class' train method and how the model can be saved to a file. We start with the declaration of the POSModel instance variable:

POSModel model = null;

A try-with-resources block opens the sample file:

try (InputStream dataIn = new FileInputStream("sample.train");) {
    …
} catch (IOException e) {
    // Handle excpetions
}

An instance of the PlainTextByLineStream class is created and used with the WordTagSampleStream class to create an ObjectStream<POSSample> instance. This puts the sample data into the format required by the train method:

ObjectStream<String> lineStream = 
    new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<POSSample> sampleStream = 
    new WordTagSampleStream(lineStream);

The train method uses its parameters to specify the language, the sample stream, training parameters, and any dictionaries (none) needed, as shown here:

model = POSTaggerME.train("en", sampleStream,
    TrainingParameters.defaultParams(), null, null);

The output of this process is lengthy. The following output has been shortened to conserve space:

Indexing events using cutoff of 5

  Computing event counts...  done. 90 events
  Indexing...  done.
Sorting and merging events... done. Reduced 90 events to 82.
Done indexing.
Incorporating indexed data for training...  
done.
  Number of Event Tokens: 82
      Number of Outcomes: 17
    Number of Predicates: 45
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-254.98920096505964  0.14444444444444443
  2:  ... loglikelihood=-201.19283975630537  0.6
  3:  ... loglikelihood=-174.8849213436524  0.6111111111111112
  4:  ... loglikelihood=-157.58164262220754  0.6333333333333333
  5:  ... loglikelihood=-144.69272379986646  0.6555555555555556
...
 99:  ... loglikelihood=-33.461128002846024  0.9333333333333333
100:  ... loglikelihood=-33.29073273669207  0.9333333333333333

To save the model to a file, we use the following code. The output stream is created and the POSModel class' serialize method saves the model to the en_pos_verne.bin file:

try (OutputStream modelOut = new BufferedOutputStream(
        new FileOutputStream(new File("en_pos_verne.bin")));) {
    model.serialize(modelOut);
} catch (IOException e) {
    // Handle exceptions
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.211.166