Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using NLP APIs

We will demonstrate the NER process using OpenNLP, Stanford API, and LingPipe. Each of these provide alternate techniques that can often do a good job of identifying entities in the text. The following declaration will serve as the sample text to demonstrate the APIs:

String sentences[] = {"Joe was the last person to see Fred. ",
  "He saw him in Boston at McKenzie's pub at 3:00 where he "
  + " paid $2.45 for an ale. ",
  "Joe wanted to go to Vermont for the day to visit a cousin who "
  + "works at IBM, but Sally and he had to look for Fred"};

Using OpenNLP for NER

We will demonstrate the use of the TokenNameFinderModel class to perform NLP using the OpenNLP API. Additionally, we will demonstrate how to determine the probability that the entity identified is correct.

The general approach is to convert the text into a series of tokenized sentences, create an instance of the TokenNameFinderModel class using an appropriate model, and then use the find method to identify the entities in the text.

The following example demonstrates the use of the TokenNameFinderModel class. We will use a simple sentence initially and then use multiple sentences. The sentence is defined here:

String sentence = "He was the last person to see Fred.";

We will use the models found in the en-token.bin and en-ner-person.bin files for the tokenizer and name finder models, respectively. The InputStream object for these files is opened using a try-with-resources block, as shown here:

try (InputStream tokenStream = new FileInputStream(
        new File(getModelDir(), "en-token.bin"));
        InputStream modelStream = new FileInputStream(
            new File(getModelDir(), "en-ner-person.bin"));) {
    ...

} catch (Exception ex) {
    // Handle exceptions
}

Within the try block, the TokenizerModel and Tokenizer objects are created:

    TokenizerModel tokenModel = new TokenizerModel(tokenStream);
    Tokenizer tokenizer = new TokenizerME(tokenModel);

Next, an instance of the NameFinderME class is created using the person model:

TokenNameFinderModel entityModel = 
    new TokenNameFinderModel(modelStream);
NameFinderME nameFinder = new NameFinderME(entityModel);

We can now use the tokenize method to tokenize the text and the find method to identify the person in the text. The find method will use the tokenized String array as input and return an array of Span objects, as shown:

String tokens[] = tokenizer.tokenize(sentence);
Span nameSpans[] = nameFinder.find(tokens);

We discussed the Span class in Chapter 3, Finding Sentences. As you may remember, this class holds positional information about the entities found. The actual string entities are still in the tokens array:

The following for statement displays the person found in the sentence. Its positional information and the person are displayed on separate lines:

for (int i = 0; i < nameSpans.length; i++) {
    System.out.println("Span: " + nameSpans[i].toString());
    System.out.println("Entity: "
        + tokens[nameSpans[i].getStart()]);
}

The output is as follows:

Span: [7..9) person
Entity: Fred

We will often work with multiple sentences. To demonstrate this, we will use the previously defined sentences string array. The previous for statement is replaced with the following sequence. The tokenize method is invoked against each sentence and then the entity information is displayed as earlier:

for (String sentence : sentences) {
    String tokens[] = tokenizer.tokenize(sentence);
    Span nameSpans[] = nameFinder.find(tokens);
    for (int i = 0; i < nameSpans.length; i++) {
        System.out.println("Span: " + nameSpans[i].toString());
        System.out.println("Entity: " 
            + tokens[nameSpans[i].getStart()]);
    }
    System.out.println();
}

The output is as follows. There is an extra blank line between the two people detected because the second sentence did not contain a person:

Span: [0..1) person
Entity: Joe
Span: [7..9) person
Entity: Fred


Span: [0..1) person
Entity: Joe
Span: [19..20) person
Entity: Sally
Span: [26..27) person
Entity: Fred

Determining the accuracy of the entity

When the TokenNameFinderModel identifies entities in text, it computes a probability for that entity. We can access this information using the probs method as shown in the following line of code. This method returns an array of doubles, which corresponds to the elements of the nameSpans array:

double[] spanProbs = nameFinder.probs(nameSpans);

Add this statement to the previous example immediately after the use of the find method. Then add the next statement at the end of the nested for statement:

System.out.println("Probability: " + spanProbs[i]);

When the example is executed, you will get the following output. The probability fields reflect the confidence level of the entity assignment. For the first entity, the model is 80.529 percent confident that "Joe" is a person:

Span: [0..1) person
Entity: Joe
Probability: 0.8052914774025202
Span: [7..9) person
Entity: Fred
Probability: 0.9042160889302772

Span: [0..1) person
Entity: Joe
Probability: 0.9620970782763985
Span: [19..20) person
Entity: Sally
Probability: 0.964568603518126
Span: [26..27) person
Entity: Fred
Probability: 0.990383039618594

Using other entity types

OpenNLP supports different libraries as listed in the following table. These models can be downloaded from http://opennlp.sourceforge.net/models-1.5/. The prefix, en, specifies English as the language and ner indicates that the model is for NER.

English finder models	Filename
Location name finder model	`en-ner-location.bin`
Money name finder model	`en-ner-money.bin`
Organization name finder model	`en-ner-organization.bin`
Percentage name finder model	`en-ner-percentage.bin`
Person name finder model	`en-ner-person.bin`
Time name finder model	`en-ner-time.bin`

If we modify the statement to use a different model file, we can see how they work against the sample sentences:

InputStream modelStream = new FileInputStream(
    new File(getModelDir(), "en-ner-time.bin"));) {

Note

When the en-ner-money.bin model is used, the index in the tokens array in the earlier code sequence has to be increased by one. Otherwise, all that is returned is the dollar sign.

The various outputs are shown in the following table.

Model	Output
`en-ner-location.bin`	`Span: [4..5) location` `Entity: Boston` `Probability: 0.8656908776583051` `Span: [5..6) location` `Entity: Vermont` `Probability: 0.9732488014011262`
`en-ner-money.bin`	`Span: [14..16) money` `Entity: 2.45` `Probability: 0.7200919701507937`
`en-ner-organization.bin`	`Span: [16..17) organization` `Entity: IBM` `Probability: 0.9256970736336729`
`en-ner-time.bin`	`The model was not able to detect time in this text sequence`

The model failed to find the time entities in the sample text. This illustrates that the model did not have enough confidence that it found any time entities in the text.

Processing multiple entity types

We can also handle multiple entity types at the same time. This involves creating instances of the NameFinderME class based on each model within a loop and applying the model against each sentence, keeping track of the entities as they are found.

We will illustrate this process with the following example. It requires rewriting the previous try block to create the InputStream instance within the block, as shown here:

try {
    InputStream tokenStream = new FileInputStream(
        new File(getModelDir(), "en-token.bin"));
    TokenizerModel tokenModel = new TokenizerModel(tokenStream);
    Tokenizer tokenizer = new TokenizerME(tokenModel);
    ...
} catch (Exception ex) {
    // Handle exceptions
}

Within the try block, we will define a string array to hold the names of the model files. As shown here, we will use models for people, locations, and organizations:

String modelNames[] = {"en-ner-person.bin", 
    "en-ner-location.bin", "en-ner-organization.bin"};

An ArrayList instance is created to hold the entities as they are discovered:

ArrayList<String> list = new ArrayList();

A for-each statement is used to load one model at a time and then to create an instance of the NameFinderME class:

for(String name : modelNames) {
    TokenNameFinderModel entityModel = new TokenNameFinderModel(
        new FileInputStream(new File(getModelDir(), name)));
    NameFinderME nameFinder = new NameFinderME(entityModel);
    ...
}

Previously, we did not try to identify which sentences the entities were found in. This is not hard to do but we need to use a simple for statement instead of a for-each statement to keep track of the sentence indexes. This is shown in the following example, where the previous example has been modified to use the integer variable index to keep the sentences. Otherwise, the code works the same way as earlier:

for (int index = 0; index < sentences.length; index++) {
    String tokens[] = tokenizer.tokenize(sentences[index]);
    Span nameSpans[] = nameFinder.find(tokens);
    for(Span span : nameSpans) {
        list.add("Sentence: " + index
            + " Span: " + span.toString() + " Entity: "
            + tokens[span.getStart()]);
    }
}

The entities discovered are then displayed:

for(String element : list) {
    System.out.println(element);
}

The output is as follows:

Sentence: 0 Span: [0..1) person Entity: Joe
Sentence: 0 Span: [7..9) person Entity: Fred
Sentence: 2 Span: [0..1) person Entity: Joe
Sentence: 2 Span: [19..20) person Entity: Sally
Sentence: 2 Span: [26..27) person Entity: Fred
Sentence: 1 Span: [4..5) location Entity: Boston
Sentence: 2 Span: [5..6) location Entity: Vermont
Sentence: 2 Span: [16..17) organization Entity: IBM

Using the Stanford API for NER

We will demonstrate the CRFClassifier class as used to perform NER. This class implements what is known as a linear chain Conditional Random Field (CRF) sequence model.

To demonstrate the use of the CRFClassifier class, we will start with a declaration of the classifier file string, as shown here:

String model = getModelDir() + 
    "\english.conll.4class.distsim.crf.ser.gz";

The classifier is then created using the model:

CRFClassifier<CoreLabel> classifier =
    CRFClassifier.getClassifierNoExceptions(model);

The classify method takes a single string representing the text to be processed. To use the sentences text, we need to convert it to a simple string:

String sentence = "";
for (String element : sentences) {
    sentence += element;
}

The classify method is then applied to the text.

List<List<CoreLabel>> entityList = classifier.classify(sentence);

A List instance of List instances of CoreLabel objects is returned. The object returned is a list that contains another list. The contained list is a List instance of CoreLabel objects. The CoreLabel class represents a word with additional information attached to it. The "internal" list contains a list of these words. In the outer for-each statement in the following code sequence, the reference variable, internalList, represents one sentence of the text. In the inner for-each statement, each word in that inner list is displayed. The word method returns the word and the get method returns the type of the word.

The words and their types are then displayed:

for (List<CoreLabel> internalList: entityList) {
    for (CoreLabel coreLabel : internalList) {
        String word = coreLabel.word();
        String category = coreLabel.get(
            CoreAnnotations.AnswerAnnotation.class);
        System.out.println(word + ":" + category);
    }
}

Part of the output follows. It has been truncated because every word is displayed. The O represents the "Other" category:

Joe:PERSON
was:O
the:O
last:O
person:O
to:O
see:O
Fred:PERSON
.:O
He:O
...
look:O
for:O
Fred:PERSON

To filter out the words that are not relevant, replace the println statement with the following statements. This will eliminate the other categories:

if (!"O".equals(category)) {
    System.out.println(word + ":" + category);
}

The output is simpler now:

Joe:PERSON
Fred:PERSON
Boston:LOCATION
McKenzie:PERSON
Joe:PERSON
Vermont:LOCATION
IBM:ORGANIZATION
Sally:PERSON
Fred:PERSON

Using LingPipe for NER

We previously demonstrated the use of LingPipe using regular expressions in the Using regular expressions for NER section earlier in this chapter. Here, we will demonstrate how name entity models and the ExactDictionaryChunker class are used to perform NER analysis.

Using LingPipe's name entity models

LingPipe has a few named entity models that we can use with chunking. These files consist of a serialized object that can be read from a file and then applied to text. These objects implement the Chunker interface. The chunking process results in a series of Chunking objects that identify the entities of interest.

A list of the NER models is found in the following table. These models can be downloaded from http://alias-i.com/lingpipe/web/models.html:

Genre	Corpus	File
English News	MUC-6	`ne-en-news-muc6.AbstractCharLmRescoringChunker`
English Genes	GeneTag	`ne-en-bio-genetag.HmmChunker`
English Genomics	GENIA	`ne-en-bio-genia.TokenShapeChunker`

We will use the model found in the ne-en-news-muc6.AbstractCharLmRescoringChunker file to demonstrate how this class is used. We start with a try-catch block to deal with exceptions as shown in the following example. The file is opened and used with the AbstractExternalizable class' static readObject method to create an instance of a Chunker class. This method will read in the serialized model:

try {
    File modelFile = new File(getModelDir(), 
        "ne-en-news-muc6.AbstractCharLmRescoringChunker");
     Chunker chunker = (Chunker) 
        AbstractExternalizable.readObject(modelFile);
    ...
} catch (IOException | ClassNotFoundException ex) {
    // Handle exception
}

The Chunker and Chunking interfaces provide methods that work with a set of chunks of text. Its chunk method returns an object that implements the Chunking instance. The following sequence displays the chunks found in each sentence of the text, as shown here:

for (int i = 0; i < sentences.length; ++i) {
    Chunking chunking = chunker.chunk(sentences[i]);
    System.out.println("Chunking=" + chunking);
}

The output of this sequence is as follows:

Chunking=Joe was the last person to see Fred.  : [0-3:PERSON@-Infinity, 31-35:ORGANIZATION@-Infinity]
Chunking=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale.  : [14-20:LOCATION@-Infinity, 24-32:PERSON@-Infinity]
Chunking=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred : [0-3:PERSON@-Infinity, 20-27:ORGANIZATION@-Infinity, 71-74:ORGANIZATION@-Infinity, 109-113:ORGANIZATION@-Infinity]

Instead, we can use methods of the Chunk class to extract specific pieces of information as illustrated here. We will replace the previous for statement with the following for-each statement. This calls a displayChunkSet method developed in the Using LingPipe's RegExChunker class section earlier in this chapter:

for (String sentence : sentences) {
    displayChunkSet(chunker, sentence);
}

The output that follows shows the result. However, it does not always match the entity type correctly.

Type: PERSON Entity: [Joe] Score: -Infinity
Type: ORGANIZATION Entity: [Fred] Score: -Infinity
Type: LOCATION Entity: [Boston] Score: -Infinity
Type: PERSON Entity: [McKenzie] Score: -Infinity
Type: PERSON Entity: [Joe] Score: -Infinity
Type: ORGANIZATION Entity: [Vermont] Score: -Infinity
Type: ORGANIZATION Entity: [IBM] Score: -Infinity
Type: ORGANIZATION Entity: [Fred] Score: -Infinity

Using the ExactDictionaryChunker class

The ExactDictionaryChunker class provides an easy way to create a dictionary of entities and their types, which can be used to find them later in text. It uses a MapDictionary object to store entries and then the ExactDictionaryChunker class is used to extract chunks based on the dictionary.

The AbstractDictionary interface supports basic operations for entities, categories, and scores. The score is used in the matching process. The MapDictionary and TrieDictionary classes implement the AbstractDictionary interface. The TrieDictionary class stores information using a character trie structure. This approach uses less memory when it is a concern. We will use the MapDictionary class for our example.

To illustrate this approach, we start with a declaration of the MapDictionary class:

private MapDictionary<String> dictionary;

The dictionary will contain the entities that we are interested in finding. We need to initialize the model as performed in the following initializeDictionary method. The DictionaryEntry constructor used here accepts three arguments:

String: The name of the entity
String: The category of the entity
Double: Represent a score for the entity

The score is used when determining matches. A few entities are declared and added to the dictionary.

private static void initializeDictionary() {
    dictionary = new MapDictionary<String>();
    dictionary.addEntry(
        new DictionaryEntry<String>("Joe","PERSON",1.0));
    dictionary.addEntry(
        new DictionaryEntry<String>("Fred","PERSON",1.0));
    dictionary.addEntry(
        new DictionaryEntry<String>("Boston","PLACE",1.0));
    dictionary.addEntry(
        new DictionaryEntry<String>("pub","PLACE",1.0));
    dictionary.addEntry(
        new DictionaryEntry<String>("Vermont","PLACE",1.0));
    dictionary.addEntry(
        new DictionaryEntry<String>("IBM","ORGANIZATION",1.0));
    dictionary.addEntry(
        new DictionaryEntry<String>("Sally","PERSON",1.0));
}

An ExactDictionaryChunker instance will use this dictionary. The arguments of the ExactDictionaryChunker class are detailed here:

Dictionary<String>: It is a dictionary containing the entities
TokenizerFactory: It is a tokenizer used by the chunker
boolean: If it is true, the chunker should return all matches
boolean: If it is true, matches are case sensitive

Matches can be overlapping. For example, in the phrase "The First National Bank", the entity "bank" could be used by itself or in conjunction with the rest of the phrase. The third parameter determines if all of the matches are returned.

In the following sequence, the dictionary is initialized. We then create an instance of the ExactDictionaryChunker class using the Indo-European tokenizer, where we return all matches and ignore the case of the tokens:

initializeDictionary();
ExactDictionaryChunker dictionaryChunker
    = new ExactDictionaryChunker(dictionary,
        IndoEuropeanTokenizerFactory.INSTANCE, true, false);

The dictionaryChunker object is used with each sentence, as shown in the following code sequence. We will use the displayChunkSet method as developed in the Using LingPipe's RegExChunker class section earlier in this chapter:

for (String sentence : sentences) {
    System.out.println("
TEXT=" + sentence);
    displayChunkSet(dictionaryChunker, sentence);
}

On execution, we get the following output:

TEXT=Joe was the last person to see Fred. 
Type: PERSON Entity: [Joe] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0

TEXT=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. 
Type: PLACE Entity: [Boston] Score: 1.0
Type: PLACE Entity: [pub] Score: 1.0

TEXT=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred
Type: PERSON Entity: [Joe] Score: 1.0
Type: PLACE Entity: [Vermont] Score: 1.0
Type: ORGANIZATION Entity: [IBM] Score: 1.0
Type: PERSON Entity: [Sally] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0

This does a pretty good job but it requires a lot of effort to create the dictionary for a large vocabulary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Using NLP APIs

Create new playlist

Sign In

Sign Up