We will demonstrate the NER process using OpenNLP, Stanford API, and LingPipe. Each of these provide alternate techniques that can often do a good job of identifying entities in the text. The following declaration will serve as the sample text to demonstrate the APIs:
String sentences[] = {"Joe was the last person to see Fred. ", "He saw him in Boston at McKenzie's pub at 3:00 where he " + " paid $2.45 for an ale. ", "Joe wanted to go to Vermont for the day to visit a cousin who " + "works at IBM, but Sally and he had to look for Fred"};
We will demonstrate the use of the TokenNameFinderModel
class to perform NLP using the OpenNLP API. Additionally, we will demonstrate how to determine the probability that the entity identified is correct.
The general approach is to convert the text into a series of tokenized sentences, create an instance of the TokenNameFinderModel
class using an appropriate model, and then use the find
method to identify the entities in the text.
The following example demonstrates the use of the TokenNameFinderModel
class. We will use a simple sentence initially and then use multiple sentences. The sentence is defined here:
String sentence = "He was the last person to see Fred.";
We will use the models found in the en-token.bin
and en-ner-person.bin
files for the tokenizer and name finder models, respectively. The InputStream
object for these files is opened using a try-with-resources block, as shown here:
try (InputStream tokenStream = new FileInputStream( new File(getModelDir(), "en-token.bin")); InputStream modelStream = new FileInputStream( new File(getModelDir(), "en-ner-person.bin"));) { ... } catch (Exception ex) { // Handle exceptions }
Within the try block, the TokenizerModel
and Tokenizer
objects are created:
TokenizerModel tokenModel = new TokenizerModel(tokenStream); Tokenizer tokenizer = new TokenizerME(tokenModel);
Next, an instance of the NameFinderME
class is created using the person model:
TokenNameFinderModel entityModel = new TokenNameFinderModel(modelStream); NameFinderME nameFinder = new NameFinderME(entityModel);
We can now use the tokenize
method to tokenize the text and the find
method to identify the person in the text. The find
method will use the tokenized String
array as input and return an array of Span
objects, as shown:
String tokens[] = tokenizer.tokenize(sentence); Span nameSpans[] = nameFinder.find(tokens);
We discussed the Span
class in Chapter 3, Finding Sentences. As you may remember, this class holds positional information about the entities found. The actual string entities are still in the tokens
array:
The following for statement displays the person found in the sentence. Its positional information and the person are displayed on separate lines:
for (int i = 0; i < nameSpans.length; i++) { System.out.println("Span: " + nameSpans[i].toString()); System.out.println("Entity: " + tokens[nameSpans[i].getStart()]); }
The output is as follows:
Span: [7..9) person Entity: Fred
We will often work with multiple sentences. To demonstrate this, we will use the previously defined sentences
string array. The previous for statement is replaced with the following sequence. The tokenize
method is invoked against each sentence and then the entity information is displayed as earlier:
for (String sentence : sentences) { String tokens[] = tokenizer.tokenize(sentence); Span nameSpans[] = nameFinder.find(tokens); for (int i = 0; i < nameSpans.length; i++) { System.out.println("Span: " + nameSpans[i].toString()); System.out.println("Entity: " + tokens[nameSpans[i].getStart()]); } System.out.println(); }
The output is as follows. There is an extra blank line between the two people detected because the second sentence did not contain a person:
Span: [0..1) person Entity: Joe Span: [7..9) person Entity: Fred Span: [0..1) person Entity: Joe Span: [19..20) person Entity: Sally Span: [26..27) person Entity: Fred
When the TokenNameFinderModel
identifies entities in text, it computes a probability for that entity. We can access this information using the probs
method as shown in the following line of code. This method returns an array of doubles, which corresponds to the elements of the nameSpans
array:
double[] spanProbs = nameFinder.probs(nameSpans);
Add this statement to the previous example immediately after the use of the find
method. Then add the next statement at the end of the nested for statement:
System.out.println("Probability: " + spanProbs[i]);
When the example is executed, you will get the following output. The probability fields reflect the confidence level of the entity assignment. For the first entity, the model is 80.529 percent confident that "Joe" is a person:
Span: [0..1) person Entity: Joe Probability: 0.8052914774025202 Span: [7..9) person Entity: Fred Probability: 0.9042160889302772 Span: [0..1) person Entity: Joe Probability: 0.9620970782763985 Span: [19..20) person Entity: Sally Probability: 0.964568603518126 Span: [26..27) person Entity: Fred Probability: 0.990383039618594
OpenNLP supports different libraries as listed in the following table. These models can be downloaded from http://opennlp.sourceforge.net/models-1.5/. The prefix, en
, specifies English as the language and ner
indicates that the model is for NER.
English finder models |
Filename |
---|---|
Location name finder model |
|
Money name finder model |
|
Organization name finder model |
|
Percentage name finder model |
|
Person name finder model |
|
Time name finder model |
|
If we modify the statement to use a different model file, we can see how they work against the sample sentences:
InputStream modelStream = new FileInputStream( new File(getModelDir(), "en-ner-time.bin"));) {
The various outputs are shown in the following table.
Model |
Output |
---|---|
|
|
|
|
|
|
|
|
The model failed to find the time entities in the sample text. This illustrates that the model did not have enough confidence that it found any time entities in the text.
We can also handle multiple entity types at the same time. This involves creating instances of the NameFinderME
class based on each model within a loop and applying the model against each sentence, keeping track of the entities as they are found.
We will illustrate this process with the following example. It requires rewriting the previous try block to create the InputStream
instance within the block, as shown here:
try { InputStream tokenStream = new FileInputStream( new File(getModelDir(), "en-token.bin")); TokenizerModel tokenModel = new TokenizerModel(tokenStream); Tokenizer tokenizer = new TokenizerME(tokenModel); ... } catch (Exception ex) { // Handle exceptions }
Within the try block, we will define a string
array to hold the names of the model files. As shown here, we will use models for people, locations, and organizations:
String modelNames[] = {"en-ner-person.bin", "en-ner-location.bin", "en-ner-organization.bin"};
An ArrayList
instance is created to hold the entities as they are discovered:
ArrayList<String> list = new ArrayList();
A for-each statement is used to load one model at a time and then to create an instance of the NameFinderME
class:
for(String name : modelNames) { TokenNameFinderModel entityModel = new TokenNameFinderModel( new FileInputStream(new File(getModelDir(), name))); NameFinderME nameFinder = new NameFinderME(entityModel); ... }
Previously, we did not try to identify which sentences the entities were found in. This is not hard to do but we need to use a simple for statement instead of a for-each statement to keep track of the sentence indexes. This is shown in the following example, where the previous example has been modified to use the integer variable index
to keep the sentences. Otherwise, the code works the same way as earlier:
for (int index = 0; index < sentences.length; index++) { String tokens[] = tokenizer.tokenize(sentences[index]); Span nameSpans[] = nameFinder.find(tokens); for(Span span : nameSpans) { list.add("Sentence: " + index + " Span: " + span.toString() + " Entity: " + tokens[span.getStart()]); } }
The entities discovered are then displayed:
for(String element : list) { System.out.println(element); }
The output is as follows:
Sentence: 0 Span: [0..1) person Entity: Joe Sentence: 0 Span: [7..9) person Entity: Fred Sentence: 2 Span: [0..1) person Entity: Joe Sentence: 2 Span: [19..20) person Entity: Sally Sentence: 2 Span: [26..27) person Entity: Fred Sentence: 1 Span: [4..5) location Entity: Boston Sentence: 2 Span: [5..6) location Entity: Vermont Sentence: 2 Span: [16..17) organization Entity: IBM
We will demonstrate the CRFClassifier
class as used to perform NER. This class implements what is known as a linear chain Conditional Random Field (CRF) sequence model.
To demonstrate the use of the CRFClassifier
class, we will start with a declaration of the classifier file string, as shown here:
String model = getModelDir() + "\english.conll.4class.distsim.crf.ser.gz";
The classifier is then created using the model:
CRFClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(model);
The classify
method takes a single string representing the text to be processed. To use the sentences
text, we need to convert it to a simple string:
String sentence = ""; for (String element : sentences) { sentence += element; }
The classify
method is then applied to the text.
List<List<CoreLabel>> entityList = classifier.classify(sentence);
A List
instance of List
instances of CoreLabel
objects is returned. The object returned is a list that contains another list. The contained list is a List
instance of CoreLabel
objects. The CoreLabel
class represents a word with additional information attached to it. The "internal" list contains a list of these words. In the outer for-each statement in the following code sequence, the reference variable, internalList
, represents one sentence of the text. In the inner for-each statement, each word in that inner list is displayed. The word
method returns the word and the get
method returns the type of the word.
The words and their types are then displayed:
for (List<CoreLabel> internalList: entityList) { for (CoreLabel coreLabel : internalList) { String word = coreLabel.word(); String category = coreLabel.get( CoreAnnotations.AnswerAnnotation.class); System.out.println(word + ":" + category); } }
Part of the output follows. It has been truncated because every word is displayed. The O
represents the "Other" category:
Joe:PERSON was:O the:O last:O person:O to:O see:O Fred:PERSON .:O He:O ... look:O for:O Fred:PERSON
To filter out the words that are not relevant, replace the println
statement with the following statements. This will eliminate the other categories:
if (!"O".equals(category)) { System.out.println(word + ":" + category); }
The output is simpler now:
Joe:PERSON Fred:PERSON Boston:LOCATION McKenzie:PERSON Joe:PERSON Vermont:LOCATION IBM:ORGANIZATION Sally:PERSON Fred:PERSON
We previously demonstrated the use of LingPipe using regular expressions in the Using regular expressions for NER section earlier in this chapter. Here, we will demonstrate how name entity models and the ExactDictionaryChunker
class are used to perform NER analysis.
LingPipe has a few named entity models that we can use with chunking. These files consist of a serialized object that can be read from a file and then applied to text. These objects implement the Chunker
interface. The chunking process results in a series of Chunking
objects that identify the entities of interest.
A list of the NER models is found in the following table. These models can be downloaded from http://alias-i.com/lingpipe/web/models.html:
Genre |
Corpus |
File |
---|---|---|
English News |
MUC-6 |
|
English Genes |
GeneTag |
|
English Genomics |
GENIA |
|
We will use the model found in the ne-en-news-muc6.AbstractCharLmRescoringChunker
file to demonstrate how this class is used. We start with a try-catch block to deal with exceptions as shown in the following example. The file is opened and used with the AbstractExternalizable
class' static readObject
method to create an instance of a Chunker
class. This method will read in the serialized model:
try { File modelFile = new File(getModelDir(), "ne-en-news-muc6.AbstractCharLmRescoringChunker"); Chunker chunker = (Chunker) AbstractExternalizable.readObject(modelFile); ... } catch (IOException | ClassNotFoundException ex) { // Handle exception }
The Chunker
and Chunking
interfaces provide methods that work with a set of chunks of text. Its chunk
method returns an object that implements the Chunking
instance. The following sequence displays the chunks found in each sentence of the text, as shown here:
for (int i = 0; i < sentences.length; ++i) { Chunking chunking = chunker.chunk(sentences[i]); System.out.println("Chunking=" + chunking); }
The output of this sequence is as follows:
Chunking=Joe was the last person to see Fred. : [0-3:PERSON@-Infinity, 31-35:ORGANIZATION@-Infinity] Chunking=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. : [14-20:LOCATION@-Infinity, 24-32:PERSON@-Infinity] Chunking=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred : [0-3:PERSON@-Infinity, 20-27:ORGANIZATION@-Infinity, 71-74:ORGANIZATION@-Infinity, 109-113:ORGANIZATION@-Infinity]
Instead, we can use methods of the Chunk
class to extract specific pieces of information as illustrated here. We will replace the previous for statement with the following for-each statement. This calls a displayChunkSet
method developed in the Using LingPipe's RegExChunker class section earlier in this chapter:
for (String sentence : sentences) { displayChunkSet(chunker, sentence); }
The output that follows shows the result. However, it does not always match the entity type correctly.
Type: PERSON Entity: [Joe] Score: -Infinity Type: ORGANIZATION Entity: [Fred] Score: -Infinity Type: LOCATION Entity: [Boston] Score: -Infinity Type: PERSON Entity: [McKenzie] Score: -Infinity Type: PERSON Entity: [Joe] Score: -Infinity Type: ORGANIZATION Entity: [Vermont] Score: -Infinity Type: ORGANIZATION Entity: [IBM] Score: -Infinity Type: ORGANIZATION Entity: [Fred] Score: -Infinity
The ExactDictionaryChunker
class provides an easy way to create a dictionary of entities and their types, which can be used to find them later in text. It uses a MapDictionary
object to store entries and then the ExactDictionaryChunker
class is used to extract chunks based on the dictionary.
The AbstractDictionary
interface supports basic operations for entities, categories, and scores. The score is used in the matching process. The MapDictionary
and TrieDictionary
classes implement the AbstractDictionary
interface. The TrieDictionary
class stores information using a character trie structure. This approach uses less memory when it is a concern. We will use the MapDictionary
class for our example.
To illustrate this approach, we start with a declaration of the MapDictionary
class:
private MapDictionary<String> dictionary;
The dictionary will contain the entities that we are interested in finding. We need to initialize the model as performed in the following initializeDictionary
method. The DictionaryEntry
constructor used here accepts three arguments:
String
: The name of the entityString
: The category of the entityDouble
: Represent a score for the entityThe score is used when determining matches. A few entities are declared and added to the dictionary.
private static void initializeDictionary() { dictionary = new MapDictionary<String>(); dictionary.addEntry( new DictionaryEntry<String>("Joe","PERSON",1.0)); dictionary.addEntry( new DictionaryEntry<String>("Fred","PERSON",1.0)); dictionary.addEntry( new DictionaryEntry<String>("Boston","PLACE",1.0)); dictionary.addEntry( new DictionaryEntry<String>("pub","PLACE",1.0)); dictionary.addEntry( new DictionaryEntry<String>("Vermont","PLACE",1.0)); dictionary.addEntry( new DictionaryEntry<String>("IBM","ORGANIZATION",1.0)); dictionary.addEntry( new DictionaryEntry<String>("Sally","PERSON",1.0)); }
An ExactDictionaryChunker
instance will use this dictionary. The arguments of the ExactDictionaryChunker
class are detailed here:
Dictionary<String>
: It is a dictionary containing the entitiesTokenizerFactory
: It is a tokenizer used by the chunkerboolean
: If it is true
, the chunker should return all matchesboolean
: If it is true
, matches are case sensitiveMatches can be overlapping. For example, in the phrase "The First National Bank", the entity "bank" could be used by itself or in conjunction with the rest of the phrase. The third parameter determines if all of the matches are returned.
In the following sequence, the dictionary is initialized. We then create an instance of the ExactDictionaryChunker
class using the Indo-European tokenizer, where we return all matches and ignore the case of the tokens:
initializeDictionary(); ExactDictionaryChunker dictionaryChunker = new ExactDictionaryChunker(dictionary, IndoEuropeanTokenizerFactory.INSTANCE, true, false);
The dictionaryChunker
object is used with each sentence, as shown in the following code sequence. We will use the displayChunkSet
method as developed in the Using LingPipe's RegExChunker class section earlier in this chapter:
for (String sentence : sentences) { System.out.println(" TEXT=" + sentence); displayChunkSet(dictionaryChunker, sentence); }
On execution, we get the following output:
TEXT=Joe was the last person to see Fred. Type: PERSON Entity: [Joe] Score: 1.0 Type: PERSON Entity: [Fred] Score: 1.0 TEXT=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. Type: PLACE Entity: [Boston] Score: 1.0 Type: PLACE Entity: [pub] Score: 1.0 TEXT=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred Type: PERSON Entity: [Joe] Score: 1.0 Type: PLACE Entity: [Vermont] Score: 1.0 Type: ORGANIZATION Entity: [IBM] Score: 1.0 Type: PERSON Entity: [Sally] Score: 1.0 Type: PERSON Entity: [Fred] Score: 1.0
This does a pretty good job but it requires a lot of effort to create the dictionary for a large vocabulary.
3.12.123.189