Based on our discussion so far in this chapter, we know that building an NER system will start with the following steps:
To help us correctly find tokens at step 2, separate the real named entities from the impostors at step 4, and to ensure that the entities are placed into the correct class at step 5, it is common to leverage a machine learning approach, similar to what NLTK and its sentiment mining functions did for us in Chapter 5, Sentiment Analysis in Text. Relying on a large set of pre-classified examples will help us work out some of the more complicated issues we introduced above for recognizing named entities, for example, choosing the correct boundary in multi-word noun phrases, or recognizing novel approaches to capitalization, or knowing what kind of named entity it is.
But even with a flexible machine learning approach, the vagaries and oddities of written language remind us to be careful; some legitimate named entities may slip through the cracks, while other tokens may be called named entities when they really are not. With so many exceptions to the rules for how to find named entities, the risk of generating false positives or false negatives is high. Therefore, just as with earlier chapters, we will need an evaluation plan for whatever machine learning-based NER system we end up choosing.
Because of the potential for NER systems to over- or under-classify words as named entities, we will need to use the kind of false positive and false negative calculations that we first saw in Chapter 3, Entity Matching. However, the calculations will be slightly different with NER due to partial matches. Partial matches happen when our NER system catches, for example, Caribbean but not Pirates of the Caribbean. These are sometimes called boundary errors, since the NER system found some of the token, but messed up its boundaries by being too short or too long. An NER system may also misidentify the entity into the wrong class. For example, it may recognize Fido but call it a GPE rather than a PERSON.
With these kinds of partial matches, we have three choices for how to handle them:
Strict and lenient scoring are straightforward to understand, but partial credit scoring needs a bit more explanation. How does it work?
One system for handling partial matches came out of the Message Understanding Conference (MUC) series. This seven-conference series was held by the US Government agency DARPA in the late 1980s and 1990s to encourage researchers to devise new techniques for information extraction. One of the outcomes of the sixth iteration of this conference was to begin to piece together a scoring system for named entities that would handle partial matches on either the phrase itself or the assigned class. With a comprehensive rule scoring system, it is possible to evaluate proposed NER matchers to determine whether they are as good as human NER systems. So how does MUC scoring work?
The MUC scoring system works by computing two scores: one for finding the correct entity terms, and another score for classifying them correctly into their category of PERSON, GPE, and so on. The class is scored as correct, as long as some part of the entity term was also found. These two scores are then fed into a precision and recall calculation, similar to what we saw earlier in Chapter 3, Entity Matching.
To show how this works, the table below shows the expected and guessed results from a sample NER system. PERSON, ORGANIZATION, and GPE have been abbreviated as PER, ORG, and GPE, respectively:
Item |
Expected named entity |
Guessed named entity |
Boundaries correct? |
Class correct? |
---|---|---|---|---|
1 |
Pirates of the Caribbean / ORG |
Caribbean / GPE |
No |
No |
2 |
Fido / PER |
Fido / GPE |
Yes |
No |
3 |
Microsoft Windows / ORG |
Microsoft / ORG |
No |
Yes |
4 |
Captain America / PER |
Captain America/ PER |
Yes |
Yes |
5 |
Great Bend / GPE |
- |
- |
- |
6 |
- |
Marvel / ORG |
No |
No |
7 |
Shaker Heights / GPE |
Shaker Heights / GPE |
Yes |
Yes |
Note that on line 3, the correct class was guessed even though only the text boundary was only partially correct. On line 5, there was an expected named entity that was skipped by the system. On line 6, the system found an entity where none was expected.
To figure out the precision and recall, we keep track of the following:
We can use the example above to calculate each of these measures:
To calculate the MUC precision and recall for a NER system we apply these measures as follows:
MUC_Precision = CORRECT / GUESSED = 6/12 = 50% MUC_Recall = CORRECT / POSSIBLE = 6/12 = 50%
We can also calculate the F1-measure as the harmonic mean of precision and recall as follows:
F1 = 2*((MUC_Precision * MUC_Recall) / (MUC_Precision + MUC_Recall)) = 2*(.5 * .5) / (.5 + .5)) = 50%
If we were operating under a strict scoring protocol, there are only two named entities that were guessed totally correctly in that system example: the ones on lines 4 and 7. However, there were six guesses by the NER system, and six total possible guesses. This yields:
Strict_Precision = CORRECT / GUESSED = 2/6 (30%) Srict_Recall = CORRECT / POSSIBLE = 2/6 (30%)
Whether a partial or strict scoring protocol is used depends on the objectives of the work. How important are precise boundaries and classes in that domain? Your answer may vary depending on what you are working on. For example, if you are more interested in counting the named entities that appear in a text, it may be sufficient just to account for partial matches.
In the next section, we will devise a named entity recognition system and calculate the associated accuracy scores for a project using real data.
3.145.170.83