Building and evaluating NER systems

Based on our discussion so far in this chapter, we know that building an NER system will start with the following steps:

  1. Separate our document into sentences.
  2. Separate our sentences into tokens.
  3. Tag each token with a part of speech.
  4. Identify named entities from this tagged token set.
  5. Identify the class of each named entity.

To help us correctly find tokens at step 2, separate the real named entities from the impostors at step 4, and to ensure that the entities are placed into the correct class at step 5, it is common to leverage a machine learning approach, similar to what NLTK and its sentiment mining functions did for us in Chapter 5, Sentiment Analysis in Text. Relying on a large set of pre-classified examples will help us work out some of the more complicated issues we introduced above for recognizing named entities, for example, choosing the correct boundary in multi-word noun phrases, or recognizing novel approaches to capitalization, or knowing what kind of named entity it is.

But even with a flexible machine learning approach, the vagaries and oddities of written language remind us to be careful; some legitimate named entities may slip through the cracks, while other tokens may be called named entities when they really are not. With so many exceptions to the rules for how to find named entities, the risk of generating false positives or false negatives is high. Therefore, just as with earlier chapters, we will need an evaluation plan for whatever machine learning-based NER system we end up choosing.

NER and partial matches

Because of the potential for NER systems to over- or under-classify words as named entities, we will need to use the kind of false positive and false negative calculations that we first saw in Chapter 3, Entity Matching. However, the calculations will be slightly different with NER due to partial matches. Partial matches happen when our NER system catches, for example, Caribbean but not Pirates of the Caribbean. These are sometimes called boundary errors, since the NER system found some of the token, but messed up its boundaries by being too short or too long. An NER system may also misidentify the entity into the wrong class. For example, it may recognize Fido but call it a GPE rather than a PERSON.

With these kinds of partial matches, we have three choices for how to handle them:

  • Strict Scoring: We can score the partial match as both a false positive (because Caribbean is not a correct guess) and a false negative (because Pirates of the Caribbean was also missing).
  • Lenient Scoring: We can score the partial match as a true positive. No penalties are given for false negatives or false positives, and we just assume that Caribbean is good enough.
  • Partial Scoring: We can come up with rules that give some credit for matches that are partially correct, for example, finding Caribbean instead of Pirates of the Caribbean.

Strict and lenient scoring are straightforward to understand, but partial credit scoring needs a bit more explanation. How does it work?

Handling partial matches

One system for handling partial matches came out of the Message Understanding Conference (MUC) series. This seven-conference series was held by the US Government agency DARPA in the late 1980s and 1990s to encourage researchers to devise new techniques for information extraction. One of the outcomes of the sixth iteration of this conference was to begin to piece together a scoring system for named entities that would handle partial matches on either the phrase itself or the assigned class. With a comprehensive rule scoring system, it is possible to evaluate proposed NER matchers to determine whether they are as good as human NER systems. So how does MUC scoring work?

The MUC scoring system works by computing two scores: one for finding the correct entity terms, and another score for classifying them correctly into their category of PERSON, GPE, and so on. The class is scored as correct, as long as some part of the entity term was also found. These two scores are then fed into a precision and recall calculation, similar to what we saw earlier in Chapter 3, Entity Matching.

To show how this works, the table below shows the expected and guessed results from a sample NER system. PERSON, ORGANIZATION, and GPE have been abbreviated as PER, ORG, and GPE, respectively:

Item

Expected named entity

Guessed named entity

Boundaries correct?

Class correct?

1

Pirates of the Caribbean / ORG

Caribbean / GPE

No

No

2

Fido / PER

Fido / GPE

Yes

No

3

Microsoft Windows / ORG

Microsoft / ORG

No

Yes

4

Captain America / PER

Captain America/ PER

Yes

Yes

5

Great Bend / GPE

-

-

-

6

-

Marvel / ORG

No

No

7

Shaker Heights / GPE

Shaker Heights / GPE

Yes

Yes

Note that on line 3, the correct class was guessed even though only the text boundary was only partially correct. On line 5, there was an expected named entity that was skipped by the system. On line 6, the system found an entity where none was expected.

To figure out the precision and recall, we keep track of the following:

  • CORRECT: Number of correct guesses for both the boundaries and the class
  • GUESSED: Number of actual answers for both boundaries and class
  • POSSIBLE: Number of possible answers for both boundaries and class

We can use the example above to calculate each of these measures:

  • CORRECT: The NER system guessed three correct boundaries and three correct classes, so CORRECT = 6.
  • GUESSED: The NER system guessed a total of six boundaries (it missed one on line 5), and a total of six classes (again, missing the one on line 5), so GUESSED = 12.
  • POSSIBLE: The number of possible guesses for text boundaries should be six (lines 1-5, and 7), and the number of possible class guesses is also six, so POSSIBLE = 12.

To calculate the MUC precision and recall for a NER system we apply these measures as follows:

MUC_Precision = CORRECT / GUESSED
              = 6/12
              = 50%
MUC_Recall = CORRECT / POSSIBLE 
           = 6/12 
           = 50%

We can also calculate the F1-measure as the harmonic mean of precision and recall as follows:

F1 = 2*((MUC_Precision * MUC_Recall) / (MUC_Precision + MUC_Recall))
   = 2*(.5 * .5) / (.5 + .5))
   = 50%

If we were operating under a strict scoring protocol, there are only two named entities that were guessed totally correctly in that system example: the ones on lines 4 and 7. However, there were six guesses by the NER system, and six total possible guesses. This yields:

Strict_Precision = CORRECT / GUESSED = 2/6 (30%)
Srict_Recall = CORRECT / POSSIBLE = 2/6 (30%)

Whether a partial or strict scoring protocol is used depends on the objectives of the work. How important are precise boundaries and classes in that domain? Your answer may vary depending on what you are working on. For example, if you are more interested in counting the named entities that appear in a text, it may be sufficient just to account for partial matches.

In the next section, we will devise a named entity recognition system and calculate the associated accuracy scores for a project using real data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.170.83