Named entity recognition project

In this set of small projects, we will try our NER techniques on a variety of different types of text that we have seen already in prior chapters, as well as some new text. For variety, will look for named entities in e-mail texts, board meeting minutes, IRC chat dialogue, and human-created summaries of IRC chat dialogue. With these different types of data sources, we will be able to see how writing style and content both affect the accuracy of the NER system.

A simple NER tool

Our first step is to write a simple named entity recognition program that will allow us to find and extract named entities from a text sample. We will take this program and point it at several different text samples in turn. The code and text files for this project are all available on the GitHub site for this book, at https://github.com/megansquire/masteringDM/tree/master/ch6.

The code we will write is a short Python program that uses the same NLTK library we introduced in Chapter 3, Entity Matching, and Chapter 5, Sentiment Analysis in Text. We will also import a pretty printer library so that the output of this program will be easier to understand:

import nltk 
import pprint

Next we will set up five different files, and just comment out the ones we are not working with at the moment. After we describe the rest of the code, we will describe each of these files in turn, where they came from, and what the NER results were:

# sample files that we use in this chapter 
filename = 'apacheMeetingMinutes.txt' 
#filename = 'djangoIRCchat.txt' 
#filename = 'gnueIRCsummary.txt' 
#filename = 'lkmlEmails.txt' 
#filename = 'lkmlEmailsReduced.txt'

The next section of code describes our simple NER routine. First, we open each file and read it into the text object:

with open(filename, 'r', encoding='utf8') as sampleFile:
    text=sampleFile.read()

Next, we load a tokenizer that will read through each line in the text and look for sentences. It is important to look at sentences instead of just looking at lines because, depending on the type of text we are working with, there could be multiple sentences per line. For example, most IRC chat will be one sentence per line, but most e-mail will have multiple sentences per line:

en = {}
try:   
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(text.strip())

For each sentence we find, we are going to figure out the part of speech for every word in the sentence. The ne_chunk() function takes the collection of words and tags, and finds the most likely candidates for named entities, storing these in the variable chunked:

    for sentence in sentences:
        tokenized = nltk.word_tokenize(sentence)
        tagged = nltk.pos_tag(tokenized)
        chunked = nltk.ne_chunk(tagged)

Next we will examine each item in chunked and build a dictionary entry for it and its label. Recall from our earlier discussion that a label, or class, can be one of ORGANIZATION, PERSON, GPE for location, and so on:

        for tree in chunked:
            if hasattr(tree, 'label'):
                ne = ' '.join(c[0] for c in tree.leaves())
                en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
except Exception as e:
    print(str(e))
    
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(en)

Finally, we print the dictionary of named entities and their classes. In the next section, we describe what happens when we run this code against each of the different files shown in the beginning of the program. We also describe where we got the data for these files, and what kind of text they include.

Apache Board meeting minutes

The first text source we will use is taken from the publicly available collection of meeting minutes from the Board of Directors for the Apache project. Since 2010 the Apache Board of Directors has posted the minutes from its meetings on its website, available here: http://www.apache.org/foundation/board/calendar.html.

The text file I created for this project is taken from the February 17, 2016 minutes, specifically the President's section B. The original link for this February minutes file is https://www.apache.org/foundation/records/minutes/2016/board_minutes_2016_02_17.txt.

The final file is 33 lines long. A sample of the file is as follows:

I believe our Membership felt fully involved and as a result is almost unanimous in their approval of the new design.
Well done Sally (and thanks to LucidWorks and HotWax Systems for donating creative services).
Sally has confirmed a return of her media/analyst trainings at ApacheCon.

Our NER program will be looking for words such as LucidWorks, HotWax Systems, Sally, and ApacheCon.

When we run the program against this Apache meeting minutes file, we get the results, which are as follows:

{'ApacheCon': ['ORGANIZATION', 'NNP'],
'Appveyor CI': ['PERSON', 'NNP NNP'],
'CFP': ['ORGANIZATION', 'NNP'],
'David': ['PERSON', 'NNP'],
'GitHub': ['ORGANIZATION', 'NNP'],
'HotWax Systems': ['ORGANIZATION', 'NNP NNP'],
'Huge': ['GPE', 'JJ'],
'Infra': ['ORGANIZATION', 'NNP'],
'LucidWorks': ['ORGANIZATION', 'NNP'],
'Mark Thomas': ['PERSON', 'NNP NNP'],
'Melissa': ['GPE', 'NNP'],
'New': ['GPE', 'NNP'],
'Remediation': ['GPE', 'NN'],
'Sally': ['PERSON', 'NNP'],
'TAC': ['ORGANIZATION', 'NNP'],
'TLPs': ['ORGANIZATION', 'NNP'],
'VP Infra': ['ORGANIZATION', 'NNP NNP'],
'Virtual': ['PERSON', 'NNP'],
'iTunes': ['ORGANIZATION', 'NNS']}

The program correctly found most items, but incorrectly found the words CFP, Huge, New, and Remediation. These words were capitalized because they were the first word in the sentence, and this undoubtedly led to them being accidentally declared as named entities by NLTK when they were actually not. The word Virtual may seem at first like it is a mistake, but in reading the text it turns out that this is the name of a company. Unfortunately, the NER tool declared it to be a PERSON. Additionally, Melissa should have been a PERSON, not GPE, and Appveyor CI should have been labeled ORGANIZATION rather than PERSON.

We did not find any false negatives with this text sample. Using the formulas from the last section, we can calculate the accuracy of our program as follows:

  • CORRECT: The NER system guessed 15 correct boundaries and 12 correct classes, so CORRECT = 27
  • GUESSED: The NER system guessed a total of 19 boundaries, and a total of 19 classes, so GUESSED = 38
  • POSSIBLE: The number of possible guesses for text boundaries should be 15, and the number of possible class guesses is also 15, so POSSIBLE = 30

To calculate the MUC precision, recall, and F1 harmonic mean for our NER system, we apply these measures to our data like this:

MUC_Precision = CORRECT / GUESSED
   = 27/38
   = 71%
MUC_Recall = CORRECT / POSSIBLE 
   = 27/30 
   = 90%
F1 = 2*((MUC_Precision * MUC_Recall) / (MUC_Precision + MUC_Recall))
   = 2*(.71 * .90) / (.71 + .90))
   = 79%

In this case, the NER program seemed to err on the side of false positives. It made several incorrect guesses, but did not miss any named entities in the text. Next we will see how the program does with highly unstructured chat text.

Django IRC chat

The Django project has an IRC channel where the community can discuss various aspects of the project including how it works, bug fixing, and so on. The IRC logs are provided for anyone to read at http://django-irc-logs.com. From this collection of log files, we chose a random date, March 23, 2014, and extracted all of the IRC log messages sent on that date. We did not collect the system messages, such as users logging in or logging out. There are 677 lines of text in this sample.

One thing we notice right away about IRC chat is because of the casual nature of this communication format, most lines of text are not written with proper capitalization or punctuation. To clean the data so that it would be able to be used by the sentence tokenizer, we added a period at the end of each line to simulate a sentence structure, and we removed URLs. A few sample lines from the data set look like this:

is it not possible though?.
it's possible, if you write a pile of JS to do it.
to dah pls no JS.
i just want native django queryset filter things.
Maybe he wouldn't have to. Maybe you could just use AdminActions.

We can see that any line that already ended in a question mark now has an additional period at the end, but this will not affect our NER program. We see various differences in capitalization, for example the person typing last does use it, but the rest of the lines are more loosely capitalized.

The NER program identified 105 named entities, shown below. Examples of false positives would be generic, capitalized nouns such as the Boolean variable value False, the generic acronym API for application programming interface, and the generic acronym CA for certificate authority. True positives include FTP for the File Transfer Protocol program and South as the name of a system.

Here, the NER program is even more generous than it was with the Apache meeting minutes, with 52 false positives. The remaining 53 entities we will declare as correct, and we will give them 2 points each: 1 point for a correct boundary and 1 point for a correct class. For entities that have correct boundaries but an incorrect class, we assign 1 point. For example, marking Allauth as a PERSON rather than ORGANIZATION yields 1 point. Entities that are incorrect in boundary are given 0 points. Only the first few entries are shown, for space reasons:

{0  'API': ['ORGANIZATION', 'NNP'],
0    'APIs': ['ORGANIZATION', 'NNP'],
0    'Admin': ['PERSON', 'NNP'],
2    'AdminActions': ['ORGANIZATION', 'NNS'],
0    'Ahh': ['GPE', 'NNP'],
2    'Aldryn': ['PERSON', 'NNP'],
2    'Aleksander': ['PERSON', 'NN'],
1    'Allauth': ['PERSON', 'NNP'],
0    'Anyone': ['GPE', 'NN'],
2    'Australian': ['GPE', 'JJ'],
…}

To identify false negatives, we read through the chat log, and constructed the following list of 21 named entities that should have been caught but were not. Most of these are usernames, which are rarely capitalized. Non-username false negatives include html5 and softlayer, which also should have been capitalized but were not:

m1chael
html5
softlayer
tuxskar
comcast
gunicorn
tjsimmons
nginx
zlio
theslowl
frog3r
HowardwLo
dodobas
spoutnik16
moldy
carlfk
benwilber
erik`
apollo13
frege
dpaste

We can apply the same formulas:

  • CORRECT: The NER system guessed 53 correct boundary and class pairs and 32 partially correct classes, so CORRECT = 85
  • GUESSED: The NER system guessed a total of 105 boundaries, and a total of 105 classes, so GUESSED = 210
  • POSSIBLE: The number of possible guesses for text boundaries should be 74 (53 found entities plus 21 false negatives), and the number of possible class guesses would also be 74, so POSSIBLE = 148

To calculate the MUC precision, recall, and F1 harmonic mean for our NER system, we apply these measures as follows:

MUC_Precision = CORRECT / GUESSED
   = 85/210
   = 40%
MUC_Recall = CORRECT / POSSIBLE 
   = 85/148 
   = 57%
F1 = 2*((MUC_Precision * MUC_Recall) / (MUC_Precision + MUC_Recall))
   = 2*(.40 * .57) / (.40 + .57))
   = 47%

We can see from these dismal numbers that the accuracy of the NER program is vastly reduced in an IRC chat context, when compared to the board meeting minutes context. To improve accuracy, we will need to address both false positives and false negatives.

The main issue with false negatives seems to be missing the names of the chat participants, so one way to improve those recall numbers would be to provide a better way of detecting usernames. We could find a list of chatters on the system and capitalize their names to make it more likely that the NER tool would find them. Another approach would be to teach the system what a common username protocol looks like in IRC. For example, on IRC it is common to begin a chat line by directing it towards the person you are talking to, like this:

tjsimmons: don't forget that people typically use nginx to serve /static.

Here the person who is being addressed is called tjsimmons. The system could be taught that any single word at the beginning of a line followed by a colon character is probably a user's name and should be included as a named entity.

False positives seem to mostly stem from over-sensitivity to capitalized generic words such as acronyms, function names, and the like. This is a harder problem to solve, but one approach could be to provide a domain-specific context for the NER to work from. For instance, we could provide a pre-constructed vocabulary of known words to ignore or we could train the system to recognize common features of uninteresting words from this domain. An example of the latter would be the rule if any capitalized word is followed by (), it is a function, so ignore it. Depending on the data you have, you may need to add additional layers to your NER system so that it can become more accurate.

GnuIRC summaries

As a contrast to the Django IRC chat, which was very casual and very loosely punctuated and capitalized, we will also analyze also a formal summary of an IRC chat, written in clear prose by a human being. The GNUe project IRC channel had a human summary written each week for several years in the early 2000s. Two lines of the sample of the summary for the GNUe project are shown here:

Michael Dean (mdean) said that a new release (feature wise) is probably about 3 or 4 weeks away, since the database upgrade was going to be huge. As of this writing, he may make an interim bug fix/small feature release to get some of the email support down.
Daniel Baumann (chillywilly) pointed out that this abstraction thingy GComm could be confused with the GNU Comm project. But as far as Jason Cater (jcater) was concerned, GComm is our internal package name... to the external world, it's GNUe Common, but said that was a good point.

By 2015, the GNUe IRC summaries were no longer available online, but I rebuilt the data set using XML files from Archive.org, and posted it on my FLOSSmole site at the following URL: http://flossdata.syr.edu/data/irc/GNUe/.

The data set for this project is called gnueIRCsummary.txt and it is available on the GitHub site for this chapter at https://github.com/megansquire/masteringDM/tree/master/ch6.

This file consists of the first 10 paragraphs of the 23-27 October 2001 GNUe summary, which is a sample of about 55 lines of text.

When we run the NER program against this data set, we see many cases of partial boundaries. This data set will be a great way to test our partial MUC-style scoring protocol. The system accurately caught Andrew Mitchell, but seemed to split Jeff Bailey into two separate words. How do we score these? The system scored Bailey incorrectly as an ORGANIZATION but Jeff correctly as a PERSON, we need to score one correct and one incorrect. Here, next to each line, I have added numbers to indicate whether the item was given 0, 1, or 2 points:

  • 2 points means that both the boundaries and the class are correct
  • 1 point means that the class was correct but the boundary was only partial
  • An incorrect boundary and incorrect class is worth 0 points

Note that no partial points were given if that entity was already found in full. So there are no points given for Jason when Jason Cater was already found. The exception to this rule is that sometimes the first name is mentioned in the text without the last name. This is the case with Derek and Derek Neighbors, both of which are used in the text. Therefore, we can score both Derek and Derek Neighbors as a 2. Only the first five rows of this result set are shown, for space reasons:

{2   'Andrew Mitchell': ['PERSON', 'NNP NNP'],
0    'Bailey': ['ORGANIZATION', 'NNP'],
0    'Baumann': ['ORGANIZATION', 'NNP'],
1    'Bayonne': ['PERSON', 'NNP'],
1    'Cater': ['PERSON', 'NNP'],
...}

False negatives for this data set include:

pyro
pygmy
windows
orbit
omniORB

It is questionable whether the missing usernames should also be considered false negatives. For instance, in this data set, the first instance of a first and last name combination is followed by a username, as in Derek Neighbors (dneighbo). In the future, we may wish to train this system to find these usernames in addition to the full names. In this example, however, we elect to ignore the usernames and not penalize the system for not finding them, since it did attempt to find the full names, and those represent the same entity as the usernames.

We can apply the same formulas:

  • CORRECT: The NER system earned a total of 41 boundaries and location points
  • GUESSED: The NER system guessed a total of 33 boundaries, and a total of 33 classes, so GUESSED = 66
  • POSSIBLE: The number of possible guesses for text boundaries should be 21 (all the 2 point answers plus the five false negatives), and the number of possible class guesses is also 21, so POSSIBLE = 42

To calculate the MUC precision, recall, and F1 harmonic mean for our NER system, we apply these measures as follows:

MUC_Precision = CORRECT / GUESSED 
   = 41/66
   = 62%
MUC_Recall = CORRECT / POSSIBLE 
   = 41/42 
   = 98%
F1 = 2*((MUC_Precision * MUC_Recall) / (MUC_Precision + MUC_Recall))
   = 2*(.62 * .98) / (.62 + .98))
   = 76%

Here we see that the inclusion of partial scores for boundaries almost entirely makes up for the five false positives. If we had scored these strictly, with no partial matches allowed, there would be only 16 totally correct guesses, so the numbers would look like this:

MUC_Precision = CORRECT / GUESSED 
   = 32/66
   = 48%
MUC_Recall = CORRECT / POSSIBLE 
   = 32/42 
   = 76%
F1 = 2*((MUC_Precision * MUC_Recall) / (MUC_Precision + MUC_Recall))
   = 2*(.48 * .76) / (.48 + .76))
   = 59%

This example shows that whether we choose a loose or strict scoring protocol will affect the presumed accuracy of our NER system. When you are presented with NER accuracy results from someone else, it is important to ask about the scoring protocol that they used.

Next we will experiment with some e-mails. These will be similar to the proper English of the GNUe IRC summaries and the Apache Board meeting minutes, but will have the same high technical content as the Django IRC chats.

LKML e-mails

In Chapter 5, Sentiment Analysis in Text, we used a tiny sample of the e-mail messages sent to the Linux Kernel Mailing List. Here we start with the same 77 e-mails sent by Linus Torvalds to the LKML, but for this project, I made two changes to that data set. First, I removed a few portions of a few of the lines that had boilerplate text, such as On Fri, Jan 8, 2016 at 4:13 PM Linus Torvalds wrote:, since these lines have nothing to do with the concepts in the text, and I did not want to risk the NER program accidentally finding words such as On, Fri, or Jan in the text. Second, in order to reduce the result set of named entities to a more manageable size, I decided to also remove three of the e-mails. These three e-mails were summaries of patches that had been added into the kernel that week, so each message included dozens of names in them.

Note

On the GitHub site for this book, you will find both files, lkmlEmails.txt and lkmlEmailsReduced.txt. The second of these is the one we will use for the remainder of this chapter, although you should feel free to test with the first file too if you like. Experimenting with the first file will produce many, many more named entities than the second one.

Running our NER program against the lkmlEmailsReduced.txt file yields the following named entities . Once again, I have scored each as either a 0, 1, or 2 following the example in the previous sections. Again, only the first five lines are shown for space reasons:

{2  'AIO': ['ORGANIZATION', 'NNP'],
0    'Actually': ['PERSON', 'NNP'],
1    'Al': ['GPE', 'NNP'],
1    'Alpha': ['GPE', 'NNP'],
2    'Andrew': ['GPE', 'NNP'],
...}

Once again, the NER program does find a lot of first names; however, here the last names are rarely used. (The case with the lkmlEmails.txt file is different. The inclusion of those three extra e-mails does mean a lot more duplicate first names.) Our program did seem to miss three named entities (perhaps more if we decided that function names or libraries were important to catch as well):

valgrind
mmap
github

To calculate precision and recall, we need to figure out the following:

  • CORRECT: The NER system earned a total of 72 boundaries and location points
  • GUESSED: The NER system guessed a total of 72 boundaries, and a total of 72 classes, so GUESSED =144
  • POSSIBLE: The number of possible guesses for text boundaries should be 42 (all 39 of the correct and partially correct answers plus the 3 false negatives), and the number of possible class guesses is also 42, so POSSIBLE = 84

To calculate the MUC precision, recall, and harmonic mean F1 for this NER system, we fit our data into the formulas as follows:

MUC_Precision = CORRECT / GUESSED 
   = 72/144
   = 50%
MUC_Recall = CORRECT / POSSIBLE 
   = 72/84 
   = 86%
F1 = 2*((MUC_Precision * MUC_Recall) / (MUC_Precision + MUC_Recall))
   = 2*(.5 * .86) / (.5 + .86))
   = 63%

Here the low number of false negatives drives up the recall rates, but precision is still fairly low due to a lot of false positives.

Having four very different types of text samples allows us to compare the strengths and weaknesses of this simple NER program against text with different characteristics. False negatives seem to result from words missing capitalization, and false positives seem to result from over-sensitivity to capitalized words at the beginning of sentences, acronyms, and boundary issues with multi-word phrases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.57