Techniques for named entity recognition

Before we tackle the strategies for named entity recognition, we should differentiate between some similar terms that we will come across when doing this work. Usually, when English-speakers first begin to think about named entities, they assume named entities are just proper nouns. What is a proper noun? Proper nouns are typically capitalized in English, and refer to a specific named person, place, or thing. Proper names can include proper nouns as well as noun phrases. Alaska, Barack Obama, January, and The Grateful Dead are all proper names. Are all proper nouns and names capitalized? Not necessarily, as we saw with iPhone and iPad, and also eBay, and the author bell hooks. Are all capitalized nouns proper? No. For example, we write the Englishman came around for tea, where Englishman is capitalized, yet Englishman is a common noun in English.

NER is considerably more interesting than just recognizing nouns, or proper nouns. In a linguistic sense, named entities are sometimes defined more strictly to only include those for which there must be no ambiguity about what noun is being discussed. For example, Barack Obama is an unambiguous name for a specific person, so we call this name a rigid designator, and it would qualify as a named entity (and a proper name). On the other hand, the phrase The President of the United States is called a flaccid designator because, even though it is a proper name and refers to an actual person – who happens to be named Barack Obama at the moment – the actual physical entity that this name refers to will change. Other examples of rigid and flaccid designators would be the difference between Wednesday and tomorrow, or Mastering Data Mining with Python and the best book ever written. Some proper nouns are rigid designators and others are not, and some named entities are proper nouns and some are not.

Even though a flaccid designator may be capitalized or may refer to the same thing as a rigid designator in popular usage, the job of NER is to filter out the true named entities from the impostors. So the presence of capital letters on a word that has been also identified as a noun can be a good clue that the thing is a named entity. However, that logic would dictate that The President of the United States is necessarily a named entity, yet a strict definition of named entities as rigid designators would suggest that it should not be a named entity. Consider also the following sentences:

  • I am starting my new project on Monday.

    Here, Monday is a proper noun and a rigid designator, since it refers to one particular Monday in time.

  • Monday is the best day to start any new project.

    In this sentence, Monday is also a proper noun, but it is being used generically, so it is a flaccid designator.

  • The new Watch by Apple is very expensive, but it plays my iTunes.

    Here, the brand name Watch is a named entity, as is Apple and iTunes. All three are also proper nouns, but only the first two follow traditional capitalization rules.

  • Watch me buy this EDM song on the iTunes store.

    In this example, Watch is the first word of the sentence, so it is capitalized, but not a named entity. It is being used as a verb rather than a noun. EDM is an acronym for Electronic Dance Music, so it is capitalized.

From all these examples, we can see that while imperfect, both capitalization and parts of speech are pretty good indicators of which word or phrases could be considered named entities in a text. Capitalization is easy enough to spot by just observing the case of the characters in a word. Even medial caps can be included if we want to recognize words such as iPhone and iTunes. However, as we have shown, capitalization alone is not sufficient to indicate whether a word or phrase is a named entity or not. The next critical step will be to find nouns and noun phrases, which are great indicators of named entities.

Tagging parts of speech

To find the Part of Speech (POS) for each word in a sentence, we can use a type of software called a POS tagger. A POS tagger first splits texts into sentences, then assigns each word, or token, in the sentence a part of speech. The combination of a token and its part of speech is called a tuple. A tuple looks like this:

('dog', NN)

Sometimes the assigned part of speech is easy to guess, such as NN for noun, but other times the POS tagger will recognize the word as a more exotic part of speech; for example, the NLTK tagger finds the word dogs is a plural noun and dog's is a possessive noun:

('dogs', NNP)
("dog's", NN$)

The second word, dog's, is surrounded by double quotes since it has a single quotation mark inside of itself.

How does the POS tagger know the part of speech for every word, especially considering that some words will change their part of speech depending on how the word is used in a sentence? To answer this question, we must first understand the concept of a corpus. A corpus is a collection of texts, and an annotated corpus is one that has been tagged, for example with the correct POS for each word. Once we have a POS tagged corpus that we are confident is correct, we can use that to determine the parts of speech for all the tokens in new documents that we have not yet seen.

There are many well-known corpora that are used over and over again to tag new documents. One of the most famous is called the Brown Corpus, after Brown University. This corpus was created in 1961. It consists of 500 documents, each approximately 2000 words, for a total of about a million tagged tokens. The documents are all from native speakers of American English. The original documentation for this corpus can be found at http://www.hit.uib.no/icame/brown/bcm.html.

Today, the Brown Corpus is just one of many, many corpora available to be used as a model by a POS tagger. Some, like Brown, consist of a variety of text types, such as fiction, news articles, and religious books. Other corpora focus on specific text types, such as news. Still other corpora are tagged for different languages or different dialects, or have been updated with newer words that did not exist in 1961 when the Brown Corpus was created.

As we will see in our project later in the chapter, an off-the-shelf POS tagger, such as the one that comes with NLTK, can be directed to use different corpora. The default corpus that is used by NLTK is called the Penn Treebank tagger. The Penn Treebank list of abbreviations for parts of speech is quite extensive, and can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

For example, the different noun abbreviations in Penn are:

  • NN: Noun, singular or mass (I see a dog.)
  • NNS: Noun, plural (I see many dogs.)
  • NNP: Proper noun, singular (My dog is named Fido.)
  • NNPS: Proper noun, plural (There are many Fidos in the park.)

There are many other abbreviations for parts of speech but since we are focused on named entity recognition in this chapter, we will mostly be working with nouns.

Classes of named entities

If NER were as simple as just identifying proper nouns, it would not be nearly as much fun. In addition to POS tagging, we can also use a tagger to attempt to deduce what kind of named entity we have found. Common classes for the named entities include: PERSON, ORGANIZATION, GPE (for geopolitical entity, or place), and so on. It is true that these are very large, general classes for the nouns, but they still do serve as one additional layer of granularity. With these classes, Fido may be classified as a canine PERSON, Pirates of the Caribbean should be classified as an ORGANIZATION, and the partial match Caribbean would likely be classified as a place or GPE.

Now that we know more about the NER goals of identifying each named entity and specifying its class, it is time to start looking at techniques that can help us achieve these goals with real data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.192.79