Chapter 4. Finding People and Things

The process of finding people and things is referred to as Named Entity Recognition (NER). Entities such as people and places are associated with categories that have names, which identify what they are. A named category can be as simple as "people". Common entity types include:

  • People
  • Locations
  • Organizations
  • Money
  • Time
  • URLs

Finding names, locations, and various things in a document are important and useful NLP tasks. They are used in many places such as conducting simple searches, processing queries, resolving references, the disambiguation of text, and finding the meaning of text. For example, NER is sometimes interested in only finding those entities that belong to a single category. Using categories, the search can be isolated to those item types. Other NLP tasks use NER such as in POS taggers and in performing cross-referencing tasks.

The NER process involves two tasks:

  • Detection of entities
  • Classification of entities

Detection is concerned with finding the position of an entity within text. Once it is are located, it is important to determine what type of entity was discovered. After these two tasks have been performed, the results can be used to solve other tasks such as searching and determining the meaning of the text. For example, identifying names from a movie or book review and helping to find other movies or books that might be of interest. Extracting location information can assist in providing references to nearby services.

Why NER is difficult?

Like many NLP tasks, NER is not always simple. Although the tokenization of a text will reveal its components, understanding what they are can be difficult. Using proper nouns will not always work because of the ambiguity of language. For example, Penny and Faith, while valid names, they may also be used for a measurement of currency and a belief, respectively. We can also find words such as Georgia are used as a name of a country, a state, and a person.

Some phrases can be challenging. The phrase "Metropolitan Convention and Exhibit Hall" may contain words that in themselves are valid entities. When the domain is well known, a list of entities can be very useful and easy to implement.

NER is typically applied at the sentence level, otherwise a phrase can easily span a sentence leading to incorrect identification of an entity. For example, in the following sentence:

"Bob went south. Dakota went west."

If we ignored the sentence boundaries, then we could inadvertently find the location entity South Dakota.

Specialized text such as URLs, e-mail addresses, and specialized numbers can be difficult to isolate. This identification is made even more difficult if we have to take into account variations of the entity form. For example, are parentheses used with phone numbers? Are dashes, or periods, or some other character used to separate its parts? Do we need to consider international phone numbers?

These factors contribute to the need for good NER techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.111.92