Techniques for name recognition

There are a number of NER techniques available. Some use regular expressions and others are based on a predefined dictionary. Regular expressions have a lot of expressive power and can isolate entities. A dictionary of entity names can be compared to tokens of text to find matches.

Another common NER approach uses trained models to detect their presence. These models are dependent on the type of entity we are looking for and the target language. A model that works well for one domain, such as web pages, may not work well for a different domain, such as medical journals.

When a model is trained, it uses an annotated block of text, which identifies the entities of interest. To measure how well a model has been trained, several measures are used:

  • Precision: It is the percentage of entities found that match exactly the spans found in the evaluation data
  • Recall: It is the percentage of entities defined in the corpus that were found in the same location
  • Performance measure: It is the harmonic mean of precision and recall given by F1 = 2 * Precision * Recall / (Recall + Precision)

We will use these measures when we cover the evaluation of models.

NER is also known as entity identification and entity chunking. Chunking is the analysis of text to identify its parts such as nouns, verbs, or other components. As humans, we tend to chunk a sentence into distinct parts. These parts form a structure that we use to determine its meaning. The NER process will create spans of text such as "Queen of England". However, there may be other entities within these spans such as "England".

Lists and regular expressions

One technique is to use lists of "standard" entities along with regular expressions to identify the named entities. Named entities are sometimes referred to as proper nouns. The standard entities list could be a list of states, common names, months, or frequently referenced locations. Gazetteers, which are lists that contain geographical information used with maps, provide a source of location-related entities. However, maintaining such lists can be time consuming. They can also be specific to language and locale. Making changes to the list can be tedious. We will demonstrate this approach in the Using the ExactDictionaryChunker class section later in this chapter.

Regular expressions can be useful in identifying entities. Their powerful syntax provides enough flexibility in many situations to accurately isolate the entity of interest. However, this flexibility can also make it difficult to understand and maintain. We will demonstrate several regular expression approaches in this chapter.

Statistical classifiers

Statistical classifiers determine whether a word is a start of an entity, the continuation of an entity, or not an entity at all. Sample text is tagged to isolate entities. Once a classifier has been developed, it can be trained on different sets of data for different problem domains. The disadvantage of this approach is that it requires someone to annotate the sample text, which is a time-consuming process. In addition, it is domain dependent.

We will examine several approaches to perform NER. First, we will start by explaining how regular expressions are used to identify entities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.138.14