Summary

POS tagging is a powerful technique for identifying the grammatical parts of a sentence. It provides useful processing for downstream tasks such as question analysis and analyzing the sentiment of text. We will return to this subject when we address parsing in Chapter 7, Using a Parser to Extract Relationships.

Tagging is not an easy process due to the ambiguities found in most languages. The increasing use of textese only makes the process more difficult. Fortunately, there are models that can do a good job of identifying this type of text. However, as new terms and slang are introduced, these models need to be kept up to date.

We investigated the use of OpenNLP, the Stanford API, and LingPipe in support of tagging. These libraries used several different types of approaches to tag words including both rule-based and model-based approaches. We saw how dictionaries can be used to enhance the tagging process.

We briefly touched on the model training process. Pretagged sample texts are used as input to the process and a model emerges as output. Although we did not address validation of the model, this can be accomplished in a similar manner as accomplished in earlier chapters.

The various POS tagger approaches can be compared based on a number of factors such as their accuracy and how fast they run. Although we did not cover these issues here, there are numerous web resources available. One comparison that examines how fast they run can be found at http://mattwilkens.com/2008/11/08/evaluating-pos-taggers-speed/.

In the next chapter, we will examine techniques to classify documents based on their content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.78.102