Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Detecting Part of Speech

Previously, we identified parts of text such as people, places, and things. In this chapter, we will investigate the process of finding POS. These are the parts that we recognize in English as the grammatical elements, such as nouns and verbs. We will find that the context of the word is an important aspect of determining what type of word it is.

We will examine the tagging process, which essentially assigns a POS to a tag. This process is at the heart of detecting POS. We will briefly discuss why tagging is important and then examine the various factors that makes detecting POS difficult. Various NLP APIs are then used to illustrate the tagging process. We will also demonstrate how to train a model to address specialized text.

The tagging process

Tagging is the process of assigning a description to a token or a portion of text. This description is called a tag. POS tagging is the process of assigning a POS tag to a token. These tags are normally tags such as noun, verb, and adjective.

For example, consider the following sentence:

"The cow jumped over the moon."

For many of these initial examples, we will illustrate the result of a POS tagger using the OpenNLP tagger to be discussed in Using OpenNLP POS taggers, later in this chapter. If we use that tagger with the previous example, we will get the following results. Notice that the words are followed by a forward slash and then their POS tag. These tags will be explained shortly:

The/DT cow/NN jumped/VBD over/IN the/DT moon./NN

Words can potentially have more than one tag associated with them depending on their context. For example, the word "saw" could be a noun or a verb. When a word can be classified into different categories, information such as its position, words in its vicinity, or similar information are used to probabilistically determine the appropriate category. For example, if a word is preceded by a determiner and followed by a noun, then tag the word as an adjective.

The general tagging process consists of tokenizing the text, determining possible tags, and resolving ambiguous tags. Algorithms are used to perform POS identification (tagging). There are two general approaches:

Rule-based: Rule-based taggers uses a set of rules and a dictionary of words and possible tags. The rules are used when a word has multiple tags. Rules often use the previous and/or following words to select a tag.
Stochastic: Stochastic taggers use is either based on the Markov model or are cue-based, which uses either decision trees or maximum entropy. Markov models are finite state machines where each state has two probability distributions. Its objective is to find the optimal sequence of tags for a sentence. Hidden Markov Models (HMM) are also used. In these models, the state transitions are not visible.

A maximum entropy tagger uses statistics to determine the POS for a word and often uses a corpus to train a model. A corpus is a collection of words marked up with POS tags. Corpora exist for a number of languages. These take a lot of effort to develop. Frequently used corpora include the Penn Treebank (http://www.cis.upenn.edu/~treebank/) or Brown Corpus (http://www.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html).

A sample from the Penn Treebank corpus, which illustrates POS markup, is as follows:

Well/UH what/WP do/VBP you/PRP think/VB about/IN
the/DT idea/NN of/IN ,/, uh/UH ,/, kids/NNS having/VBG
to/TO do/VB public/JJ service/NN work/NN for/IN a/DT
year/NN ?/.

There are traditionally nine parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, a more complete analysis often requires additional categories and subcategories. There have been as many as 150 different parts of speech identified. In some situations, it may be necessary to create new tags. A short list is shown in the following table.

These are the tags we use frequently in this chapter:

Part	Meaning
`NN`	Noun, singular or mass
`DT`	Determiner
`VB`	Verb, base form
`VBD`	Verb, past tense
`VBZ`	Verb, third person singular present
`IN`	Preposition or subordinating conjunction
`NNP`	Proper noun, singular
`TO`	to
`JJ`	Adjective

A more comprehensive list is shown in the following table. This list is adapted from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The complete list of The University of Pennsylvania (Penn) Treebank Tag-set can be found at http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html. A set of tags is referred to as a tag set.

Tag	Description	Tag	Description
`CC`	Coordinating conjunction	`PRP$`	Possessive pronoun
`CD`	Cardinal number	`RB`	Adverb
`DT`	Determiner	`RBR`	Adverb, comparative
`EX`	Existential there	`RBS`	Adverb, superlative
`FW`	Foreign word	`RP`	Particle
`IN`	Preposition or subordinating conjunction	`SYM`	Symbol
`JJ`	Adjective	`TO`	to
`JJR`	Adjective, comparative	`UH`	Interjection
`JJS`	Adjective, superlative	`VB`	Verb, base form
`LS`	List item marker	`VBD`	Verb, past tense
`MD`	Modal	`VBG`	Verb, gerund or present participle
`NN`	Noun, singular or mass	`VBN`	Verb, past participle
`NNS`	Noun, plural	`VBP`	Verb, non-third person singular present
`NNP`	Proper noun, singular	`VBZ`	Verb, third person singular present
`NNPS`	Proper noun, plural	`WDT`	Wh-determiner
`PDT`	Predeterminer	`WP`	Wh-pronoun
`POS`	Possessive ending	`WP$`	Possessive wh-pronoun
`PRP`	Personal pronoun	`WRB`	Wh-adverb

The development of a manual corpus is labor intensive. However, some statistical techniques have been developed to create corpora. A number of corpora are available. One of the first ones was the Brown Corpus (http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM). Newer ones include the British National Corpus (http://www.natcorp.ox.ac.uk/corpus/index.xml), with over 100 million words, and the American National Corpus (http://www.anc.org/). A list of corpora can be found at http://en.wikipedia.org/wiki/List_of_text_corpora.

Importance of POS taggers

Proper tagging of a sentence can enhance the quality of downstream processing tasks. If we know that "sue" is a verb and not a noun, then this can assist in the correct relationship among tokens. Determining the POS, phrases, clauses, and any relationship between them is called parsing. This is in contrast to tokenization where we are only interested in identifying "word" elements and we are not concerned about their meaning.

POS tagging is used for many downstream processes such as question analysis and analyzing the sentiment of text. Some social media sites are frequently interested in assessing the sentiment of their clients' communication. Text indexing will frequently use POS data. Speech processing can use tags to help decide how to pronounce words.

What makes POS difficult?

There are many aspects of a language that can make POS tagging difficult. Most English words will have two or more tags associated with them. A dictionary is not always sufficient to determine a word's POS. For example, the meaning of words like "bill" and "force" are dependent on their context. The following sentence demonstrates how they can both be used in the same sentence as nouns and verbs.

"Bill used the force to force the manger to tear the bill in two."

Using the OpenNLP tagger with this sentence produces the following output:

Bill/NNP used/VBD the/DT force/NN to/TO force/VB the/DT manger/NN to/TO tear/VB the/DT bill/NN in/IN two./PRP$

The use of textese, a combination of different forms of text including abbreviations, hashtags, emoticons, and slang, in communications mediums such as tweets and text makes it more difficult to tag sentences. For example, the following message is difficult to tag:

"AFAIK she H8 cth! BTW had a GR8 tym at the party BBIAM."

Its equivalent is:

"As far as I know, she hates cleaning the house! By the way, had a great time at the party. Be back in a minute."

Using the OpenNLP tagger, we will get the following output:

AFAIK/NNS she/PRP H8/CD cth!/.
BTW/NNP had/VBD a/DT GR8/CD tym/NN at/IN the/DT party/NN BBIAM./.

In the Using the MaxentTagger class to tag textese section later in this chapter, we will provide a demonstration of how LingPipe can handle textese. A short list of textese is given in the following table:

Phrase	Textese	Phrase	Textese
As far as I know	AFAIK	By the way	BTW
Away from keyboard	AFK	You're on your own	YOYO
Thanks	THNX or THX	As soon as possible	ASAP
Today	2day	What do you mean by that	WDYMBT
Before	B4	Be back in a minute	BBIAM
See you	C U	Can't	CNT
Haha	hh	Later	l8R
Laughing out loud	LOL	On the other hand	OTOH
Rolling on the floor laughing	ROFL or ROTFL	I don't know	IDK
Great	GR8	Cleaning the house	CTH
At the moment	ATM	In my humble opinion	IMHO

Note

Although there are several list of textese, a large list can be found at http://www.ukrainecalling.com/textspeak.aspx.

Tokenization is an important step in the POS tagging process. If the tokens are not split properly, we can get erroneous results. There are several other potential problems including:

If we use lowercase, then words such as "sam" can be confused with the person or for the System for Award Management (www.sam.gov)
We have to take into account contractions such as "can't" and recognize that different characters may be used for the apostrophe
Although phrases such as "vice versa" can be treated as a unit, it has been used for a band in England, the title of a novel, and as the title of a magazine
We can't ignore hyphenated words such as "first-cut" and "prime-cut" that have meanings different from their individual use
Some words have embedded numbers such as iPhone 5S
Special character sequences such as a URL or e-mail address also need to be handled

Some words are found embedded in quotes or parentheses, which can make their meaning confusing. Consider the following example:

"Whether "Blue" was correct or not (it's not) is debatable."

"Blue" could refer to the color blue or conceivably the nickname of a person. The output of the tagger for this sentence is as follows:

Whether/IN "Blue"/NNP was/VBD correct/JJ or/CC not/RB (it's/JJ not)/NN is/VBZ debatable/VBG

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Detecting Part of Speech

Create new playlist

Sign In

Sign Up

Chapter 5. Detecting Part of Speech

The tagging process

Importance of POS taggers

What makes POS difficult?

Note

Table of Contents for
5. Detecting Part of Speech