Chapter 5. Detecting Part of Speech

Previously, we identified parts of text such as people, places, and things. In this chapter, we will investigate the process of finding POS. These are the parts that we recognize in English as the grammatical elements, such as nouns and verbs. We will find that the context of the word is an important aspect of determining what type of word it is.

We will examine the tagging process, which essentially assigns a POS to a tag. This process is at the heart of detecting POS. We will briefly discuss why tagging is important and then examine the various factors that makes detecting POS difficult. Various NLP APIs are then used to illustrate the tagging process. We will also demonstrate how to train a model to address specialized text.

The tagging process

Tagging is the process of assigning a description to a token or a portion of text. This description is called a tag. POS tagging is the process of assigning a POS tag to a token. These tags are normally tags such as noun, verb, and adjective.

For example, consider the following sentence:

"The cow jumped over the moon."

For many of these initial examples, we will illustrate the result of a POS tagger using the OpenNLP tagger to be discussed in Using OpenNLP POS taggers, later in this chapter. If we use that tagger with the previous example, we will get the following results. Notice that the words are followed by a forward slash and then their POS tag. These tags will be explained shortly:

The/DT cow/NN jumped/VBD over/IN the/DT moon./NN

Words can potentially have more than one tag associated with them depending on their context. For example, the word "saw" could be a noun or a verb. When a word can be classified into different categories, information such as its position, words in its vicinity, or similar information are used to probabilistically determine the appropriate category. For example, if a word is preceded by a determiner and followed by a noun, then tag the word as an adjective.

The general tagging process consists of tokenizing the text, determining possible tags, and resolving ambiguous tags. Algorithms are used to perform POS identification (tagging). There are two general approaches:

  • Rule-based: Rule-based taggers uses a set of rules and a dictionary of words and possible tags. The rules are used when a word has multiple tags. Rules often use the previous and/or following words to select a tag.
  • Stochastic: Stochastic taggers use is either based on the Markov model or are cue-based, which uses either decision trees or maximum entropy. Markov models are finite state machines where each state has two probability distributions. Its objective is to find the optimal sequence of tags for a sentence. Hidden Markov Models (HMM) are also used. In these models, the state transitions are not visible.

A maximum entropy tagger uses statistics to determine the POS for a word and often uses a corpus to train a model. A corpus is a collection of words marked up with POS tags. Corpora exist for a number of languages. These take a lot of effort to develop. Frequently used corpora include the Penn Treebank (http://www.cis.upenn.edu/~treebank/) or Brown Corpus (http://www.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html).

A sample from the Penn Treebank corpus, which illustrates POS markup, is as follows:

Well/UH what/WP do/VBP you/PRP think/VB about/IN
the/DT idea/NN of/IN ,/, uh/UH ,/, kids/NNS having/VBG
to/TO do/VB public/JJ service/NN work/NN for/IN a/DT
year/NN ?/.

There are traditionally nine parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, a more complete analysis often requires additional categories and subcategories. There have been as many as 150 different parts of speech identified. In some situations, it may be necessary to create new tags. A short list is shown in the following table.

These are the tags we use frequently in this chapter:

Part

Meaning

NN

Noun, singular or mass

DT

Determiner

VB

Verb, base form

VBD

Verb, past tense

VBZ

Verb, third person singular present

IN

Preposition or subordinating conjunction

NNP

Proper noun, singular

TO

to

JJ

Adjective

A more comprehensive list is shown in the following table. This list is adapted from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The complete list of The University of Pennsylvania (Penn) Treebank Tag-set can be found at http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html. A set of tags is referred to as a tag set.

Tag

Description

Tag

Description

CC

Coordinating conjunction

PRP$

Possessive pronoun

CD

Cardinal number

RB

Adverb

DT

Determiner

RBR

Adverb, comparative

EX

Existential there

RBS

Adverb, superlative

FW

Foreign word

RP

Particle

IN

Preposition or subordinating conjunction

SYM

Symbol

JJ

Adjective

TO

to

JJR

Adjective, comparative

UH

Interjection

JJS

Adjective, superlative

VB

Verb, base form

LS

List item marker

VBD

Verb, past tense

MD

Modal

VBG

Verb, gerund or present participle

NN

Noun, singular or mass

VBN

Verb, past participle

NNS

Noun, plural

VBP

Verb, non-third person singular present

NNP

Proper noun, singular

VBZ

Verb, third person singular present

NNPS

Proper noun, plural

WDT

Wh-determiner

PDT

Predeterminer

WP

Wh-pronoun

POS

Possessive ending

WP$

Possessive wh-pronoun

PRP

Personal pronoun

WRB

Wh-adverb

The development of a manual corpus is labor intensive. However, some statistical techniques have been developed to create corpora. A number of corpora are available. One of the first ones was the Brown Corpus (http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM). Newer ones include the British National Corpus (http://www.natcorp.ox.ac.uk/corpus/index.xml), with over 100 million words, and the American National Corpus (http://www.anc.org/). A list of corpora can be found at http://en.wikipedia.org/wiki/List_of_text_corpora.

Importance of POS taggers

Proper tagging of a sentence can enhance the quality of downstream processing tasks. If we know that "sue" is a verb and not a noun, then this can assist in the correct relationship among tokens. Determining the POS, phrases, clauses, and any relationship between them is called parsing. This is in contrast to tokenization where we are only interested in identifying "word" elements and we are not concerned about their meaning.

POS tagging is used for many downstream processes such as question analysis and analyzing the sentiment of text. Some social media sites are frequently interested in assessing the sentiment of their clients' communication. Text indexing will frequently use POS data. Speech processing can use tags to help decide how to pronounce words.

What makes POS difficult?

There are many aspects of a language that can make POS tagging difficult. Most English words will have two or more tags associated with them. A dictionary is not always sufficient to determine a word's POS. For example, the meaning of words like "bill" and "force" are dependent on their context. The following sentence demonstrates how they can both be used in the same sentence as nouns and verbs.

"Bill used the force to force the manger to tear the bill in two."

Using the OpenNLP tagger with this sentence produces the following output:

Bill/NNP used/VBD the/DT force/NN to/TO force/VB the/DT manger/NN to/TO tear/VB the/DT bill/NN in/IN two./PRP$

The use of textese, a combination of different forms of text including abbreviations, hashtags, emoticons, and slang, in communications mediums such as tweets and text makes it more difficult to tag sentences. For example, the following message is difficult to tag:

"AFAIK she H8 cth! BTW had a GR8 tym at the party BBIAM."

Its equivalent is:

"As far as I know, she hates cleaning the house! By the way, had a great time at the party. Be back in a minute."

Using the OpenNLP tagger, we will get the following output:

AFAIK/NNS she/PRP H8/CD cth!/.
BTW/NNP had/VBD a/DT GR8/CD tym/NN at/IN the/DT party/NN BBIAM./.

In the Using the MaxentTagger class to tag textese section later in this chapter, we will provide a demonstration of how LingPipe can handle textese. A short list of textese is given in the following table:

Phrase

Textese

Phrase

Textese

As far as I know

AFAIK

By the way

BTW

Away from keyboard

AFK

You're on your own

YOYO

Thanks

THNX or THX

As soon as possible

ASAP

Today

2day

What do you mean by that

WDYMBT

Before

B4

Be back in a minute

BBIAM

See you

C U

Can't

CNT

Haha

hh

Later

l8R

Laughing out loud

LOL

On the other hand

OTOH

Rolling on the floor laughing

ROFL or ROTFL

I don't know

IDK

Great

GR8

Cleaning the house

CTH

At the moment

ATM

In my humble opinion

IMHO

Note

Although there are several list of textese, a large list can be found at http://www.ukrainecalling.com/textspeak.aspx.

Tokenization is an important step in the POS tagging process. If the tokens are not split properly, we can get erroneous results. There are several other potential problems including:

  • If we use lowercase, then words such as "sam" can be confused with the person or for the System for Award Management (www.sam.gov)
  • We have to take into account contractions such as "can't" and recognize that different characters may be used for the apostrophe
  • Although phrases such as "vice versa" can be treated as a unit, it has been used for a band in England, the title of a novel, and as the title of a magazine
  • We can't ignore hyphenated words such as "first-cut" and "prime-cut" that have meanings different from their individual use
  • Some words have embedded numbers such as iPhone 5S
  • Special character sequences such as a URL or e-mail address also need to be handled

Some words are found embedded in quotes or parentheses, which can make their meaning confusing. Consider the following example:

"Whether "Blue" was correct or not (it's not) is debatable."

"Blue" could refer to the color blue or conceivably the nickname of a person. The output of the tagger for this sentence is as follows:

Whether/IN "Blue"/NNP was/VBD correct/JJ or/CC not/RB (it's/JJ not)/NN is/VBZ debatable/VBG
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.165.115