Previously, we identified parts of text such as people, places, and things. In this chapter, we will investigate the process of finding POS. These are the parts that we recognize in English as the grammatical elements, such as nouns and verbs. We will find that the context of the word is an important aspect of determining what type of word it is.
We will examine the tagging process, which essentially assigns a POS to a tag. This process is at the heart of detecting POS. We will briefly discuss why tagging is important and then examine the various factors that makes detecting POS difficult. Various NLP APIs are then used to illustrate the tagging process. We will also demonstrate how to train a model to address specialized text.
Tagging is the process of assigning a description to a token or a portion of text. This description is called a tag. POS tagging is the process of assigning a POS tag to a token. These tags are normally tags such as noun, verb, and adjective.
For example, consider the following sentence:
"The cow jumped over the moon."
For many of these initial examples, we will illustrate the result of a POS tagger using the OpenNLP tagger to be discussed in Using OpenNLP POS taggers, later in this chapter. If we use that tagger with the previous example, we will get the following results. Notice that the words are followed by a forward slash and then their POS tag. These tags will be explained shortly:
The/DT cow/NN jumped/VBD over/IN the/DT moon./NN
Words can potentially have more than one tag associated with them depending on their context. For example, the word "saw" could be a noun or a verb. When a word can be classified into different categories, information such as its position, words in its vicinity, or similar information are used to probabilistically determine the appropriate category. For example, if a word is preceded by a determiner and followed by a noun, then tag the word as an adjective.
The general tagging process consists of tokenizing the text, determining possible tags, and resolving ambiguous tags. Algorithms are used to perform POS identification (tagging). There are two general approaches:
A maximum entropy tagger uses statistics to determine the POS for a word and often uses a corpus to train a model. A corpus is a collection of words marked up with POS tags. Corpora exist for a number of languages. These take a lot of effort to develop. Frequently used corpora include the Penn Treebank (http://www.cis.upenn.edu/~treebank/) or Brown Corpus (http://www.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html).
A sample from the Penn Treebank corpus, which illustrates POS markup, is as follows:
Well/UH what/WP do/VBP you/PRP think/VB about/IN the/DT idea/NN of/IN ,/, uh/UH ,/, kids/NNS having/VBG to/TO do/VB public/JJ service/NN work/NN for/IN a/DT year/NN ?/.
There are traditionally nine parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, a more complete analysis often requires additional categories and subcategories. There have been as many as 150 different parts of speech identified. In some situations, it may be necessary to create new tags. A short list is shown in the following table.
These are the tags we use frequently in this chapter:
Part |
Meaning |
---|---|
|
Noun, singular or mass |
|
Determiner |
|
Verb, base form |
|
Verb, past tense |
|
Verb, third person singular present |
|
Preposition or subordinating conjunction |
|
Proper noun, singular |
|
to |
|
Adjective |
A more comprehensive list is shown in the following table. This list is adapted from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The complete list of The University of Pennsylvania (Penn) Treebank Tag-set can be found at http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html. A set of tags is referred to as a tag set.
Tag |
Description |
Tag |
Description |
---|---|---|---|
|
Coordinating conjunction |
|
Possessive pronoun |
|
Cardinal number |
|
Adverb |
|
Determiner |
|
Adverb, comparative |
|
Existential there |
|
Adverb, superlative |
|
Foreign word |
|
Particle |
|
Preposition or subordinating conjunction |
|
Symbol |
|
Adjective |
|
to |
|
Adjective, comparative |
|
Interjection |
|
Adjective, superlative |
|
Verb, base form |
|
List item marker |
|
Verb, past tense |
|
Modal |
|
Verb, gerund or present participle |
|
Noun, singular or mass |
|
Verb, past participle |
|
Noun, plural |
|
Verb, non-third person singular present |
|
Proper noun, singular |
|
Verb, third person singular present |
|
Proper noun, plural |
|
Wh-determiner |
|
Predeterminer |
|
Wh-pronoun |
|
Possessive ending |
|
Possessive wh-pronoun |
|
Personal pronoun |
|
Wh-adverb |
The development of a manual corpus is labor intensive. However, some statistical techniques have been developed to create corpora. A number of corpora are available. One of the first ones was the Brown Corpus (http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM). Newer ones include the British National Corpus (http://www.natcorp.ox.ac.uk/corpus/index.xml), with over 100 million words, and the American National Corpus (http://www.anc.org/). A list of corpora can be found at http://en.wikipedia.org/wiki/List_of_text_corpora.
Proper tagging of a sentence can enhance the quality of downstream processing tasks. If we know that "sue" is a verb and not a noun, then this can assist in the correct relationship among tokens. Determining the POS, phrases, clauses, and any relationship between them is called parsing. This is in contrast to tokenization where we are only interested in identifying "word" elements and we are not concerned about their meaning.
POS tagging is used for many downstream processes such as question analysis and analyzing the sentiment of text. Some social media sites are frequently interested in assessing the sentiment of their clients' communication. Text indexing will frequently use POS data. Speech processing can use tags to help decide how to pronounce words.
There are many aspects of a language that can make POS tagging difficult. Most English words will have two or more tags associated with them. A dictionary is not always sufficient to determine a word's POS. For example, the meaning of words like "bill" and "force" are dependent on their context. The following sentence demonstrates how they can both be used in the same sentence as nouns and verbs.
"Bill used the force to force the manger to tear the bill in two."
Using the OpenNLP tagger with this sentence produces the following output:
Bill/NNP used/VBD the/DT force/NN to/TO force/VB the/DT manger/NN to/TO tear/VB the/DT bill/NN in/IN two./PRP$
The use of textese, a combination of different forms of text including abbreviations, hashtags, emoticons, and slang, in communications mediums such as tweets and text makes it more difficult to tag sentences. For example, the following message is difficult to tag:
"AFAIK she H8 cth! BTW had a GR8 tym at the party BBIAM."
Its equivalent is:
"As far as I know, she hates cleaning the house! By the way, had a great time at the party. Be back in a minute."
Using the OpenNLP tagger, we will get the following output:
AFAIK/NNS she/PRP H8/CD cth!/. BTW/NNP had/VBD a/DT GR8/CD tym/NN at/IN the/DT party/NN BBIAM./.
In the Using the MaxentTagger class to tag textese section later in this chapter, we will provide a demonstration of how LingPipe can handle textese. A short list of textese is given in the following table:
Phrase |
Textese |
Phrase |
Textese |
---|---|---|---|
As far as I know |
AFAIK |
By the way |
BTW |
Away from keyboard |
AFK |
You're on your own |
YOYO |
Thanks |
THNX or THX |
As soon as possible |
ASAP |
Today |
2day |
What do you mean by that |
WDYMBT |
Before |
B4 |
Be back in a minute |
BBIAM |
See you |
C U |
Can't |
CNT |
Haha |
hh |
Later |
l8R |
Laughing out loud |
LOL |
On the other hand |
OTOH |
Rolling on the floor laughing |
ROFL or ROTFL |
I don't know |
IDK |
Great |
GR8 |
Cleaning the house |
CTH |
At the moment |
ATM |
In my humble opinion |
IMHO |
Although there are several list of textese, a large list can be found at http://www.ukrainecalling.com/textspeak.aspx.
Tokenization is an important step in the POS tagging process. If the tokens are not split properly, we can get erroneous results. There are several other potential problems including:
Some words are found embedded in quotes or parentheses, which can make their meaning confusing. Consider the following example:
"Whether "Blue" was correct or not (it's not) is debatable."
"Blue" could refer to the color blue or conceivably the nickname of a person. The output of the tagger for this sentence is as follows:
Whether/IN "Blue"/NNP was/VBD correct/JJ or/CC not/RB (it's/JJ not)/NN is/VBZ debatable/VBG
3.147.72.74