Understanding tagging and POS

POS is concerned with identifying the types of components found in a sentence. For example, this sentence has several elements, including the verb "has", several nouns such as "example" and "elements", and adjectives such as "several". Tagging, or more specifically POS tagging, is the process of associating element types to words.

POS tagging is useful as it adds more information about the sentence. We can ascertain the relationship between words and often their relative importance. The results of tagging are often used in later processing steps.

This task can be difficult as we are unable to rely upon a simple dictionary of words to determine their type. For example, the word lead can be used as both a noun and as a verb. We might use it in either of the following two sentences:

He took the lead in the play.
Lead the way!

POS tagging will attempt to associate the proper label to each word of a sentence.

Using OpenNLP to identify POS

To illustrate this process, we will be using OpenNLP (https://opennlp.apache.org/). This is an open source Apache project which supports many other NLP processing tasks.

We will be using the POSModel class, which can be trained to recognize POS elements. In this example, we will use it with a previously trained model based on the Penn TreeBank tag-set (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html). Various pretrained models are found at http://opennlp.sourceforge.net/models-1.5/. We will be using the en-pos-maxent.bin model. This has been trained on English text using what is called maximum entropy.

Maximum entropy refers to the amount of uncertainty in the model which it maximizes. For a given problem there is a set of probabilities describing what is known about the data set. These probabilities are used to build a model. For example, we may know that there is a 23 percent chance that one specific event may follow a certain condition. We do not want to make any assumptions about unknown probabilities so we avoid adding unjustified information. A maximum entropy approach attempts to preserve as much uncertainty as possible; hence it attempts to maximize entropy.

We will also use the POSTaggerME class, which is a maximum entropy tagger. This is the class that will make tag predictions. With any sentence, there may be more than one way of classifying, or tagging, its components.

We start with code to acquire the previously trained English tagger model and a simple sentence to be tagged:

try (InputStream input = new FileInputStream( 
        new File("en-pos-maxent.bin"));) { 
    String sentence = "Let's parse this sentence."; 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
} 

The tagger uses an array of strings, where each string is a word. The following sequence takes the previous sentence and creates an array called words. The first part uses the Scanner class to parse the sentence string. We could have used other code to read the data from a file if needed. After that, the List class's toArray method is used to create the array of strings:

List<String> list = new ArrayList<>(); 
Scanner scanner = new Scanner(sentence); 
while(scanner.hasNext()) { 
    list.add(scanner.next()); 
} 
String[] words = new String[1]; 
words = list.toArray(words); 

The model is then built using the file containing the model:

POSModel posModel = new POSModel(input); 

The tagger is then created based on the model:

POSTaggerME posTagger = new POSTaggerME(posModel); 

The tag method does the actual work. It is passed an array of words and returns an array of tags. The words and tags are then displayed:

String[] posTags = posTagger.tag(words); 
for(int i=0; i<posTags.length; i++) { 
    out.println(words[i] + " - " + posTags[i]); 
} 

The output for this example follows:

Let's - NNP
parse - NN
this - DT
sentence. - NN

The analysis has determined that the word let's is a singular proper noun while the words parse and sentence are singular nouns. The word this is a determiner, that is, it is a word that modifies another and helps identify a phrase as general or specific. A list of tags is provided in the next section.

Understanding POS tags

The POS elements returned abbreviations. A list of Penn TreeBankPOS tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The following is a shortened version of this list: 

Tag

Description

Tag

Description

DT

Determiner

RB

Adverb

JJ

Adjective

RBR

Adverb, comparative

JJR

Adjective, comparative

RBS

Adverb, superlative

JJS

Adjective, superlative

RP

Particle

NN

Noun, singular or mass

SYM

Symbol

NNS

Noun, plural

TOP

Top of the parse tree

NNP

Proper noun, singular

VB

Verb, base form

NNPS

Proper noun, plural

VBD

Verb, past tense

POS

Possessive ending

VBG

Verb, gerund or present participle

PRP

Personal pronoun

VBN

Verb, past participle

PRP$

Possessive pronoun

VBP

Verb, non-3rd person singular present

S

Simple declarative clause

VBZ

Verb, 3rd person singular present

As mentioned earlier, there may be more than one possible set of POS assignments for a sentence. The topKSequences method, as shown next, will return various assignment possibilities along with a score. The method returns an array of Sequence objects whose toString method returns the score and POS list:

    Sequence sequences[] = posTagger.topKSequences(words); 
    for(Sequence sequence : sequences) { 
        out.println(sequence); 
    } 

The output for the previous sentence follows, where the last sequence is considered to be the most probable alternative:

-2.3264880694837213 [NNP, NN, DT, NN]
-2.6610271245387853 [NNP, VBD, DT, NN]
-2.6630142638557217 [NNP, VB, DT, NN]

Each line of output assigns possible tags to each word of the sentence. We can see that only the second word, parse, is determined to have other possible tags.

Next, we will demonstrate how to extract relationships from text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.173.242