Using NLP APIs

We will use the OpenNLP and Stanford APIs to demonstrate parsing and the extraction of relation information. LingPipe can also be used but will not be discussed here. An example of how LingPipe is used to parse biomedical literature can be found at http://alias-i.com/lingpipe-3.9.3/demos/tutorial/medline/.

Using OpenNLP

Parsing text is simple using the ParserTool class. Its static parseLine method accepts three arguments and returns a Parser instance. These arguments are:

  • A string containing the text to be parsed
  • A Parser instance
  • An integer specifying how many parses are to be returned

The Parser instance holds the elements of the parse. The parses are returned in order of their probability. To create a Parser instance, we will use the ParserFactory class' create method. This method uses a ParserModel instance that we will create using the en-parser-chunking.bin file.

This process is shown here, where an input stream for the model file is created using a try-with-resources block. The ParserModel instance is created followed by a Parser instance:

String fileLocation = getModelDir() + 
    "/en-parser-chunking.bin";
try (InputStream modelInputStream = 
            new FileInputStream(fileLocation);) {
     ParserModel model = new ParserModel(modelInputStream);
    Parser parser = ParserFactory.create(model);
    ...
} catch (IOException ex) {
    // Handle exceptions
}

We will use a simple sentence to demonstrate the parsing process. In the following code sequence, the parseLine method is invoked using a value of 3 for the third argument. This will return the top three parses:

String sentence = "The cow jumped over the moon";
Parse parses[] = ParserTool.parseLine(sentence, parser, 3);

Next, the parses are displayed along with their probabilities, as shown here:

for(Parse parse : parses) {
    parse.show();
    System.out.println("Probability: " + parse.getProb());
}

The output is as follows:

(TOP (S (NP (DT The) (NN cow)) (VP (VBD jumped) (PP (IN over) (NP (DT the) (NN moon))))))
Probability: -1.043506016751117
(TOP (S (NP (DT The) (NN cow)) (VP (VP (VBD jumped) (PRT (RP over))) (NP (DT the) (NN moon)))))
Probability: -4.248553665013661
(TOP (S (NP (DT The) (NNS cow)) (VP (VBD jumped) (PP (IN over) (NP (DT the) (NN moon))))))
Probability: -4.761071294573854

Notice that each parse produces a slightly different order and assignment of tags. The following output shows the first parse formatted to make it easier to read:

(TOP 
      (S 
          (NP 
               (DT The) 
               (NN cow)
          )
          (VP 
               (VBD jumped) 
               (PP 
                    (IN over)
                    (NP 
                         (DT the)
                         (NN moon)
                     )
               )
           )
     )
)

The showCodeTree method can be used instead to display parent-child relationships:

parse.showCodeTree();

The output for the first parse is shown here. The first part of each line shows the element levels enclosed in brackets. The tag is displayed next followed by two hash values separated by ->. The first number is for the element and the second number is its parent. For example, in the third line, it shows the proper noun, The, to have a parent of the noun phrase, The cow:

[0] S -929208263 -> -929208263 TOP The cow jumped over the moon
[0.0] NP -929237012 -> -929208263 S The cow
[0.0.0] DT -929242488 -> -929237012 NP The
[0.0.0.0] TK -929242488 -> -929242488 DT The
[0.0.1] NN -929034400 -> -929237012 NP cow
[0.0.1.0] TK -929034400 -> -929034400 NN cow
[0.1] VP -928803039 -> -929208263 S jumped over the moon
[0.1.0] VBD -928822205 -> -928803039 VP jumped
[0.1.0.0] TK -928822205 -> -928822205 VBD jumped
[0.1.1] PP -928448468 -> -928803039 VP over the moon
[0.1.1.0] IN -928460789 -> -928448468 PP over
[0.1.1.0.0] TK -928460789 -> -928460789 IN over
[0.1.1.1] NP -928195203 -> -928448468 PP the moon
[0.1.1.1.0] DT -928202048 -> -928195203 NP the
[0.1.1.1.0.0] TK -928202048 -> -928202048 DT the
[0.1.1.1.1] NN -927992591 -> -928195203 NP moon
[0.1.1.1.1.0] TK -927992591 -> -927992591 NN moon

Another way of accessing the elements of the parse is through the getChildren method. This method returns an array of the Parse objects each representing an element of the parse. Using various Parse methods, we can get each element's text, tag, and labels. This is illustrated here:

Parse children[] = parse.getChildren();
for (Parse parseElement : children) {
    System.out.println(parseElement.getText());
    System.out.println(parseElement.getType());
    Parse tags[] = parseElement.getTagNodes();
    System.out.println("Tags");
    for (Parse tag : tags) {
        System.out.println("[" + tag + "]" 
            + " type: " + tag.getType() 
            + "  Probability: " + tag.getProb() 
            + "  Label: " + tag.getLabel());
    }
}

The output of this sequence is as follows:

The cow jumped over the moon
S
Tags
[The] type: DT  Probability: 0.9380626549164167  Label: null
[cow] type: NN  Probability: 0.9574993337971017  Label: null
[jumped] type: VBD  Probability: 0.9652983971550483  Label: S-VP
[over] type: IN  Probability: 0.7990638213315913  Label: S-PP
[the] type: DT  Probability: 0.9848023215770413  Label: null
[moon] type: NN  Probability: 0.9942338356992393  Label: null

Using the Stanford API

There are several approaches to parsing available in the Stanford NLP API. First, we will demonstrate a general purposes parser, the LexicalizedParser class. Then, we will illustrate how the result of the parser can be displayed using the TreePrint class. This will be followed by a demonstration of how to determine word dependencies using the GrammaticalStructure class.

Using the LexicalizedParser class

The LexicalizedParser class is a lexicalized PCFG parser. It can use various models to perform the parsing process. The apply method is used with a List instance of the CoreLabel objects to create a parse tree.

In the following code sequence, the parser is instantiated using the englishPCFG.ser.gz model:

String parserModel = ".../models/lexparser/englishPCFG.ser.gz";
LexicalizedParser lexicalizedParser = 
   LexicalizedParser.loadModel(parserModel);

The list instance of the CoreLabel objects is created using the Sentence class' toCoreLabelList method. The CoreLabel objects contain a word and other information. There are no tags or labels for these words. The words in the array have been effectively tokenized.

String[] senetenceArray = {"The", "cow", "jumped", "over", 
    "the", "moon", "."};
List<CoreLabel> words = 
    Sentence.toCoreLabelList(senetenceArray);

The apply method can now be invoked:

Tree parseTree = lexicalizedParser.apply(words);

One simple approach to display the result of the parse is to use the pennPrint method, which displays the parse tree in the same way as the Penn TreeBank does (http://www.sfs.uni-tuebingen.de/~dm/07/autumn/795.10/ptb-annotation-guide/root.html):

parseTree.pennPrint();

The output is as follows:

(ROOT
  (S
    (NP (DT The) (NN cow))
    (VP (VBD jumped)
      (PP (IN over)
        (NP (DT the) (NN moon))))
    (. .)))

The Tree class provides numerous methods for working with parse trees.

Using the TreePrint class

The TreePrint class provides a simple way to display the tree. An instance of the class is created using a string describing the display format to be used. An array of valid output formats can be obtained using the static outputTreeFormats variable and are listed in the following table:

Tree Format Strings

penn

dependencies

collocations

oneline

typedDependencies

semanticGraph

rootSymbolOnly

typedDependenciesCollapsed

conllStyleDependencies

words

latexTree

conll2007

wordsAndTags

xmlTree

 

Stanford uses type dependencies to describe the grammatical relationships that exist within a sentence. These are detailed in the Stanford Typed Dependencies Manual (http://nlp.stanford.edu/software/dependencies_manual.pdf).

The following code example illustrates how the TreePrint class can be used. The printTree method performs the actual display operation.

In this case, the TreePrint object is created showing the type dependencies "collapsed".

TreePrint treePrint = 
    new TreePrint("typedDependenciesCollapsed");
treePrint.printTree(parseTree);

The output of this sequence is as follows where the number reflects its position within the sentence:

det(cow-2, The-1)
nsubj(jumped-3, cow-2)
root(ROOT-0, jumped-3)
det(moon-6, the-5)
prep_over(jumped-3, moon-6)

Using the "penn" string to create the object results in the following output:

(ROOT
  (S
    (NP (DT The) (NN cow))
    (VP (VBD jumped)
      (PP (IN over)
        (NP (DT the) (NN moon))))
    (. .)))

The "dependencies" string produces a simple list of dependencies:

dep(cow-2,The-1)
dep(jumped-3,cow-2)
dep(null-0,jumped-3,root)
dep(jumped-3,over-4)
dep(moon-6,the-5)
dep(over-4,moon-6)

The formats can be combined using commas. The following example will result in both the penn style and the typedDependenciesCollapsed formats being used for the display:

"penn,typedDependenciesCollapsed"

Finding word dependencies using the GrammaticalStructure class

Another approach to parse text is to use the LexicalizedParser object created in the previous section in conjunction with the TreebankLanguagePack interface. A Treebank is a text corpus that has been annotated with syntactic or semantic information, providing information about a sentence's structure. The first major Treebank was the Penn TreeBank (http://www.cis.upenn.edu/~treebank/). Treebanks can be created manually or semiautomatically.

The next example illustrates how a simple string can be formatted using the parser. A tokenizer factory creates a tokenizer.

The CoreLabel class that we discussed in the Using the LexicalizedParser class section is used here:

String sentence = "The cow jumped over the moon.";
TokenizerFactory<CoreLabel> tokenizerFactory = 
    PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tokenizer = 
    tokenizerFactory.getTokenizer(new StringReader(sentence));
List<CoreLabel> wordList = tokenizer.tokenize();
parseTree = lexicalizedParser.apply(wordList);

The TreebankLanguagePack interface specifies methods for working with a Treebank. In the following code, a series of objects are created that culminate with the creation of a TypedDependency instance, which is used to obtain dependency information about elements of a sentence. An instance of a GrammaticalStructureFactory object is created and used to create an instance of a GrammaticalStructure class.

As this class' name implies, it stores grammatical information between elements in the tree:

TreebankLanguagePack tlp = 
    lexicalizedParser.treebankLanguagePack;
GrammaticalStructureFactory gsf = 
    tlp.grammaticalStructureFactory();
GrammaticalStructure gs = 
    gsf.newGrammaticalStructure(parseTree);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();

We can simply display the list as shown here:

System.out.println(tdl);

The output is as follows:

[det(cow-2, The-1), nsubj(jumped-3, cow-2), root(ROOT-0, jumped-3), det(moon-6, the-5), prep_over(jumped-3, moon-6)]

This information can also be extracted using the gov, reln, and dep methods, which return the governor word, the relationship, and the dependent element, respectively, as illustrated here:

for(TypedDependency dependency : tdl) {
    System.out.println("Governor Word: [" + dependency.gov() 
        + "] Relation: [" + dependency.reln().getLongName()
        + "] Dependent Word: [" + dependency.dep() + "]");
}

The output is as follows:

Governor Word: [cow/NN] Relation: [determiner] Dependent Word: [The/DT]
Governor Word: [jumped/VBD] Relation: [nominal subject] Dependent Word: [cow/NN]
Governor Word: [ROOT] Relation: [root] Dependent Word: [jumped/VBD]
Governor Word: [moon/NN] Relation: [determiner] Dependent Word: [the/DT]
Governor Word: [jumped/VBD] Relation: [prep_collapsed] Dependent Word: [moon/NN]

From this, we can gleam the relationships within a sentence and the elements of the relationship.

Finding coreference resolution entities

Coreference resolution refers to the occurrence of two or more expressions in text that refer to the same person or entity. Consider the following sentence:

"He took his cash and she took her change and together they bought their lunch."

There are several coreferences in this sentence. The word "his" refers to "He" and the word "her" refers to "she". In addition, "they" refers to both "He" and "she".

An endophora is a coreference of an expression that either precedes it or follows it. Endophora can be classified as anaphors or cataphors. In the following sentence, the word "It", is the anaphor that refers to its antecedent, "the earthquake":

"Mary felt the earthquake. It shook the entire building."

In the next sentence, "she" is a cataphor as it points to the postcedent, "Mary":

"As she sat there, Mary felt the earthquake."

The Stanford API supports coreference resolution with the StanfordCoreNLP class using a dcoref annotation. We will demonstrate the use of this class with the previous sentence.

We start with the creation of the pipeline and the use of the annotate method, as shown here:

String sentence = "He took his cash and she took her change " 
    + "and together they bought their lunch.";
Properties props = new Properties();
props.put("annotators", 
    "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation(sentence);
pipeline.annotate(annotation);

The Annotation class' get method, when used with an argument of CorefChainAnnotation.class, will return a Map instance of the CorefChain objects, as shown here. These objects contain information about the coreferences found in the sentence:

Map<Integer, CorefChain> corefChainMap = 
    annotation.get(CorefChainAnnotation.class);

The set of the CorefChain objects is indexed using integers. We can iterate over these objects as shown here. The key set is obtained and then each CorefChain object is displayed:

Set<Integer> set = corefChainMap.keySet();
Iterator<Integer> setIterator = set.iterator();
while(setIterator.hasNext()) {
    CorefChain corefChain = 
        corefChainMap.get(setIterator.next());
    System.out.println("CorefChain: " + corefChain);
}

The following output is generated:

CorefChain: CHAIN1-["He" in sentence 1, "his" in sentence 1]
CorefChain: CHAIN2-["his cash" in sentence 1]
CorefChain: CHAIN4-["she" in sentence 1, "her" in sentence 1]
CorefChain: CHAIN5-["her change" in sentence 1]
CorefChain: CHAIN7-["they" in sentence 1, "their" in sentence 1]
CorefChain: CHAIN8-["their lunch" in sentence 1]

We get more detailed information using methods of the CorefChain and CorefMention classes. The latter class contains information about a specific coreference found in the sentence.

Add the following code sequence to the body of the previous while loop to obtain and display this information. The startIndex and endIndex fields of the class refer to the position of the words in the sentence:

System.out.print("ClusterId: " + corefChain.getChainID());
CorefMention mention = corefChain.getRepresentativeMention();
System.out.println(" CorefMention: " + mention 
    + " Span: [" + mention.mentionSpan + "]");

List<CorefMention> mentionList = 
    corefChain.getMentionsInTextualOrder();
Iterator<CorefMention> mentionIterator = 
    mentionList.iterator();
while(mentionIterator.hasNext()) {
    CorefMention cfm = mentionIterator.next();
    System.out.println("	Mention: " + cfm 
        + " Span: [" + mention.mentionSpan + "]");
    System.out.print("	Mention Mention Type: " 
        + cfm.mentionType + " Gender: " + cfm.gender);
    System.out.println(" Start: " + cfm.startIndex 
        + " End: " + cfm.endIndex);
}
System.out.println();

The output is as follows. Only the first and last mentions are displayed to conserve space:

CorefChain: CHAIN1-["He" in sentence 1, "his" in sentence 1]
ClusterId: 1 CorefMention: "He" in sentence 1 Span: [He]
  Mention: "He" in sentence 1 Span: [He]
  Mention Type: PRONOMINAL Gender: MALE Start: 1 End: 2
  Mention: "his" in sentence 1 Span: [He]
  Mention Type: PRONOMINAL Gender: MALE Start: 3 End: 4

CorefChain: CHAIN8-["their lunch" in sentence 1]
ClusterId: 8 CorefMention: "their lunch" in sentence 1 Span: [their lunch]
  Mention: "their lunch" in sentence 1 Span: [their lunch]
  Mention Type: NOMINAL Gender: UNKNOWN Start: 14 End: 16
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.255.86