Extracting relationships for a question-answer system

In this section, we will examine an approach for extracting relationships that can be useful for answering queries. Possible/candidate queries include:

  • Who is/was the 14th president of the United States?
  • What is the 1st president's home town?
  • When was Herbert Hoover president?

The process of answering these types of questions is not easy. We will demonstrate one approach to answer certain types of questions, but we will simplify many aspects of the process. Even with these restrictions, we will find that the system responds well to the queries.

This process consists of several steps:

  1. Finding word dependencies
  2. Identifying the type of questions
  3. Extracting its relevant components
  4. Searching the answer
  5. Presenting the answer

We will show the general framework to identify whether a question is of the types who, what, when, or where. Next, we will investigate some of the issues required to answer the "who" type questions.

To keep the example simple, we will restrict the questions to those relating to presidents of the U.S.. A simple database of presidential facts will be used to look up the answer to a question.

Finding the word dependencies

The question is stored as a simple string:

String question = 
    "Who is the 32nd president of the United States?";

We will use the LexicalizedParser class as developed in the Finding word dependencies using the GrammaticalStructure class section. The relevant code is duplicated here for your convenience:

String parserModel = ".../englishPCFG.ser.gz";
LexicalizedParser lexicalizedParser = 
    LexicalizedParser.loadModel(parserModel);

TokenizerFactory<CoreLabel> tokenizerFactory = 
    PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tokenizer = 
    tokenizerFactory.getTokenizer(new StringReader(question));
List<CoreLabel> wordList = tokenizer.tokenize();
Tree parseTree = lexicalizedParser.apply(wordList);

TreebankLanguagePack tlp = 
    lexicalizedParser.treebankLanguagePack();
GrammaticalStructureFactory gsf = 
    tlp.grammaticalStructureFactory();
GrammaticalStructure gs = 
    gsf.newGrammaticalStructure(parseTree);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
for (TypedDependency dependency : tdl) {
    System.out.println("Governor Word: [" + dependency.gov() 
        + "] Relation: [" + dependency.reln().getLongName()
        + "] Dependent Word: [" + dependency.dep() + "]");
}

When executed with the question, we get the following output:

[root(ROOT-0, Who-1), cop(Who-1, is-2), det(president-5, the-3), amod(president-5, 32nd-4), nsubj(Who-1, president-5), det(States-9, the-7), nn(States-9, United-8), prep_of(president-5, States-9)]
Governor Word: [ROOT] Relation: [root] Dependent Word: [Who/WP]
Governor Word: [Who/WP] Relation: [copula] Dependent Word: [is/VBZ]
Governor Word: [president/NN] Relation: [determiner] Dependent Word: [the/DT]
Governor Word: [president/NN] Relation: [adjectival modifier] Dependent Word: [32nd/JJ]
Governor Word: [Who/WP] Relation: [nominal subject] Dependent Word: [president/NN]
Governor Word: [States/NNPS] Relation: [determiner] Dependent Word: [the/DT]
Governor Word: [States/NNPS] Relation: [nn modifier] Dependent Word: [United/NNP]
Governor Word: [president/NN] Relation: [prep_collapsed] Dependent Word: [States/NNPS]

This information provides the foundation to determine the type of question.

Determining the question type

The relationships detected suggest ways to detect different types of questions. For example, to determine whether it is a "who" type question, we can check whether the relationship is nominal subject and the governor is who.

In the following code, we iterate over the question type dependencies to determine whether it matches this combination, and if so, call the processWhoQuestion method to process the question:

for (TypedDependency dependency : tdl) {
    if ("nominal subject".equals( dependency.reln().getLongName())
        && "who".equalsIgnoreCase( dependency.gov().originalText())) {
        processWhoQuestion(tdl);
    }
}

This simple distinction worked reasonably well. It will correctly identify all of the following variations to the same question:

Who is the 32nd president of the United States?
Who was the 32nd president of the United States?
The 32nd president of the United States was who?
The 32nd president is who of the United States?

We can also determine other question types using different selection criteria. The following questions typify other question types:

What was the 3rd President's party?
When was the 12th president inaugurated?
Where is the 30th president's home town?

We can determine the question type using the relations as suggested in the following table:

Question type

Relation

Governor

Dependent

What

nominal subject

what

NA

When

adverbial modifier

NA

when

Where

adverbial modifier

NA

where

This approach does require hardcoding the relationships.

Searching for the answer

Once we know the type of question, we can use the relations found in the text to answer the question. To illustrate this process, we will develop the processWhoQuestion method. This method uses the TypedDependency list to garner the information needed to answer a "who" type question about presidents. Specifically, we need to know which president they are interested in, based on the president's ordinal rank.

We will also need a list of presidents to search for relevant information. The createPresidentList method was developed to perform this task. It reads a file, PresidentList, containing the president's name, inauguration year, and last year in office. The file uses the following format and can be downloaded from www.packtpub.com:

George Washington   (1789-1797)

The following createPresidentList method demonstrates the use of OpenNLP's SimpleTokenizer class to tokenize each line. A variable number of tokens make up a president's name. Once that is determined, the dates are easily extracted:

public List<President> createPresidentList() {
    ArrayList<President> list = new ArrayList<>();
    String line = null;
    try (FileReader reader = new FileReader("PresidentList");
            BufferedReader br = new BufferedReader(reader)) {
        while ((line = br.readLine()) != null) {
            SimpleTokenizer simpleTokenizer = 
                SimpleTokenizer.INSTANCE;
            String tokens[] = simpleTokenizer.tokenize(line);
            String name = "";
            String start = "";
            String end = "";
            int i = 0;
            while (!"(".equals(tokens[i])) {
                name += tokens[i] + " ";
                i++;
            }
            start = tokens[i + 1];
            end = tokens[i + 3];
            if (end.equalsIgnoreCase("present")) {
                end = start;
            }
            list.add(new President(name, 
                Integer.parseInt(start),
                Integer.parseInt(end)));
        }
     } catch (IOException ex) {
        // Handle exceptions
    }
    return list;
}

A President class holds presidential information, as shown here. The getter methods have been left out:

public class President {
    private String name;
    private int start;
    private int end;

    public President(String name, int start, int end) {
        this.name = name;
        this.start = start;
        this.end = end;
    }
    ...
}

The processWhoQuestion method follows. We use type dependencies again to extract the ordinal value of the question. If the governor is president and the adjectival modifier is the relation, then the dependent word is the ordinal. This string is passed to the getOrder method, which returns the ordinal as an integer. We add 1 to it since the list of presidents also started at one:

public void processWhoQuestion(List<TypedDependency> tdl) {
    List<President> list = createPresidentList();
    for (TypedDependency dependency : tdl) {
        if ("president".equalsIgnoreCase(
                dependency.gov().originalText())
                && "adjectival modifier".equals(
                  dependency.reln().getLongName())) {
            String positionText = 
                dependency.dep().originalText();
            int position = getOrder(positionText)-1;
            System.out.println("The president is " 
                + list.get(position).getName());
        }
    }
}

The getOrder method is as follows and simply takes the first numeric characters and converts them to an integer. A more sophisticated version would look at other variations including words such as "first" and "sixteenth":

private static int getOrder(String position) {
    String tmp = "";
    int i = 0;
    while (Character.isDigit(position.charAt(i))) {
        tmp += position.charAt(i++);
    }
    return Integer.parseInt(tmp);
}

When executed, we get the following output:

The president is Franklin D . Roosevelt

This implementation is a simple example of how information can be extracted from a sentence and used to answer questions. The other types of questions can be implemented in a similar fashion and are left as an exercise for the reader.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.150.142