Creating a ruleset

Quite often when using linguistics, you will be writing custom rules. Here is one data structure suggestion to help you store these rules: a list of dictionaries. Each dictionary in turn can have elements ranging from simple string lists to lists of strings. Avoid nesting a list of dictionaries inside a dictionary:

ruleset = [
{
'id': 1,
'req_tags': ['NNP', 'VBZ', 'NN'],
},
{
'id': 2,
'req_tags': ['NNP', 'VBZ'],
}
]

Here, I have written two rules. Each rule is simply a collection of part-of-speech tags that has been stored under the req_tags key. Each rule is comprised of all of the tags that I will look for in a particular sentence.

Depending on id, I will use a hardcoded question template to generate my questions. In practice, you can and should move the question template to your ruleset.

Next, I need a function to pull out all of the tokens that match a particular tag. We do this by simply iterating over the entire list of and matching each token against the target tag:

def get_pos_tag(doc, tag):
return [tok for tok in doc if tok.tag_ == tag]

On runtime complexity:

This is slow O(n). As an exercise, can you think of a way to reduce this to O(1)?
Hint: You can precompute some results and store them, but at the cost of more memory consumption. 

Next, I am going to write a function to use the preceding ruleset, and also use a question template.

Here is the broad outline that I will follow for each sentence:

  • For each rule ID, check if all the required tags (req_tags) meet the conditions
  • Find the first rule ID that matches
  • Find the words that match the required part of the speech tags
  • Fill in the corresponding question template and return the question string
def sent_to_ques(sent:str)->str:
"""
Return a question string corresponding to a sentence string using a set of pre-written rules
"""
doc = nlp(sent)
pos_tags = [token.tag_ for token in doc]
for idx, rule in enumerate(ruleset):
if rule['id'] == 1:
if all(key in pos_tags for key in rule['req_tags']):
print(f"Rule id {rule['id']} matched for sentence: {sent}")
NNP = get_pos_tag(doc, "NNP")
NNP = str(NNP[0])
VBZ = get_pos_tag(doc, "VBZ")
VBZ = str(VBZ[0])
ques = f'What {VBZ} {NNP}?'
return(ques)
if rule['id'] == 2:
if all(key in pos_tags for key in rule['req_tags']): #'NNP', 'VBZ' in sentence.
print(f"Rule id {rule['id']} matched for sentence: {sent}")
NNP = get_pos_tag(doc, "NNP")
NNP = str(NNP[0])
VBZ = get_pos_tag(doc, "VBZ")
VBZ = str(VBZ[0].lemma_)
ques = f'What does {NNP} {VBZ}?'
return(ques)

Within each rule ID match, I do something more: I drop all but the first match for each part-of-speech tag that I receive. For instance, when I query for NNP, I later pick the first element with NNP[0], convert it into a string, and drop all other matches.

While this is a perfectly good approach for simple sentences, this breaks down when you have conditional statements or complex reasoning. Let's run the preceding function for each sentence in the example text and see what questions we get:

for sent in doc.sents:
print(f"The generated question is: {sent_to_ques(str(sent))}")

Rule id 1 matched for sentence: Bansoori is an Indian classical instrument. The generated question is: What is Bansoori? Rule id 2 matched for sentence: Tom plays Bansoori and Guitar. The generated question is: What does Tom play?

This is quite good. In practice, you will need a much larger set, maybe 10-15 rulesets and corresponding templates just to have a decent coverage of What? questions.

Another few rulesets might be needed to cover WhenWho, and Where type questions. For instance, Who plays Bansoori? is also a valid question from the second sentence that we have in the preceding code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.111.211