Training a tagger-based chunker

Training a chunker can be a great alternative to manually specifying regular expression chunk patterns. Instead of a painstaking process of trial and error to get the exact right patterns, we can use existing corpus data to train chunkers much like we did in Chapter 4, Part-of-Speech Tagging.

How to do it...

As with the part-of-speech tagging, we will use the treebank corpus data for training. But this time we will use the treebank_chunk corpus, which is specifically formatted to produce chunked sentences in the form of trees. These chunked_sents() will be used by a TagChunker class to train a tagger-based chunker. The TagChunker uses a helper function conll_tag_chunks() to extract a list of (pos, iob) tuples from a list of Tree. These (pos, iob) tuples are then used to train a tagger in the same way (word, pos) tuples were used in Chapter 4, Part-of-Speech Tagging to train part-of-speech taggers. But instead of learning part-of-speech tags for words, we are learning IOB tags for part-of-speech tags. Here's the code from chunkers.py:

import nltk.chunk, itertools
from nltk.tag import UnigramTagger, BigramTagger
from tag_util import backoff_tagger

def conll_tag_chunks(chunk_sents):
  tagged_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
  return [[(t, c) for (w, t, c) in sent] for sent in tagged_sents]

class TagChunker(nltk.chunk.ChunkParserI):
  def __init__(self, train_chunks, tagger_classes=[UnigramTagger, BigramTagger]):
    train_sents = conll_tag_chunks(train_chunks)
    self.tagger = backoff_tagger(train_sents, tagger_classes)

  def parse(self, tagged_sent):
    if not tagged_sent: return None
    (words, tags) = zip(*tagged_sent)
    chunks = self.tagger.tag(tags)
    wtc = itertools.izip(words, chunks)
    return nltk.chunk.conlltags2tree([(w,t,c) for (w,(t,c)) in wtc])

Once we have our trained TagChunker, we can then evaluate the ChunkScore the same way we did for the RegexpParser in the previous recipes.

>>> from chunkers import TagChunker
>>> from nltk.corpus import treebank_chunk
>>> train_chunks = treebank_chunk.chunked_sents()[:3000]
>>> test_chunks = treebank_chunk.chunked_sents()[3000:]
>>> chunker = TagChunker(train_chunks)
>>> score = chunker.evaluate(test_chunks)
>>> score.accuracy()
0.97320393352514278
>>> score.precision()
0.91665343705350055
>>> score.recall()
0.9465573770491803

Pretty darn accurate! Training a chunker is clearly a great alternative to manually specified grammars and regular expressions.

How it works...

Recall from the Creating a chunked phrase corpus recipe in Chapter 3, Creating Custom Corpora that the conll2000 corpus defines chunks using IOB tags, which specify the type of chunk and where it begins and ends. We can train a part-of-speech tagger on these IOB tag patterns, and then use that to power a ChunkerI subclass. But first we need to transform a Tree that you would get from the chunked_sents() method of a corpus into a format usable by a part-of-speech tagger. This is what conll_tag_chunks() does. It uses nltk.chunk.tree2conlltags() to convert a sentence Tree into a list of 3-tuples of the form (word, pos, iob) where pos is the part-of-speech tag and iob is an IOB tag, such as B-NP to mark the beginning of a noun-phrase, or I-NP to mark that the word is inside the noun-phrase. The reverse of this method is nltk.chunk.conlltags2tree(). Here's some code to demonstrate these nltk.chunk functions:

>>> import nltk.chunk
>>> from nltk.tree import Tree
>>> t = Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')])])
>>> nltk.chunk.tree2conlltags(t)
[('the', 'DT', 'B-NP'), ('book', 'NN', 'I-NP')]
>>> nltk.chunk.conlltags2tree([('the', 'DT', 'B-NP'), ('book', 'NN', 'I-NP')])
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')])])

The next step is to convert these 3-tuples into 2-tuples that the tagger can recognize. Because the RegexpParser uses part-of-speech tags for chunk patterns, we will do that here too and use part-of-speech tags as if they were words to tag. By simply dropping the word from 3-tuple (word, pos, iob), the conll_tag_chunks() function returns a list of 2-tuples of the form (pos, iob). When given the preceding example Tree in a list, the results are in a format we can feed to a tagger.

>>> conll_tag_chunks([t])
[[('DT', 'B-NP'), ('NN', 'I-NP')]]

The final step is a subclass of ChunkParserI called TagChunker. It trains on a list of chunk trees using an internal tagger. This internal tagger is composed of a UnigramTagger and a BigramTagger in a backoff chain, using the backoff_tagger() method created in the Training and combining Ngram taggers recipe in Chapter 4, Part-of-Speech Tagging.

Finally, ChunkerI subclasses must implement a parse() method that expects a part-of-speech tagged sentence. We unzip that sentence into a list of words and part-of-speech tags. The tags are then tagged by the tagger to get IOB tags, which are then re-combined with the words and part-of-speech tags to create 3-tuples we can pass to nltk.chunk.conlltags2tree() to return a final Tree.

There's more...

Since we have been talking about the conll IOB tags, let us see how the TagChunker does on the conll2000 corpus:

>>> from nltk.corpus import conll2000
>>> conll_train = conll2000.chunked_sents('train.txt')
>>> conll_test = conll2000.chunked_sents('test.txt')
>>> chunker = TagChunker(conll_train)
>>> score = chunker.evaluate(conll_test)
>>> score.accuracy()
0.89505456234037617
>>> score.precision()
0.81148419743556754
>>> score.recall()
0.86441916769448635

Not quite as good as on treebank_chunk, but conll2000 is a much larger corpus, so it's not too surprising.

Using different taggers

If you want to use different tagger classes with the TagChunker, you can pass them in as tagger_classes. For example, here's the TagChunker using just a UnigramTagger:

>>> from nltk.tag import UnigramTagger
>>> uni_chunker = TagChunker(train_chunks, tagger_classes=[UnigramTagger])
>>> score = uni_chunker.evaluate(test_chunks)
>>> score.accuracy()
0.96749259243354657

The tagger_classes will be passed directly into the backoff_tagger() function, which means they must be subclasses of SequentialBackoffTagger. In testing, the default of tagger_classes=[UnigramTagger, BigramTagger] produces the best results.

See also

The Training and combining Ngram taggers recipe in Chapter 4, Part-of-Speech Tagging covers backoff tagging with a UnigramTagger and BigramTagger. ChunkScore metrics returned by the evaluate() method of a chunker were explained in the previous recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.99.57