Combining taggers with backoff tagging

Backoff tagging is one of the core features of SequentialBackoffTagger. It allows you to chain taggers together so that if one tagger doesn't know how to tag a word, it can pass the word on to the next backoff tagger. If that one can't do it, it can pass the word on to the next backoff tagger, and so on until there are no backoff taggers left to check.

How to do it...

Every subclass of SequentialBackoffTagger can take a backoff keyword argument whose value is another instance of a SequentialBackoffTagger. So, we'll use the DefaultTagger class from the Default tagging recipe in this chapter as the backoff to the UnigramTagger class covered in the previous recipe, Training a unigram part-of-speech tagger. Refer to both the recipes for details on train_sents and test_sents.

>>> tagger1 = DefaultTagger('NN')
>>> tagger2 = UnigramTagger(train_sents, backoff=tagger1)
>>> tagger2.evaluate(test_sents)
0.8758471832505935

By using a default tag of NN whenever the UnigramTagger is unable to tag a word, we've increased the accuracy by almost 2%!

How it works...

When a SequentialBackoffTagger class is initialized, it creates an internal list of backoff taggers with itself as the first element. If a backoff tagger is given, then the backoff tagger's internal list of taggers is appended. Here's some code to illustrate this:

>>> tagger1._taggers == [tagger1]
True
>>> tagger2._taggers == [tagger2, tagger1]
True

The _taggers list is the internal list of backoff taggers that the SequentialBackoffTagger class uses when the tag() method is called. It goes through its list of taggers, calling choose_tag() on each one. As soon as a tag is found, it stops and returns that tag. This means that if the primary tagger can tag the word, then that's the tag that will be returned. But if it returns None, then the next tagger is tried, and so on until a tag is found, or else None is returned. Of course, None will never be returned if your final backoff tagger is a DefaultTagger.

There's more...

While most of the taggers included in NLTK are subclasses of SequentialBackoffTagger, not all of them are. There's a few taggers that we'll cover in the later recipes that cannot be used as part of a backoff tagging chain, such as the BrillTagger class. However, these taggers generally take another tagger to use as a baseline, and a SequentialBackoffTagger class is often a good choice for that baseline.

Saving and loading a trained tagger with pickle

Since training a tagger can take a while, and you generally only need to do the training once, pickling a trained tagger is a useful way to save it for later usage. If your trained tagger is called tagger, then here's how to dump and load it with pickle:

>>> import pickle
>>> f = open('tagger.pickle', 'wb')
>>> pickle.dump(tagger, f)
>>> f.close()
>>> f = open('tagger.pickle', 'rb')
>>> tagger = pickle.load(f)

If your tagger pickle file is located in an NLTK data directory, you could also use nltk.data.load('tagger.pickle') to load the tagger.

See also

In the next recipe, we'll combine more taggers with backoff tagging. Also, see the previous two recipes for details on the DefaultTagger and UnigramTagger classes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.166.34