Lazy corpus loading

Loading a corpus reader can be an expensive operation due to the number of files, file sizes, and various initialization tasks. And while you'll often want to specify a corpus reader in a common module, you don't always need to access it right away. To speed up module import time when a corpus reader is defined, NLTK provides a LazyCorpusLoader class that can transform itself into your actual corpus reader as soon as you need it. This way, you can define a corpus reader in a common module without it slowing down module loading.

How to do it...

The LazyCorpusLoader class requires two arguments: the name of the corpus and the corpus reader class, plus any other arguments needed to initialize the corpus reader class.

The name argument specifies the root directory name of the corpus, which must be within a corpora subdirectory of one of the paths in nltk.data.path. See the Setting up a custom corpus recipe of this chapter for more details on nltk.data.path.

For example, if you have a custom corpora named cookbook in your local nltk_data directory, its path would be ~/nltk_data/corpora/cookbook. You'd then pass 'cookbook' to LazyCorpusLoader as the name, and LazyCorpusLoader will look in ~/nltk_data/corpora for a directory named 'cookbook'.

The second argument to LazyCorpusLoader is reader_cls, which should be the name of a subclass of CorpusReader, such as WordListCorpusReader. You will also need to pass in any other arguments required by the reader_cls argument for initialization. This will be demonstrated as follows, using the same wordlist file we created in the earlier recipe, Creating a wordlist corpus. The third argument to LazyCorpusLoader is the list of filenames and fileids that will be passed to WordListCorpusReader at initialization:

>>> from nltk.corpus.util import LazyCorpusLoader
>>> from nltk.corpus.reader import WordListCorpusReader
>>> reader = LazyCorpusLoader('cookbook', WordListCorpusReader, ['wordlist'])
>>> isinstance(reader, LazyCorpusLoader)
True
>>> reader.fileids()
['wordlist']
>>> isinstance(reader, LazyCorpusLoader)
False
>>> isinstance(reader, WordListCorpusReader)
True

How it works...

The LazyCorpusLoader class stores all the arguments given, but otherwise does nothing until you try to access an attribute or method. This way, initialization is very fast, eliminating the overhead of loading the corpus reader immediately. As soon as you do access an attribute or method, it does the following:

  1. Calls nltk.data.find('corpora/%s' % name) to find the corpus data root directory.
  2. Instantiates the corpus reader class with the root directory and any other arguments.
  3. Transforms itself into the corpus reader class.

So in the previous example code, before we call reader.fileids(), reader is an instance of LazyCorpusLoader, but after the call, reader becomes an instance of WordListCorpusReader.

There's more...

All of the corpora included with NLTK and defined in nltk.corpus are initially a LazyCorpusLoader class. The following is some code from nltk.corpus defining the treebank corpora:

treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*.mrg',tagset='wsj', encoding='ascii')
treebank_chunk = LazyCorpusLoader('treebank/tagged', ChunkedCorpusReader, r'wsj_.*.pos',sent_tokenizer=RegexpTokenizer(r'(?<=/.)s*(?![^[]*])', gaps=True),
    para_block_reader=tagged_treebank_para_block_reader, encoding='ascii')
treebank_raw = LazyCorpusLoader('treebank/raw', PlaintextCorpusReader, r'wsj_.*', encoding='ISO-8859-2')

As you can see in the previous code, any number of additional arguments can be passed through by LazyCorpusLoader to its reader_cls argument.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.239.50