Setting up a custom corpus

A corpus is a collection of text documents, and corpora is the plural of corpus. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.

Getting ready

You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C: ltk_data on Windows, and /usr/share/nltk_data on Linux, Unix, or Mac OS X.

How to do it...

NLTK defines a list of data directories, or paths, in nltk.data.path. Our custom corpora must be within one of these paths so it can be found by NLTK. So as not to conflict with the official data package, we'll create a custom nltk_data directory in our home directory. Here's some Python code to create this directory and verify that it is in the list of known paths specified by nltk.data.path:

>>> import os, os.path
>>> path = os.path.expanduser('~/nltk_data')
>>> if not os.path.exists(path):
...    os.mkdir(path)
>>> os.path.exists(path)
True
>>> import nltk.data
>>> path in nltk.data.path
True

If the last line, path in nltk.data.path, is True, then you should now have a nltk_data directory in your home directory. The path should be %UserProfile% ltk_data on Windows, or ~/nltk_data on Unix, Linux, or Mac OS X. For simplicity, I'll refer to the directory as ~/nltk_data.

Note

If the last line does not return True, try creating the nltk_data directory manually in your home directory, then verify that the absolute path is in nltk.data.path. It's essential to ensure that this directory exists and is in nltk.data.path before continuing. Once you have your nltk_data directory, the convention is that corpora reside in a corpora subdirectory. Create this corpora directory within the nltk_data directory, so that the path is ~/nltk_data/corpora. Finally, we'll create a subdirectory in corpora to hold our custom corpus. Let's call it cookbook, giving us the full path of ~/nltk_data/corpora/cookbook.

Now we can create a simple word list file and make sure it loads. In Chapter 2, Replacing and Correcting Words, Spelling correction with Enchant recipe, we created a word list file called mywords.txt. Put this file into ~/nltk_data/corpora/cookbook/. Now we can use nltk.data.load() to load the file.

>>> import nltk.data
>>> nltk.data.load('corpora/cookbook/mywords.txt', format='raw')
'nltk
'

Note

We need to specify format='raw' since nltk.data.load() doesn't know how to interpret .txt files. As we'll see, it does know how to interpret a number of other file formats.

How it works...

The nltk.data.load() function recognizes a number of formats, such as 'raw', 'pickle', and 'yaml'. If no format is specified, then it tries to guess the format based on the file's extension. In the previous case, we have a .txt file, which is not a recognized extension, so we have to specify the 'raw' format. But if we used a file that ended in .yaml, then we would not need to specify the format.

Filenames passed in to nltk.data.load() can be absolute or relative paths. Relative paths must be relative to one of the paths specified in nltk.data.path. The file is found using nltk.data.find(path), which searches all known paths combined with the relative path. Absolute paths do not require a search, and are used as is.

There's more...

For most corpora access, you won't actually need to use nltk.data.load, as that will be handled by the CorpusReader classes covered in the following recipes. But it's a good function to be familiar with for loading .pickle files and .yaml files, plus it introduces the idea of putting all of your data files into a path known by NLTK.

Loading a YAML file

If you put the synonyms.yaml file from the Chapter 2, Replacing and Correcting Words, Replacing synonyms recipe, into ~/nltk_data/corpora/cookbook (next to mywords.txt), you can use nltk.data.load() to load it without specifying a format.

>>> import nltk.data
>>> nltk.data.load('corpora/cookbook/synonyms.yaml')
{'bday': 'birthday'}

This assumes that PyYAML is installed. If not, you can find download and installation instructions at http://pyyaml.org/wiki/PyYAML.

See also

In the next recipes, we'll cover various corpus readers, and then in the Lazy corpus loading recipe, we'll use the LazyCorpusLoader, which expects corpus data to be in a corpora subdirectory of one of the paths specified by nltk.data.path.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.105.227