A corpus is a collection of text documents, and corpora is the plural of corpus. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.
You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C:
ltk_data
on Windows, and /usr/share/nltk_data
on Linux, Unix, or Mac OS X.
NLTK defines a list of data directories, or paths, in nltk.data.path
. Our custom corpora must be within one of these paths so it can be found by NLTK. So as not to conflict with the official data package, we'll create a custom nltk_data
directory in our home directory. Here's some Python code to create this directory and verify that it is in the list of known paths specified by nltk.data.path
:
>>> import os, os.path >>> path = os.path.expanduser('~/nltk_data') >>> if not os.path.exists(path): ... os.mkdir(path) >>> os.path.exists(path) True >>> import nltk.data >>> path in nltk.data.path True
If the last line, path in nltk.data.path
, is True
, then you should now have a nltk_data
directory in your home directory. The path should be %UserProfile%
ltk_data
on Windows, or ~/nltk_data
on Unix, Linux, or Mac OS X. For simplicity, I'll refer to the directory as ~/nltk_data
.
If the last line does not return True
, try creating the nltk_data
directory manually in your home directory, then verify that the absolute path is in nltk.data.path
. It's essential to ensure that this directory exists and is in nltk.data.path
before continuing. Once you have your nltk_data
directory, the convention is that corpora reside in a corpora
subdirectory. Create this corpora
directory within the nltk_data
directory, so that the path is ~/nltk_data/corpora
. Finally, we'll create a subdirectory in corpora
to hold our custom corpus. Let's call it cookbook
, giving us the full path of ~/nltk_data/corpora/cookbook
.
Now we can create a simple word list file and make sure it loads. In Chapter 2, Replacing and Correcting Words, Spelling correction with Enchant recipe, we created a word list file called mywords.txt
. Put this file into ~/nltk_data/corpora/cookbook/
. Now we can use nltk.data.load()
to load the file.
>>> import nltk.data >>> nltk.data.load('corpora/cookbook/mywords.txt', format='raw') 'nltk '
The nltk.data.load()
function recognizes a number of formats, such as 'raw'
, 'pickle'
, and 'yaml'
. If no format is specified, then it tries to guess the format based on the file's extension. In the previous case, we have a .txt
file, which is not a recognized extension, so we have to specify the 'raw'
format. But if we used a file that ended in .yaml
, then we would not need to specify the format.
Filenames passed in to nltk.data.load()
can be absolute or relative paths. Relative paths must be relative to one of the paths specified in nltk.data.path
. The file is found using nltk.data.find(path)
, which searches all known paths combined with the relative path. Absolute paths do not require a search, and are used as is.
For most corpora access, you won't actually need to use nltk.data.load
, as that will be handled by the CorpusReader
classes covered in the following recipes. But it's a good function to be familiar with for loading .pickle
files and .yaml
files, plus it introduces the idea of putting all of your data files into a path known by NLTK.
If you put the synonyms.yaml
file from the Chapter 2, Replacing and Correcting Words, Replacing synonyms recipe, into ~/nltk_data/corpora/cookbook
(next to mywords.txt
), you can use nltk.data.load()
to load it without specifying a format.
>>> import nltk.data >>> nltk.data.load('corpora/cookbook/synonyms.yaml') {'bday': 'birthday'}
This assumes that PyYAML is installed. If not, you can find download and installation instructions at http://pyyaml.org/wiki/PyYAML.
3.144.105.227