Filtering stopwords in a tokenized sentence

Stopwords are common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing. Most search engines will filter stopwords out of search queries and documents in order to save space in their index.

Getting ready

NLTK comes with a stopwords corpus that contains word lists for many languages. Be sure to unzip the datafile so NLTK can find these word lists in nltk_data/corpora/stopwords/.

How to do it...

We're going to create a set of all English stopwords, then use it to filter stopwords from a sentence.

>>> from nltk.corpus import stopwords
>>> english_stops = set(stopwords.words('english'))
>>> words = ["Can't", 'is', 'a', 'contraction']
>>> [word for word in words if word not in english_stops]
["Can't", 'contraction']

How it works...

The stopwords corpus is an instance of nltk.corpus.reader.WordListCorpusReader. As such, it has a words() method that can take a single argument for the file ID, which in this case is 'english', referring to a file containing a list of English stopwords. You could also call stopwords.words() with no argument to get a list of all stopwords in every language available.

There's more...

You can see the list of all English stopwords using stopwords.words('english') or by examining the word list file at nltk_data/corpora/stopwords/english. There are also stopword lists for many other languages. You can see the complete list of languages using the fileids() method:

>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']

Any of these fileids can be used as an argument to the words() method to get a list of stopwords for that language.

See also

If you'd like to create your own stopwords corpus, see the Creating a word list corpus recipe in Chapter 3, Creating Custom Corpora, to learn how to use the WordListCorpusReader. We'll also be using stopwords in the Discovering word collocations recipe, later in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.19.21