Filtering out stopwords, names, and numbers

It's a common requirement in text analysis to get rid of stopwords (common words with low information value). NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words:

sw = set(nltk.corpus.stopwords.words('english'))
print "Stop words", list(sw)[:7]

The following common words are printed:

Stop words ['all', 'just', 'being', 'over', 'both', 'through', 'yourselves']

Notice that all the words in this corpus are in lowercase.

NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books mostly with expired copyright, which are available for free on the Internet (see http://www.gutenberg.org/).

Load the Gutenberg corpus and print some of its filenames:

gb = nltk.corpus.gutenberg
print "Gutenberg files", gb.fileids()[-5:]

Some of the titles printed may be familiar to you:

Gutenberg files ['milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Extract the first couple of sentences from the milton-paradise.txt file that we will filter later:

text_sent = gb.sents("milton-paradise.txt")[:2]
print "Unfiltered", text_sent

The following sentences are printed:

Unfiltered [['[', 'Paradise', 'Lost', 'by', 'John', 'Milton', '1667', ']'], ['Book', 'I']]

Now, filter out the stopwords as follows:

for sent in text_sent:
    filtered = [w for w in sent if w.lower() not in sw]
    print "Filtered", filtered

For the first sentence, we get the following output:

Filtered ['[', 'Paradise', 'Lost', 'John', 'Milton', '1667', ']']

If we compare with the previous snippet, we notice that the word by has been filtered out as it was found in the stopwords corpus. Sometimes, we want to remove numbers and names too. We can remove words based on Part of Speech (POS) tags. In this tagging scheme, numbers correspond to the Cardinal Number (CD) tag. Names correspond to the proper noun singular (NNP) tag. Tagging is an inexact process based on heuristics. It's a big topic that deserves an entire book (see the Preface). Tag the filtered text with the pos_tag() function:

tagged = nltk.pos_tag(filtered)
print "Tagged", tagged

For our text, we get the following tags:

Tagged [('[', 'NN'), ('Paradise', 'NNP'), ('Lost', 'NNP'), ('John', 'NNP'), ('Milton', 'NNP'), ('1667', 'CD'), (']', 'CD')]

The pos_tag() function returns a list of tuples, where the second element in each tuple is the tag. As you can see, some of the words are tagged as NNP, although they probably shouldn't be. The heuristic here is to tag words as NNP if the first character of a word is uppercase. If we set all the words to be lowercase, we will get a different result. This is left as an exercise for the reader. It's easy to remove the words in the list with the NNP and CD tags. Have a look at the filtering.py file in this book's code bundle:

import nltk

sw = set(nltk.corpus.stopwords.words('english'))
print "Stop words", list(sw)[:7]

gb = nltk.corpus.gutenberg
print "Gutenberg files", gb.fileids()[-5:]
text_sent = gb.sents("milton-paradise.txt")[:2]
print "Unfiltered", text_sent

for sent in text_sent:
    filtered = [w for w in sent if w.lower() not in sw]
    print "Filtered", filtered
    tagged = nltk.pos_tag(filtered)
    print "Tagged", tagged 

    words= []
    for word in tagged:
        if word[1] != 'NNP' and word[1] != 'CD':
           words.append(word[0]) 

    print words
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.29.71