Analyzing word frequencies

The NLTK FreqDist class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out stopwords and punctuation:

punctuation = set(string.punctuation)
filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]

Create a FreqDist object and print associated keys and values with highest frequency:

fd = nltk.FreqDist(filtered)
print "Words", fd.keys()[:5]
print "Counts", fd.values()[:5]

The keys and values are printed as follows:

Words ['d', 'caesar', 'brutus', 'bru', 'haue']
Counts [215, 190, 161, 153, 148]

The first word in this list is of course not an English word, so we may need to add the heuristic that words have a minimum of two characters. The NLTK FreqDist class allows dictionary-like access, but it also has convenience methods. Get the word with the most frequent word and the related count:

print "Max", fd.max()
print "Count", fd['d']

The following result shouldn't be a surprise:

Max d
Count 215

The analysis until this point concerned single words, but we can extend the analysis to word pairs and triplets. These are also called bigrams and trigrams. We can find them with the bigrams() and trigrams() functions. Repeat the analysis, but this time for bigrams:

fd = nltk.FreqDist(nltk.bigrams(filtered))
print "Bigrams", fd.keys()[:5]
print "Counts", fd.values()[:5]
print "Bigram Max", fd.max()
print "Bigram count", fd[('let', 'vs')]

The following output should be printed:

Bigrams [('let', 'vs'), ('wee', 'l'), ('mark', 'antony'), ('marke', 'antony'), ('st', 'thou')]
Counts [16, 15, 13, 12, 12]
Bigram Max ('let', 'vs')
Bigram count 16

Have a peek at the frequencies.py file in this book's code bundle:

import nltk
import string


gb = nltk.corpus.gutenberg
words = gb.words("shakespeare-caesar.txt")

sw = set(nltk.corpus.stopwords.words('english'))
punctuation = set(string.punctuation)
filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]
fd = nltk.FreqDist(filtered)
print "Words", fd.keys()[:5]
print "Counts", fd.values()[:5]
print "Max", fd.max()
print "Count", fd['d']

fd = nltk.FreqDist(nltk.bigrams(filtered))
print "Bigrams", fd.keys()[:5]
print "Counts", fd.values()[:5]
print "Bigram Max", fd.max()
print "Bigram count", fd[('let', 'vs')]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.37.38