The NLTK FreqDist
class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out stopwords and punctuation:
punctuation = set(string.punctuation) filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]
Create a FreqDist
object and print associated keys and values with highest frequency:
fd = nltk.FreqDist(filtered) print "Words", fd.keys()[:5] print "Counts", fd.values()[:5]
The keys and values are printed as follows:
Words ['d', 'caesar', 'brutus', 'bru', 'haue'] Counts [215, 190, 161, 153, 148]
The first word in this list is of course not an English word, so we may need to add the heuristic that words have a minimum of two characters. The NLTK FreqDist
class allows dictionary-like access, but it also has convenience methods. Get the word with the most frequent word and the related count:
print "Max", fd.max() print "Count", fd['d']
The following result shouldn't be a surprise:
Max d Count 215
The analysis until this point concerned single words, but we can extend the analysis to word pairs and triplets. These are also called bigrams and trigrams. We can find them with the bigrams()
and trigrams()
functions. Repeat the analysis, but this time for bigrams:
fd = nltk.FreqDist(nltk.bigrams(filtered)) print "Bigrams", fd.keys()[:5] print "Counts", fd.values()[:5] print "Bigram Max", fd.max() print "Bigram count", fd[('let', 'vs')]
The following output should be printed:
Bigrams [('let', 'vs'), ('wee', 'l'), ('mark', 'antony'), ('marke', 'antony'), ('st', 'thou')] Counts [16, 15, 13, 12, 12] Bigram Max ('let', 'vs') Bigram count 16
Have a peek at the frequencies.py
file in this book's code bundle:
import nltk import string gb = nltk.corpus.gutenberg words = gb.words("shakespeare-caesar.txt") sw = set(nltk.corpus.stopwords.words('english')) punctuation = set(string.punctuation) filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation] fd = nltk.FreqDist(filtered) print "Words", fd.keys()[:5] print "Counts", fd.values()[:5] print "Max", fd.max() print "Count", fd['d'] fd = nltk.FreqDist(nltk.bigrams(filtered)) print "Bigrams", fd.keys()[:5] print "Counts", fd.values()[:5] print "Bigram Max", fd.max() print "Bigram count", fd[('let', 'vs')]
3.144.37.38