Given that sentence detection is probably the first task you’ll want
to ponder when building an NLP stack, it makes sense to start there. Even
if you never complete the remaining tasks in the pipeline, it turns out
that EOS detection alone yields some powerful possibilities such as
document summarization, which we’ll be considering as a follow-up
exercise. But first, we’ll need to fetch some high-quality blog data.
Let’s use the tried and true feedparser
module, which you can easy_install
if you don’t have it
already, to fetch some posts from the O’Reilly Radar blog. The listing in
Example 8-1 fetches a few posts and saves
them to a local file as plain old JSON, since nothing else in this chapter
hinges on the capabilities of a more advanced storage medium, such as
CouchDB. As always, you can choose to store the posts anywhere you’d
like.
Example 8-1. Harvesting blog data by parsing feeds (blogs_and_nlp__get_feed.py)
# -*- coding: utf-8 -*- import os import sys from datetime import datetime as dt import json import feedparser from BeautifulSoup import BeautifulStoneSoup from nltk import clean_html # Example feed: # http://feeds.feedburner.com/oreilly/radar/atom FEED_URL = sys.argv[1] def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts = [] for e in fp.entries: blog_posts.append({'title': e.title, 'content' : cleanHtml(e.content[0].value), 'link': e.links[0].href}) if not os.path.isdir('out'): os.mkdir('out') out_file = '%s__%s.json' % (fp.feed.title, dt.utcnow()) f = open(os.path.join(os.getcwd(), 'out', out_file), 'w') f.write(json.dumps(blog_posts)) f.close() print >> sys.stderr, 'Wrote output file to %s' % (f.name, )
Obtaining our unstructured text from a reputable source affords us
the luxury of assuming good English
grammar; hopefully this also means that one of NLTK’s out-of-the-box
sentence detectors will work reasonably well. There’s no better way to
find out than hacking some code to see what happens, so go ahead and take
a gander at the code listing in Example 8-2. It introduces the sent_tokenize
and word_tokenize
methods, which are aliases for
NLTK’s currently recommended sentence detector and word tokenizer. A brief
discussion of the listing is provided afterward.
Example 8-2. Using NLTK’s NLP tools to parse blog data (blogs_and_nlp__sentence_detection.py)
# -*- coding: utf-8 -*- import sys import json import nltk # Load in output from blogs_and_nlp__get_feed.py BLOG_DATA = sys.argv[1] blog_data = json.loads(open(BLOG_DATA).read()) # Customize your list of stopwords as needed. Here, we add common # punctuation and contraction artifacts stop_words = nltk.corpus.stopwords.words('english') + [ '.', ',', '--', ''s', '?', ')', '(', ':', ''', ''re', '"', '-', '}', '{', ] for post in blog_data: sentences = nltk.tokenize.sent_tokenize(post['content']) words = [w.lower() for sentence in sentences for w in nltk.tokenize.word_tokenize(sentence)] fdist = nltk.FreqDist(words) # Basic stats num_words = sum([i[1] for i in fdist.items()]) num_unique_words = len(fdist.keys()) # Hapaxes are words that appear only once num_hapaxes = len(fdist.hapaxes()) top_10_words_sans_stop_words = [w for w in fdist.items() if w[0] not in stop_words][:10] print post['title'] print ' Num Sentences:'.ljust(25), len(sentences) print ' Num Words:'.ljust(25), num_words print ' Num Unique Words:'.ljust(25), num_unique_words print ' Num Hapaxes:'.ljust(25), num_hapaxes print ' Top 10 Most Frequent Words (sans stop words): ', ' '.join(['%s (%s)' % (w[0], w[1]) for w in top_10_words_sans_stop_words]) print
The first things you’re probably wondering about are the sent_tokenize
and word_tokenize
calls. NLTK provides several
options for tokenization, but it provides “recommendations” as to the best
available via these aliases. At the time of this writing (you can
double-check this with pydoc
at any
time), the sentence detector is the PunktSentenceTokenizer
and the word tokenizer is
the TreebankWordTokenizer
. Let’s take a
brief look at each of these.
Internally, the PunktSentenceTokenizer
relies heavily on being
able to detect abbreviations as part of collocation patterns, and it uses
some regular expressions to try to intelligently parse sentences by taking
into account common patterns of punctuation usage. A full explanation of
the innards of the PunktSentenceTokenizer
’s logic is outside the
scope of this book, but Tibor Kiss and Jan Strunk’s original paper, “Unsupervised
Multilingual Sentence Boundary Detection” discusses its approach,
is highly readable, and you should take some time to review it. As we’ll
see in a bit, it is possible to instantiate the PunktSentenceTokenizer
with sample text that it
trains on to try to improve its accuracy. The type of underlying algorithm
that’s used is an unsupervised learning algorithm; it
does not require you to explicitly mark up the sample training data in any
way. Instead, the algorithm inspects certain features that appear in the text itself, such as
the use of capitalization, the co-occurrences of tokens, etc., to derive
suitable parameters for breaking the text into sentences.
While NLTK’s WhitespaceTokenizer
,
which creates tokens by breaking a piece of text on whitespace, would have
been the simplest word tokenizer to introduce, you’re already familiar
with some of the shortcomings of blindly breaking on whitespace. Instead,
NLTK currently recommends the TreebankWordTokenizer
, a word tokenizer that
operates on sentences and uses the same conventions as the Penn Treebank
Project.[54] The one thing that may catch you off guard is that the
TreebankWordTokenizer
’s tokenization
does some less-than-obvious things, such as separately tagging components
in contractions and nouns having
possessive forms. For example, the parsing for the sentence “I’m hungry,”
would yield separate components for “I” and “’m”, maintaining a
distinction between the subject and verb for “I’m”. As you might imagine,
finely grained access to this kind of grammatical information can be quite
valuable when it’s time to do advanced analysis that scrutinizes
relationships between subjects and verbs in sentences.
If you have a lot of trouble with advanced word tokenizers such as
NLTK’s TreebankWordTokenizer
or
PunktWordTokenizer
, it’s fine to
default back to the WhitespaceTokenizer
until you decide whether
it’s worth the investment to use a more advanced tokenizer. In fact, in
some cases using a more straightforward tokenizer can be advantageous.
For example, using an advanced tokenizer on data that frequently inlines
URLs might be a bad idea, because these tokenizers do not recognize URLs
out of the box and will mistakenly break them up into multiple tokens.
It’s not in the scope of this book to implement a custom tokenizer, but
there are lots of online sources you can consult if this is something you’re interested in
attempting.
Given a sentence tokenizer and a word tokenizer, we can first parse
the text into sentences and then parse each sentence into tokens. Note
that while this approach is fairly intuitive, it can have a subtle
Achilles’ heel in that errors produced by the sentence detector propagate
forward and can potentially bound the upper limit of the quality that the
rest of the NLP stack can produce. For example, if the sentence tokenizer
mistakenly breaks a sentence on the period after “Mr.” that appears in a
section of text such as “Mr. Green killed Colonel Mustard in the study
with the candlestick”, it may not be possible to extract the entity “Mr.
Green” from the text unless specialized repair logic is in place. Again,
it all depends on the sophistication of the full NLP stack and how it
accounts for error propagation. The out-of-the-box PunktSentenceTokenizer
is trained on the Penn
Treebank corpus and performs quite well. The end goal of the parsing is to
instantiate a handy-dandy FreqDist
object, which expects a list of tokens. The remainder of the code in Example 8-2 is straightforward usage of
a few of the commonly used NLTK APIs.
The aim of this section was to familiarize you with the first step involved in building an NLP pipeline. Along the way, we developed a few metrics that make a feeble attempt at characterizing some blog data. Our pipeline doesn’t involve part-of-speech tagging or chunking (yet), but it should give you a basic understanding of some concepts and get you thinking about some of the subtler issues involved. While it’s true that we could have simply split on whitespace, counted terms, tallied the results, and still gained a lot of information from the data, it won’t be long before you’ll be glad that you took these initial steps toward a deeper understanding of the data. To illustrate one possible application for what you’ve just learned, in the next section, we’ll look at a simple document summarization algorithm that relies on little more than sentence segmentation and frequency analysis.
[54] “Treebank” is a very specific term that refers to a corpus that’s been specially tagged with advanced linguistic information. In fact, the reason such a corpus is called a “treebank” is to emphasize that it’s a bank (think: collection) of sentences that have been parsed into trees adhering to a particular grammar.
3.147.140.18