Sentence Detection in Blogs with NLTK

Given that sentence detection is probably the first task you’ll want to ponder when building an NLP stack, it makes sense to start there. Even if you never complete the remaining tasks in the pipeline, it turns out that EOS detection alone yields some powerful possibilities such as document summarization, which we’ll be considering as a follow-up exercise. But first, we’ll need to fetch some high-quality blog data. Let’s use the tried and true feedparser module, which you can easy_install if you don’t have it already, to fetch some posts from the O’Reilly Radar blog. The listing in Example 8-1 fetches a few posts and saves them to a local file as plain old JSON, since nothing else in this chapter hinges on the capabilities of a more advanced storage medium, such as CouchDB. As always, you can choose to store the posts anywhere you’d like.

Example 8-1. Harvesting blog data by parsing feeds (blogs_and_nlp__get_feed.py)

# -*- coding: utf-8 -*-

import os
import sys
from datetime import datetime as dt
import json
import feedparser
from BeautifulSoup import BeautifulStoneSoup
from nltk import clean_html

# Example feed:
# http://feeds.feedburner.com/oreilly/radar/atom
FEED_URL = sys.argv[1]


def cleanHtml(html):
    return BeautifulStoneSoup(clean_html(html),
                              convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]


fp = feedparser.parse(FEED_URL)

print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title)

blog_posts = []
for e in fp.entries:
    blog_posts.append({'title': e.title, 'content'
                      : cleanHtml(e.content[0].value), 'link': e.links[0].href})

if not os.path.isdir('out'):
    os.mkdir('out')

out_file = '%s__%s.json' % (fp.feed.title, dt.utcnow())
f = open(os.path.join(os.getcwd(), 'out', out_file), 'w')
f.write(json.dumps(blog_posts))
f.close()

print >> sys.stderr, 'Wrote output file to %s' % (f.name, )

Obtaining our unstructured text from a reputable source affords us the luxury of assuming good English grammar; hopefully this also means that one of NLTK’s out-of-the-box sentence detectors will work reasonably well. There’s no better way to find out than hacking some code to see what happens, so go ahead and take a gander at the code listing in Example 8-2. It introduces the sent_tokenize and word_tokenize methods, which are aliases for NLTK’s currently recommended sentence detector and word tokenizer. A brief discussion of the listing is provided afterward.

Example 8-2. Using NLTK’s NLP tools to parse blog data (blogs_and_nlp__sentence_detection.py)

# -*- coding: utf-8 -*-

import sys
import json
import nltk

# Load in output from blogs_and_nlp__get_feed.py

BLOG_DATA = sys.argv[1]
blog_data = json.loads(open(BLOG_DATA).read())

# Customize your list of stopwords as needed. Here, we add common
# punctuation and contraction artifacts

stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    ''s',
    '?',
    ')',
    '(',
    ':',
    ''',
    ''re',
    '"',
    '-',
    '}',
    '{',
    ]

for post in blog_data:
    sentences = nltk.tokenize.sent_tokenize(post['content'])

    words = [w.lower() for sentence in sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    # Basic stats

    num_words = sum([i[1] for i in fdist.items()])
    num_unique_words = len(fdist.keys())

    # Hapaxes are words that appear only once

    num_hapaxes = len(fdist.hapaxes())

    top_10_words_sans_stop_words = [w for w in fdist.items() if w[0]
                                    not in stop_words][:10]

    print post['title']
    print '	Num Sentences:'.ljust(25), len(sentences)
    print '	Num Words:'.ljust(25), num_words
    print '	Num Unique Words:'.ljust(25), num_unique_words
    print '	Num Hapaxes:'.ljust(25), num_hapaxes
    print '	Top 10 Most Frequent Words (sans stop words):
		', 
            '
		'.join(['%s (%s)'
            % (w[0], w[1]) for w in top_10_words_sans_stop_words])
    print

The first things you’re probably wondering about are the sent_tokenize and word_tokenize calls. NLTK provides several options for tokenization, but it provides “recommendations” as to the best available via these aliases. At the time of this writing (you can double-check this with pydoc at any time), the sentence detector is the PunktSentenceTokenizer and the word tokenizer is the TreebankWordTokenizer. Let’s take a brief look at each of these.

Internally, the PunktSentenceTokenizer relies heavily on being able to detect abbreviations as part of collocation patterns, and it uses some regular expressions to try to intelligently parse sentences by taking into account common patterns of punctuation usage. A full explanation of the innards of the PunktSentenceTokenizer’s logic is outside the scope of this book, but Tibor Kiss and Jan Strunk’s original paper, “Unsupervised Multilingual Sentence Boundary Detection” discusses its approach, is highly readable, and you should take some time to review it. As we’ll see in a bit, it is possible to instantiate the PunktSentenceTokenizer with sample text that it trains on to try to improve its accuracy. The type of underlying algorithm that’s used is an unsupervised learning algorithm; it does not require you to explicitly mark up the sample training data in any way. Instead, the algorithm inspects certain features that appear in the text itself, such as the use of capitalization, the co-occurrences of tokens, etc., to derive suitable parameters for breaking the text into sentences.

While NLTK’s WhitespaceTokenizer, which creates tokens by breaking a piece of text on whitespace, would have been the simplest word tokenizer to introduce, you’re already familiar with some of the shortcomings of blindly breaking on whitespace. Instead, NLTK currently recommends the TreebankWordTokenizer, a word tokenizer that operates on sentences and uses the same conventions as the Penn Treebank Project.[54] The one thing that may catch you off guard is that the TreebankWordTokenizer’s tokenization does some less-than-obvious things, such as separately tagging components in contractions and nouns having possessive forms. For example, the parsing for the sentence “I’m hungry,” would yield separate components for “I” and “’m”, maintaining a distinction between the subject and verb for “I’m”. As you might imagine, finely grained access to this kind of grammatical information can be quite valuable when it’s time to do advanced analysis that scrutinizes relationships between subjects and verbs in sentences.

Note

If you have a lot of trouble with advanced word tokenizers such as NLTK’s TreebankWordTokenizer or PunktWordTokenizer, it’s fine to default back to the WhitespaceTokenizer until you decide whether it’s worth the investment to use a more advanced tokenizer. In fact, in some cases using a more straightforward tokenizer can be advantageous. For example, using an advanced tokenizer on data that frequently inlines URLs might be a bad idea, because these tokenizers do not recognize URLs out of the box and will mistakenly break them up into multiple tokens. It’s not in the scope of this book to implement a custom tokenizer, but there are lots of online sources you can consult if this is something you’re interested in attempting.

Given a sentence tokenizer and a word tokenizer, we can first parse the text into sentences and then parse each sentence into tokens. Note that while this approach is fairly intuitive, it can have a subtle Achilles’ heel in that errors produced by the sentence detector propagate forward and can potentially bound the upper limit of the quality that the rest of the NLP stack can produce. For example, if the sentence tokenizer mistakenly breaks a sentence on the period after “Mr.” that appears in a section of text such as “Mr. Green killed Colonel Mustard in the study with the candlestick”, it may not be possible to extract the entity “Mr. Green” from the text unless specialized repair logic is in place. Again, it all depends on the sophistication of the full NLP stack and how it accounts for error propagation. The out-of-the-box PunktSentenceTokenizer is trained on the Penn Treebank corpus and performs quite well. The end goal of the parsing is to instantiate a handy-dandy FreqDist object, which expects a list of tokens. The remainder of the code in Example 8-2 is straightforward usage of a few of the commonly used NLTK APIs.

The aim of this section was to familiarize you with the first step involved in building an NLP pipeline. Along the way, we developed a few metrics that make a feeble attempt at characterizing some blog data. Our pipeline doesn’t involve part-of-speech tagging or chunking (yet), but it should give you a basic understanding of some concepts and get you thinking about some of the subtler issues involved. While it’s true that we could have simply split on whitespace, counted terms, tallied the results, and still gained a lot of information from the data, it won’t be long before you’ll be glad that you took these initial steps toward a deeper understanding of the data. To illustrate one possible application for what you’ve just learned, in the next section, we’ll look at a simple document summarization algorithm that relies on little more than sentence segmentation and frequency analysis.



[54] “Treebank” is a very specific term that refers to a corpus that’s been specially tagged with advanced linguistic information. In fact, the reason such a corpus is called a “treebank” is to emphasize that it’s a bank (think: collection) of sentences that have been parsed into trees adhering to a particular grammar.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.140.18