Chapter 8. Blogs et al.: Natural Language Processing (and Beyond)

This chapter is a modest attempt to introduce Natural Language Processing (NLP) and apply it to the unstructured data in blogs. In the spirit of the prior chapters, it attempts to present the minimal level of detail required to empower you with a solid general understanding of an inherently complex topic, while also providing enough of a technical drill-down that you’ll be able to immediately get to work mining some data. Although we’ve been regularly cutting corners and taking a Pareto-like approach—giving you the crucial 20% of the skills that you can use to do 80% of the work—the corners we’ll cut in this chapter are especially pronounced because NLP is just that complex. No chapter out of any book—or any small multivolume set of books, for that matter, could possibly do it justice. This chapter is a pragmatic introduction that’ll give you enough information to do some pretty amazing things, like automatically generating abstracts from documents and extracting lists of important entities, but we will not journey very far into topics that would require multiple dissertations to sort out.

Although it’s not absolutely necessary that you have read Chapter 7 before you dive into this chapter, it’s highly recommended that you do so. A good understanding of Natural Language Processing presupposes an appreciation and working knowledge of some of the fundamental strengths and weaknesses of TF-IDF, vector space models, etc. In that regard, these two chapters have a somewhat tighter coupling than most other chapters in this book. The specific data source that’s used in this chapter is blogs, but as was the case in Chapter 7, just about any source of text could be used. Blogs just happen to be a staple in the social web that are inherently well suited to text mining. And besides, the line between blog posts and articles is getting quite blurry these days!

NLP: A Pareto-Like Introduction

The opening section of this chapter is mostly an expository discussion that attempts to illustrate the difficulty of NLP and give you a good understanding of how it differs from the techniques introduced in previous chapters. The section after it, however, gets right to business with some sample code to get you on your way.

Syntax and Semantics

You may recall from Chapter 7 that perhaps the most fundamental weaknesses of TF-IDF and cosine similarity are that these models inherently don’t require a deep semantic understanding of the data. Quite the contrary, the examples in that chapter were able to take advantage of very basic syntax that separated tokens by whitespace to break an otherwise opaque document into a bag of tokens and use frequency and simple statistical similarity metrics to determine which tokens were likely to be important in the data. Although you can do some really amazing things with these techniques, they don’t really give you any notion of what any given token means in the context in which it appears in the document. Look no further than a sentence containing a homograph[51] such as “fish” or “bear” as a case in point; either one could be a noun or a verb.

NLP is inherently complex and difficult to do even reasonably well, and completely nailing it for a large set of commonly spoken languages may very well be the problem of the century. After all, a complete mastery of NLP is practically synonymous with acing the Turing Test, and to the most careful observer, a computer program that achieves this demonstrates an uncanny amount of human-like intelligence. Whereas structured or semi-structured sources are essentially collections of records with some presupposed meaning given to each field that can immediately be analyzed, there are more subtle considerations to be handled with natural language data for even the seemingly simplest of tasks. For example, let’s suppose you’re given a document and asked to count the number of sentences in it. It’s a trivial task if you’re a human and have just a basic understanding of English grammar, but it’s another story entirely for a machine, which will require a complex and detailed set of instructions to complete the same task.

The encouraging news is that machines can detect the ends of sentences on relatively well-formed data very quickly and with nearly perfect accuracy. However, even if you’ve accurately detected all of the sentences, there’s still a lot that you probably don’t know about the ways that words or phrases are used in those sentences. For example, consider the now classic circa-1990 phrase, “That’s the bomb”.[52] It’s a trivially parseable sentence consisting of almost nothing except a subject, predicate, and object. However, without additional context, even if you have perfect information about the components of the sentence—you have no way of knowing the meaning of the word “bomb”—it could be “something really cool,” or a nuke capable of immense destruction. The point is that even with perfect information about the structure of a sentence, you may still need additional context outside the sentence to interpret it. In this case, you need to resolve what the pronoun “that” really refers to[53] and do some inferencing about whether “that” is likely to be a dangerous weapon or not. Thus, as an overly broad generalization, we can say that NLP is fundamentally about taking an opaque document that consists of an ordered collection of symbols adhering to proper syntax and a well-defined grammar, and ultimately deducing the associated semantics that are associated with those symbols.

A Brief Thought Exercise

Let’s get back to the task of detecting sentences, the first step in most NLP pipelines, to illustrate some of the complexity involved in NLP. It’s deceptively easy to overestimate the utility of simple rule-based heuristics, and it’s important to work through an exercise so that you realize what some of the key issues are and don’t waste time trying to reinvent the wheel.

Your first attempt at solving the sentence detection problem might be to just count the periods, question marks, and exclamation points in the sentence. That’s the most obvious heuristic for starting out, but it’s quite crude and has the potential for producing an extremely high margin of error. Consider the following (pretty unambiguous) accusation:

“Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.”

Simply tokenizing the sentence by splitting on punctuation (specifically, periods) would produce the following result:

>>> txt = "Mr. Green killed Colonel Mustard in the study with the 
... candlestick. Mr. Green is not a very nice fellow."
>>> txt.split(".")
['Mr', 'Green killed Colonel Mustard in the study with the candlestick', 
 'Mr', 'Green is not a very nice fellow', '']

It should be immediately obvious that performing sentence detection by blindly breaking on periods without incorporating some notion of context or higher-level information is insufficient. In this case, the problem is the use of “Mr.”, a valid abbreviation that’s commonly used in the English language. Although we already know from Chapter 7 that n-gram analysis of this sample would likely tell us that “Mr. Green” is really one collocation or chunk (a compound token containing whitespace), if we had a larger amount of text to analyze, it’s not hard to imagine other edge cases that would be difficult to detect based on the appearance of collocations. Thinking ahead a bit, it’s also worth pointing out that finding the key topics in a sentence isn’t easy to accomplish with trivial logic either. As an intelligent human, you can easily discern that the key topics in our sample might be “Mr. Green”, “Colonel Mustard”, “the study”, and “the candlestick”, but training a machine to tell you the same things is a complex task. A few obvious possibilities are probably occurring to you, such as doing some ‘Title Case’ detection with a regular expression, constructing a list of common abbreviations to parse out the proper noun phrases, and applying some variation of that logic to the problem of finding end-of-sentence boundaries to prevent yourself from getting into trouble on that front.

OK, sure. Those things will work for some examples, but what’s the margin of error going to be like for arbitrary English text? How forgiving is your algorithm for poorly formed English text; highly abbreviated information such as text messages or tweets; or (gasp) other romantic languages, such as Spanish, French, or Italian? There are no simple answers here, and that’s why text analytics is such an important topic in an age where the amount of accessible textual data is literally increasing every second. These things aren’t pointed out to discourage you. They’re actually mentioned to motivate you to keep trying when times get tough because this is an inherently difficult space that no one has completely conquered yet. Even if you do feel deflated, you won’t feel that way for long because NLTK actually performs reasonably well out-of-the-box for many situations involving arbitrary text, as we’ll see in the next section.



[51] A homonym is a special case of a homograph. Two words are homographs if they have the same spelling. Two words are homonyms if they have the same spelling and the same pronunciation. For some reason, “homonym” seems more common in parlance than “homograph” even if it’s being misused.

[52] See also Urban Dictionary’s definitions for bomb.

[53] Resolving pronouns is called anaphora resolution, a topic that’s well outside the scope of this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.107.55