Bigram Analysis

As previously mentioned, one issue that is frequently overlooked in unstructured text processing is the tremendous amount of information gained when you’re able to look at more than one token at a time, because so many concepts we express are phrases and not just single words. For example, if someone were to tell you that a few of the most common terms in a post are “open”, “source”, and “government”, could you necessarily say that the text is probably about “open source”, “open government”, both, or neither? If you had a priori knowledge of the author or content, you could probably make a good guess, but if you were relying totally on a machine to try to classify the nature of a document as being about collaborative software development or transformational government, you’d need to go back to the text and somehow determine which of the words most frequently occur after “open”—i.e., you’d like to find the collocations that start with the token “open”.

Recall from Chapter 6 that an n-gram is just a terse way of expressing each possible consecutive sequence of n tokens from a text, and it provides the foundational data structure for computing collocations. There are always (n-1) n-grams for any value of n, and if you were to consider all of the bigrams (2-grams) for the sequence of tokens ["Mr.", "Green", "killed", "Colonel", "Mustard"], you’d have four possibilities: [("Mr.", "Green"), ("Green", "killed"), ("killed", "Colonel"), ("Colonel", "Mustard")]. You’d need a larger sample of text than just our sample sentence to determine collocations, but assuming you had background knowledge or additional text, the next step would be to statistically analyze the bigrams in order to determine which of them are likely to be collocations.

Warning

It’s worth noting that the storage necessary for persisting an n-gram model requires space for (T–1)*n tokens (which is practically T*n), where T is the number of tokens in question and n is defined by the size of the desired n-gram. As an example, assume a document contains 1,000 tokens and requires around 8 KB of storage. Storing all bigrams for the text would require roughly double the original storage, or 16 KB, as you would be storing 999*2 tokens plus overhead. Storing all trigrams for the text (998*3 tokens plus overhead) would require roughly triple the original storage, or 24 KB. Thus, without devising specialized data structures or compression schemes, the storage costs for n-grams can be estimated as n times the original storage requirement for any value of n.

n-grams are very simple yet very powerful as a technique for clustering commonly co-occurring words. If you compute all of the n-grams for even a small value of n, you’re likely to discover that some interesting patterns emerge from the text itself with no additional work required. For example, in considering the bigrams for a sufficiently long text, you’re likely to discover the proper names, such as “Mr. Green” and “Colonel Mustard”, concepts such as “open source” or “open government”, and so forth. Similar patterns emerge when you consider frequent trigrams and n-grams for values of n slightly larger than three. In fact, computing bigrams in this way produces essentially the same results as the collocations function that you ran in an earlier interpreter session, except that some additional statistical analysis takes into account the use of rare words. As you already know from the interpreter session we looked at earlier in this chapter (Example 7-2), NLTK takes care of most of the effort in computing n-grams, discovering collocations from text, discovering the context in which a token has been used, etc. Example 7-8 demonstrates.

Example 7-8. Using NLTK to compute bigrams and collocations for a sentence

>>> import nltk
>>> nltk.ngrams("Mr. Green killed Colonel Mustard in the study with the 
... candlestick. Mr. Green is not a very nice fellow.".split(), 2)
[('Mr.', 'Green'), ('Green', 'killed'), ('killed', 'Colonel'), 
 ('Colonel', 'Mustard'), ('Mustard', 'in'), ('in', 'the'), 
 ('the', 'study'), ('study', 'with'), ('with', 'the'), 
 ('the', 'candlestick.'), ('candlestick.', 'Mr.'), ('Mr.', 'Green'), 
 ('Green', 'is'), ('is', 'not'), ('not', 'a'), ('a', 'very'), 
 ('very', 'nice'), ('nice', 'fellow.')]
>>> txt = nltk.Text("Mr. Green killed Colonel Mustard in the study with the
... candletick. Mr. Green is not a very nice fellow.".split())
>>> txt.collocations()
Building collocations list
Mr. Green

Recall that the only drawback to using built-in demo functionality such as nltk.Text.collocations is that these functions don’t return data structures that you can store and manipulate. Whenever you run into such a situation, just take a look at the source code, which is usually pretty easy to adapt for your own purposes. Example 7-9 illustrates how you could compute the collocations and concordance indexes for a collection of tokens and maintain control of the results.

Example 7-9. Using NLTK to compute collocations in a similar manner to the nltk.Text.collocations demo functionality (plus__collocations.py)

# -*- coding: utf-8 -*-

import sys
import json
import nltk

# Load in human readable text from wherever you've saved it

DATA = sys.argv[1]
N = 25
data = json.loads(open(DATA).read())

all_tokens = [token for activity in data for token in activity['object']['content'
              ].lower().split()]

finder = nltk.BigramCollocationFinder.from_words(all_tokens)
finder.apply_freq_filter(2)
finder.apply_word_filter(lambda w: w in nltk.corpus.stopwords.words('english'))
scorer = nltk.metrics.BigramAssocMeasures.jaccard
collocations = finder.nbest(scorer, N)

for collocation in collocations:
    c = ' '.join(collocation)
    print c

In short, the implementation loosely follows NLTK’s collocations demo function. It filters out bigrams that don’t appear more than a minimum number of times (two, in this case) and then applies a scoring metric to rank the results. In this instance, the scoring function is the well-known Jaccard Index, as defined by nltk.metrics.Bigram AssocMeasures.jaccard. A contingency table is used by the BigramAssocMeasures class to rank the co-occurrence of terms in any given bigram as compared to the possibilities of other words that could have appeared in the bigram. Conceptually, the Jaccard Index measures similarity of sets, and in this case, the sample sets are specific comparisons of bigrams that appeared in the text. The details of how contingency tables and Jaccard values are calculated is arguably an advanced topic, but the next section, How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions, provides an extended discussion of those details since they’re critical to a good understanding of collocation detection.

In the meantime, though, let’s examine Example 7-10. This example shows some output from Tim’s Google+ data that should make it pretty apparent that returning scored bigrams is immensely more powerful than only returning tokens, because of the additional context that grounds the terms in meaning.

Example 7-10. Sample results from Example 7-9

ada lovelace
jennifer pahlka
hod lipson
pine nuts
safe, welcoming
1st floor,
5 southampton
7ha cost:
bcs, 1st
borrow 42
broadcom masters
building, 5
date: friday
disaster relief
dissolvable sugar
do-it-yourself festival,
dot com
fabric samples
finance protection
london, wc2e
maximizing shareholder
patron profiles
portable disaster
rural co
vat tickets:

Keeping in mind that no special heuristics or tactics that could have inspected the text for proper names based on Title Case were employed, it’s actually quite amazing that so many proper names and common phrases were sifted out of the data. There’s still a certain amount of inevitable noise in the results because we have not yet made any effort to clean punctuation from the tokens, but for the small amount of work we’ve put in, the results are really quite good. This might be a good time to mention that even if reasonably good natural language processing capabilities were employed, it might still be difficult to eliminate all the noise from the results of textual analysis. Getting comfortable with the noise and finding heuristics to control it is a good idea until you get to the point where you’re willing to make a significant investment in obtaining the perfect results that a well-educated human would be able to pick out from the text.

Hopefully, the primary observation you’re making at this point is that with very little effort and time invested, we’ve been able to use another very basic technique to draw out some pretty powerful meaning from some free text data, and the results seem to be pretty representative of what we already suspect should be true. This is encouraging, because it suggests that applying the same technique to anyone else’s Google+ data (or any other kind of unstructured text, for that matter) would potentially be just as informative, giving you a quick glimpse into key items that are being discussed. And just as importantly, while the data in this case probably confirms a few things you may already know about Tim O’Reilly, you may have learned a couple of new things, too—one of which is that he might just have a sweet spot for Ada Lovelace as evidenced by “ada lovelace” showing up in the collocation results. While it would be easy enough to use the concordance, a regular expression, or even the Python string type’s built-in find method to find posts relevant to “ada lovelace”, let’s instead take advantage of the code we developed in Example 7-5 and use TF-IDF to query for “ada lovelace.” What comes back? Survey says:

I just got an email from +Suw Charman about Ada Lovelace Day, 
and thought I'd share it here, sinc...
	Link: https://plus.google.com/107033731246200681024/posts/1XSAkDs9b44
	Score: 0.198150014715

And there you have it. The “ada lovelace” query leads us to some content about Ada Lovelace Day. You’ve effectively started with a nominal (if that) understanding of the text, narrowed in on some interesting topics using collection analysis, and searched the text for one of those interesting topics using TF-IDF. There’s no reason you couldn’t also use cosine similarity at this point to find the most similar post to the one about the lovely Ada Lovelace (or whatever it is that you’re keen to investigate).

How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions

Note

This section dives into some of the more technical details of how BigramCollocationFinder—the Jaccard scoring function from Example 7-9—works. If this is your first reading of the chapter or you’re not interested in these details, feel free to skip this section and come back to it later.

A common data structure that’s used to compute metrics related to bigrams is the contingency table. The purpose of a contingency table is to compactly express the frequencies associated with the various possibilities for appearance of different terms of a bigram. Take a look at the bold entries in Table 7-5, where token1 expresses the existence of token1 in the bigram, and ~token1 expresses that token1 does not exist in the bigram.

Table 7-5. Contingency table example—values in italics represent “marginals,” and values in bold represent frequency counts of bigram variations

 token1~token1 
token2 frequency(token1, token2) frequency(~token1, token2) frequency(*, token2)
~token2 frequency(token1, ~token2) frequency(~token1, ~token2)  
 frequency(token1, *) frequency(*, *)

Although there are a few details associated with which cells are significant for which calculations, hopefully it’s not difficult to see that the four middle cells in the table express the frequencies associated with the appearance of various tokens in the bigram. The values in these cells can compute different similarity metrics that can be used to score and rank bigrams in order of likely significance, as was the case with the previously introduced Jaccard Index, which we’ll dissect in just a moment. First, however, let’s briefly discuss how the terms for the contingency table are computed.

The way that the various entries in the contingency table are computed is directly tied to which data structures you have precomputed or otherwise have available. If you assume that you only have available a frequency distribution for the various bigrams in the text, the way to calculate frequency(token1, token2) is a direct lookup, but what about frequency(~token1, token2)? With no other information available, you’d need to scan every single bigram for the appearance of token2 in the second slot and subtract frequency(token1, token2) from that value. (Take a moment to convince yourself that this is true if it isn’t obvious.)

However, if you assume that you have a frequency distribution available that counts the occurrences of each individual token in the text (the text’s unigrams) in addition to a frequency distribution of the bigrams, there’s a much less expensive shortcut you can take that involves two lookups and an arithmetic operation. Subtract the number of times that token2 appeared as a unigram from the number of times the bigram (token1, token2) appeared, and you’re left with the number of times the bigram (~token1, token2) appeared. For example, if the bigram (“mr.”, “green”) appeared three times and the unigram (“green”) appeared seven times, it must be the case that the bigram (~“mr.”, “green”) appeared four times (where ~“mr.” literally means “any token other than ‘mr.’”). In Table 7-5, the expression frequency(*, token2) represents the unigram token2 and is referred to as a marginal because it’s noted in the margin of the table as a shortcut. The value for frequency(token1, *) works the same way in helping to compute frequency(token1, ~token2), and the expression frequency(*, *) refers to any possible unigram and is equivalent to the total number of tokens in the text. Given frequency(token1, token2), frequency(token1, ~token2), and frequency(~token1, token2), the value of frequency(*, *) is necessary to calculate frequency(~token1, ~token2).

Although this discussion of contingency tables may seem somewhat tangential, it’s an important foundation for understanding different scoring functions. For example, consider the Jaccard Index. Conceptually, it expresses the similarity of two sets and is defined by:

Contingency table example—values in italics represent “marginals,” and values in bold represent frequency counts of bigram variations

In other words, that’s the number of items in common between the two sets divided by the total number of distinct items in the combined sets. It’s worth taking a moment to ponder this simple yet effective calculation. If Set1 and Set2 were identical, the union and the intersection of the two sets would be equivalent to one another, resulting in a ratio of 1.0. If both sets were completely different, the numerator of the ratio would be 0, resulting in a value of 0.0. Then there’s everything else in between. The Jaccard Index as applied to a particular bigram expresses the ratio between the frequency of a particular bigram and the sum of the frequencies with which any bigram containing a term in the bigram of interest appears. One interpretation of that metric might be that the higher the ratio is, the more likely it is that (token1, token2) appears in the text, and hence, the more likely it is that the collocation “token1 token2” expresses a meaningful concept.

The selection of the most appropriate scoring function is usually determined based upon knowledge about the characteristics of the underlying data, some intuition, and sometimes a bit of luck. Most of the association metrics defined in nltk.metrics.associations are discussed in Chapter 5 of Christopher Manning and Hinrich Schuetze’s Foundations of Statistical Natural Language Processing (MIT Press), which is conveniently available online and serves as a useful reference for the descriptions that follow. An in-depth discussion of these metrics is outside the scope of this book, but the promotional chapter just mentioned provides a detailed account with in-depth examples. The Jaccard Index, Dice’s coefficient, and the likelihood ratio are good starting points if you find yourself needing to build your own collocation detector. They are described, along with some other key terms, in the list that follows:

Raw frequency

As its name implies, raw frequency is the ratio expressing the frequency of a particular n-gram divided by the frequency of all n-grams. It is useful for examining the overall frequency of a particular collocation in a text.

Jaccard Index

The Jaccard Index is a ratio that measures the similarity between sets. As applied to collocations, it is defined as the frequency of a particular collocation divided by the total number of collocations that contain at least one term in the collocation of interest. It is useful for determining the likelihood of whether the given terms actually form a collocation, as well as ranking the likelihood of probable collocations. Using notation consistent with previous explanations, this formulation would be mathematically defined as:

Contingency table example—values in italics represent “marginals,” and values in bold represent frequency counts of bigram variations
Dice’s coefficient

Dice’s coefficient is extremely similar to the Jaccard Index. The fundamental difference is that it weights agreements among the sets twice as heavily as Jaccard. It is defined mathematically as:

Contingency table example—values in italics represent “marginals,” and values in bold represent frequency counts of bigram variations

Mathematically, it can be shown fairly easily that:

Contingency table example—values in italics represent “marginals,” and values in bold represent frequency counts of bigram variations

You might choose to use this metric instead of the Jaccard Index when it’s more important to bias the score in favor of instances in which the collocation “token1 token2” appears.

Likelihood ratio

This metric is yet another approach to hypothesis testing that is used to measure the independence between terms that may form a collocation. It’s been shown to be a more appropriate approach for collocation discovery than the chi-square test in the general case, and it works well on data that includes many infrequent collocations. The particular calculations involved in computing likelihood estimates for collocations as implemented by NLTK assume a binomial distribution, where the parameters governing the distribution are calculated based upon the number of occurrences of collocations and constituent terms.

Chi-square

Like Student’s t-score, this metric is commonly used for testing independence between two variables and can be used to measure whether two tokens are collocations based upon Pearson’s chi-square test of statistical significance. Generally speaking, the differences obtained from applying the t-test and chi-square test are not substantial. The advantage of chi-square testing is that unlike t-testing, it does not assume an underlying normal distribution; for this reason, chi-square testing is more commonly used.

Student’s t-score

Traditionally, Student’s t-score has been used for hypothesis testing, and as applied to n-gram analysis, t-scores can be used for testing the hypothesis of whether two terms are collocations. The statistical procedure for this calculation uses a standard distribution per the norm for t-testing. An advantage of the t-score values as opposed to raw frequencies is that a t-score takes into account the frequency of a bigram relative to its constituent components. This characteristic facilitates ranking the strengths of collocations. A criticism of the t-test is that it necessarily assumes that the underlying probability distribution for collocations is normal, which is not often the case.

Pointwise Mutual Information

Pointwise Mutual Information (PMI) is a measure of how much information is gained about a particular word if you also know the value of a neighboring word. To put it another way, how much one word can tell you about another. Ironically (in the context of the current discussion), the calculations involved in computing the PMI lead it to score high-frequency words lower than low-frequency words, which is opposite of the desired effect. Therefore, it is a good measure of independence but not a good measure of dependence (i.e., a less than ideal choice for scoring collocations). It has also been shown that sparse data is a particular stumbling block for PMI scoring, and that other techniques such as the likelihood ratio tend to outperform it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.179.161