As previously mentioned, one issue that is frequently overlooked in unstructured text processing is the tremendous amount of information gained when you’re able to look at more than one token at a time, because so many concepts we express are phrases and not just single words. For example, if someone were to tell you that a few of the most common terms in a post are “open”, “source”, and “government”, could you necessarily say that the text is probably about “open source”, “open government”, both, or neither? If you had a priori knowledge of the author or content, you could probably make a good guess, but if you were relying totally on a machine to try to classify the nature of a document as being about collaborative software development or transformational government, you’d need to go back to the text and somehow determine which of the words most frequently occur after “open”—i.e., you’d like to find the collocations that start with the token “open”.
Recall from Chapter 6 that an
n-gram is just a terse way of expressing each
possible consecutive sequence of n tokens from a
text, and it provides the foundational data structure for computing
collocations. There are always (n-1)
n-grams for any value of n, and
if you were to consider all of the bigrams (2-grams) for the sequence of
tokens ["Mr.", "Green", "killed", "Colonel",
"Mustard"]
, you’d have four possibilities: [("Mr.", "Green"), ("Green", "killed"), ("killed",
"Colonel"), ("Colonel", "Mustard")]
. You’d need a larger sample
of text than just our sample sentence to determine collocations, but
assuming you had background knowledge or additional text, the next step
would be to statistically analyze the bigrams in order to determine which
of them are likely to be collocations.
It’s worth noting that the storage necessary for persisting an n-gram model requires space for (T–1)*n tokens (which is practically T*n), where T is the number of tokens in question and n is defined by the size of the desired n-gram. As an example, assume a document contains 1,000 tokens and requires around 8 KB of storage. Storing all bigrams for the text would require roughly double the original storage, or 16 KB, as you would be storing 999*2 tokens plus overhead. Storing all trigrams for the text (998*3 tokens plus overhead) would require roughly triple the original storage, or 24 KB. Thus, without devising specialized data structures or compression schemes, the storage costs for n-grams can be estimated as n times the original storage requirement for any value of n.
n-grams are very simple yet very powerful as a
technique for clustering commonly co-occurring words. If you compute all
of the n-grams for even a small value of
n, you’re likely to discover that some interesting
patterns emerge from the text itself with no additional work required. For
example, in considering the bigrams for a sufficiently long text, you’re
likely to discover the proper names, such as “Mr. Green” and “Colonel
Mustard”, concepts such as “open source” or “open government”, and so
forth. Similar patterns emerge when you consider frequent trigrams and
n-grams for values of n slightly
larger than three. In fact, computing bigrams in this way produces
essentially the same results as the collocations
function that you ran in an earlier
interpreter session, except that some additional statistical analysis
takes into account the use of rare words. As you already know from the
interpreter session we looked at earlier in this chapter (Example 7-2), NLTK takes care of most of the effort in computing
n-grams, discovering collocations from text,
discovering the context in which a token has been used, etc. Example 7-8 demonstrates.
Example 7-8. Using NLTK to compute bigrams and collocations for a sentence
>>>import nltk
>>>nltk.ngrams("Mr. Green killed Colonel Mustard in the study with the
...candlestick. Mr. Green is not a very nice fellow.".split(), 2)
[('Mr.', 'Green'), ('Green', 'killed'), ('killed', 'Colonel'), ('Colonel', 'Mustard'), ('Mustard', 'in'), ('in', 'the'), ('the', 'study'), ('study', 'with'), ('with', 'the'), ('the', 'candlestick.'), ('candlestick.', 'Mr.'), ('Mr.', 'Green'), ('Green', 'is'), ('is', 'not'), ('not', 'a'), ('a', 'very'), ('very', 'nice'), ('nice', 'fellow.')] >>>txt = nltk.Text("Mr. Green killed Colonel Mustard in the study with the
...candletick. Mr. Green is not a very nice fellow.".split())
>>>txt.collocations()
Building collocations list Mr. Green
Recall that the only drawback to using built-in demo functionality
such as nltk.Text.collocations
is that these functions don’t
return data structures that you can store and manipulate. Whenever you run
into such a situation, just take a look at the source code, which is
usually pretty easy to adapt for your own purposes. Example 7-9 illustrates how you could compute the
collocations and concordance indexes for a collection of tokens and
maintain control of the results.
Example 7-9. Using NLTK to compute collocations in a similar manner to the nltk.Text.collocations demo functionality (plus__collocations.py)
# -*- coding: utf-8 -*- import sys import json import nltk # Load in human readable text from wherever you've saved it DATA = sys.argv[1] N = 25 data = json.loads(open(DATA).read()) all_tokens = [token for activity in data for token in activity['object']['content' ].lower().split()] finder = nltk.BigramCollocationFinder.from_words(all_tokens) finder.apply_freq_filter(2) finder.apply_word_filter(lambda w: w in nltk.corpus.stopwords.words('english')) scorer = nltk.metrics.BigramAssocMeasures.jaccard collocations = finder.nbest(scorer, N) for collocation in collocations: c = ' '.join(collocation) print c
In short, the implementation loosely follows NLTK’s collocations
demo function. It filters out
bigrams that don’t appear more than a minimum number of times (two,
in this case) and then applies a
scoring metric to rank the results. In this instance, the scoring function is the well-known
Jaccard Index, as defined by nltk.metrics.Bigram
AssocMeasures.jaccard
. A contingency table is used by the BigramAssocMeasures
class to rank the
co-occurrence of terms in any given bigram as compared to the
possibilities of other words that could have appeared in the bigram.
Conceptually, the Jaccard Index measures similarity of sets, and in this
case, the sample sets are specific comparisons of bigrams that appeared in
the text. The details of how contingency tables and Jaccard values are
calculated is arguably an advanced topic, but the next section, How the Collocation Sausage Is Made:
Contingency Tables and Scoring Functions, provides an extended discussion of those details
since they’re critical to a good understanding of collocation
detection.
In the meantime, though, let’s examine Example 7-10. This example shows some output from Tim’s Google+ data that should make it pretty apparent that returning scored bigrams is immensely more powerful than only returning tokens, because of the additional context that grounds the terms in meaning.
Example 7-10. Sample results from Example 7-9
ada lovelace jennifer pahlka hod lipson pine nuts safe, welcoming 1st floor, 5 southampton 7ha cost: bcs, 1st borrow 42 broadcom masters building, 5 date: friday disaster relief dissolvable sugar do-it-yourself festival, dot com fabric samples finance protection london, wc2e maximizing shareholder patron profiles portable disaster rural co vat tickets:
Keeping in mind that no special heuristics or tactics that could have inspected the text for proper names based on Title Case were employed, it’s actually quite amazing that so many proper names and common phrases were sifted out of the data. There’s still a certain amount of inevitable noise in the results because we have not yet made any effort to clean punctuation from the tokens, but for the small amount of work we’ve put in, the results are really quite good. This might be a good time to mention that even if reasonably good natural language processing capabilities were employed, it might still be difficult to eliminate all the noise from the results of textual analysis. Getting comfortable with the noise and finding heuristics to control it is a good idea until you get to the point where you’re willing to make a significant investment in obtaining the perfect results that a well-educated human would be able to pick out from the text.
Hopefully, the primary observation you’re making at this point is
that with very little effort and time invested, we’ve been able to use
another very basic technique to draw out some pretty powerful meaning from
some free text data, and the results seem to be
pretty representative of what we already suspect should be
true. This is encouraging, because it suggests that applying
the same technique to anyone else’s Google+ data (or any other kind of
unstructured text, for that matter) would potentially be just as
informative, giving you a quick glimpse into key items that are being
discussed. And just as importantly, while the data in this case probably
confirms a few things you may already know about Tim O’Reilly, you may
have learned a couple of new things, too—one of which is that he might just have
a sweet spot for Ada Lovelace as evidenced by “ada lovelace” showing up in
the collocation results. While it would be easy enough to use the
concordance, a regular expression, or even the Python string type’s
built-in find
method to find posts
relevant to “ada lovelace”, let’s instead take advantage of the code we
developed in Example 7-5 and use TF-IDF to query
for “ada lovelace.” What comes back? Survey says:
I just got an email from +Suw Charman about Ada Lovelace Day, and thought I'd share it here, sinc... Link: https://plus.google.com/107033731246200681024/posts/1XSAkDs9b44 Score: 0.198150014715
And there you have it. The “ada lovelace” query leads us to some content about Ada Lovelace Day. You’ve effectively started with a nominal (if that) understanding of the text, narrowed in on some interesting topics using collection analysis, and searched the text for one of those interesting topics using TF-IDF. There’s no reason you couldn’t also use cosine similarity at this point to find the most similar post to the one about the lovely Ada Lovelace (or whatever it is that you’re keen to investigate).
This section dives into some of the more technical details of
how BigramCollocationFinder
—the
Jaccard scoring function from Example 7-9—works. If this is your first reading
of the chapter or you’re not interested in these details, feel free
to skip this section and come back to it later.
A common data structure that’s used to compute metrics related to bigrams is the contingency table. The purpose of a contingency table is to compactly express the frequencies associated with the various possibilities for appearance of different terms of a bigram. Take a look at the bold entries in Table 7-5, where token1 expresses the existence of token1 in the bigram, and ~token1 expresses that token1 does not exist in the bigram.
Table 7-5. Contingency table example—values in italics represent “marginals,” and values in bold represent frequency counts of bigram variations
token1 | ~token1 | ||
token2 | frequency(token1, token2) | frequency(~token1, token2) | frequency(*, token2) |
~token2 | frequency(token1, ~token2) | frequency(~token1, ~token2) | |
frequency(token1, *) | frequency(*, *) |
Although there are a few details associated with which cells are significant for which calculations, hopefully it’s not difficult to see that the four middle cells in the table express the frequencies associated with the appearance of various tokens in the bigram. The values in these cells can compute different similarity metrics that can be used to score and rank bigrams in order of likely significance, as was the case with the previously introduced Jaccard Index, which we’ll dissect in just a moment. First, however, let’s briefly discuss how the terms for the contingency table are computed.
The way that the various entries in the contingency table are computed is directly tied to which data structures you have precomputed or otherwise have available. If you assume that you only have available a frequency distribution for the various bigrams in the text, the way to calculate frequency(token1, token2) is a direct lookup, but what about frequency(~token1, token2)? With no other information available, you’d need to scan every single bigram for the appearance of token2 in the second slot and subtract frequency(token1, token2) from that value. (Take a moment to convince yourself that this is true if it isn’t obvious.)
However, if you assume that you have a frequency distribution available that counts the occurrences of each individual token in the text (the text’s unigrams) in addition to a frequency distribution of the bigrams, there’s a much less expensive shortcut you can take that involves two lookups and an arithmetic operation. Subtract the number of times that token2 appeared as a unigram from the number of times the bigram (token1, token2) appeared, and you’re left with the number of times the bigram (~token1, token2) appeared. For example, if the bigram (“mr.”, “green”) appeared three times and the unigram (“green”) appeared seven times, it must be the case that the bigram (~“mr.”, “green”) appeared four times (where ~“mr.” literally means “any token other than ‘mr.’”). In Table 7-5, the expression frequency(*, token2) represents the unigram token2 and is referred to as a marginal because it’s noted in the margin of the table as a shortcut. The value for frequency(token1, *) works the same way in helping to compute frequency(token1, ~token2), and the expression frequency(*, *) refers to any possible unigram and is equivalent to the total number of tokens in the text. Given frequency(token1, token2), frequency(token1, ~token2), and frequency(~token1, token2), the value of frequency(*, *) is necessary to calculate frequency(~token1, ~token2).
Although this discussion of contingency tables may seem somewhat tangential, it’s an important foundation for understanding different scoring functions. For example, consider the Jaccard Index. Conceptually, it expresses the similarity of two sets and is defined by:
In other words, that’s the number of items in common between the two sets divided by the total number of distinct items in the combined sets. It’s worth taking a moment to ponder this simple yet effective calculation. If Set1 and Set2 were identical, the union and the intersection of the two sets would be equivalent to one another, resulting in a ratio of 1.0. If both sets were completely different, the numerator of the ratio would be 0, resulting in a value of 0.0. Then there’s everything else in between. The Jaccard Index as applied to a particular bigram expresses the ratio between the frequency of a particular bigram and the sum of the frequencies with which any bigram containing a term in the bigram of interest appears. One interpretation of that metric might be that the higher the ratio is, the more likely it is that (token1, token2) appears in the text, and hence, the more likely it is that the collocation “token1 token2” expresses a meaningful concept.
The selection of the most appropriate scoring function is usually
determined based upon knowledge about the characteristics of the
underlying data, some intuition, and sometimes a bit of luck. Most of
the association metrics defined in
nltk.metrics.associations
are discussed in Chapter 5 of
Christopher Manning and Hinrich Schuetze’s Foundations of Statistical Natural Language
Processing (MIT Press), which is conveniently available
online and
serves as a useful reference for the descriptions that follow. An
in-depth discussion of these metrics is outside the scope of this book,
but the promotional chapter just mentioned provides a detailed account
with in-depth examples. The Jaccard Index, Dice’s coefficient, and the
likelihood ratio are good starting points if you find yourself needing
to build your own collocation detector. They are described, along with
some other key terms, in the list that follows:
As its name implies, raw frequency is the ratio expressing the frequency of a particular n-gram divided by the frequency of all n-grams. It is useful for examining the overall frequency of a particular collocation in a text.
The Jaccard Index is a ratio that measures the similarity between sets. As applied to collocations, it is defined as the frequency of a particular collocation divided by the total number of collocations that contain at least one term in the collocation of interest. It is useful for determining the likelihood of whether the given terms actually form a collocation, as well as ranking the likelihood of probable collocations. Using notation consistent with previous explanations, this formulation would be mathematically defined as:
Dice’s coefficient is extremely similar to the Jaccard Index. The fundamental difference is that it weights agreements among the sets twice as heavily as Jaccard. It is defined mathematically as:
Mathematically, it can be shown fairly easily that:
You might choose to use this metric instead of the Jaccard Index when it’s more important to bias the score in favor of instances in which the collocation “token1 token2” appears.
This metric is yet another approach to hypothesis testing that is used to measure the independence between terms that may form a collocation. It’s been shown to be a more appropriate approach for collocation discovery than the chi-square test in the general case, and it works well on data that includes many infrequent collocations. The particular calculations involved in computing likelihood estimates for collocations as implemented by NLTK assume a binomial distribution, where the parameters governing the distribution are calculated based upon the number of occurrences of collocations and constituent terms.
Like Student’s t-score, this metric is commonly used for testing independence between two variables and can be used to measure whether two tokens are collocations based upon Pearson’s chi-square test of statistical significance. Generally speaking, the differences obtained from applying the t-test and chi-square test are not substantial. The advantage of chi-square testing is that unlike t-testing, it does not assume an underlying normal distribution; for this reason, chi-square testing is more commonly used.
Traditionally, Student’s t-score has been used for hypothesis testing, and as applied to n-gram analysis, t-scores can be used for testing the hypothesis of whether two terms are collocations. The statistical procedure for this calculation uses a standard distribution per the norm for t-testing. An advantage of the t-score values as opposed to raw frequencies is that a t-score takes into account the frequency of a bigram relative to its constituent components. This characteristic facilitates ranking the strengths of collocations. A criticism of the t-test is that it necessarily assumes that the underlying probability distribution for collocations is normal, which is not often the case.
Pointwise Mutual Information (PMI) is a measure of how much information is gained about a particular word if you also know the value of a neighboring word. To put it another way, how much one word can tell you about another. Ironically (in the context of the current discussion), the calculations involved in computing the PMI lead it to score high-frequency words lower than low-frequency words, which is opposite of the desired effect. Therefore, it is a good measure of independence but not a good measure of dependence (i.e., a less than ideal choice for scoring collocations). It has also been shown that sparse data is a particular stumbling block for PMI scoring, and that other techniques such as the likelihood ratio tend to outperform it.
18.222.179.161