Text Mining Fundamentals

Although rigorous approaches to natural language processing (NLP) that include such things as sentence segmentation, tokenization, word chunking, and entity detection are necessary in order to achieve the deepest possible understanding of textual data, it’s helpful to first introduce some fundamentals from Information Retrieval theory. The remainder of this chapter introduces some of its more foundational aspects, including TF-IDF, the cosine similarity metric, and some of the theory behind collocation detection. Chapter 8 provides a deeper discussion of NLP.

Note

If you want to dig deeper into IR theory, the full text of Introduction to Information Retrieval is available online and provides more information than you could ever want to know about the field.

A Whiz-Bang Introduction to TF-IDF

Information retrieval is an extensive field with many specialties. This discussion narrows in on TF-IDF, one of the most fundamental techniques for retrieving relevant documents from a corpus. TF-IDF stands for term frequency-inverse document frequency and can be used to query a corpus by calculating normalized scores that express the relative importance of terms in the documents. Mathematically, TF-IDF is expressed as the product of the term frequency and the inverse document frequency, tf_idf = tf*idf, where the term tf represents the importance of a term in a specific document, and idf represents the importance of a term relative to the entire corpus. Multiplying these terms together produces a score that accounts for both factors and has been an integral part of every major search engine at some point in its existence. To get a more intuitive idea of how TF-IDF works, let’s walk through each of the calculations involved in computing the overall score.

For simplicity in illustration, suppose you have a corpus containing three sample documents and terms are calculated by simply breaking on whitespace, as illustrated in Example 7-3.

Example 7-3. Sample data structures used in illustrations throughout this chapter

corpus = { 
        'a' : "Mr. Green killed Colonel Mustard in the study with the candlestick. 
               Mr. Green is not a very nice fellow.",
        'b' : "Professor Plumb has a green plant in his study.",
        'c' : "Miss Scarlett watered Professor Plumb's green plant while he was away 
               from his office last week."
}

terms = {
        'a' : [ i.lower() for i in corpus['a'].split() ],
        'b' : [ i.lower() for i in corpus['b'].split() ],
        'c' : [ i.lower() for i in corpus['c'].split() ]
        }

A term’s frequency could simply be represented as the number of times it occurs in the text, but it is more commonly the case that it is normalized by taking into account the total number of terms in the text, so that the overall score accounts for document length relative to a term’s frequency. For example, the term “green” (once normalized to lowercase) occurs twice in corpus['a'] and only once in corpus['b'], so corpus['a'] would produce a higher score if frequency were the only scoring criterion. However, if you normalize for document length, corpus['b'] would have a slightly higher term frequency score for “green” (1/9) than corpus['a'] (2/19), because corpus['b'] is shorter than corpus['a']—even though “green” occurs more frequently in corpus['a']. Thus, a common technique for scoring a compound query such as “Mr. Green” is to sum the term frequency scores for each of the query terms in each document, and return the documents ranked by the summed term frequency score.

The previous paragraph is actually not as confusing as it sounds; for example, querying our sample corpus for “Mr. Green” would return the normalized scores reported in Table 7-1 for each document.

Table 7-1. Sample term frequency scores for “Mr. Green”

Documenttf(Mr.)tf(Green)Sum
corpus['a']2/192/194/19 (0.2105)
corpus['b']01/91/9 (0.1111)
corpus['c']01/161/16 (0.0625)

For this contrived example, a cumulative term frequency scoring scheme works out and returns corpus['a'] (the document that we’d expect it to return), since corpus['a'] is the only one that contains the compound token “Mr. Green”. However, a number of problems could have emerged because the term frequency scoring model looks at each document as an unordered collection of words. For example, queries for “Green Mr.” or “Green Mr. Foo” would have returned the exact same scores as the query for “Mr. Green”, even though neither of those compound tokens appear in the sample sentences. Additionally, there are a number of scenarios that we could easily contrive to illustrate fairly poor results from the term frequency ranking technique by exploiting the fact that trailing punctuation is not properly handled, and the context around tokens of interest is not taken into account by the calculations.

Considering term frequency alone turns out to be a common problem when scoring on a document-by-document basis because it doesn’t account for very frequent words that are common across many documents. Meaning, all terms are weighted equally regardless of their actual importance. For example, “the green plant” contains the stopword “the”, which skews overall term frequency scores in favor of corpus['a'] because “the” appears twice in that document, as does “green”. In contrast, in corpus['c'] “green” and “plant” each appear only once.

Consequently, the scores would break down as shown in Table 7-2, with corpus['a'] ranked as more relevant than corpus['c'] even though intuition might lead you to believe that ideal query results probably shouldn’t have turned out that way. (Fortunately, however, corpus['b'] still ranks highest.)

Table 7-2. Sample term frequency scores for “the green plant”

Documenttf(the)tf(green)tf(plant)Sum
corpus['a']2/192/1904/19 (0.2105)
corpus['b']01/91/92/9 (0.2222)
corpus['c']01/161/161/8 (0.125)

Toolkits such as NLTK provide lists of stopwords that can be used to filter out terms such as “the”, “a”, “and”, etc., but keep in mind that there may be terms that evade even the best stopword lists and yet still are quite common to specialized domains. The inverse document frequency metric is a calculation that provides a generic normalization metric for a corpus, and accounts for the appearance of common terms across a set of documents by taking into consideration the total number of documents in which a query term ever appears. The intuition behind this metric is that it produces a higher value if a term is somewhat uncommon across the corpus than if it is very common, which helps to account for the problem with stopwords we just investigated. For example, a query for “green” in the corpus of sample documents should return a lower inverse document frequency score than “candlestick”, because “green” appears in every document while “candlestick” appears in only one. Mathematically, the only nuance of interest for the inverse document frequency calculation is that a logarithm is used to squash the result into a compressed range (as shown in Figure 7-3) since its usual application is in multiplying it against term frequency as a scaling factor.

At this point, we’ve come full circle and devised a way to compute a score for a multiterm query that accounts for the frequency of terms appearing in a document, the length of the document in which any particular term appears, and the overall uniqueness of the terms across documents in the entire corpus. In other words, tf-idf = tf*idf. Example 7-4 is a naive implementation of this discussion that should help solidify the concepts described. Take a moment to review it, and then we’ll discuss a few sample queries.

The logarithm squashes a large range of values into a more compressed space

Figure 7-3. The logarithm squashes a large range of values into a more compressed space

Example 7-4. Running TF-IDF on sample data (plus__tf_idf.py)

# -*- coding: utf-8 -*-

import sys
from math import log

QUERY_TERMS = sys.argv[1:]


def tf(term, doc, normalize=True):
    doc = doc.lower().split()
    if normalize:
        return doc.count(term.lower()) / float(len(doc))
    else:
        return doc.count(term.lower()) / 1.0


def idf(term, corpus):
    num_texts_with_term = len([True for text in corpus if term.lower()
                              in text.lower().split()])

    # tf-idf calc involves multiplying against a tf value less than 0, so it's important
    # to return a value greater than 1 for consistent scoring. (Multiplying two values 
    # less than 1 returns a value less than each of them)

    try:
        return 1.0 + log(float(len(corpus)) / num_texts_with_term)
    except ZeroDivisionError:
        return 1.0


def tf_idf(term, doc, corpus):
    return tf(term, doc) * idf(term, corpus)


corpus = 
    {'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. 
Mr. Green is not a very nice fellow.',
     'b': 'Professor Plumb has a green plant in his study.',
     'c': "Miss Scarlett watered Professor Plumb's green plant while he was away 
from his office last week."}

# Score queries by calculating cumulative tf_idf score for each term in query

query_scores = {'a': 0, 'b': 0, 'c': 0}
for term in [t.lower() for t in QUERY_TERMS]:
    for doc in sorted(corpus):
        print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc])
    print 'IDF: %s' % (term, ), idf(term, corpus.values())
    print

    for doc in sorted(corpus):
        score = tf_idf(term, corpus[doc], corpus.values())
        print 'TF-IDF(%s): %s' % (doc, term), score
        query_scores[doc] += score
    print

print "Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), )
for (doc, score) in sorted(query_scores.items()):
    print doc, score

Although we’re working on a trivially small scale, the calculations involved work the same for larger data sets. Table 7-3 is a consolidated adaptation of the program’s output for three sample queries that involve four distinct terms:

  • “green”

  • “Mr. Green”

  • “the green plant”

Even though the IDF calculations for terms are for the entire corpus, they are displayed on a per-document basis so that you can easily verify TF-IDF scores by skimming a single row and multiplying two numbers. Again, it’s worth taking a few minutes to understand the data in the table so that you have a good feel for the way the calculations work. It’s remarkable just how powerful TF-IDF is, given that it doesn’t account for the proximity or ordering of words in a document.

Table 7-3. Calculations involved in TF-IDF sample queries, as computed by Example 7-4

Documenttf(mr.)tf(green)tf(the)tf(plant)
corpus['a']0.10530.10531.10530
corpus['b']00.111100.1111
corpus['c']00.062500.0625
 idf(mr.)idf(green)idf(the)idf(plant)
corpus['a']2.09861.02.0991.4055
corpus['b']2.09861.02.0991.4055
corpus['c']2.09861.02.0991.4055
 tf-idf(mr.)tf-idf(green)tf-idf(the)tf-idf(plant)
corpus['a']0.22090.10530.22090
corpus['b']00.111100.1562
corpus['c']00.062500.0878

The same results for each query are shown in Table 7-4, with the TF-IDF values summed on a per-document basis.

Table 7-4. Summed TF-IDF values for sample queries as computed by Example 7-4

Querycorpus['a']corpus['b']corpus['c']
green0.10530.11110.0625
Mr. Green0.2209 + 0.1053 = 0.32620 + 0.1111 = 0.11110 + 0.0625 = 0.0625
the green plant0.2209 + 0.1053 + 0 = 0.32620 + 0.1111 + 0.1562 = 0.26730 + 0.0625 + 0.0878 = 0.1503

From a qualitative standpoint, the query results seem reasonable. The corpus['b'] document is the winner for the query “green”, with corpus['a'] just a hair behind. In this case, the deciding factor was the length of corpus['b'] being much smaller than corpus['a']: the normalized TF score tipped in favor of corpus['b'] for its one occurrence of “green”, even though “Green” appeared in corpus['a'] two times. Since “green” appears in all three documents, the net effect of the IDF term in the calculations was a wash. Do note, however, that if we had returned 0.0 instead of 1.0 for IDF calculations, as is done in some implementations, the TF-IDF scores for “green” would have been 0.0 for all three documents. Depending on the particular situation, it may be better to return 0.0 for the IDF scores rather than 1.0. For example, if you had 100,000 documents and “green” appeared in all of them, you’d almost certainly consider it to be a stopword and want to remove its effects in a query entirely.

Note

A very worthwhile exercise to consider is why corpus['a'] scored highest for “the green plant” as opposed to corpus['b'], which at first blush, might have seemed a little more obvious.

A finer point to observe is that the sample implementation provided in Example 7-4 adjusts the IDF score by adding a value of 1.0 to the logarithm calculation, for the purposes of illustration and because we’re dealing with a trivial document set. Without the 1.0 adjustment in the calculation, it would be possible to have the idf function return values that are less than 1.0, which would result in two fractions being multiplied in the TF-IDF calculation. Since multiplying two fractions together results in a value smaller than either of them, this turns out to be an easily overlooked edge case in the TF-IDF calculation. Recall that the intuition behind the TF-IDF calculation is that we’d like to be able to multiply two terms in a way that consistently produces larger TF-IDF scores for more-relevant queries than for less-relevant queries.

Querying Google+ Data with TF-IDF

Let’s apply TF-IDF to the Google+ data we collected earlier and see how it works out as a tool for querying the data. NLTK provides some abstractions that we can use instead of rolling our own, so there’s actually very little to do now that you understand the underlying theory. The listing in Example 7-5 assumes you saved the Google+ data as a JSON file, and it allows you to pass in multiple query terms that are used to score the documents by relevance.

Example 7-5. Querying Google+ data with TF-IDF (plus__tf_idf_nltk.py)

# -*- coding: utf-8 -*-

import sys
import json
import nltk

# Load in unstructured data from wherever you've saved it

DATA = sys.argv[1]
data = json.loads(open(DATA).read())

QUERY_TERMS = sys.argv[2:]

activities = [activity['object']['content'].lower().split() 
              for activity in data 
                if activity['object']['content'] != ""]

# Provides tf/idf/tf_idf abstractions

tc = nltk.TextCollection(activities)

relevant_activities = []

for idx in range(len(activities)):
    score = 0
    for term in [t.lower() for t in QUERY_TERMS]:
        score += tc.tf_idf(term, activities[idx])
    if score > 0:
        relevant_activities.append({'score': score, 'title': data[idx]['title'],
                              'url': data[idx]['url']})

# Sort by score and display results

relevant_activities = sorted(relevant_activities, key=lambda p: p['score'], reverse=True)
for activity in relevant_activities:
    print activity['title']
    print '	Link: %s' % (activity['url'], )
    print '	Score: %s' % (activity['score'], )

Sample query results for “SOPA”, a controversial piece of proposed legislation, on Tim O’Reilly’s Google+ data are shown in Example 7-6.

Example 7-6. Sample results from Example 7-5

I think the key point of this piece by +Mike Loukides, that PIPA and SOPA provide a 
"right of ext...
	Link: https://plus.google.com/107033731246200681024/posts/ULi4RYpvQGT
	Score: 0.0805961208217
Learn to Be a Better Activist During the SOPA Blackouts +Clay Johnson has put together 
an awesome...
	Link: https://plus.google.com/107033731246200681024/posts/hrC5aj7gS6v
	Score: 0.0255051015259
SOPA and PIPA are bad industrial policy There are many arguments against SOPA and PIPA 
that are b...
	Link: https://plus.google.com/107033731246200681024/posts/LZs8TekXK2T
	Score: 0.0227351539694
Further thoughts on SOPA, and why Congress shouldn't listen to lobbyists Colleen Taylor
of GigaOM...
	Link: https://plus.google.com/107033731246200681024/posts/5Xd3VjFR8gx
	Score: 0.0112879721039

Given a search term, being able to narrow in on three Google+ content items ranked by relevance is of tremendous benefit when analyzing unstructured text data. Try out some other queries and qualitatively review the results to see for yourself how well the TF-IDF metric works, keeping in mind that the absolute values of the scores aren’t really important—it’s the ability to find and sort documents by relevance that matters. Then, begin to ponder the gazillion ways that you could tune or augment this metric to be even more effective. (After all, every data hacker has to complete this exercise at some point in life.) One obvious improvement that’s left as an exercise for the reader is to stem verbs so that variations in tense, grammatical role, etc., resolve to the same stem and can be more accurately accounted for in similarity calculations. The nltk.stem module provides easy-to-use implementations for several common stemming algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.52.124