Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Clustering Contacts by Job Title

Now that you can (hopefully) better appreciate the nature of the record-matching problem, let’s collect some real-world data from LinkedIn and start hacking out clusters. A few reasons you might be interested in mining your LinkedIn data are because you want to take an honest look at whether your networking skills have been helping you to meet the “right kinds of people,” or because you want to approach contacts who will most likely fit into a certain socioeconomic bracket with a particular kind of business enquiry or proposition. The remainder of this section systematically works through some exercises in clustering your contacts by job title.

Standardizing and Counting Job Titles

An obvious starting point when working with a new data set is to count things, and this situation is no different. This section introduces a pattern for transforming common job titles and then performs a basic frequency analysis.

Assuming you have a reasonable number of exported contacts, the minor nuances among job titles that you’ll encounter may actually be surprising, but before we get into that, let’s put together some code that establishes some patterns for normalizing record data and takes a basic inventory sorted by frequency. Example 6-2 inspects job titles and prints out frequency information for the titles themselves and for individual tokens that occur in them. If you haven’t already, you’ll need to easy_install pretty table , a package that you can use to produce nicely formatted tabular output.

Example 6-2. Standardizing common job titles and computing their frequencies (linkedin__analyze_titles.py)

# -*- coding: utf-8 -*-

import sys
import nltk
import csv
from prettytable import PrettyTable

CSV_FILE = sys.argv[1]

transforms = [
    ('Sr.', 'Senior'),
    ('Sr', 'Senior'),
    ('Jr.', 'Junior'),
    ('Jr', 'Junior'),
    ('CEO', 'Chief Executive Officer'),
    ('COO', 'Chief Operating Officer'),
    ('CTO', 'Chief Technology Officer'),
    ('CFO', 'Chief Finance Officer'),
    ('VP', 'Vice President'),
    ]

csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='"')
contacts = [row for row in csvReader]

# Read in a list of titles and split apart
# any combined titles like "President/CEO"
# Other variations could be handled as well such
# as "President & CEO", "President and CEO", etc.

titles = []
for contact in contacts:
    titles.extend([t.strip() for t in contact['Job Title'].split('/')
                  if contact['Job Title'].strip() != ''])

# Replace common/known abbreviations

for i in range(len(titles)):
    for transform in transforms:
        titles[i] = titles[i].replace(*transform)

# Print out a table of titles sorted by frequency

pt = PrettyTable(fields=['Title', 'Freq'])
pt.set_field_align('Title', 'l')
titles_fdist = nltk.FreqDist(titles)
[pt.add_row([title, freq]) for (title, freq) in titles_fdist.items() if freq > 1]
pt.printt()

# Print out a table of tokens sorted by frequency

tokens = []
for title in titles:
    tokens.extend([t.strip(',') for t in title.split()])
pt = PrettyTable(fields=['Token', 'Freq'])
pt.set_field_align('Token', 'l')
tokens_fdist = nltk.FreqDist(tokens)
[pt.add_row([token, freq]) for (token, freq) in tokens_fdist.items() if freq > 1
 and len(token) > 2]
pt.printt()

In short, the code reads in CSV records and makes a mild attempt at normalizing them by splitting apart combined titles that use the forward slash (like a title of “President/CEO”) and replacing known abbreviations. Beyond that, it just displays the results of a frequency distribution of both full job titles and individual tokens contained in the job titles. This is nothing new, but it is useful as a starting template and provides you with some reasonable insight into how the data breaks down. Sample results are shown in Example 6-3.

Example 6-3. Sample results for Example 6-2

+--------------------------+------+
|          Title           | Freq |
+--------------------------+------+
| Chief Executive Officer  |  16  |
| Senior Software Engineer |  12  |
| Owner                    |  9   |
| Software Engineer        |  9   |

...

+-------------+------+
|    Token    | Freq |
+-------------+------+
| Engineer    |  44  |
| Software    |  31  |
| Senior      |  27  |
| Manager     |  26  |
| Chief       |  24  |
| Officer     |  12  |
| Director    |  11  |

...

What’s interesting is that the sample results show that the most common job title based on exact matches is “Chief Executive Officer,” while the most common token in the job titles is “Engineer.” Although “Engineer” is not a constituent token of the most common job title, it does appear in a large number of job titles (for example, “Senior Software Engineer” and “Software Engineer,” which show up near the top of the job titles list). In analyzing job title or address book data, this is precisely the kind of insight that motivates the need for an approximate matching or clustering algorithm. The next section investigates further.

Common Similarity Metrics for Clustering

The most substantive decision that needs to be made in taking a set of strings—job titles, in this case—and clustering them in a useful way is which underlying similarity metric to use. There are myriad string similarity metrics available, and choosing the one that’s most appropriate for your situation largely depends on the nature of your objective. A few of the common similarity metrics that might be helpful in comparing job titles follow:

Edit distance

Edit distance, also known as Levenshtein distance, is a simple measure of how many insertions, deletions, and replacements it would take to convert one string into another. For example, the cost of converting “dad” into “bad” would be one replacement operation and would yield a value of 1. NLTK provides an implementation of edit distance via the nltk.metrics.distance.edit_distance function. Note that the actual edit distance between two strings is often different from the number of operations required to compute the edit distance; computation of edit distance is usually on the order of M*N operations for strings of length M and N.

n-gram similarity

An n-gram is just a terse way of expressing each possible consecutive sequence of n tokens from a text, and it provides the foundational data structure for computing collocations. Peek ahead to Bigram Analysis for an extended discussion of n-grams and collocations.

There are many variations of n-gram similarity, but consider the straightforward case of computing all possible bigrams for the tokens of two strings, and scoring the similarity between the strings by counting the number of common bigrams between them. Example 6-4 demonstrates.

Example 6-4. Using NLTK to compute bigrams in the interpreter

>>> ceo_bigrams = nltk.bigrams("Chief Executive Officer".split(), pad_right=True, 
... pad_left=True)
>>> ceo_bigrams
[(None, 'Chief'), ('Chief', 'Executive'), ('Executive', 'Officer'), 
('Officer', None)]
>>> cto_bigrams = nltk.bigrams("Chief Technology Officer".split(), pad_right=True, 
... pad_left=True)
>>> cto_bigrams
[(None, 'Chief'), ('Chief', 'Technology'), ('Technology', 'Officer'), 
 ('Officer', None)]
>>> len(set(ceo_bigrams).intersection(set(cto_bigrams)))
2

Note that the use of the keyword arguments pad_right and pad_left intentionally allows for leading and trailing tokens to match. The effect of padding is to allow bigrams such as (None, ‘Chief’) to emerge, which are important matches across job titles. NLTK provides a fairly comprehensive array of bigram and trigram scoring functions via the BigramAssociationMeasures and TrigramAssociationMeasures classes defined in nltk.metrics.association module.

Jaccard distance

The Jaccard metric expresses the similarity of two sets and is defined by |Set1 intersection Set2/|Set1 union Set2|. In other words, it’s the number of items in common between the two sets divided by the total number of distinct items in the two sets. Peek ahead to How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions for an extended discussion. In general, you’ll compute Jaccard similarity by using n-grams, including unigrams (single tokens), to measure the similarity of two strings. An implementation of Jaccard distance is available at nltk.metrics.distance.jaccard_distance and could be implemented as:

len(X.union(Y)) - len(X.intersection(Y)))/float(len(X.union(Y))

where X and Y are sets of items.

MASI distance

The MASI^[41] distance metric is a weighted version of Jaccard similarity that adjusts the score to result in a smaller distance than Jaccard when a partial overlap between sets exists. MASI distance is defined in NLTK as nltk.metrics.distance.masi_distance, which can be implemented as:

1 - float(len(X.intersection(Y)))/float(max(len(X),len(Y)))

where X and Y are sets of items. Example 6-5 and the sample results in Table 6-2 provide some data points that may help you better understand this metric.

Example 6-5. Using built-in distance metrics from NLTK to compare small sets of items (linkedin__distances.py)

# -*- coding: utf-8 -*-

from nltk.metrics.distance import jaccard_distance, masi_distance
from prettytable import PrettyTable

fields = ['X', 'Y', 'Jaccard(X,Y)', 'MASI(X,Y)']
pt = PrettyTable(fields=fields)
[pt.set_field_align(f, 'l') for f in fields]

for z in range(4):
    X = set()
    for x in range(z, 4):
        Y = set()
        for y in range(1, 3):
            X.add(x)
            Y.add(y)
            pt.add_row([list(X), list(Y), round(jaccard_distance(X, Y), 2),
                       round(masi_distance(X, Y), 2)])
pt.printt()

Notice that whenever the two sets are either completely disjoint or equal—i.e., when one set is totally subsumed^[42] by the other—the MASI distance works out to be the same value as the Jaccard distance. However, when the sets overlap only partially, the MASI distance works out to be higher than the Jaccard distance. It’s worth considering whether you think MASI will be more effective than Jaccard distance for the task of scoring two arbitrary job titles and clustering them together if the similarity (distance) between them exceeds a certain threshold. The table of sample results in Table 6-2 tries to help you wrap your head around how MASI compares to plain old Jaccard.

Table 6-2. A comparison of Jaccard and MASI distances for two small sets of items as emitted from Example 6-5

X	Y	Jaccard (X,Y)	MASI (X,Y)
[0]	[1]	1.0	1.0
[0]	[1, 2]	1.0	1.0
[0, 1]	[1]	0.5	0.5
[0, 1]	[1, 2]	0.67	0.5
[0, 1, 2]	[1]	0.67	0.67
[0, 1, 2]	[1, 2]	0.33	0.33
[0, 1, 2, 3]	[1]	0.75	0.75
[0, 1, 2, 3]	[1, 2]	0.5	0.5
[1]	[1]	0.0	0.0
[1]	[1, 2]	0.5	0.5
[1, 2]	[1]	0.5	0.5
[1, 2]	[1, 2]	0.0	0.0
[1, 2, 3]	[1]	0.67	0.67
[1, 2, 3]	[1, 2]	0.33	0.33
[2]	[1]	1.0	1.0
[2]	[1, 2]	0.5	0.5
[2, 3]	[1]	1.0	1.0
[2, 3]	[1, 2]	0.67	0.5
[3]	[1]	1.0	1.0
[3]	[1, 2]	1.0	1.0

While the list of interesting similarity metrics could go on and on, there’s quite literally an ocean of literature about them online, and it’s easy enough to plug in and try out different similarity heuristics once you have a better feel for the data you’re mining. The next section works up a script that clusters job titles using the MASI similarity metric.

A Greedy Approach to Clustering

Given MASI’s partiality to partially overlapping terms, and that we have insight suggesting that overlap in titles is important, let’s try to cluster job titles by comparing them to one another using MASI distance, as an extension of Example 6-2. Example 6-6 clusters similar titles and then displays your contacts accordingly. Take a look at the code—especially the nested loop invoking the DISTANCE function that makes a greedy decision—and then we’ll discuss.

Example 6-6. Clustering job titles using a greedy heuristic (linkedin__cluster_contacts_by_title.py)

# -*- coding: utf-8 -*-

import sys
import csv
from nltk.metrics.distance import masi_distance

CSV_FILE = sys.argv[1]

DISTANCE_THRESHOLD = 0.34
DISTANCE = masi_distance

def cluster_contacts_by_title(csv_file):

    transforms = [
        ('Sr.', 'Senior'),
        ('Sr', 'Senior'),
        ('Jr.', 'Junior'),
        ('Jr', 'Junior'),
        ('CEO', 'Chief Executive Officer'),
        ('COO', 'Chief Operating Officer'),
        ('CTO', 'Chief Technology Officer'),
        ('CFO', 'Chief Finance Officer'),
        ('VP', 'Vice President'),
        ]

    seperators = ['/', 'and', '&']

    csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='"')
    contacts = [row for row in csvReader]

# Normalize and/or replace known abbreviations
# and build up list of common titles

    all_titles = []
    for i in range(len(contacts)):
        if contacts[i]['Job Title'] == '':
            contacts[i]['Job Titles'] = ['']
            continue
        titles = [contacts[i]['Job Title']]
        for title in titles:
            for seperator in seperators:
                if title.find(seperator) >= 0:
                    titles.remove(title)
                    titles.extend([title.strip() for title in title.split(seperator)
                                  if title.strip() != ''])

        for transform in transforms:
            titles = [title.replace(*transform) for title in titles]
        contacts[i]['Job Titles'] = titles
        all_titles.extend(titles)

    all_titles = list(set(all_titles))

    clusters = {}
    for title1 in all_titles:
        clusters[title1] = []
        for title2 in all_titles:
            if title2 in clusters[title1] or clusters.has_key(title2) and title1 
                in clusters[title2]:
                continue
            distance = DISTANCE(set(title1.split()), set(title2.split()))
            if distance < DISTANCE_THRESHOLD:
                clusters[title1].append(title2)

# Flatten out clusters

    clusters = [clusters[title] for title in clusters if len(clusters[title]) > 1]

# Round up contacts who are in these clusters and group them together

    clustered_contacts = {}
    for cluster in clusters:
        clustered_contacts[tuple(cluster)] = []
        for contact in contacts:
            for title in contact['Job Titles']:
                if title in cluster:
                    clustered_contacts[tuple(cluster)].append('%s %s.'
                            % (contact['First Name'], contact['Last Name'][0]))

    return clustered_contacts

if __name__ == '__main__':
    clustered_contacts = cluster_contacts_by_title(CSV_FILE)

    for titles in clustered_contacts:
        common_titles_heading = 'Common Titles: ' + ', '.join(titles)
        print common_titles_heading

        descriptive_terms = set(titles[0].split())
        for title in titles:
            descriptive_terms.intersection_update(set(title.split()))
        descriptive_terms_heading = 'Descriptive Terms: ' 
            + ', '.join(descriptive_terms)
        print descriptive_terms_heading
        print '-' * max(len(descriptive_terms_heading), len(common_titles_heading))
        print '
'.join(clustered_contacts[titles])
        print

The code listing starts by separating out combined titles using a list of common conjunctions and then normalizes common titles. Then, a nested loop iterates over all of the titles and clusters them together according to a thresholded MASI similarity metric. This tight loop is where most of the real action happens in the listing: it’s where each title is compared to each other title. If the distance between any two titles as determined by a similarity heuristic is “close enough,” we greedily group them together. In this context, being “greedy” means that the first time we are able to determine that an item might fit in a cluster, we go ahead and assign it without further considering whether there might be a better fit, or making any attempt to account for such a better fit if one appears later. Although incredibly pragmatic, this approach produces very reasonable results. Clearly, the choice of an effective similarity heuristic is critical, but given the nature of the nested loop, the fewer times we have to invoke the scoring function, the better. More will be said about this consideration in the next section, but do note that we do use some conditional logic to try to avoid repeating unnecessary calculations if possible.

The rest of the listing just looks up contacts with a particular job title and groups them for display, but there is one other interesting nuance involved in computing clusters: you often need to assign each cluster a meaningful label. The working implementation computes labels by taking the setwise intersection of terms in the job titles for each cluster, which seems reasonable given that it’s the most obvious common thread. Your mileage is sure to vary with other approaches.

The types of results you might expect from this code are useful in that they group together individuals who are likely to share common responsibilities in their job duties. This information might be useful for a variety of reasons, whether you’re planning an event that includes a “CEO Panel,” trying to figure out who can best help you to make your next career move, or trying to determine whether you are really well-enough connected to other similar professionals given your own job responsibilities. Abridged results for a sample network are presented in Example 6-7.

Example 6-7. Sample output for Example 6-6

Common Titles: Chief Executive Officer, 
               Chief Finance Officer, 
               Chief Technology Officer, 
               Chief Operations Officer
Descriptive Terms: Chief, Officer
----------------------------------------
Gregg P.
Scott S.
Mario G.
Brian F.
Damien K.
Chris A.
Trey B.
Emil E.
Dan M.
Tim E.
John T. H.

...


Common Titles: Consultant, 
               Software Engineer, 
               Software Engineer
Descriptive Terms: Software, Engineer
-----------------------------------------
Kent L.
Paul C.
Rick C.
James B.
Alex R.
Marco V.
James L.
Bill T.

...

Scalable clustering sure ain’t easy

It’s important to note that in the worst case, the nested loop executing the DISTANCE calculation from Example 6-6 would require it to be invoked in what we’ve already mentioned is O(n²) time complexity—in other words, len(all_titles)*len(all_titles) times. A nested loop that compares every single item to every single other item for clustering purposes is not a scalable approach for a very large value of n, but given that the unique number of titles for your professional network is not likely to be very large, it shouldn’t impose a performance constraint. It may not seem like a big deal—after all, it’s just a nested loop—but the crux of an O(n²) algorithm is that the number of comparisons required to process an input set increases exponentially in proportion to the number of items in the set. For example, a small input set of 100 job titles would require only 10,000 scoring operations, while 10,000 job titles would require 100,000,000 scoring operations. The math doesn’t work out so well and eventually buckles, even when you have a lot of hardware to throw at it.

Your initial reaction when faced with what seems like a predicament that scales exponentially will probably be to try to reduce the value of n as much as possible. But most of the time you won’t be able to reduce it enough to make your solution scalable as the size of your input grows, because you still have an O(n²) algorithm. What you really want to do is figure out a way to come up with an algorithm that’s on the order of O(k*n), where k is much smaller than n and represents a manageable amount of overhead that grows much more slowly than the rate of n’s growth. But as with any other engineering decision, there are performance and quality trade-offs to be made in all corners of the real world, and it sure ain’t easy to strike the right balance. In fact, many data-mining companies that have successfully implemented scalable record-matching analytics at a high degree of fidelity consider their specific approaches to be proprietary information (trade secrets), since they result in definite business advantages. (More times than you’d think, it’s not the ability to solve a problem that really matters, but the ability to implement it in a particular manner that can be taken to the market.)

For situations in which an O(n²) algorithm is simply unacceptable, one variation to the working example that you might try is rewriting the nested loops so that a random sample is selected for the scoring function, which would effectively reduce it to O(k*n), if k were the sample size. As the value of the sample size approaches n, however, you’d expect the runtime to begin approaching the O(n²) runtime. Example 6-8 shows how that sampling technique might look in code; the key changes to the previous listing are highlighted in bold. The key takeaway is that for each invocation of the outer loop, we’re executing the inner loop a much smaller, fixed number of times.

Example 6-8. A small modification to an excerpt of Example 6-6 that incorporates random sampling to improve performance

# ...snip ...

all_titles = list(set(all_titles))
clusters = {}
for title1 in all_titles:
    clusters[title1] = []
    for sample in range(SAMPLE_SIZE):
        title2 = all_titles[random.randint(0, len(all_titles)-1)]
        if title2 in clusters[title1] or clusters.has_key(title2) and title1 
            in clusters[title2]:
            continue
        distance = DISTANCE(set(title1.split()), set(title2.split()))
        if distance < DISTANCE_THRESHOLD:
        clusters[title1].append(title2)

# ...snip ...

Another approach you might consider is to randomly sample the data into n bins (where n is some number that’s generally less than or equal to the square root of the number of items in your set), perform clustering within each of those individual bins, and then optionally merge the output. For example, if you had one million items, an O(n²) algorithm would take a trillion logical operations, whereas binning the one million items into 1,000 bins containing 1,000 items each and clustering each individual bin would require only a billion operations. (That’s 1,000*1,000 comparisons for each bin for all 1,000 bins.) A billion is still a large number, but it’s three orders of magnitude smaller than a trillion, and that’s nothing to sneeze at.

There are many other approaches in the literature besides sampling or binning that could be far better at reducing the dimensionality of a problem. For example, you’d ideally compare every item in a set, and at the end of the day, the particular technique you’ll end up using to avoid an O(n²) situation for a large value of n will vary based upon real-world constraints and insights you’re likely to gain through experimentation and domain-specific knowledge. As you consider the possibilities, keep in mind that the field of machine learning offers many techniques that are designed to combat exactly these types of scale problems by using various sorts of probabilistic models and sophisticated sampling techniques. The next section introduces a fairly intuitive and well-known clustering algorithm called k-means, which is a general-purpose unsupervised approach for clustering a multidimensional space. We’ll be using this technique later to cluster your contacts by geographic location.

Intelligent clustering enables compelling user experiences

It’s easy to get so caught up in the data-mining techniques themselves that you lose sight of the business case that motivated them in the first place. Simple clustering techniques can create incredibly compelling user experiences (all over the place) by leveraging results even as simple as the job title ones we just produced. Figure 6-2 demonstrates a powerful alternative view of your data via a simple tree widget that could be used as part of a navigation pane or faceted display for filtering search criteria. Assuming that the underlying similarity metrics you’ve chosen have produced meaningful clusters, a simple hierarchical display that presents data in logical groups with a count of the items in each group can streamline the process of finding information, and power intuitive workflows for almost any application where a lot of skimming would otherwise be required to find the results.

Figure 6-2. Intelligently clustering data lends itself to faceted displays and compelling user experiences

The code to create a simple navigational display can be surprisingly simple, given the maturity of Ajax toolkits and other UI libraries. For example, Example 6-9 shows a simple web page that is all that’s required to populate the Dojo tree widget shown in Figure 6-2 using a makeshift template.

Example 6-9. A template for a Dojo tree widget, as displayed in Figure 6-2—just add data and watch it go!

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
		"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
	<title>Clustered Contacts</title>
    <link rel="stylesheet" 
        href="http://ajax.googleapis.com/ajax/libs/dojo/1.5/dojo/resources/dojo.css">
    <link rel="stylesheet" 
        href="http://ajax.googleapis.com/ajax/libs/dojo/1.5/dijit/themes/claro/claro.css">

    <script src="http://ajax.googleapis.com/ajax/libs/dojo/1.5/dojo/dojo.xd.js" 
        type="text/javascript" djConfig="parseOnLoad:true"></script>

	<script language="JavaScript" type="text/javascript">
		dojo.require("dojo.data.ItemFileReadStore");
		dojo.require("dijit.Tree");
		dojo.require("dojo.parser");
	</script>

    <script language="JavaScript" type="text/javascript">
        var data = %s; //substituted by Python
    </script>

</head>
<body class="claro">
	<div dojoType="dojo.data.ItemFileReadStore" jsId="jobsStore"
		data="data"></div>

	<div dojoType="dijit.Tree" id="mytree" store="jobsStore"
		label="Clustered Contacts" openOnClick="true">
	</div>
</body>
</html>

The data value that’s substituted into the template is a simple JSON structure, which we can obtain easily enough by modifying the output from Example 6-6, as illustrated in Example 6-10.

Example 6-10. Munging the clustered_contacts data that’s computed in Example 6-6 to produce JSON that’s consumable by Dojo’s tree widget

import json
import webbrowser
import os

data = {"label" : "name", "temp_items" : {}, "items" : []} 
for titles in clustered_contacts:
    descriptive_terms = set(titles[0].split())
    for title in titles:
        descriptive_terms.intersection_update(set(title.split()))
    descriptive_terms = ', '.join(descriptive_terms)

    if data['temp_items'].has_key(descriptive_terms):
        data['temp_items'][descriptive_terms].extend([{'name' : cc } for cc 
            in clustered_contacts[titles]])
    else:
        data['temp_items'][descriptive_terms] = [{'name' : cc } for cc 
            in clustered_contacts[titles]]

for descriptive_terms in data['temp_items']:
    data['items'].append({"name" : "%s (%s)" % (descriptive_terms, 
        len(data['temp_items'][descriptive_terms]),),                           "children" : [i for i in 
                              data['temp_items'][descriptive_terms]]})

del data['temp_items']

# Open the template and substitute the data

if not os.path.isdir('out'):
    os.mkdir('out')

TEMPLATE = os.path.join(os.getcwd(), '..', 'web_code', 'dojo', 'dojo_tree.html')                                                   
OUT = os.path.join('out', 'dojo_tree.html')

t = open(TEMPLATE).read()
f = open(OUT, 'w')
f.write(t % json.dumps(data, indent=4))
f.close()

webbrowser.open("file://" + os.path.join(os.getcwd(), OUT))

There’s incredible value in being able to create user experiences that present data in intuitive ways that power workflows. Something as simple as an intelligently crafted hierarchical display can inadvertently motivate users to spend more time on a site, discover more information than they normally would, and ultimately realize more value in the services the site offers.

Hierarchical and k-Means Clustering

Example 6-6 introduced a greedy approach to clustering, but there are at least two other common clustering techniques that you should know about: hierarchical clustering and k-means clustering. Hierarchical clustering is superficially similar to the greedy heuristic we have been using, while k-means clustering is radically different. We’ll primarily focus on k-means throughout the rest of this chapter, but it’s worthwhile to briefly introduce the theory behind both of these approaches since you’re very likely to encounter them during literature review and research. An excellent implementation of both of these approaches is available via the cluster module that you can easy_install, so please go ahead and take a moment to do that.

Hierarchical clustering

Hierarchical clustering is a deterministic and exhaustive technique, in that it computes the full matrix of distances between all items and then walks through the matrix clustering items that meet a minimum distance threshold. It’s hierarchical in that walking over the matrix and clustering items together naturally produces a tree structure that expresses the relative distances between items. In the literature, you may see this technique called agglomerative, because the leaves on the tree represent the items that are being clustered, while nodes in the tree agglomerate these items into clusters.

This technique is similar to but not fundamentally the same as the approach used in Example 6-6, which uses a greedy heuristic to cluster items instead of successively building up a hierarchy. As such, the actual wall clock time (the amount of time it takes for the code to run) for hierarchical clustering may be a bit longer, and you may need to tweak your scoring function and distance threshold. However, the use of dynamic programming and other clever bookkeeping techniques can result in substantial savings in wall clock execution time, depending on the implementation. For example, given that the distance between two items such as job titles is almost certainly going to be symmetric, you should only have to compute one half of the distance matrix instead of the full matrix. Even though the time complexity of the algorithm as a whole is still O(n²), only 0.5*n² units of work are being performed instead of n² units of work.

If we were to rewrite Example 6-6 to use the cluster module, the nested loop performing the clustering DISTANCE computation would be replaced with something like the code in Example 6-11.

Example 6-11. A minor modification to Example 6-6 that uses cluster.HierarchicalClustering instead of a greedy heuristic (linkedin__cluster_contacts_by_title_hac.py)

# ... snip ...

# Import the goods
from cluster import HierarchicalClustering

# Define a scoring function
def score(title1, title2): 
    return DISTANCE(set(title1.split()), set(title2.split()))

# Feed the class your data and the scoring function
hc = HierarchicalClustering(all_titles, score)

# Cluster the data according to a distance threshold
clusters = hc.getlevel(DISTANCE_THRESHOLD)

# Remove singleton clusters
clusters = [c for c in clusters if len(c) > 1]

# ... snip ...

If you’re very interested in variations on hierarchical clustering, be sure to check out the HierarchicalClustering class’ setLinkageMethod method, which provides some subtle variations on how the class can compute distances between clusters. For example, you can specify whether distances between clusters should be determined by calculating the shortest, longest, or average distance between any two clusters. Depending on the distribution of your data, choosing a different linkage method can potentially produce quite different results. Figure 6-3 displays a portion of a professional network as a radial tree layout and a dendogram using Protovis, a powerful HTML5-based visualization toolkit. The radial tree layout is more space-efficient and probably a better choice for this particular data set, but a dendogram would be a great choice if you needed to easily find correlations between each level in the tree (which would correspond to each level of agglomeration in hierarchical clustering) for a more complex set of data. Both of the visualizations presented here are essentially just static incarnations of the interactive tree widget from Figure 6-2. As they show, an amazing amount of information becomes apparent when you are able to look at a simple image of your professional network.^[43]

Figure 6-3. A Protovis radial tree layout of contacts clustered by job title (left), and a slice of the same data presented as a dendogram (right)

k-means clustering

Whereas hierarchical clustering is a deterministic technique that exhausts the possibilities and is an O(n²) approach, k-means clustering generally executes on the order of O(k*n) times. The substantial savings in performance come at the expense of results that are approximate, but they still have the potential to be quite good. The idea is that you generally have a multidimensional space containing n points, which you cluster into k clusters through the following series of steps:

Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K₁, K₂, …, K_k.
Assign each of the n points to a cluster by finding the nearest K_n—effectively creating k clusters and requiring k*n comparisons.
For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its K_i value to be that value. (Hence, you’re computing “k-means” during each iteration of the algorithm.)
Repeat steps 2–3 until the members of the clusters do not change between iterations. Generally speaking, relatively few iterations are required for convergence.

Because k-means may not be all that intuitive at first glance, Figure 6-4 displays each step of the algorithm as presented in the online “Tutorial on Clustering Algorithms”, which features an interactive Java applet. The sample parameters used involve 100 data points and a value of 3 for the parameter k, which means that the algorithm will produce three clusters. The important thing to note at each step is the location of the squares, and which points are included in each of those three clusters as the algorithm progresses. The algorithm only takes nine steps to complete.

Figure 6-4. Progression of k-means for k=3 with 100 points

Although you could run k-means on points in two dimensions or two thousand dimensions, the most common range is usually somewhere on the order of tens of dimensions, with the most common cases being two or three dimensions. When the dimensionality of the space you’re working in is relatively small, k-means can be an effective clustering technique because it executes fairly quickly and is capable of producing very reasonable results. You do, however, need to pick an appropriate value for k, which is not always obvious. The next section demonstrates how to fetch more extended profile information for your contacts, which bridges this discussion with a follow-up exercise devoted to geographically clustering your professional network by applying k-means.

^[41]Measuring Agreement on Set-Valued Items. See http://www.cs.columbia.edu/~becky/pubs/lrec06masi.pdf.

^[42]def subsumed(X,Y): return X.union_update(Y) == X or Y.union_update(X) == Y

^[43]linkedin__cluster_contacts_by_title_hac.py provides a minimal adaptation of Example 6-6 that demonstrates how to produce output that could be consumed by Protovis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Clustering Contacts by Job Title

Create new playlist

Sign In

Sign Up