Exploring Social Connections with XFN

With a bit of context established for how microformats fit into the overall web space, let’s now turn to some practical applications of XFN, which is by far the most popular microformat you’re likely to encounter. As you already know, XFN is a means of identifying relationships to other people by including a few keywords in the rel attribute of an anchor tag. XFN is commonly used in blogs, and particularly in “blogroll” plug-ins such as those offered by WordPress.[10] Consider the following HTML content, shown in Example 2-1, that might be present in a blogroll.

Example 2-1. Example XFN markup

<div>
    <a href="http://example.org/matthew" rel="me">Matthew</a>
    <a href="http://example.com/users/jc" rel="friend met">J.C.</a>
    <a href="http://example.com/users/abe" rel="friend met co-worker">Abe</a>
    <a href="http://example.net/~baseeret" rel="spouse met">Baseeret</a>
    <a href="http://example.net/~lindsaybelle" rel="child met">Lindsay Belle</a>
    </div>

From reading the content in the rel tags, it’s hopefully pretty obvious what the relationships are between the various people. Namely, some guy named Matthew has a couple of friends, a spouse, and a child. He works with one of these friends and has met everyone in “real life” (which wouldn’t be the case for a strictly online associate). Apart from using a well-defined vocabulary, that’s about all there is to XFN. The good news is that it’s deceptively simple, yet incredibly powerful when employed at a large enough scale because it is deliberately authored structured information. Unless you run across some content that’s just drastically out of date—as in two best friends becoming archenemies but just forgot to update their blogrolls accordingly—XFN gives you a very accurate indicator as to how two people are connected. Given that most blogging platforms support XFN, there’s quite a bit of information that can be discovered. The bad news is that XFN doesn’t really tell you anything beyond those basics, so you have to use other sources of information and techniques to discover anything beyond conclusions, such as, “Matthew and Baseeret are married and have a child named Lindsay Belle.” But you have to start somewhere, right?

Let’s whip up a simple script for harvesting XFN data similar to the service offered by rubhub, a social search engine that crawls and indexes a large number of websites using XFN. You might also want to check out one of the many online XFN tools if you want to explore the full specification before moving on to the next section.

A Breadth-First Crawl of XFN Data

Let’s get social by mining some XFN data and building out a social graph from it. Given that XFN can be embedded into any conceivable web page, the bad news is that we’re about to do some web scraping. The good news, however, is that it’s probably the most trivial web scraping you’ll ever do, and the BeautifulSoup package absolutely minimizes the burden. The code in Example 2-2 uses Ajaxian, a popular blog about modern-day web development, as the basis of the graph. Do yourself a favor and easy_install BeautifulSoup before trying to run it.

Example 2-2. Scraping XFN content from a web page (microformats__xfn_scrape.py)

# -*- coding: utf-8 -*-

import sys
import urllib2
import HTMLParser
from BeautifulSoup import BeautifulSoup

# Try http://ajaxian.com/
URL = sys.argv[1]

XFN_TAGS = set([
    'colleague',
    'sweetheart',
    'parent',
    'co-resident',
    'co-worker',
    'muse',
    'neighbor',
    'sibling',
    'kin',
    'child',
    'date',
    'spouse',
    'me',
    'acquaintance',
    'met',
    'crush',
    'contact',
    'friend',
    ])

try:
    page = urllib2.urlopen(URL)
except urllib2.URLError:
    print 'Failed to fetch ' + item

try:
    soup = BeautifulSoup(page)
except HTMLParser.HTMLParseError:
    print 'Failed to parse ' + item

anchorTags = soup.findAll('a')

for a in anchorTags:
    if a.has_key('rel'):
        if len(set(a['rel'].split()) & XFN_TAGS) > 0:
            tags = a['rel'].split()
            print a.contents[0], a['href'], tags

Warning

As of Version 3.1.x, BeautifulSoup switched to using HTMLParser instead of the SGMLParser, and as a result, a common complaint is that BeautifulSoup seems a little less robust than in the 3.0.x release branch. The author of BeautifulSoup recommends a number of options, with the most obvious choice being to simply use the 3.0.x version if 3.1.x isn’t robust enough for your needs. Alternatively, you can catch the HTMLParseError, as shown in Example 2-2, and discard the content.

Running the code against a URL that includes XFN information returns the name, type of relationship, and a URL for each of a person’s friends. Sample output follows:

Dion Almaer http://www.almaer.com/blog/ [u'me']
Ben Galbraith http://weblogs.java.net/blog/javaben/ [u'co-worker']
Rey Bango http://reybango.com/ [u'friend']
Michael Mahemoff http://softwareas.com/ [u'friend']
Chris Cornutt http://blog.phpdeveloper.org/ [u'friend']
Rob Sanheim http://www.robsanheim.com/ [u'friend']
Dietrich Kappe http://blogs.pathf.com/agileajax/ [u'friend']
Chris Heilmann http://wait-till-i.com/ [u'friend']
Brad Neuberg http://codinginparadise.org/about/ [u'friend']

Assuming that the URL for each friend includes XFN or other useful information, it’s straightforward enough to follow the links and build out more social graph information in a systematic way. That approach is exactly what the next code example does: it builds out a graph in a breadth-first manner, which is to say that it does something like what is described in Example 2-3 in pseudocode.

Example 2-3. Pseudocode for a breadth-first search

Create an empty graph
Create an empty queue to keep track of nodes that need to be processed

Add the starting point to the graph as the root node
Add the root node to a queue for processing

Repeat until some maximum depth is reached or the queue is empty:
  Remove a node from the queue 
  For each of the node's neighbors: 
    If the neighbor hasn't already been processed: 
      Add it to the queue   
      Add it to the graph 
      Create an edge in the graph that connects the node and its neighbor

Note that this approach to building out a graph has the advantage of naturally creating edges between nodes in both directions, if such edges exist, without any additional bookkeeping required. This is useful for social situations since it naturally lends itself to identifying mutual friends without any additional logic. The refinement of the code from Example 2-2 presented in Example 2-4 takes a breadth-first approach to building up a NetworkX graph by following hyperlinks if they appear to have XFN data encoded in them. Running the example code for even the shallow depth of two can return quite a large graph, depending on the popularity of XFN within the niche community of friends at hand, as can be seen by generating an image file of the graph and inspecting it or running various graph metrics on it (see Figure 2-1).

Note

Windows users may want to review Visualizing Tweet Graphs for an explanation of why the workaround for nx.drawing.write_dot is necessary in Example 2-4.

Example 2-4. Using a breadth-first search to crawl XFN links (microformats__xfn_crawl.py)

# -*- coding: utf-8 -*-

import sys
import os
import urllib2
from BeautifulSoup import BeautifulSoup
import HTMLParser
import networkx as nx

ROOT_URL = sys.argv[1]

if len(sys.argv) > 2:
    MAX_DEPTH = int(sys.argv[2])
else:
    MAX_DEPTH = 1

XFN_TAGS = set([
    'colleague',
    'sweetheart',
    'parent',
    'co-resident',
    'co-worker',
    'muse',
    'neighbor',
    'sibling',
    'kin',
    'child',
    'date',
    'spouse',
    'me',
    'acquaintance',
    'met',
    'crush',
    'contact',
    'friend',
    ])

OUT = "graph.dot"

depth = 0

g = nx.DiGraph()

next_queue = [ROOT_URL]

while depth < MAX_DEPTH:

    depth += 1
    (queue, next_queue) = (next_queue, [])

    for item in queue:
        try:
            page = urllib2.urlopen(item)
        except urllib2.URLError:
            print 'Failed to fetch ' + item
            continue

        try:
            soup = BeautifulSoup(page)
        except HTMLParser.HTMLParseError:
            print 'Failed to parse ' + item
            continue

        anchorTags = soup.findAll('a')

        if not g.has_node(item):
            g.add_node(item)

        for a in anchorTags:
            if a.has_key('rel'):
                if len(set(a['rel'].split()) & XFN_TAGS) > 0:
                    friend_url = a['href']
                    g.add_edge(item, friend_url)
                    g[item][friend_url]['label'] = a['rel'].encode('utf-8')
                    g.node[friend_url]['label'] = a.contents[0].encode('utf-8')

                    next_queue.append(friend_url)

        # Further analysis of the graph could be accomplished here

if not os.path.isdir('out'):
    os.mkdir('out')

try:
    nx.drawing.write_dot(g, os.path.join('out', OUT))
except ImportError, e:

    # Help for Windows users:
    # Not a general purpose method, but representative of
    # the same output write_dot would provide for this graph
    # if installed and easy to implement

    dot = []
    for (n1, n2) in g.edges():
        dot.append('"%s" [label="%s"]' % (n2, g.node[n2]['label']))
        dot.append('"%s" -> "%s" [label="%s"]' % (n1, n2, g[n1][n2]['label']))

    f = open(os.path.join('out', OUT), 'w')
    f.write('''strict digraph {
    %s
    }''' % (';
'.join(dot), ))
    f.close()

# *nix users could produce an image file with a good layout 
# as follows from a terminal:
# $ circo -Tpng -Ograph graph.dot
# Windows users could use the same options with circo.exe 
# or use the GVedit desktop application

Note

The use of the attribute label for the nodes and edges in the DiGraph isn’t arbitrary. Attributes included in nodes and edges are serialized into the DOT language, and Graphviz tools recognize label as a special attribute.

Digraph of Ajaxian XFN data—note that the edge label “Me” between http://ajaxian.com and Dion Almaer indicates that Dion owns the blog

Figure 2-1. Digraph of Ajaxian XFN data—note that the edge label “Me” between http://ajaxian.com and Dion Almaer indicates that Dion owns the blog

Despite its simplicity, the graph is very interesting. It connects eight people, with a gentleman by the name of Dion Almaer being the common thread. Note, however, that crawling out one or more levels might introduce other nodes in the graph that are connected to everyone else. From the graph alone, it’s unclear whether Dion might have a closer relationship with Ben Galbraith than the other folks because of the difference between “co-worker” and “friend” relationships, but we could nonetheless crawl Ben’s XFN information, if he has provided any at the destination identified by his hyperlink, and search for other co-worker tags to build out a social network of who works with whom. See Chapter 6 for more on mining data as it relates to colleagues and coworkers.

Brief analysis of breadth-first techniques

We generally won’t take quite this long of a pause to analyze the approach, but given that this example is the first real algorithm we’ve written, and that this isn’t the last we’ll see of it in this book, it’s worthwhile to examine it more closely. In general, there are two criteria for examination that you should always consider for an algorithm: efficiency and effectiveness. Or, to put it another way: performance and quality.

Standard performance analysis of any algorithm generally involves examining its worst-case time and space complexity—in other words, the amount of time it would take the program to execute, and the amount of memory required for execution over a very large data set. The breadth-first approach we’ve taken is essentially a breadth-first search, except that we’re not actually searching for anything in particular because there are no exit criteria beyond expanding the graph out either to a maximum depth or until we run out of nodes. If we were searching for something specific instead of just crawling links indefinitely, that would be considered an actual breadth-first search. In practice, however, almost all searches impose some kind of exit criteria because of the limitations of finite computing resources. Thus, a more common variation of a breadth-first search is called a bounded breadth-first search, which imposes a limit on the maximum depth of the search just as we do in this example.

For a breadth-first search (or breadth-first crawl), both the time and space complexity can be bounded in the worst case by bd, where b is the branching factor of the graph and d is the depth. If you sketch out an example on paper and think about it, this analysis makes a lot of sense. If every node in a graph had five neighbors, and you only went out to a depth of one, you’d end up with six nodes in all: the root node and its five neighbors. If all five of those neighbors had five neighbors too, and you expanded out another level, you’d end up with 31 nodes in all: the root node, the root node’s five neighbors, and five neighbors for each of the root node’s neighbors. Such analysis may seem pedantic, but in the emerging era of big data, being able to perform at least a crude level of analysis based on what you know of the data set is more important than ever. Table 2-2 provides an overview of how bd grows for a few sizes of b and d.

Table 2-2. Example branching factor calculations for graphs of varying depths

Branching factorNodes for depth = 1Nodes for depth = 2Nodes for depth = 3Nodes for depth = 4Nodes for depth = 5
237153163
341340121364
4521853411365
56311567813906
674325915559331

While the previous comments pertain primarily to the theoretical bounds of the algorithm, one final consideration worth noting is the practical performance of the algorithm for a data set of a fixed size. Mild profiling of the code would reveal that it is primarily I/O bound from the standpoint that the vast majority of time is spent waiting for the urlopen method to return content to be processed. A later example, Threading Together Conversations, introduces and demonstrates the use of a thread pool to increase performance in exchange for a modicum of complexity.

A final consideration in analysis is the overall quality of the results. From the standpoint of quality analysis, basic visual inspection of the graph output reveals that there actually is a graph that connects people. Mission accomplished? Yes, but there’s always room for improvement. One consideration is that slight variations in URLs result in multiple nodes potentially appearing in the graph for the same person. For example, if Matthew is referenced in one hyperlink with the URL http://example.com/~matthew but as http://www.example.com/~matthew in another URL, those two nodes will remain distinct in the graph even though they most likely point to the same resource on the Web. Fortunately, XFN defines a special rel="me" value that can be used for identity consolidation. Google’s Social Graph API takes this very approach to connect a user’s various profiles, and there exist many examples of services that use rel="me" to allow users to connect profiles across multiple external sites. Another (much lesser) issue in resolving URLs is the use or omission of a trailing slash at the end. Most well-designed sites will automatically redirect one to the other, so this detail is mostly a nonissue.

Fortunately, others have also recognized these kinds of problems and decided to do something about them. SocialGraph Node Mapper is an interesting open source project that standardizes URLs relative to trailing slashes, the presence of “www”, etc., but it also recognizes that various social networking sites might expose different URLs that all link back to the same person. For example, http://blog.example.com/matthew and http://status.example.com?user=matthew might resolve to the same person for a given social networking site.



[10] See http://codex.wordpress.org/Links_Add_New_SubPanel for an example of one popular plug-in.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.36.221