Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty)

One of the most fascinating aspects of data mining is that it affords you the ability to discover new knowledge from existing information. There really is something to be said for the old adage that “knowledge is power,” and it’s especially true in an age where the amount of information available is steadily growing with no indication of decline. As an interesting exercise, let’s see what we can discover about some of the latent social networks that exist in the sea of Twitter data. The basic approach we’ll take is to collect some focused data on two or more topics in a specific way by searching on a particular hashtag, and then apply some of the same metrics we coded up in the previous section (where we analyzed Tim’s tweets) to get a feel for the similarity between the networks.

Since there’s no such thing as a “stupid question,” let’s move forward in the spirit of famed economist Steven D. Levitt[33] and ask the question, “What do #TeaParty and #JustinBieber have in common?”[34]

Example 5-14 provides a simple mechanism for collecting approximately the most recent 1,500 tweets (the maximum currently returned by the search API) on a particular topic and storing them away in CouchDB. Like other listings you’ve seen earlier in this chapter, it includes simple map/reduce logic to incrementally update the tweets in the event that you’d like to run it over a longer period of time to collect a larger batch of data than the search API can give you in a short duration. You might want to investigate the streaming API for this type of task.

Example 5-14. Harvesting tweets for a given query (the_tweet__search.py)

# -*- coding: utf-8 -*-

import sys
import twitter
import couchdb
from couchdb.design import ViewDefinition
from twitter__util import makeTwitterRequest

SEARCH_TERM = sys.argv[1]
MAX_PAGES = 15

KW = {
    'domain': 'search.twitter.com',
    'count': 200,
    'rpp': 100,
    'q': SEARCH_TERM,
    }

server = couchdb.Server('http://localhost:5984')
DB = 'search-%s' % (SEARCH_TERM.lower().replace('#', '').replace('@', ''), )

try:
    db = server.create(DB)
except couchdb.http.PreconditionFailed, e:

    # already exists, so append to it, and be mindful of duplicates

    db = server[DB]

t = twitter.Twitter(domain='search.twitter.com')

for page in range(1, 16):
    KW['page'] = page
    tweets = makeTwitterRequest(t, t.search, **KW)
    db.update(tweets['results'], all_or_nothing=True)
    if len(tweets['results']) == 0:
        break
    print 'Fetched %i tweets' % len(tweets['results'])

The following sections are based on approximately 3,000 tweets per topic and assume that you’ve run the script to collect data on #TeaParty and #JustinBieber (or any other topics that interest you).

Warning

Depending on your terminal preferences, you may need to escape certain characters (such as the hash symbol) because of the way they might be interpreted by your shell. For example, in Bash, you’d need to escape a hashtag query for #TeaParty as #TeaParty to ensure that the shell interprets the hash symbol as part of the query term, instead of as the beginning of a comment.

What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets?

One of the simplest yet probably most effective ways to characterize two different crowds is to examine the entities that appear in an aggregate pool of tweets. In addition to giving you a good idea of the other topics that each crowd is talking about, you can compare the entities that do co-occur to arrive at a very rudimentary similarity metric. Example 5-4 already provides the logic we need to perform a first pass at entity analysis. Assuming you’ve run search queries for #JustinBieber and #TeaParty, you should have two CouchDB databases called “search-justinbieber” and “search-teaparty” that you can pass in to produce your own results. Sample results for each hashtag with an entity frequency greater than 20 follow in Tables 5-3 and 5-4; Figure 5-4 displays a chart conveying the underlying frequency distributions for these tables. Because the y-axis contains such extremes, it is adjusted to be a logarithmic scale, which makes the y values easier to read.

Table 5-3. Most frequent entities appearing in tweets containing #TeaParty

EntityFrequency
#teaparty2834
#tcot2426
#p2911
#tlot781
#gop739
#ocra649
#sgp567
#twisters269
#dnc175
#tpp170
#GOP150
#iamthemob123
#ucot120
#libertarian112
#obama112
#vote2010109
#TeaParty106
#hhrs104
#politics100
#immigration97
#cspj96
#acon91
#dems82
#palin79
#topprog78
#Obama74
#tweetcongress72
#jcot71
#Teaparty62
#rs60
#oilspill59
#news58
#glennbeck54
#FF47
#liberty47
@welshman00745
#spwbt44
#TCOT43
http://tinyurl.com/24h36zq43
#rnc42
#military40
#palin1240
@Drudge_Report39
@ALIPAC35
#majority35
#NoAmnesty35
#patriottweets35
@ResistTyranny34
#tsot34
http://tinyurl.com/386k5hh31
#conservative30
#AZ29
#TopProg29
@JIDF28
@STOPOBAMA201228
@TheFlaCracker28
#palin201228
@thenewdeal27
#AFIRE27
#Dems27
#asamom26
#GOPDeficit25
#wethepeople25
@andilinks24
@RonPaulNews24
#ampats24
#cnn24
#jews24
@First_Patriots23
#patriot23
#pjtv23
@Liliaep22
#nvsen22
@BrnEyeSuss21
@crispix4921
@koopersmith21
@Kriskxx21
#Kagan21
@blogging_tories20
#cdnpoli20
#fail20
#nra20
#roft20

Table 5-4. Most frequent entities appearing in tweets containing #JustinBieber

EntityFrequency
#justinbieber1613
#JustinBieber1379
@lojadoaltivo354
@ProSieben258
#JUSTINBIEBER191
#Proform191
http://migre.me/TJwj191
#Justinbieber107
#nowplaying104
@justinbieber99
#music88
#tickets80
@_Yassi_78
#musicmonday78
#video78
#Dschungel74
#Celebrity42
#beliebers38
#BieberFact38
@JustBieberFact32
@TinselTownDirt32
@rheinzeitung28
#WTF28
http://tinyurl.com/343kax428
#Telezwerge26
#Escutando22
#justinBieber22
#Restart22
#TT22
http://bit.ly/aARD4t21
http://bit.ly/b2Kc1L21
#bieberblast20
#Eclipse20
#somebodytolove20
Distribution of entities co-occurring with #JustinBieber and #TeaParty

Figure 5-4. Distribution of entities co-occurring with #JustinBieber and #TeaParty

What’s immediately obvious is that the #TeaParty tweets seem to have a lot more area “under the curve” and a much longer tail[35] (if you can even call it a tail) than the #JustinBieber tweets. Thus, at a glance, it would seem that the average number of hashtags for #TeaParty tweets would be higher than for #JustinBieber tweets. The next section investigates this assumption, but before we move on, let’s make a few more observations about these results. A cursory qualitative assessment of the results seems to indicate that the information encoded into the entities themselves is richer for #TeaParty. For example, in #TeaParty entities, we see topics such as #oilspill, #Obama, #palin, #libertarian, and @Drudge_Report, among others. In contrast, many of the most frequently occurring #JustinBieber entities are simply variations of #JustinBieber, with the rest of the hashtags being somewhat scattered and unfocused. Keep in mind, however, that this isn’t all that unexpected, given that #TeaParty is a very political topic whereas #JustinBieber is associated with pop culture and entertainment.

Some other observations are that a couple of user entities (@lojadoaltivo and @ProSieben) appear in the top few results—higher than the “official” @justinbieber account itself—and that many of the entities that co-occur most often with #JustinBieber are non-English words or user entities, often associated with the entertainment industry.

Having briefly scratched the surface of a qualitative assessment, let’s now return to the question of whether there are definitively more hashtags per tweet for #TeaParty than #JustinBieber.

On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags?

Example 5-13 provides a working implementation for counting the average number of hashtags per tweet and can be readily applied to the search-justinbieber and search-teaparty databases without any additional work required.

Tallying the results for the two databases reveals that #JustinBieber tweets average around 1.95 hashtags per tweet, while #TeaParty tweets have around 5.15 hashtags per tweet. That’s approximately 2.5 times more hashtags for #TeaParty tweets than #JustinBieber tweets. Although this isn’t necessarily the most surprising find in the world, having firm data points on which to base further explorations or to back up conjectures is helpful: they are quantifiable results that can be tracked over time, or shared and reassessed by others.

Although the difference in this case is striking, keep in mind that the data collected is whatever Twitter handed us back as the most recent ~3,000 tweets for each topic via the search API. It isn’t necessarily statistically significant, even though it is probably a very good indicator and very well may be so. Whether they realize it or not, #TeaParty Twitterers are big believers in folksonomies: they clearly have a vested interest in ensuring that their content is easily accessible and cross-referenced via search APIs and data hackers such as ourselves.

Which Gets Retweeted More Often: #JustinBieber or #TeaParty?

Earlier in this chapter, we made the reasonable conjecture that tweets that are retweeted with high frequency are likely to be more influential and more informative or editorial in nature than ones that are not. Tweets such as “Eating a pretzel” and “Aliens have just landed on the White House front lawn; we are all going to die! #fail #apocalypse” being extreme examples of content that is fairly unlikely and likely to be retweeted, respectively. How does #TeaParty compare to #JustinBieber for retweets? Analyzing @mentions from the working set of search results again produces interesting results. Truncated results showing which users have retweeted #TeaParty and #JustinBieber most often using a threshold with a frequency parameter of 10 appear in Tables 5-5 and 5-6.

Table 5-5. Most frequent retweeters of #TeaParty

EntityFrequency
@teapartyleader10
@dhrxsol123411
@HCReminder11
@ObamaBallBuster11
@spitfiremurphy11
@GregWHoward12
@BrnEyeSuss13
@Calroofer13
@grammy62013
@Herfarm14
@andilinks16
@c4Liberty16
@FloridaPundit16
@tlw316
@Kriskxx18
@crispix4919
@JIDF19
@libertyideals19
@blogging_tories20
@Liliaep21
@STOPOBAMA201222
@First_Patriots23
@RonPaulNews23
@TheFlaCracker24
@thenewdeal25
@ResistTyranny29
@ALIPAC32
@Drudge_Report38
@welshman00739

Table 5-6. Most frequent retweeters of #JustinBieber

EntityFrequency
@justinbieber 14
@JesusBeebs16
@LeePhilipEvans16
@JustBieberFact32
@TinselTownDirt32
@ProSieben122
@lojadoaltivo189

If you do some back of the envelope analysis by running Example 5-4 on the ~3,000 tweets for each topic, you’ll discover that about 1,335 of the #TeaParty tweets are retweets, while only about 763 of the #JustinBieber tweets are retweets. That’s practically twice as many retweets for #TeaParty than #JustinBieber. You’ll also observe that #TeaParty has a much longer tail, checking in with over 400 total retweets against #JustinBieber’s 131 retweets. Regardless of statistical rigor, intuitively, those are probably pretty relevant indicators that size up the different interest groups in meaningful ways. It would seem that #TeaParty folks more consistently retweet content than #JustinBieber folks; however, of the #JustinBieber folks who do retweet content, there are clearly a few outliers who retweet much more frequently than others. Figure 5-5 displays a simple chart of the values from Tables 5-5 and 5-6. As with Figure 5-4, the y-axis is a log scale, which makes the chart a little more readable by squashing the frequency values to require less vertical space.

Distribution of users who have retweeted #JustinBieber and #TeaParty

Figure 5-5. Distribution of users who have retweeted #JustinBieber and #TeaParty

How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets?

A final looming question that might be keeping you up at night is how much overlap exists between the entities parsed out of the #TeaParty and #JustinBieber tweets. Borrowing from some of the ideas in Chapter 4, we’re essentially asking for the logical intersection of the two sets of entities. Although we could certainly compute this by taking the time to adapt existing Python code, it might be even easier to just capture the results of the scripts we already have on hand into two files and pass those filenames as parameters into a disposable script that provides a general-purpose facility for computing the intersection of any line-delimited file. In addition to getting the job done, this approach also leaves you with artifacts that you can casually peruse and readily share with others. Assuming you are working in a *nix shell with the script count-entities-in-tweets.py, one approach for capturing the entities from the #TeaParty and #JustinBieber output of Example 5-4 and storing them in sorted order follows:

#!/bin/bash

mkdir -p out
for db in teaparty justinbieber; do
    python the_tweet__count_entities_in_tweets.py search-$db 0 | 
    tail +3 | awk '{print $2}' | sort > out/$db.entities
done

After you’ve run this script, you can pass the two filenames into the general-purpose Python program to compute the output, as shown in Example 5-15.

Example 5-15. Computing the set intersection of lines in files (the_tweet__compute_intersection_of_lines_in_files.py)

# -*- coding: utf-8 -*-

"""
Read in 2 or more files and compute the logical intersection of the lines in them
"""

import sys

data = {}
for i in range(1, len(sys.argv)):
    data[sys.argv[i]] = set(open(sys.argv[i]).readlines())

intersection = set()
keys = data.keys()
for k in range(len(keys) - 1):
    intersection = data[keys[k]].intersection(data[keys[k - 1]])

msg = 'Common items shared amongst %s:' % ', '.join(keys).strip()
print msg
print '-' * len(msg)
for i in intersection:
    print i.strip()

The entities shared between #JustinBieber and #TeaParty are somewhat predictable, yet interesting. Example 5-16 lists the results from our sample.

Example 5-16. Sample results from Example 5-15

Common items shared amongst teaparty.entities, justinbieber.entities:
---------------------------------------------------------------------
#lol
#jesus
#worldcup
#teaparty
#AZ
#milk
#ff
#guns
#WorldCup
#bp
#News
#dancing
#music
#glennbeck
http://www.linkati.com/q/index
@addthis
#nowplaying
#news
#WTF
#fail
#toomanypeople
#oilspill
#catholic

It shouldn’t be surprising that #WorldCup, #worldcup, and #oilspill are in the results, given that they’re pretty popular topics; however, having #teaparty, #glennbeck, #jesus, and #catholic show up on the list of shared hashtags might be somewhat of a surprise if you’re not that familiar with the TeaParty movement. Further analysis could very easily determine exactly how strong the correlations are between the two searches by accounting for how frequently certain hashtags appear in each search. One thing that’s immediately clear from these results is that none of these common entities appears in the top 20 most frequent entities associated with #JustinBieber, so that’s already an indicator that they’re out in the tail of the frequency distribution for #JustinBieber mentions. (And yes, having #WTF and #fail show up on the list at all, especially as a common thread between two diverse groups, is sort of funny. Experiencing frustration is, unfortunately, a common thread of humanity.) If you want to dig deeper, as a further exercise you might reuse Example 5-7 to enable full-text indexing on the tweets in order to search by keyword.



[33] Steven D. Levitt is the co-author of Freakonomics: A Rogue Economist Explores the Hidden Side of Everything (Harper), a book that systematically uses data to answer seemingly radical questions such as, “What do school teachers and sumo wrestlers have in common?”

[34] This question was partly inspired by the interesting Radar post, “Data science democratized”, which mentions a presentation that investigated the same question.

[35] A “long tail” or “heavy tail” refers to a feature of statistical distributions in which a significant portion (usually 50 percent or more) of the area under the curve exists within its tail. This concept is revisited as part of a brief overview of Zipf’s law in Data Hacking with NLTK.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.234.118