Tweet and RT were sitting on a fence. Tweet fell off. Who was left?
In this chapter, we’ll largely use CouchDB’s map/reduce capabilities to exploit the entities in tweets (@mentions, #hashtags, etc.) to try to answer the question, “What’s everyone talking about?” With overall throughput now far exceeding 50 million tweets per day and occasional peak velocities in excess of 3,000 tweets per second, there’s vast potential in mining tweet content, and this is the chapter where we’ll finally dig in. Whereas the previous chapter primarily focused on the social graph linkages that exist among friends and followers, this chapter focuses on learning as much as possible about Twitterers by inspecting the entities that appear in their tweets. You’ll also see ties back to Redis for accessing user data you have harvested from Chapter 4 and NetworkX for graph analytics. So many tweets, so little time to mine them—let’s get started!
It is highly recommended that you read Chapters 3 and 4 before reading this chapter. Much of its discussion builds upon the foundation those chapters established, including Redis and CouchDB, which are again used in this chapter.
If the
pen is mightier than the sword, what does that say about the
tweet? There are a number of interesting incidents in which Twitter has
saved lives, one of the most notorious being James Karl Buck’s famous
“Arrested” tweet that led to his speedy release when he was
detained by Egyptian authorities. It doesn’t take too much work to find
evidence of similar incidents, as well as countless uses of Twitter for
noble fundraising efforts and other benevolent causes. Having an outlet
really can make a huge difference sometimes. More often than not, though,
your home time line (tweet stream) and the public time line are filled
with information that’s not quite so dramatic or intriguing. At times like
these, cutting out some of the cruft can help you glimpse the big picture.
Given that as many as 50 percent of all tweets
contain at least one entity that has been intentionally crafted by
the tweet author, they make a very logical starting point for tweet
analysis. In fact, Twitter has recognized their value and begun to
directly expose them in the time line API calls, and in early 2010 and as
the year unfolded, they increasingly became most standard throughout the
entire Twitter API. Consider the tweet in Example 5-1,
retrieved from a time line API call with the opt-in include_entities=true
parameter specified in the
query.
Example 5-1. A sample tweet from a search API that illustrates tweet entities
{ "created_at" : "Thu Jun 24 14:21:11 +0000 2010", "entities" : { "hashtags" : [ { "indices" : [ 97, 103 ], "text" : "gov20" }, { "indices" : [ 104, 112 ], "text" : "opengov" } ], "urls" : [ { "expanded_url" : null, "indices" : [ 76, 96 ], "url" : "http://bit.ly/9o4uoG" } ], "user_mentions" : [ { "id" : 28165790, "indices" : [ 16, 28 ], "name" : "crowdFlower", "screen_name" : "crowdFlower" } ] }, "id" : 16932571217, "text" : "Great idea from @crowdflower: Crowdsourcing the Goldman ... #opengov", "user" : { "description" : "Founder and CEO, O'Reilly Media. Watching the alpha ...", "id" : 2384071, "location" : "Sebastopol, CA", "name" : "Tim O'Reilly", "screen_name" : "timoreilly", "url" : "http://radar.oreilly.com", } }
By default, a tweet specifies a lot of useful information about its
author via the user
field in the status
object,
but the tweet entities provide insight into the content of the tweet
itself. By briefly inspecting this one sample tweet, we can safely infer
that @timoreilly is probably interested in the
transformational topics of open government and Government 2.0, as
indicated by the hashtags included in the tweet. It’s probably also safe
to infer that @crowdflower has some relation to
Government 2.0 and that the URL may point to such related content. Thus,
if you wanted to discover some additional information about the author of
this tweet in an automated fashion, you could consider pivoting from @timoreilly
over to @crowdflower and exploring that user’s tweets
or profile information, spawning a search on the hashtags included in the
tweet to see what kind of other information pops up, or following the link
and doing some page scraping to learn more about the underlying context of
the tweet.
Given that there’s so much value to be gained from analyzing tweet
entities, you’ll sorely miss them in some APIs or from historical archives
of Twitter data that are becoming more and more common to mine. Instead of
manually parsing them out of the text yourself (not such an easy thing to
do when tweets contain arbitrary Unicode characters), however, just
easy_install twitter-text-py
[29] so that you can focus your efforts on far more interesting
problems. The script in Example 5-2 illustrates some basic
usage of its Extractor
class, which
produces a structure similar to the one exposed by the time line APIs. You
have everything to gain and nothing to lose by automatically embedding
entities in this manner until tweet entities become the default.
As of December 2010, tweet entities were becoming more and more common through the APIs, but were not quite officially “blessed” and the norm. This chapter was written with the assumption that you’d want to know how to parse them out for yourself, but you should realize that keeping up with the latest happenings with the Twitter API might save you some work. Manual extraction of tweet entities might also be very helpful for situations in which you’re mining historical archives from organizations such as Infochimps or GNIP.
Example 5-2. Extracting tweet entities with a little help from the twitter_text package (the_tweet__extract_tweet_entities.py)
# -*- coding: utf-8 -*- import sys import json import twitter_text import twitter from twitter__login import login # Get a tweet id clicking on a status right off of twitter.com. # For example, http://twitter.com/#!/timoreilly/status/17386521699024896 TWEET_ID = sys.argv[1] # You may need to setup your OAuth settings in twitter__login.py t = login() def getEntities(tweet): # Now extract various entities from it and build up a familiar structure extractor = twitter_text.Extractor(tweet['text']) # Note that the production Twitter API contains a few additional fields in # the entities hash that would require additional API calls to resolve entities = {} entities['user_mentions'] = [] for um in extractor.extract_mentioned_screen_names_with_indices(): entities['user_mentions'].append(um) entities['hashtags'] = [] for ht in extractor.extract_hashtags_with_indices(): # massage field name to match production twitter api ht['text'] = ht['hashtag'] del ht['hashtag'] entities['hashtags'].append(ht) entities['urls'] = [] for url in extractor.extract_urls_with_indices(): entities['urls'].append(url) return entities # Fetch a tweet using an API method of your choice and mixin the entities tweet = t.statuses.show(id=TWEET_ID) tweet['entities'] = getEntities(tweet) print json.dumps(tweet, indent=4)
Now, equipped with an overview of tweet entities and some of the interesting possibilities, let’s get to work harvesting and analyzing some tweets.
[29] The twitter-text-py module
is
a port of the twitter-text-rb
module (both available via GitHub), which Twitter uses in
production.
3.149.214.32