Chapter 5. Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet

Tweet and RT were sitting on a fence. Tweet fell off. Who was left?

In this chapter, we’ll largely use CouchDB’s map/reduce capabilities to exploit the entities in tweets (@mentions, #hashtags, etc.) to try to answer the question, “What’s everyone talking about?” With overall throughput now far exceeding 50 million tweets per day and occasional peak velocities in excess of 3,000 tweets per second, there’s vast potential in mining tweet content, and this is the chapter where we’ll finally dig in. Whereas the previous chapter primarily focused on the social graph linkages that exist among friends and followers, this chapter focuses on learning as much as possible about Twitterers by inspecting the entities that appear in their tweets. You’ll also see ties back to Redis for accessing user data you have harvested from Chapter 4 and NetworkX for graph analytics. So many tweets, so little time to mine them—let’s get started!


It is highly recommended that you read Chapters 3 and 4 before reading this chapter. Much of its discussion builds upon the foundation those chapters established, including Redis and CouchDB, which are again used in this chapter.

Pen : Sword :: Tweet : Machine Gun (?!?)

If the pen is mightier than the sword, what does that say about the tweet? There are a number of interesting incidents in which Twitter has saved lives, one of the most notorious being James Karl Buck’s famous “Arrested” tweet that led to his speedy release when he was detained by Egyptian authorities. It doesn’t take too much work to find evidence of similar incidents, as well as countless uses of Twitter for noble fundraising efforts and other benevolent causes. Having an outlet really can make a huge difference sometimes. More often than not, though, your home time line (tweet stream) and the public time line are filled with information that’s not quite so dramatic or intriguing. At times like these, cutting out some of the cruft can help you glimpse the big picture. Given that as many as 50 percent of all tweets contain at least one entity that has been intentionally crafted by the tweet author, they make a very logical starting point for tweet analysis. In fact, Twitter has recognized their value and begun to directly expose them in the time line API calls, and in early 2010 and as the year unfolded, they increasingly became most standard throughout the entire Twitter API. Consider the tweet in Example 5-1, retrieved from a time line API call with the opt-in include_entities=true parameter specified in the query.

Example 5-1. A sample tweet from a search API that illustrates tweet entities

    "created_at" : "Thu Jun 24 14:21:11 +0000 2010",
    "entities" : { 
                    "hashtags" : [ 
                        {   "indices" : [ 97, 103 ],
                            "text" : "gov20"
                        {   "indices" : [ 104, 112 ],
                            "text" : "opengov"
                    "urls" : [ 
                        {   "expanded_url" : null,
                            "indices" : [ 76, 96 ],
                            "url" : ""
                    "user_mentions" : [ 
                        {   "id" : 28165790,
                            "indices" : [ 16, 28 ],
                            "name" : "crowdFlower",
                            "screen_name" : "crowdFlower"
    "id" : 16932571217,
    "text" : "Great idea from @crowdflower: Crowdsourcing the Goldman ... #opengov",
    "user" : { 
        "description" : "Founder and CEO, O'Reilly Media. Watching the alpha ...",
        "id" : 2384071,
        "location" : "Sebastopol, CA",
        "name" : "Tim O'Reilly",
        "screen_name" : "timoreilly",
        "url" : "",

By default, a tweet specifies a lot of useful information about its author via the user field in the status object, but the tweet entities provide insight into the content of the tweet itself. By briefly inspecting this one sample tweet, we can safely infer that @timoreilly is probably interested in the transformational topics of open government and Government 2.0, as indicated by the hashtags included in the tweet. It’s probably also safe to infer that @crowdflower has some relation to Government 2.0 and that the URL may point to such related content. Thus, if you wanted to discover some additional information about the author of this tweet in an automated fashion, you could consider pivoting from @timoreilly over to @crowdflower and exploring that user’s tweets or profile information, spawning a search on the hashtags included in the tweet to see what kind of other information pops up, or following the link and doing some page scraping to learn more about the underlying context of the tweet.

Given that there’s so much value to be gained from analyzing tweet entities, you’ll sorely miss them in some APIs or from historical archives of Twitter data that are becoming more and more common to mine. Instead of manually parsing them out of the text yourself (not such an easy thing to do when tweets contain arbitrary Unicode characters), however, just easy_install twitter-text-py[29] so that you can focus your efforts on far more interesting problems. The script in Example 5-2 illustrates some basic usage of its Extractor class, which produces a structure similar to the one exposed by the time line APIs. You have everything to gain and nothing to lose by automatically embedding entities in this manner until tweet entities become the default.


As of December 2010, tweet entities were becoming more and more common through the APIs, but were not quite officially “blessed” and the norm. This chapter was written with the assumption that you’d want to know how to parse them out for yourself, but you should realize that keeping up with the latest happenings with the Twitter API might save you some work. Manual extraction of tweet entities might also be very helpful for situations in which you’re mining historical archives from organizations such as Infochimps or GNIP.

Example 5-2. Extracting tweet entities with a little help from the twitter_text package (

# -*- coding: utf-8 -*-

import sys
import json
import twitter_text
import twitter
from twitter__login import login

# Get a tweet id clicking on a status right off of 
# For example,!/timoreilly/status/17386521699024896

TWEET_ID = sys.argv[1]

# You may need to setup your OAuth settings in
t = login()

def getEntities(tweet):

    # Now extract various entities from it and build up a familiar structure

    extractor = twitter_text.Extractor(tweet['text'])

    # Note that the production Twitter API contains a few additional fields in
    # the entities hash that would require additional API calls to resolve

    entities = {}
    entities['user_mentions'] = []
    for um in extractor.extract_mentioned_screen_names_with_indices():

    entities['hashtags'] = []
    for ht in extractor.extract_hashtags_with_indices():

        # massage field name to match production twitter api

        ht['text'] = ht['hashtag']
        del ht['hashtag']

    entities['urls'] = []
    for url in extractor.extract_urls_with_indices():

    return entities

# Fetch a tweet using an API method of your choice and mixin the entities

tweet =

tweet['entities'] = getEntities(tweet)

print json.dumps(tweet, indent=4)

Now, equipped with an overview of tweet entities and some of the interesting possibilities, let’s get to work harvesting and analyzing some tweets.

[29] The twitter-text-py module is a port of the twitter-text-rb module (both available via GitHub), which Twitter uses in production.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.