Obtaining Twitter data

As expected, before analyzing Twitter data, one must obtain the data. One reason we like R for social media mining is that it makes obtaining targeted portions of Twitter data (relatively) simple. Besides having the capacity to read standard data types and files from traditional statistical software packages, R can also read many other specialized formats. For instance, R can read relational databases, Hadoop, and some web formats such as Twitter. This chapter first covers how to obtain Twitter data before describing some simple exploratory data analysis techniques.

To begin ingesting social media data from Twitter, you will need a developer account on Twitter. You can start one (free of cost) at https://dev.twitter.com/apps. Once you have a Twitter account, return to that page and enter your username and password. Now, simply click on the Create New Application button and enter the requested information. Note that these inputs are neither important nor binding. You simply need to provide a name, description, and website (even just a personal blog) in the required fields.

Once finished, you should see a page with a lot of information about your application. Included here is a section called OAuth settings. These are crucial in helping you authenticate your application with Twitter, thus allowing you to mine tweets. More specifically, these bits of information will authenticate you with the Twitter application programming interface (API). You'll want to copy the consumer key, consumer secret, request token URL, authorize URL, and access token URL to a file and keep them handy.

Now that we have set up an application with Twitter, we need to download the R package that allows us to pull tweets into our local R session. Though there are several packages that do this, we prefer the twitteR package for its ease of use and flexibility. Instructions for downloading packages can be found in Chapter 2, Getting Started with R, but in general, installing packages is done by invoking install.packages("…"). You can download the twitteR package and load it into your R session as follows:

> install.packages("twitteR")
> library(twitteR)

Now, we are just a few lines of R code away from pulling in Twitter data. If you are using a Windows machine, there is an additional prestep of downloading a cecert.pem file, which forms a portion of certain types of certification schemes for Internet transfers, as shown in the following code snippet:

> download.file(url="http://curl.haxx.se/ca/cacert.pem",
destfile="C:/.../cacert.pem")

In this example, we have saved the file to the C: directory, but you can save it to wherever you have the appropriate permissions on your machine. Also, note the use of the backslash instead of the Windows-standard forward slash in the file locations. Next, create R objects from your own consumer information, filled in here with XX to indicate a placeholder, as seen in the following lines of code:

> my.key <- "XX"
> my.secret <- "XX"

With that done, pass this information to a function called OAuthFactory. The requestURL, accessURL, and authURL in the following code snippet are demonstrative, but you should verify this information with that provided by Twitter as a part of authorizing your application:

> cred <- OAuthFactory$new(consumerKey=my.key,
    consumerSecret=my.secret,
    requestURL='https://api.twitter.com/oauth/request_token',
    accessURL='https://api.twitter.com/oauth/access_token',
    authURL='https://api.twitter.com/oauth/authorize')

Finally, input the cred$handshake call that follows this paragraph, including the full path to where you saved your cacert.pem file. This will bring up a URL in the R console that you will have to copy and paste into a browser. Doing so will take you to a Twitter page that will supply you with a numeric code that you can copy and paste into your instance of R after the cred$handshake call.

> cred$handshake(cainfo="C:/.../cacert.pem")

Finally, save your authentication settings as follows:

> save(cred, file="twitter authentication.Rdata")
> registerTwitterOAuth(cred)

The registerTwitterOAuth function returns a value of TRUE on success; you are now ready to begin mining Twitter data, and after all of these steps, it will seem very simple. The workhorse of the twitteR package is a function called, appropriately, searchTwitter. The standard arguments to the function are a search term, a number of tweets to return, and providing the cacert.pem file downloaded previously. More information about the function, including how to search specific time frames, geographic locations, and more, can be found by typing ?searchTwitter. For now, let's pull in some tweets with the #bigdata hashtag and save them to an object called bigdata as follows (note that you may leave off the cainfo argument on non-Windows machines):

> bigdata <- searchTwitter("#bigdata", n=1500, cainfo="cacert.pem") 

We can find out what class or type of object bigdata is by using the class function as follows:

> class(bigdata)
[1] list

We easily discover that bigdata is a list or a collection of objects. We can access the first few objects in a list using the head() function as follows:

> head(bigdata) 
  
[[1]]
[1] "Timothy_Hughes: RT @MarketBuildr: Prescriptive versus #predictive #analytics http://t.co/oy7rS691Ht #BigData #Marketing"
  
[[2]]
[1] "DanVesset: already have on my schedule 3 upcoming business trips to #Texas .... where all data is #BigData"

[[3]]
[1] "imagineCALGARY: Excited to be working on our methodology for turning #bigdata into the story of #yyc's #sustainability journey: http://t.co/lzPMAEQIbN"

[[4]]
[1] "ozyind: RT @timoelliott: #BigData Will Save the Planet!!! http://t.co/Tumfrse5Kc by @jamesafisher #analytics #bi #marketing"

[[5]]
[1] "BernardMarr: Mining Big Data For Sales Leads http://t.co/Xh5pBGskaG

#bigdata
#datamining
#analytics"

[[6]]
[1] "mobiusmedia: RT @carmenaugustine: One size does not fit all: "It's up to each professional to determine what they mean by #bigdata #discovery" @jaimefit…"

You can access a particular object within a list by using double braces as follows:

> bigdata[[4]]
[1] "ozyind: RT @timoelliott: #BigData Will Save the Planet!!! http://t.co/Tumfrse5Kc by @jamesafisher #analytics #bi #marketing"

There is no guarantee that searchTwitter pulled in the number of tweets requested. We may have specified a small date range or an uncommon search term. Either way, we can check the length of the bigdata list-type object with the length() function as follows:

> length(bigdata)
1500

Before we get too search-happy, it should be noted that the Twitter REST API (v1.1) limits the number of searches that can be performed in any given time period. The limits vary based on the type of search, the type of application making the search, as well as other criteria. Generally speaking, however, when using searchTwitter, you will be limited to 15 searches every 15 minutes, so make them count! More specific information on Twitter's rate limits can be found at https://dev.twitter.com/docs/rate-limiting/1.1/limits.

The main tip to avoid the rate limit becoming a hindrance is to search judiciously for particular users, themes, or hashtags. Another option is to more frequently search for users and/or themes that are more active and reserve less active users or themes to intermittent search windows. It is best practice to keep track of your searches and rate limit ceilings by querying in R, or by adding rate limit queries directly to your code. If you plan to create applications rather than merely analyze data in R, other options such as caching may prove useful. The following two lines of code return the current number of each type of search that remains in a user's allotment, as well as when each search limit will reset:

> rate.limit <- getCurRateLimitInfo(c("lists"))
> rate.limit
    resource                 limit    remaining   reset
1  /lists/subscribers        180      180         2013-07-23 21:49:49
2  /lists/memberships        15       15          2013-07-23 21:49:49
3  /lists/list               15       15          2013-07-23 21:49:49
4  /lists/ownerships         15       15          2013-07-23 21:49:49
5  /lists/subscriptions      15       15          2013-07-23 21:49:49
6  /lists/members            180      180         2013-07-23 21:49:49
7  /lists/subscribers/show   15       15          2013-07-23 21:49:49
8  /lists/statuses           180      180         2013-07-23 21:49:49
9  /lists/show               15       15          2013-07-23 21:49:49
10 /lists/members/show       15       15          2013-07-23 21:49:49

To limit the number of searches we have to undertake, it can be useful to convert our search results to a data frame and then save them for later analysis. Only two lines of code are used, one to convert the bigdata list to a data frame and another to save that data frame as a comma-separated value file:

# conversion from list to data frame
> bigdata.df <- do.call(rbind, lapply(bigdata, as.data.frame))

# write to csv; fill in the … with a valid path
> write.csv(bigdata.df, "C:/…/bigdata.csv")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.240.185