As expected, before analyzing Twitter data, one must obtain the data. One reason we like R for social media mining is that it makes obtaining targeted portions of Twitter data (relatively) simple. Besides having the capacity to read standard data types and files from traditional statistical software packages, R can also read many other specialized formats. For instance, R can read relational databases, Hadoop, and some web formats such as Twitter. This chapter first covers how to obtain Twitter data before describing some simple exploratory data analysis techniques.
To begin ingesting social media data from Twitter, you will need a developer account on Twitter. You can start one (free of cost) at https://dev.twitter.com/apps. Once you have a Twitter account, return to that page and enter your username and password. Now, simply click on the Create New Application button and enter the requested information. Note that these inputs are neither important nor binding. You simply need to provide a name, description, and website (even just a personal blog) in the required fields.
Once finished, you should see a page with a lot of information about your application. Included here is a section called OAuth settings. These are crucial in helping you authenticate your application with Twitter, thus allowing you to mine tweets. More specifically, these bits of information will authenticate you with the Twitter application programming interface (API). You'll want to copy the consumer key, consumer secret, request token URL, authorize URL, and access token URL to a file and keep them handy.
Now that we have set up an application with Twitter, we need to download the R package that allows us to pull tweets into our local R session. Though there are several packages that do this, we prefer the twitteR
package for its ease of use and flexibility. Instructions for downloading packages can be found in Chapter 2, Getting Started with R, but in general, installing packages is done by invoking install.packages("…")
. You can download the twitteR
package and load it into your R session as follows:
> install.packages("twitteR") > library(twitteR)
Now, we are just a few lines of R code away from pulling in Twitter data. If you are using a Windows machine, there is an additional prestep of downloading a cecert.pem
file, which forms a portion of certain types of certification schemes for Internet transfers, as shown in the following code snippet:
> download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="C:/.../cacert.pem")
In this example, we have saved the file to the C:
directory, but you can save it to wherever you have the appropriate permissions on your machine. Also, note the use of the backslash instead of the Windows-standard forward slash in the file locations. Next, create R objects from your own consumer information, filled in here with XX
to indicate a placeholder, as seen in the following lines of code:
> my.key <- "XX" > my.secret <- "XX"
With that done, pass this information to a function called OAuthFactory
. The requestURL
, accessURL
, and authURL
in the following code snippet are demonstrative, but you should verify this information with that provided by Twitter as a part of authorizing your application:
> cred <- OAuthFactory$new(consumerKey=my.key, consumerSecret=my.secret, requestURL='https://api.twitter.com/oauth/request_token', accessURL='https://api.twitter.com/oauth/access_token', authURL='https://api.twitter.com/oauth/authorize')
Finally, input the cred$handshake
call that follows this paragraph, including the full path to where you saved your cacert.pem
file. This will bring up a URL in the R console that you will have to copy and paste into a browser. Doing so will take you to a Twitter page that will supply you with a numeric code that you can copy and paste into your instance of R after the cred$handshake
call.
> cred$handshake(cainfo="C:/.../cacert.pem")
Finally, save your authentication settings as follows:
> save(cred, file="twitter authentication.Rdata") > registerTwitterOAuth(cred)
The registerTwitterOAuth
function returns a value of TRUE
on success; you are now ready to begin mining Twitter data, and after all of these steps, it will seem very simple. The workhorse of the twitteR
package is a function called, appropriately, searchTwitter
. The standard arguments to the function are a search term, a number of tweets to return, and providing the cacert.pem
file downloaded previously. More information about the function, including how to search specific time frames, geographic locations, and more, can be found by typing ?searchTwitter
. For now, let's pull in some tweets with the #bigdata
hashtag and save them to an object called bigdata
as follows (note that you may leave off the cainfo
argument on non-Windows machines):
> bigdata <- searchTwitter("#bigdata", n=1500, cainfo="cacert.pem")
We can find out what class or type of object bigdata
is by using the class
function as follows:
> class(bigdata) [1] list
We easily discover that bigdata
is a list or a collection of objects. We can access the first few objects in a list using the head()
function as follows:
> head(bigdata) [[1]] [1] "Timothy_Hughes: RT @MarketBuildr: Prescriptive versus #predictive #analytics http://t.co/oy7rS691Ht #BigData #Marketing" [[2]] [1] "DanVesset: already have on my schedule 3 upcoming business trips to #Texas .... where all data is #BigData" [[3]] [1] "imagineCALGARY: Excited to be working on our methodology for turning #bigdata into the story of #yyc's #sustainability journey: http://t.co/lzPMAEQIbN" [[4]] [1] "ozyind: RT @timoelliott: #BigData Will Save the Planet!!! http://t.co/Tumfrse5Kc by @jamesafisher #analytics #bi #marketing" [[5]] [1] "BernardMarr: Mining Big Data For Sales Leads http://t.co/Xh5pBGskaG #bigdata #datamining #analytics" [[6]] [1] "mobiusmedia: RT @carmenaugustine: One size does not fit all: "It's up to each professional to determine what they mean by #bigdata #discovery" @jaimefit…"
You can access a particular object within a list by using double braces as follows:
> bigdata[[4]] [1] "ozyind: RT @timoelliott: #BigData Will Save the Planet!!! http://t.co/Tumfrse5Kc by @jamesafisher #analytics #bi #marketing"
There is no guarantee that searchTwitter
pulled in the number of tweets requested. We may have specified a small date range or an uncommon search term. Either way, we can check the length of the bigdata
list-type object with the length()
function as follows:
> length(bigdata) 1500
Before we get too search-happy, it should be noted that the Twitter REST API (v1.1) limits the number of searches that can be performed in any given time period. The limits vary based on the type of search, the type of application making the search, as well as other criteria. Generally speaking, however, when using searchTwitter
, you will be limited to 15 searches every 15 minutes, so make them count! More specific information on Twitter's rate limits can be found at https://dev.twitter.com/docs/rate-limiting/1.1/limits.
The main tip to avoid the rate limit becoming a hindrance is to search judiciously for particular users, themes, or hashtags. Another option is to more frequently search for users and/or themes that are more active and reserve less active users or themes to intermittent search windows. It is best practice to keep track of your searches and rate limit ceilings by querying in R, or by adding rate limit queries directly to your code. If you plan to create applications rather than merely analyze data in R, other options such as caching may prove useful. The following two lines of code return the current number of each type of search that remains in a user's allotment, as well as when each search limit will reset:
> rate.limit <- getCurRateLimitInfo(c("lists")) > rate.limit resource limit remaining reset 1 /lists/subscribers 180 180 2013-07-23 21:49:49 2 /lists/memberships 15 15 2013-07-23 21:49:49 3 /lists/list 15 15 2013-07-23 21:49:49 4 /lists/ownerships 15 15 2013-07-23 21:49:49 5 /lists/subscriptions 15 15 2013-07-23 21:49:49 6 /lists/members 180 180 2013-07-23 21:49:49 7 /lists/subscribers/show 15 15 2013-07-23 21:49:49 8 /lists/statuses 180 180 2013-07-23 21:49:49 9 /lists/show 15 15 2013-07-23 21:49:49 10 /lists/members/show 15 15 2013-07-23 21:49:49
To limit the number of searches we have to undertake, it can be useful to convert our search results to a data frame and then save them for later analysis. Only two lines of code are used, one to convert the bigdata
list to a data frame
and another to save that data frame as a comma-separated value file:
# conversion from list to data frame > bigdata.df <- do.call(rbind, lapply(bigdata, as.data.frame)) # write to csv; fill in the … with a valid path > write.csv(bigdata.df, "C:/…/bigdata.csv")
3.149.240.185