One option to collect posts from the past few days of social media is processing Twitter's global stream of Tweet data. This stream data and API provides access to around 1 percent of all tweets. If you are interested in all this data, then a commercial Twitter Firehouse account is needed. In the following examples, we will use the free Twitter search API, which provides access to no more than 3,200 tweets based on any search query—but this will be more than enough to do some quick analysis on the trending topics among R users.
So let's load the twitteR
package and initialize the connection to the API by providing our application tokens and secrets, generated at https://apps.twitter.com:
> library(twitteR) > setup_twitter_oauth(...)
Now we can start using the searchTwitter
function to search tweets for any keywords, including hashtags and mentions. This query can be fine-tuned with a couple of arguments. Since
, until
, and n set the beginning and end date, also the number of tweets to return respectively. Language can be set with the lang
attribute by the ISO 639-1 format—for example, use en
for English.
Let's search for the most recent tweet with the official R hashtag:
> str(searchTwitter("#rstats", n = 1, resultType = 'recent')) Reference class 'status' [package "twitteR"] with 17 fields $ text : chr "7 #rstats talks in 2014"| __truncated__ $ favorited : logi FALSE $ favoriteCount: num 2 $ replyToSN : chr(0) $ created : POSIXct[1:1], format: "2015-07-21 19:31:23" $ truncated : logi FALSE $ replyToSID : chr(0) $ id : chr "623576019346280448" $ replyToUID : chr(0) $ statusSource : chr "Twitter Web Client" $ screenName : chr "daroczig" $ retweetCount : num 2 $ isRetweet : logi FALSE $ retweeted : logi FALSE $ longitude : chr(0) $ latitude : chr(0) $ urls :'data.frame': 2 obs. of 5 variables: ..$ url : chr [1:2] "http://t.co/pStTeyBr2r" "https://t.co/5L4wyxtooQ" ..$ expanded_url: chr [1:2] "http://budapestbiforum.hu/2015/en/cfp" "https://twitter.com/BudapestBI/status/623524708085067776" ..$ display_url : chr [1:2] "budapestbiforum.hu/2015/en/cfp" "twitter.com/BudapestBI/sta…" ..$ start_index : num [1:2] 97 120 ..$ stop_index : num [1:2] 119 143
This is quite an impressive amount of information for a character string with no more than 140 characters, isn't it? Besides the text including the actual tweet, we got some meta-information as well—for example, the author, post time, the number of times other users favorited or retweeted the post, the Twitter client name, and the URLs in the post along with the shortened, expanded, and displayed format. The location of the tweet is also available in some cases, if the user enabled that feature.
Based on this piece of information, we could focus on the Twitter R community in very different ways. Examples include:
Probably a mixture of these (and other) methods would be the best approach, and I highly suggest you do that as an exercise to practice what you have learned in this book. However, in the following pages we will only concentrate on the last item.
So first, we need some recent tweets on the R programming language. To search for #rstats
posts, instead of providing the related hashtag (like we did previously), we can use the Rtweets
wrapper function as well:
> tweets <- Rtweets(n = 500)
This function returned 500 reference classes similar to those we saw previously. We can count the number of original tweets excluding retweets:
> length(strip_retweets(tweets)) [1] 149
But, as we are looking for the trending topics, we are interested in the original list of tweets, where the retweets are also important as they give a natural weight to the trending posts. So let's transform the list of reference classes to a data.frame
:
> tweets <- twListToDF(tweets)
This dataset consists of 500 rows (tweets) and 16 variables on the content, author, and location of the posts, as described previously. Now, as we are only interested in the actual text of the tweets, let's load the tm
package and import our corpus as seen in Chapter 7, Unstructured Data:
> library(tm) Loading required package: NLP > corpus <- Corpus(VectorSource(tweets$text))
As the data is in the right format, we can start to clean the data from the common English words and transform everything into lowercase format; we might also want to remove any extra whitespace:
> corpus <- tm_map(corpus, removeWords, stopwords("english")) > corpus <- tm_map(corpus, content_transformer(tolower)) > corpus <- tm_map(corpus, removePunctuation) > corpus <- tm_map(corpus, stripWhitespace)
It's also wise to remove the R hashtag, as this is part of all tweets:
> corpus <- tm_map(corpus, removeWords, 'rstats')
And then we can use the wordcloud
package to plot the most important words:
> library(wordcloud) Loading required package: RColorBrewer > wordcloud(corpus)
3.15.29.119