R-related posts in social media

One option to collect posts from the past few days of social media is processing Twitter's global stream of Tweet data. This stream data and API provides access to around 1 percent of all tweets. If you are interested in all this data, then a commercial Twitter Firehouse account is needed. In the following examples, we will use the free Twitter search API, which provides access to no more than 3,200 tweets based on any search query—but this will be more than enough to do some quick analysis on the trending topics among R users.

So let's load the twitteR package and initialize the connection to the API by providing our application tokens and secrets, generated at https://apps.twitter.com:

> library(twitteR)
> setup_twitter_oauth(...)

Now we can start using the searchTwitter function to search tweets for any keywords, including hashtags and mentions. This query can be fine-tuned with a couple of arguments. Since, until, and n set the beginning and end date, also the number of tweets to return respectively. Language can be set with the lang attribute by the ISO 639-1 format—for example, use en for English.

Let's search for the most recent tweet with the official R hashtag:

> str(searchTwitter("#rstats", n = 1, resultType = 'recent'))
Reference class 'status' [package "twitteR"] with 17 fields
 $ text         : chr "7 #rstats talks in 2014"| __truncated__
 $ favorited    : logi FALSE
 $ favoriteCount: num 2
 $ replyToSN    : chr(0) 
 $ created      : POSIXct[1:1], format: "2015-07-21 19:31:23"
 $ truncated    : logi FALSE
 $ replyToSID   : chr(0) 
 $ id           : chr "623576019346280448"
 $ replyToUID   : chr(0) 
 $ statusSource : chr "Twitter Web Client"
 $ screenName   : chr "daroczig"
 $ retweetCount : num 2
 $ isRetweet    : logi FALSE
 $ retweeted    : logi FALSE
 $ longitude    : chr(0) 
 $ latitude     : chr(0) 
 $ urls         :'data.frame':	2 obs. of  5 variables:
  ..$ url         : chr [1:2] 
      "http://t.co/pStTeyBr2r" "https://t.co/5L4wyxtooQ"
  ..$ expanded_url: chr [1:2] "http://budapestbiforum.hu/2015/en/cfp" 
      "https://twitter.com/BudapestBI/status/623524708085067776"
  ..$ display_url : chr [1:2] "budapestbiforum.hu/2015/en/cfp" 
      "twitter.com/BudapestBI/sta…"
  ..$ start_index : num [1:2] 97 120
  ..$ stop_index  : num [1:2] 119 143

This is quite an impressive amount of information for a character string with no more than 140 characters, isn't it? Besides the text including the actual tweet, we got some meta-information as well—for example, the author, post time, the number of times other users favorited or retweeted the post, the Twitter client name, and the URLs in the post along with the shortened, expanded, and displayed format. The location of the tweet is also available in some cases, if the user enabled that feature.

Based on this piece of information, we could focus on the Twitter R community in very different ways. Examples include:

  • Counting the users mentioning R
  • Analyzing social network or Twitter interactions
  • Time-series analysis on the time of posts
  • Spatial analysis on the location of tweets
  • Text mining of the tweet contents

Probably a mixture of these (and other) methods would be the best approach, and I highly suggest you do that as an exercise to practice what you have learned in this book. However, in the following pages we will only concentrate on the last item.

So first, we need some recent tweets on the R programming language. To search for #rstats posts, instead of providing the related hashtag (like we did previously), we can use the Rtweets wrapper function as well:

> tweets <- Rtweets(n = 500)

This function returned 500 reference classes similar to those we saw previously. We can count the number of original tweets excluding retweets:

> length(strip_retweets(tweets))
[1] 149

But, as we are looking for the trending topics, we are interested in the original list of tweets, where the retweets are also important as they give a natural weight to the trending posts. So let's transform the list of reference classes to a data.frame:

> tweets <- twListToDF(tweets)

This dataset consists of 500 rows (tweets) and 16 variables on the content, author, and location of the posts, as described previously. Now, as we are only interested in the actual text of the tweets, let's load the tm package and import our corpus as seen in Chapter 7, Unstructured Data:

> library(tm)
Loading required package: NLP
> corpus <- Corpus(VectorSource(tweets$text))

As the data is in the right format, we can start to clean the data from the common English words and transform everything into lowercase format; we might also want to remove any extra whitespace:

> corpus <- tm_map(corpus, removeWords, stopwords("english"))
> corpus <- tm_map(corpus, content_transformer(tolower))
> corpus <- tm_map(corpus, removePunctuation)
> corpus <- tm_map(corpus, stripWhitespace)

It's also wise to remove the R hashtag, as this is part of all tweets:

> corpus <- tm_map(corpus, removeWords, 'rstats')

And then we can use the wordcloud package to plot the most important words:

> library(wordcloud)
Loading required package: RColorBrewer
> wordcloud(corpus)
R-related posts in social media
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.200.3