Depending on how many tweets you wish for, it could take many days and hours to get it. The query we are about to conduct may take only a few minutes. Using the beepr package, it's possible to trigger an alarm once the query is complete:
if(!require(beepr)){install.packages('beepr')}
library(beepr)
Given point number three, you might not be able to reproduce the results that I got myself. Yet, I encourage you to try the codes and compare the results; that's a great way to get some practice. Let's get started with search_tweets2():
tweets_dt <- search_tweets2(q = '#rstats',
n = 20000,
include_rts = T,
tweet_mode = 'extended',
retryonratelimit = T,
token = my_token)
for(i in 0:2){beep(5); Sys.sleep(3)}
The previous code will collect tweets that have #rstats (q parameter) by using search_tweets2() and will store those in a DataFrame named tweets_dt. There is also a function called search_tweets(), the first of its kind. The former shows a small advantage by enabling you to directly pass arguments to the API; the text_mode argument can be found inside the API's documentation.
The code asked for 20,000 tweets, but I ended up with only 15,999; that is explained by point number three. Arguments include_rts = T and retryonratelimit = T are respectively asking search_tweets2() to also query retweets and also to continue after 15 minutes if the rate limit is reached. The last line will trigger an alarm that will repeat three times in a row, with a 3 seconds interval between each beep.
Our recently created DataFrame with Twitter data (tweets_dt) has 42 variables. You can check how many observations and variables you got using the following code:
dim(tweets_dt)
# [1] 15999 42
names(tweets_dt)
The last line outputs all of the variables' names. We used the search_tweets2() function given by rtweet to collect information about tweets related to #rstats. There are a bunch of other functions that are often called to collect Twitter data:
- get_timeline(): Used to get the user's timeline
- stream_tweets(): Collects a live stream of Twitter data
- post_tweet(): Post tweets from your console
- save_as_csv(): Easily saves Twitter data that's created by rtweet
- read_twitter_csv(): Easily reads Twitter data, saved as a .csv
So far, we briefly discussed what KDD and data mining could mean. We also learned about some ways to retrieve text from the web using three different packages: httr, rvest, and rtweet. Now, it's time to move further, clean the data, and transform it in an insightful way.