We already saw a couple of things in data, such as the first ten rows. There is more to look at. Let's check the dimensions we've got:
dim(clean_dt)
# [1] 951 2
We got 951 mentions to R packages after withdrawing seven observations. Let's check the last 10 observations:
tail(clean_dt, 10)
# A tibble: 10 x 2
# word n
# <chr> <int>
# 1 wpp2017 1
# 2 wrapr 1
# 3 WufooR 1
# 4 xgboost 1
# 5 XR 1
# 6 xray 1
# 7 XRJulia 1
# 8 xtractomatic 1
# 9 ZeBook 1
#10 zipfextR 1
A summary could also be useful:
summary(clean_dt)
# word n
# Length:951 Min. : 1.00
# Class :character 1st Qu.: 1.00
# Mode :character Median : 2.00
# Mean : 10.58
# 3rd Qu.: 7.00
# Max. :481.00
Given that the first quarter was one, the median was two, and the third quarter was seven, it's not incorrect to infer that most packages were cited only once or twice. This little effort of calling the summary() function gave us additional information that turned out to be useful—now we know that quite a few packages are very popular in the tweets while a bunch of other ones are not so much.
By calling clean_dt[1,], I got the most-cited package: ggplot2. There is a very good reason for this. Visualizing is a wonderful way to look for insights and to convince your audience. As soon as I get the results from the models, I try to tell the whole history using three to five plots.
ggplot2 is capable of building wonderful plots with only a few lines. Additionally, there are many supplemental packages built on top of it. In the search for more knowledge on data, we are about to use ggplot2 to craft some visualizations.