Peeking data

We already saw a couple of things in data, such as the first ten rows. There is more to look at. Let's check the dimensions we've got:

dim(clean_dt)
# [1] 951   2

We got 951 mentions to R packages after withdrawing seven observations. Let's check the last 10 observations:

tail(clean_dt, 10)
# A tibble: 10 x 2
#   word n
#   <chr> <int>
# 1 wpp2017 1
# 2 wrapr 1
# 3 WufooR 1
# 4 xgboost 1
# 5 XR 1
# 6 xray 1
# 7 XRJulia 1
# 8 xtractomatic 1
# 9 ZeBook 1
#10 zipfextR 1

A summary could also be useful:

summary(clean_dt)
#     word n 
# Length:951       Min. : 1.00 
# Class :character 1st Qu.: 1.00 
# Mode :character  Median : 2.00 
#                  Mean : 10.58 
#                  3rd Qu.: 7.00 
#                  Max. :481.00

Given that the first quarter was one, the median was two, and the third quarter was seven, it's not incorrect to infer that most packages were cited only once or twice. This little effort of calling the summary() function gave us additional information that turned out to be useful—now we know that quite a few packages are very popular in the tweets while a bunch of other ones are not so much.

By calling clean_dt[1,], I got the most-cited package: ggplot2. There is a very good reason for this. Visualizing is a wonderful way to look for insights and to convince your audience. As soon as I get the results from the models, I try to tell the whole history using three to five plots.

Writing an article around three to five plots that tell the whole history really well is a wonderful way to do it.

ggplot2 is capable of building wonderful plots with only a few lines. Additionally, there are many supplemental packages built on top of it. In the search for more knowledge on data, we are about to use ggplot2 to craft some visualizations.

Table of Contents for Peeking data

Create new playlist

Sign In

Sign Up

Table of Contents for
Peeking data