Cleaning and transforming data

In Chapter 3Data Wrangling with R, we approached the topic of data cleaning (munging). Data cleaning is so important that the majority of data scientists spend most of their work time cleaning and preparing data. The last session, What is the R community tweeting about?,  gave us a DataFrame with 15999 rows and 42 columns. That is raw data. This session will clean and transform it.

Our initial goal was to check which packages the R community is talking about on Twitter. There are three variables we will use to achieve the final goal.

The variable text can be truncated when there is a retweet. When that is the case, check  retweet_text, which won't be truncated. The quoted_text variable also brings useful information. To unite all the useful information into a single object, we can use the following code:

quotes <- tweets_dt$is_quote
rts <- tweets_dt$is_retweet

dt <- c(tweets_dt$quoted_text[quotes],
tweets_dt$retweet_text[rts],
tweets_dt$text[!rts])

The objects quotes and rts are storing FALSE and TRUE. The former will tell you whether the tweet is a quote, while the latter will signal a retweet. Next, the code block is creating the dt object, which holds all of the non-truncated text retrieved by search_tweets2() that might contain package names.

If you bother to run class(dt), you will notice that dt is, in fact, a character. We need to unnest the words from the strings. Use tidytext for this task. This package won't work well with a vector of characters. A DataFrame is a more suitable object. Conversion can be made as follows:

dt <- data.frame(tweets = dt, 
stringsAsFactors = F)

Now, dt is a DataFrame with a single column named tweets. Do not fail to set stringAsFactors = F, which will avoid the strings being interpreted as factors.

Strings accidentally being converted to factors is a major cause of errors in R for both newbies and experienced users as well. Users hardly notice when a variable is unintentionally converted to a factor—functions such as data.frame() and read.csv() will interpret strings as factors by default. On the other hand, the conscious use of factors is very useful.

To finally get that data cleaned, let's combine the methods from the dplyr and tidytext packages. Make sure you have both installed and loaded:

if(!require(dplyr)){install.packages('dplyr')}
if(!require(tidytext)){install.packages('tidytext')}
library(dplyr); library(tidytext)

The following code will clean and transform the data:

pkgs <- row.names(available.packages())

clean_dt <- dt %>%
unnest_tokens(word, tweets,
to_lower = F) %>%
filter(word %in% pkgs) %>%
count(word, sort = T)

After saving all of the available packages into a single object named pkgs, several functions are chained through pipes (%>%) to clean and transform dt into a format that will later be helpful to the analysis I am thinking about. Functions from dplyr and tidytext were used to clean and transform the dataset. The following bullets are meant to explain how the latter code block worked out:

  • dt was piped to unnest_tokens() from tidytext—the object piped (dt) through tidytext::unnest_tokens(), which is a DataFrame derived subsets of tweets_dt (a tibble). Once we make sure that our strings were not taken as factors, unnest_tokens() can receive them and split words. We only had to tell the function how to name the column that will receive the words (word) and where to find the texts (tweets column). The to_lower = F argument  is also of great importance given that packages in R are case sensitive.
  • The outcome from unnest_tokens() is then piped through the dplyr packages' filter(); this function is filtering the unnested words. Only packages names (pkgs) are being kept.
  • The filtered data is now counted by dplyr packages'  count()—the pipes made sure to carry the filtered, unnested data to count(). Argument sort = T is asking the outcome to be sorted; that is, the packages' names are listed from the ones that most popped up, to the ones that less appeared.

We can check the top 10 cited packages with the following code:

> head(clean_dt, 10)
# A tibble: 10 x 2
# word n
# <chr> <int>
# 1 ggplot2 481
# 2 here 479
# 3 not 477
# 4 available 384
# 5 useful 364
# 6 maps 363
# 7 tutorial 351
# 8 files 322
# 9 tidyverse 316
#10 dplyr 296

Although here, not, available, and useful are real packages that are available for my current version of R, I am not sure that people using these words were talking about the packages—at least not every single time. There are several ways we could address those. Doing nothing is always an option. Another one would be to apply a discount rate on packages that are named after pretty common words or arbitrarily subtracting theses common words.

Package here constructs paths to projects' files. The not from the not package stands for Narrowest-Over-Threshold (NOT) change-point detection.

Each alternative comes with pros and cons—choose wisely. I am going for the third option:

clean_dt <- clean_dt[-(2:8),]
head(clean_dt, 10)
# A tibble: 10 x 2
# word n
# <chr> <int>
# 1 ggplot2 481
# 2 tidyverse 316
# 3 dplyr 296
# 4 population 242
# 5 shiny 221
# 6 ggfortify 192
# 7 purrr 191
# 8 fun 165
# 9 blogdown 146
#10 tables 137

This result sounds more reasonable. I am not sure about fun and tables being cited as packages every single time either, but I am keeping them. Let me stress that the process that I used could be called everything but rigorous. The only benefit that I got from this was not overextending myself while getting a result that sounds OK.

There are several things that would account for a more careful and rigorous way to deal with this problem. Here is one brief example:

  1. Collect tweets from the hashtag #PyData.
  2. Throw away tweets that also comes with #RStats.
  1. Clean the dataset filtering by a dictionary (rcorpora maybe) and then filter based on package names. Only packages named after common words might appear.
  2. Count how many times those words appeared.
  3. Check the proportion of these words and compare them to what you got.

I am sure you can do this. It would be a great way to exercise your R coding skills. We have now finished cleaning data. Let's move on and do some analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.152.159