In the previous chapter, you learned the methods used to access different types of databases such as relational databases (SQLite and MySQL) and non-relational databases (MongoDB and Redis). Relational databases usually return data in a tabular form, while non-relational databases may support nested data structures and other features.
Even though the data is loaded into memory, it is usually far from ready for data analysis. Most data at this stage still needs cleaning and transforming, which, in fact, may take a large proportion of time before any statistical model and visualization can be applied. In this chapter, you'll learn about a set of built-in functions and several packages for data manipulation. The packages are extremely powerful. However, to better work with these packages, we need a concrete understanding of the knowledge introduced in the previous chapters.
In this chapter, we'll cover the following topics:
sqldf
packagedata.table
to manipulate datadplyr
pipelines to manipulate data framesrlist
to work with nested data structuresPreviously, you learned the basics of data frames. Here, we will review the built-in functions used to filter a data frame. Although a data frame is essentially a list of vectors, we can access it like a matrix since all column vectors are of the same length. To select rows that meet certain conditions, we will supply a logical vector as the first argument of []
, while the second is left empty.
In R, these operations can be done with built-in functions. In this section, we will introduce some built-in functions that are most helpful to manipulate data into the form we need as model input or for presentation. Some of the functions or techniques are already presented in the previous chapters.
Most of the code in this section and subsequent sections are based on a group of fictitious data about some products. We will use the readr
package to load the data for better handling of column types. If you don't have this package installed, run install.packages("readr")
:
library(readr) product_info <- read_csv("data/product-info.csv") product_info ## id name type class released ## 1 T01 SupCar toy vehicle yes ## 2 T02 SupPlane toy vehicle no ## 3 M01 JeepX model vehicle yes ## 4 M02 AircraftX model vehicle yes ## 5 M03 Runner model people yes ## 6 M04 Dancer model people no
Once the data is loaded into memory as a data frame, we can take a look at its column types:
sapply(product_info, class) ## id name type class released ## "character" "character" "character" "character" "character"
The readr::read_csv
argument has different behavior from the built-in function read.csv
. For example, it does not automatically convert string columns to factors (which can be problematic, but adds little value). Therefore, I recommend that you use functions provided by readr
to read tabular data from file into R. If we were using the read.csv
file, then all these columns would be factors with limited possible values.
Previously, you learned the basics of data frames. In this section, we will review the built-in functions to filter a data frame. Although a data frame is essentially a list of vectors, we can access it like a matrix since all column vectors are of the same length. To select rows that meet certain conditions, we will supply a logical vector as the first argument of []
, while the second is left empty. In the following examples, we will use a series of product information points and statistics we introduced earlier to demonstrate basic data-filtering methods and summary techniques.
For example, we will take out all rows of toy
type:
product_info[product_info$type == "toy", ] ## id name type class released ## 1 T01 SupCar toy vehicle yes ## 2 T02 SupPlane toy vehicle no
Alternatively, we could take out all rows that are not released:
product_info[product_info$released == "no", ] ## id name type class released ## 2 T02 SupPlane toy vehicle no ## 6 M04 Dancer model people no
To filter columns, we will supply a character vector as the second argument while the first is left empty, which is exactly what we did when we subset a matrix:
product_info[, c("id", "name", "type")] ## id name type ## 1 T01 SupCar toy ## 2 T02 SupPlane toy ## 3 M01 JeepX model ## 4 M02 AircraftX model ## 5 M03 Runner model ## 6 M04 Dancer model
Alternatively, we can filter the data frame by regarding it as a list. We will supply only one character vector of column names in []
and omit the comma:
product_info[c("id", "name", "class")] ## id name class ## 1 T01 SupCar vehicle ## 2 T02 SupPlane vehicle ## 3 M01 JeepX vehicle ## 4 M02 AircraftX vehicle ## 5 M03 Runner people ## 6 M04 Dancer people
To filter a data frame by both row and column, we will supply a vector as the first argument to select rows and a vector as the second to select columns:
product_info[product_info$type == "toy", c("name", "class", "released")] ## name class released ## 1 SupCar vehicle yes ## 2 SupPlane vehicle no
If the row-filtering condition is based on values of certain columns, the preceding code can be very redundant, especially when the condition gets more complicated. Another built-in function that simplifies code is subset
, as we introduced previously:
subset(product_info, subset = type == "model" & released == "yes", select = name:class) ## name type class ## 3 JeepX model vehicle ## 4 AircraftX model vehicle ## 5 Runner model people
The subset
function uses non-standard evaluation so that we can directly use the columns of the data frame without typing product_info
many times, because the expressions are meant to be evaluated in the context of the data frame.
Similarly, we can use with
to evaluate an expression in the context of the data frame, that is, the columns of the data frame can be used as symbols in the expression without repeatedly specifying the data frame:
with(product_info, name[released == "no"]) ## [1] "SupPlane" "Dancer"
The expression can be more than a simple subsetting. We can summarize the data by counting the occurrences of each possible value of a vector. For example, we can create a table of the occurrences of types of records that are released:
with(product_info, table(type[released == "yes"])) ## ## model toy ## 3 1
In addition to the table of product information, we also have a table of product statistics that describe some properties of each product:
product_stats <- read_csv("data/product-stats.csv") product_stats ## id material size weight ## 1 T01 Metal 120 10.0 ## 2 T02 Metal 350 45.0 ## 3 M01 Plastics 50 NA ## 4 M02 Plastics 85 3.0 ## 5 M03 Wood 15 NA ## 6 M04 Wood 16 0.6
Now, think how we can get the names of products with the top three largest sizes. One way is to sort the records in product_stats
by size in descending order, select id
values of the top three records, and use these values to filter rows of product_info
by id
:
top_3_id <- product_stats[order(product_stats$size, decreasing = TRUE), "id"][1:3] product_info[product_info$id %in% top_3_id, ] ## id name type class released ## 1 T01 SupCar toy vehicle yes ## 2 T02 SupPlane toy vehicle no ## 4 M02 AircraftX model vehicle yes
Although it works as supposed, this approach looks quite redundant. Note that product_info
and product_stats
actually describe the same set of products from different perspectives. The connection between these two tables is the id
column. Each id
is unique and refers to the same product. To access both sets of information, we can put the two tables together into one data frame. The simplest way to do this is use merge
:
product_table <- merge(product_info, product_stats, by = "id") product_table ## id name type class released material size weight ## 1 M01 JeepX model vehicle yes Plastics 50 NA ## 2 M02 AircraftX model vehicle yes Plastics 85 3.0 ## 3 M03 Runner model people yes Wood 15 NA ## 4 M04 Dancer model people no Wood 16 0.6 ## 5 T01 SupCar toy vehicle yes Metal 120 10.0 ## 6 T02 SupPlane toy vehicle no Metal 350 45.0
Now, we create a new data frame that is a combined version of product_table
and product_info
, with a shared id
column. In fact, if you reorder the records in the second table, the two tables still can be correctly merged.
With the combined version, we can do things more easily. For example, with the merged version, we can sort the data frame with any column in one table that we loaded without having to manually work with the other:
product_table[order(product_table$size), ] ## id name type class released material size weight ## 3 M03 Runner model people yes Wood 15 NA ## 4 M04 Dancer model people no Wood 16 0.6 ## 1 M01 JeepX model vehicle yes Plastics 50 NA ## 2 M02 AircraftX model vehicle yes Plastics 85 3.0 ## 5 T01 SupCar toy vehicle yes Metal 120 10.0 ## 6 T02 SupPlane toy vehicle no Metal 350 45.0
To solve the problem, we can directly use the merged table and get the same answer:
product_table[order(product_table$size, decreasing = TRUE), "name"][1:3] ## [1] "SupPlane" "SupCar" "AircraftX"
The merged data frame allows us to sort the records by a column in one data frame and filter the records by a column in the other. For example, we will first sort the product records by weight in descending order and select all records of the model
type:
product_table[order(product_table$weight, decreasing = TRUE), ][ product_table$type == "model",] ## id name type class released material size weight ## 6 T02 SupPlane toy vehicle no Metal 350 45.0 ## 5 T01 SupCar toy vehicle yes Metal 120 10.0 ## 2 M02 AircraftX model vehicle yes Plastics 85 3.0 ## 4 M04 Dancer model people no Wood 16 0.6
Sometimes, the column values are literal but can be converted to standard R data structures to better represent the data. For example, the released
column in product_info
only takes yes
and no
, which can be better represented with a logical vector. We can use <-
to modify the column values, as you learned previously. However, it is usually better to create a new data frame with the existing columns properly adjusted and new columns added without polluting the original data. To do this, we can use transform
:
transform(product_table, released = ifelse(released == "yes", TRUE, FALSE), density = weight / size) ## id name type class released material size weight ## 1 M01 JeepX model vehicle TRUE Plastics 50 NA ## 2 M02 AircraftX model vehicle TRUE Plastics 85 3.0 ## 3 M03 Runner model people TRUE Wood 15 NA ## 4 M04 Dancer model people FALSE Wood 16 0.6 ## 5 T01 SupCar toy vehicle TRUE Metal 120 10.0 ## 6 T02 SupPlane toy vehicle FALSE Metal 350 45.0 ## density ## 1 NA ## 2 0.03529412 ## 3 NA ## 4 0.03750000 ## 5 0.08333333 ## 6 0.12857143
The result is a new data frame with released
converted to a logical vector and a new column, density
, added. You can easily verify that product_table
is not modified at all.
Also, note that transform
works in a way similar to subset
because both functions use non-standard evaluation to allow direct use of data frame columns as symbols in the arguments so that we don't have to type product_table$
before columns all the time.
In the preceding data, a number of columns contain missing values represented by NA
. Under many circumstances, we don't want any missing values to be present in our data. Therefore, we need to somehow deal with them. To demonstrate the various techniques, we will load another table that contains missing values. The table is the test results of the quality, durability, and waterproofing of each product in the previous dataset we used. It is the test results of the quality, durability, and waterproofing of each product. We will store the data in product_tests
:
product_tests <- read_csv("data/product-tests.csv") product_tests ## id quality durability waterproof ## 1 T01 NA 10 no ## 2 T02 10 9 no ## 3 M01 6 4 yes ## 4 M02 6 5 yes ## 5 M03 5 NA yes ## 6 M04 6 6 yes
Note that the values in both quality
and durability
contain missing values (NA
). To exclude all rows with missing values, we can use na.omit()
:
na.omit(product_tests) ## id quality durability waterproof ## 2 T02 10 9 no ## 3 M01 6 4 yes ## 4 M02 6 5 yes ## 6 M04 6 6 yes
Another way is to use complete.cases()
to get a logical vector, indicating all complete rows (without any missing value):
complete.cases(product_tests) ## [1] FALSE TRUE TRUE TRUE FALSE TRUE
Then, we can use this logical vector to filter the data frame. For example, we can get the id
of all complete rows:
product_tests[complete.cases(product_tests), "id"] ## [1] "T02" "M01" "M02" "M04"
Alternatively, we can get the id of all incomplete rows:
product_tests[!complete.cases(product_tests), "id"] ## [1] "T01" "M03"
Note that product_info
, product_stats
, and product_tests
all share an id
column; we can merge them all together. Unfortunately, there's no built-in function to merge an arbitrary number of data frames. We can only merge two existing data frames at a time, or we'll have to merge them recursively:
product_full <- merge(product_table, product_tests, by = "id") product_full ## id name type class released material size weight ## 1 M01 JeepX model vehicle yes Plastics 50 NA ## 2 M02 AircraftX model vehicle yes Plastics 85 3.0 ## 3 M03 Runner model people yes Wood 15 NA ## 4 M04 Dancer model people no Wood 16 0.6 ## 5 T01 SupCar toy vehicle yes Metal 120 10.0 ## 6 T02 SupPlane toy vehicle no Metal 350 45.0 ## quality durability waterproof ## 1 6 4 yes ## 2 6 5 yes ## 3 5 NA yes ## 4 6 6 yes ## 5 NA 10 no ## 6 10 9 no
With the fully merged table, we can use tapply
, another apply-family function specialized to work with tabular data, to summarize the data using certain methods over given columns. For example, we can calculate the mean value of quality
of each type
:
mean_quality1 <- tapply(product_full$quality, list(product_full$type), mean, na.rm = TRUE) mean_quality1 ## model toy ## 5.75 10.00
Note that we not only supply mean
but also specify na.rm = TRUE
to ignore the missing values in quality
. The result looks like a numeric vector. We will use str()
, so let's take a look at its structure:
str(mean_quality1) ## num [1:2(1d)] 5.75 10 ## - attr(*, "dimnames")=List of 1 ## ..$ : chr [1:2] "model" "toy"
In fact, it is a one-dimensional array:
is.array(mean_quality1) ## [1] TRUE
The tapply
function produces an array instead of a simple numeric vector, because it can be easily generalized to work with multiple grouping. For example, we can compute the mean value of quality
for each type
and class
pair:
mean_quality2 <- tapply(product_full$quality, list(product_full$type, product_full$class), mean, na.rm = TRUE) mean_quality2 ## people vehicle ## model 5.5 6 ## toy NA 10
Now, we have a two-dimensional array, whose values can be extracted by two arguments:
mean_quality2["model", "vehicle"] ## [1] 6
Moreover, we can supply more columns for grouping. In the following code, we will use the with()
function to reduce redundant typing of product_full
:
mean_quality3 <- with(product_full, tapply(quality, list(type, material, released), mean, na.rm = TRUE)) mean_quality3 ## , , no ## ## Metal Plastics Wood ## model NA NA 6 ## toy 10 NA NA ## ## , , yes ## ## Metal Plastics Wood ## model NA 6 5 ## toy NaN NA NA
Now, a three-dimensional array is produced. Even though na.rm = TRUE
is specified, many cells are still missing values. This is because no value is present for the grouping:
str(mean_quality3) ## num [1:2, 1:3, 1:2] NA 10 NA NA 6 NA NA NaN 6 NA ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "model" "toy" ## ..$ : chr [1:3] "Metal" "Plastics" "Wood" ## ..$ : chr [1:2] "no" "yes"
We can access the cell value by supplying three arguments:
mean_quality3["model", "Wood", "yes"] ## [1] 5
In summary,tapply
groups the input data frame with n
specified variables and produces an array with n
dimensions. This approach to summarizing data can be hard to work with, especially when there are more columns for grouping. This is mostly because array is usually high-dimensional, hard to represent, and not flexible for further manipulation. Later in this chapter, you will learn several different methods that make group summary much easier.
Previously, you learned how to filter, sort, merge, and summarize data frames. These operations only work on rows and columns separately. Sometimes, however, we need to do something more complex.
For example, the following code loads a dataset of tests on quality and durability on different dates of two products:
toy_tests <- read_csv("data/product-toy-tests.csv") toy_tests ## id date sample quality durability ## 1 T01 20160201 100 9 9 ## 2 T01 20160302 150 10 9 ## 3 T01 20160405 180 9 10 ## 4 T01 20160502 140 9 9 ## 5 T02 20160201 70 7 9 ## 6 T02 20160303 75 8 8 ## 7 T02 20160403 90 9 8 ## 8 T02 20160502 85 10 9
Each row of the preceding data frame represents a record of tests of a particular product (id
) on a certain date
. If we need to compare the quality or durability of the two products at the same time, it can be hard to work with such format of data. Instead, we need the data to be transformed like the following code so that we can compare the values of the two products more easily:
date T01 T02 20160201 9 9 20160301 10 9
The reshape2
package is designed for such a transform. If you don't have it installed, run the following command:
install.packages("reshape2")
Once the package is installed, we can use reshape2::dcast
to transform the data so that we can easily compare the quality
of different products on the same date
. More specifically, it reshapes toy_tests
so that the date
column is shared, the values in id
are spread as columns, and the values for each date
and id
are quality
data:
library(reshape2) toy_quality <- dcast(toy_tests, date ~ id, value.var = "quality") toy_quality ## date T01 T02 ## 1 20160201 9 7 ## 2 20160302 10 NA ## 3 20160303 NA 8 ## 4 20160403 NA 9 ## 5 20160405 9 NA ## 6 20160502 9 10
As you can see, toy_tests
is immediately transformed. The quality
values of both products are aligned with date
. Although each month the two products conduct a test, the date may not exactly match with each other. This results in missing values if one product has a value on a day, but the other has no corresponding value on exactly the same day.
One way to fill the missing value is to use the approach called Last Observation Carried Forward (LOCF), which means that if a non-missing value is followed by a missing value, then the non-missing value is carried forward to replace the missing value, until all subsequent missing values are replaced. One implementation of LOCF is provided by the zoo
package. Run the following command to install the package if you don't have it:
install.packages("zoo")
To demonstrate how it works, we will use zoo::na.locf()
to perform this technique over a very simple numeric vector with missing values:
zoo::na.locf(c(1, 2, NA, NA, 3, 1, NA, 2, NA)) ## [1] 1 2 2 2 3 1 1 2 2
It is straightforward that all missing values are replaced with previous non-missing values. To do the same thing with T01
and T02
columns of toy_quality
, we can sub-assign the processed vector to the columns:
toy_quality$T01 <- zoo::na.locf(toy_quality$T01) toy_quality$T02 <- zoo::na.locf(toy_quality$T02)
However, if toy_tests
contains thousands of products, it is ridiculous to write thousands of lines of code to do similar things like this. A better practice would be using exclusive sub-assignment as follows:
toy_quality[-1] <- lapply(toy_quality[-1], zoo::na.locf) toy_quality ## date T01 T02 ## 1 20160201 9 7 ## 2 20160302 10 7 ## 3 20160303 10 8 ## 4 20160403 10 9 ## 5 20160405 9 9 ## 6 20160502 9 10
We will use lapply()
to perform LOCF over all columns of toy_quality
, except date
, and assign the resulting list to the subset of toy_quality
without the date
column. This works because sub-assignment of a data frame accepts a list and still preserves the class of data frame.
However, although the data does not contain any missing values, the meaning of each row is changed. Originally, product T01
does not take a test on 20160303
. The value should be interpreted as the last test value of quality on or before the day. Another drawback is that in the original data, both products take tests every month, but the preceding reshaped data frame is not aligned to date
of regular frequency.
One way to fix the drawbacks is to use year-month data instead of an exact date. In the following code, we will create a new column ym
, that is, the first 6 characters of toy_tests
. For example, substr(20160101, 1, 6)
will result in 201601
:
toy_tests$ym <- substr(toy_tests$date, 1, 6) toy_tests ## id date sample quality durability ym ## 1 T01 20160201 100 9 9 201602 ## 2 T01 20160302 150 10 9 201603 ## 3 T01 20160405 180 9 10 201604 ## 4 T01 20160502 140 9 9 201605 ## 5 T02 20160201 70 7 9 201602 ## 6 T02 20160303 75 8 8 201603 ## 7 T02 20160403 90 9 8 201604 ## 8 T02 20160502 85 10 9 201605
This time, we will use the ym
column for alignment instead of date
:
toy_quality <- dcast(toy_tests, ym ~ id, value.var = "quality") toy_quality ## ym T01 T02 ## 1 201602 9 7 ## 2 201603 10 8 ## 3 201604 9 9 ## 4 201605 9 10
Now, the missing values are gone, and the quality scores of both products in each month are naturally presented.
Sometimes, we need to combine a number of columns into one that indicates the measure and another that stores the value. For example, the following code uses reshape2::melt
to combine the two measures (quality
and durability
) of the original data into a column named measure
and a column of the measured value:
toy_tests2 <- melt(toy_tests, id.vars = c("id", "ym"), measure.vars = c("quality", "durability"), variable.name = "measure") toy_tests2 ## id ym measure value ## 1 T01 201602 quality 9 ## 2 T01 201603 quality 10 ## 3 T01 201604 quality 9 ## 4 T01 201605 quality 9 ## 5 T02 201602 quality 7 ## 6 T02 201603 quality 8 ## 7 T02 201604 quality 9 ## 8 T02 201605 quality 10 ## 9 T01 201602 durability 9 ## 10 T01 201603 durability 9 ## 11 T01 201604 durability 10 ## 12 T01 201605 durability 9 ## 13 T02 201602 durability 9 ## 14 T02 201603 durability 8 ## 15 T02 201604 durability 8 ## 16 T02 201605 durability 9
The variable names are now contained in the data, which can be directly used by some packages. For example, we can use ggplot2
to plot data in such a format. The following code is an example of a scatter plot with facet grid of different combination of factors:
library(ggplot2) ggplot(toy_tests2, aes(x = ym, y = value)) + geom_point() + facet_grid(id ~ measure)
Then, we can see a scatter plot grouped by product id
and measure
with ym
as x values and value
as y values:
The plot can be easily manipulated because the grouping factor (measure
) is contained as data rather than columns, which is easier to represent from the perspective of the ggplot2
package:
ggplot(toy_tests2, aes(x = ym, y = value, color = id)) + geom_point() + facet_grid(. ~ measure)
This time, we will present the points of the two products in different colors:
18.116.65.130