Chapter 12. Data Manipulation

In the previous chapter, you learned the methods used to access different types of databases such as relational databases (SQLite and MySQL) and non-relational databases (MongoDB and Redis). Relational databases usually return data in a tabular form, while non-relational databases may support nested data structures and other features.

Even though the data is loaded into memory, it is usually far from ready for data analysis. Most data at this stage still needs cleaning and transforming, which, in fact, may take a large proportion of time before any statistical model and visualization can be applied. In this chapter, you'll learn about a set of built-in functions and several packages for data manipulation. The packages are extremely powerful. However, to better work with these packages, we need a concrete understanding of the knowledge introduced in the previous chapters.

In this chapter, we'll cover the following topics:

  • Using basic functions to manipulate data frames
  • Using SQL to query data frames via the sqldf package
  • Using data.table to manipulate data
  • Using dplyr pipelines to manipulate data frames
  • Using rlist to work with nested data structures

Using built-in functions to manipulate data frames

Previously, you learned the basics of data frames. Here, we will review the built-in functions used to filter a data frame. Although a data frame is essentially a list of vectors, we can access it like a matrix since all column vectors are of the same length. To select rows that meet certain conditions, we will supply a logical vector as the first argument of [], while the second is left empty.

In R, these operations can be done with built-in functions. In this section, we will introduce some built-in functions that are most helpful to manipulate data into the form we need as model input or for presentation. Some of the functions or techniques are already presented in the previous chapters.

Most of the code in this section and subsequent sections are based on a group of fictitious data about some products. We will use the readr package to load the data for better handling of column types. If you don't have this package installed, run install.packages("readr"):

library(readr) 
product_info <- read_csv("data/product-info.csv") 
product_info 
##    id      name  type   class released 
## 1 T01    SupCar   toy vehicle      yes 
## 2 T02  SupPlane   toy vehicle       no 
## 3 M01     JeepX model vehicle      yes 
## 4 M02 AircraftX model vehicle      yes 
## 5 M03    Runner model  people      yes 
## 6 M04    Dancer model  people       no 

Once the data is loaded into memory as a data frame, we can take a look at its column types:

sapply(product_info, class) 
##          id        name        type       class    released  
## "character" "character" "character" "character" "character" 

The readr::read_csv argument has different behavior from the built-in function read.csv. For example, it does not automatically convert string columns to factors (which can be problematic, but adds little value). Therefore, I recommend that you use functions provided by readr to read tabular data from file into R. If we were using the read.csv file, then all these columns would be factors with limited possible values.

Using built-in functions to manipulate data frames

Previously, you learned the basics of data frames. In this section, we will review the built-in functions to filter a data frame. Although a data frame is essentially a list of vectors, we can access it like a matrix since all column vectors are of the same length. To select rows that meet certain conditions, we will supply a logical vector as the first argument of [], while the second is left empty. In the following examples, we will use a series of product information points and statistics we introduced earlier to demonstrate basic data-filtering methods and summary techniques.

For example, we will take out all rows of toy type:

product_info[product_info$type == "toy", ] 
##    id     name type   class released 
## 1 T01   SupCar  toy vehicle      yes 
## 2 T02 SupPlane  toy vehicle       no 

Alternatively, we could take out all rows that are not released:

product_info[product_info$released == "no", ] 
##    id     name  type   class released 
## 2 T02 SupPlane   toy vehicle       no 
## 6 M04   Dancer model  people       no 

To filter columns, we will supply a character vector as the second argument while the first is left empty, which is exactly what we did when we subset a matrix:

product_info[, c("id", "name", "type")] 
##    id      name  type 
## 1 T01    SupCar   toy 
## 2 T02  SupPlane   toy 
## 3 M01     JeepX model 
## 4 M02 AircraftX model 
## 5 M03    Runner model 
## 6 M04    Dancer model 

Alternatively, we can filter the data frame by regarding it as a list. We will supply only one character vector of column names in [] and omit the comma:

product_info[c("id", "name", "class")] 
##    id      name   class 
## 1 T01    SupCar vehicle 
## 2 T02  SupPlane vehicle 
## 3 M01     JeepX vehicle 
## 4 M02 AircraftX vehicle 
## 5 M03    Runner  people 
## 6 M04    Dancer  people 

To filter a data frame by both row and column, we will supply a vector as the first argument to select rows and a vector as the second to select columns:

product_info[product_info$type == "toy", c("name", "class", "released")] 
##       name   class released 
## 1   SupCar vehicle      yes 
## 2 SupPlane vehicle       no 

If the row-filtering condition is based on values of certain columns, the preceding code can be very redundant, especially when the condition gets more complicated. Another built-in function that simplifies code is subset, as we introduced previously:

subset(product_info,  
  subset = type == "model" & released == "yes", 
  select = name:class) 
##        name  type   class 
## 3     JeepX model vehicle 
## 4 AircraftX model vehicle 
## 5    Runner model  people 

The subset function uses non-standard evaluation so that we can directly use the columns of the data frame without typing product_info many times, because the expressions are meant to be evaluated in the context of the data frame.

Similarly, we can use with to evaluate an expression in the context of the data frame, that is, the columns of the data frame can be used as symbols in the expression without repeatedly specifying the data frame:

with(product_info, name[released == "no"]) 
## [1] "SupPlane" "Dancer" 

The expression can be more than a simple subsetting. We can summarize the data by counting the occurrences of each possible value of a vector. For example, we can create a table of the occurrences of types of records that are released:

with(product_info, table(type[released == "yes"])) 
##  
## model   toy  
##     3     1 

In addition to the table of product information, we also have a table of product statistics that describe some properties of each product:

product_stats <- read_csv("data/product-stats.csv")  
product_stats  
##    id material size weight 
## 1 T01    Metal  120   10.0 
## 2 T02    Metal  350   45.0 
## 3 M01 Plastics   50     NA 
## 4 M02 Plastics   85    3.0 
## 5 M03     Wood   15     NA 
## 6 M04     Wood   16    0.6 

Now, think how we can get the names of products with the top three largest sizes. One way is to sort the records in product_stats by size in descending order, select id values of the top three records, and use these values to filter rows of product_info by id:

top_3_id <- product_stats[order(product_stats$size, decreasing = TRUE), "id"][1:3] 
product_info[product_info$id %in% top_3_id, ] 
##    id      name  type   class released 
## 1 T01    SupCar   toy vehicle      yes 
## 2 T02  SupPlane   toy vehicle       no 
## 4 M02 AircraftX model vehicle      yes 

Although it works as supposed, this approach looks quite redundant. Note that product_info and product_stats actually describe the same set of products from different perspectives. The connection between these two tables is the id column. Each id is unique and refers to the same product. To access both sets of information, we can put the two tables together into one data frame. The simplest way to do this is use merge:

product_table <- merge(product_info, product_stats, by = "id") 
product_table 
##    id      name  type   class released material size weight 
## 1 M01     JeepX model vehicle      yes Plastics   50     NA 
## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0 
## 3 M03    Runner model  people      yes     Wood   15     NA 
## 4 M04    Dancer model  people       no     Wood   16    0.6 
## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0 
## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0 

Now, we create a new data frame that is a combined version of product_table and product_info, with a shared id column. In fact, if you reorder the records in the second table, the two tables still can be correctly merged.

With the combined version, we can do things more easily. For example, with the merged version, we can sort the data frame with any column in one table that we loaded without having to manually work with the other:

product_table[order(product_table$size), ] 
##    id      name  type   class released material size weight 
## 3 M03    Runner model  people      yes     Wood   15     NA 
## 4 M04    Dancer model  people       no     Wood   16    0.6 
## 1 M01     JeepX model vehicle      yes Plastics   50     NA 
## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0 
## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0 
## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0 

To solve the problem, we can directly use the merged table and get the same answer:

product_table[order(product_table$size, decreasing = TRUE), "name"][1:3] 
## [1] "SupPlane"  "SupCar"    "AircraftX" 

The merged data frame allows us to sort the records by a column in one data frame and filter the records by a column in the other. For example, we will first sort the product records by weight in descending order and select all records of the model type:

product_table[order(product_table$weight, decreasing = TRUE), ][ 
  product_table$type == "model",] 
##    id      name  type   class released material size weight 
## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0 
## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0 
## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0 
## 4 M04    Dancer model  people       no     Wood   16    0.6 

Sometimes, the column values are literal but can be converted to standard R data structures to better represent the data. For example, the released column in product_info only takes yes and no, which can be better represented with a logical vector. We can use <- to modify the column values, as you learned previously. However, it is usually better to create a new data frame with the existing columns properly adjusted and new columns added without polluting the original data. To do this, we can use transform:

transform(product_table,  
  released = ifelse(released == "yes", TRUE, FALSE), 
  density = weight / size) 
##    id      name  type   class released material size weight 
## 1 M01     JeepX model vehicle     TRUE Plastics   50     NA 
## 2 M02 AircraftX model vehicle     TRUE Plastics   85    3.0 
## 3 M03    Runner model  people     TRUE     Wood   15     NA 
## 4 M04    Dancer model  people    FALSE     Wood   16    0.6 
## 5 T01    SupCar   toy vehicle     TRUE    Metal  120   10.0 
## 6 T02  SupPlane   toy vehicle    FALSE    Metal  350   45.0 
##      density 
## 1         NA 
## 2 0.03529412 
## 3         NA 
## 4 0.03750000 
## 5 0.08333333 
## 6 0.12857143 

The result is a new data frame with released converted to a logical vector and a new column, density, added. You can easily verify that product_table is not modified at all.

Also, note that transform works in a way similar to subset because both functions use non-standard evaluation to allow direct use of data frame columns as symbols in the arguments so that we don't have to type product_table$ before columns all the time.

In the preceding data, a number of columns contain missing values represented by NA. Under many circumstances, we don't want any missing values to be present in our data. Therefore, we need to somehow deal with them. To demonstrate the various techniques, we will load another table that contains missing values. The table is the test results of the quality, durability, and waterproofing of each product in the previous dataset we used. It is the test results of the quality, durability, and waterproofing of each product. We will store the data in product_tests:

product_tests <- read_csv("data/product-tests.csv")  
product_tests 
##    id quality durability waterproof 
## 1 T01      NA         10         no 
## 2 T02      10          9         no 
## 3 M01       6          4        yes 
## 4 M02       6          5        yes 
## 5 M03       5         NA        yes 
## 6 M04       6          6        yes 

Note that the values in both quality and durability contain missing values (NA). To exclude all rows with missing values, we can use na.omit():

na.omit(product_tests) 
##    id quality durability waterproof 
## 2 T02      10          9         no 
## 3 M01       6          4        yes 
## 4 M02       6          5        yes 
## 6 M04       6          6        yes 

Another way is to use complete.cases() to get a logical vector, indicating all complete rows (without any missing value):

complete.cases(product_tests) 
## [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE 

Then, we can use this logical vector to filter the data frame. For example, we can get the id of all complete rows:

product_tests[complete.cases(product_tests), "id"] 
## [1] "T02" "M01" "M02" "M04" 

Alternatively, we can get the id of all incomplete rows:

product_tests[!complete.cases(product_tests), "id"] 
## [1] "T01" "M03" 

Note that product_infoproduct_stats, and product_tests all share an id column; we can merge them all together. Unfortunately, there's no built-in function to merge an arbitrary number of data frames. We can only merge two existing data frames at a time, or we'll have to merge them recursively:

product_full <- merge(product_table, product_tests, by = "id") 
product_full 
##    id      name  type   class released material size weight 
## 1 M01     JeepX model vehicle      yes Plastics   50     NA 
## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0 
## 3 M03    Runner model  people      yes     Wood   15     NA 
## 4 M04    Dancer model  people       no     Wood   16    0.6 
## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0 
## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0 
##   quality durability waterproof 
## 1       6          4        yes 
## 2       6          5        yes 
## 3       5         NA        yes 
## 4       6          6        yes 
## 5      NA         10         no 
## 6      10          9         no 

With the fully merged table, we can use tapply, another apply-family function specialized to work with tabular data, to summarize the data using certain methods over given columns. For example, we can calculate the mean value of quality of each type:

mean_quality1 <- tapply(product_full$quality, 
  list(product_full$type), 
  mean, na.rm = TRUE) 
mean_quality1 
## model   toy  
##  5.75 10.00 

Note that we not only supply mean but also specify na.rm = TRUE to ignore the missing values in quality. The result looks like a numeric vector. We will use str(), so let's take a look at its structure:

str(mean_quality1) 
##  num [1:2(1d)] 5.75 10 
##  - attr(*, "dimnames")=List of 1 
##   ..$ : chr [1:2] "model" "toy" 

In fact, it is a one-dimensional array:

is.array(mean_quality1) 
## [1] TRUE 

The tapply function produces an array instead of a simple numeric vector, because it can be easily generalized to work with multiple grouping. For example, we can compute the mean value of quality for each type and class pair:

mean_quality2 <- tapply(product_full$quality,  
  list(product_full$type, product_full$class),  
  mean, na.rm = TRUE) 
mean_quality2 
##       people vehicle 
## model    5.5       6 
## toy       NA      10 

Now, we have a two-dimensional array, whose values can be extracted by two arguments:

mean_quality2["model", "vehicle"] 
## [1] 6 

Moreover, we can supply more columns for grouping. In the following code, we will use the with() function to reduce redundant typing of product_full:

mean_quality3 <- with(product_full, 
  tapply(quality, list(type, material, released), 
    mean, na.rm = TRUE)) 
mean_quality3 
## , , no 
##  
##       Metal Plastics Wood 
## model    NA       NA    6 
## toy      10       NA   NA 
##  
## , , yes 
##  
##       Metal Plastics Wood 
## model    NA        6    5 
## toy     NaN       NA   NA

Now, a three-dimensional array is produced. Even though na.rm = TRUE is specified, many cells are still missing values. This is because no value is present for the grouping:

str(mean_quality3) 
##  num [1:2, 1:3, 1:2] NA 10 NA NA 6 NA NA NaN 6 NA ... 
##  - attr(*, "dimnames")=List of 3 
##   ..$ : chr [1:2] "model" "toy" 
##   ..$ : chr [1:3] "Metal" "Plastics" "Wood" 
##   ..$ : chr [1:2] "no" "yes" 

We can access the cell value by supplying three arguments:

mean_quality3["model", "Wood", "yes"] 
## [1] 5 

In summary,tapply groups the input data frame with n specified variables and produces an array with n dimensions. This approach to summarizing data can be hard to work with, especially when there are more columns for grouping. This is mostly because array is usually high-dimensional, hard to represent, and not flexible for further manipulation. Later in this chapter, you will learn several different methods that make group summary much easier.

Reshaping data frames using reshape2

Previously, you learned how to filter, sort, merge, and summarize data frames. These operations only work on rows and columns separately. Sometimes, however, we need to do something more complex.

For example, the following code loads a dataset of tests on quality and durability on different dates of two products:

toy_tests <- read_csv("data/product-toy-tests.csv") 
toy_tests 
##    id     date sample quality durability 
## 1 T01 20160201    100       9          9 
## 2 T01 20160302    150      10          9 
## 3 T01 20160405    180       9         10 
## 4 T01 20160502    140       9          9 
## 5 T02 20160201     70       7          9 
## 6 T02 20160303     75       8          8 
## 7 T02 20160403     90       9          8 
## 8 T02 20160502     85      10          9 

Each row of the preceding data frame represents a record of tests of a particular product (id) on a certain date. If we need to compare the quality or durability of the two products at the same time, it can be hard to work with such format of data. Instead, we need the data to be transformed like the following code so that we can compare the values of the two products more easily:

date      T01   T02 
20160201    9     9 
20160301    10    9 

The reshape2 package is designed for such a transform. If you don't have it installed, run the following command:

install.packages("reshape2") 

Once the package is installed, we can use reshape2::dcast to transform the data so that we can easily compare the quality of different products on the same date. More specifically, it reshapes toy_tests so that the date column is shared, the values in id are spread as columns, and the values for each date and id are quality data:

library(reshape2) 
toy_quality <- dcast(toy_tests, date ~ id, value.var = "quality") 
toy_quality 
##       date T01 T02 
## 1 20160201   9   7 
## 2 20160302  10  NA 
## 3 20160303  NA   8 
## 4 20160403  NA   9 
## 5 20160405   9  NA 
## 6 20160502   9  10 

As you can see, toy_tests is immediately transformed. The quality values of both products are aligned with date. Although each month the two products conduct a test, the date may not exactly match with each other. This results in missing values if one product has a value on a day, but the other has no corresponding value on exactly the same day.

One way to fill the missing value is to use the approach called Last Observation Carried Forward (LOCF), which means that if a non-missing value is followed by a missing value, then the non-missing value is carried forward to replace the missing value, until all subsequent missing values are replaced. One implementation of LOCF is provided by the zoo package. Run the following command to install the package if you don't have it:

install.packages("zoo") 

To demonstrate how it works, we will use zoo::na.locf() to perform this technique over a very simple numeric vector with missing values:

zoo::na.locf(c(1, 2, NA, NA, 3, 1, NA, 2, NA)) 
## [1] 1 2 2 2 3 1 1 2 2 

It is straightforward that all missing values are replaced with previous non-missing values. To do the same thing with T01 and T02 columns of toy_quality, we can sub-assign the processed vector to the columns:

toy_quality$T01 <- zoo::na.locf(toy_quality$T01) 
toy_quality$T02 <- zoo::na.locf(toy_quality$T02) 

However, if toy_tests contains thousands of products, it is ridiculous to write thousands of lines of code to do similar things like this. A better practice would be using exclusive sub-assignment as follows:

toy_quality[-1] <- lapply(toy_quality[-1], zoo::na.locf) 
toy_quality 
##       date T01 T02 
## 1 20160201   9   7 
## 2 20160302  10   7 
## 3 20160303  10   8 
## 4 20160403  10   9 
## 5 20160405   9   9 
## 6 20160502   9  10 

We will use lapply() to perform LOCF over all columns of toy_quality, except date, and assign the resulting list to the subset of toy_quality without the date column. This works because sub-assignment of a data frame accepts a list and still preserves the class of data frame.

However, although the data does not contain any missing values, the meaning of each row is changed. Originally, product T01 does not take a test on 20160303. The value should be interpreted as the last test value of quality on or before the day. Another drawback is that in the original data, both products take tests every month, but the preceding reshaped data frame is not aligned to date of regular frequency.

One way to fix the drawbacks is to use year-month data instead of an exact date. In the following code, we will create a new column ym, that is, the first 6 characters of toy_tests. For example, substr(20160101, 1, 6) will result in 201601:

toy_tests$ym <- substr(toy_tests$date, 1, 6) 
toy_tests 
##    id     date sample quality durability     ym 
## 1 T01 20160201    100       9          9 201602 
## 2 T01 20160302    150      10          9 201603 
## 3 T01 20160405    180       9         10 201604 
## 4 T01 20160502    140       9          9 201605 
## 5 T02 20160201     70       7          9 201602 
## 6 T02 20160303     75       8          8 201603 
## 7 T02 20160403     90       9          8 201604 
## 8 T02 20160502     85      10          9 201605 

This time, we will use the ym column for alignment instead of date:

toy_quality <- dcast(toy_tests, ym ~ id,  
    value.var = "quality") 
toy_quality 
##       ym T01 T02 
## 1 201602   9   7 
## 2 201603  10   8 
## 3 201604   9   9 
## 4 201605   9  10 

Now, the missing values are gone, and the quality scores of both products in each month are naturally presented.

Sometimes, we need to combine a number of columns into one that indicates the measure and another that stores the value. For example, the following code uses reshape2::melt to combine the two measures (quality and durability) of the original data into a column named measure and a column of the measured value:

toy_tests2 <- melt(toy_tests, id.vars = c("id", "ym"),  
  measure.vars = c("quality", "durability"), 
  variable.name = "measure") 
toy_tests2 
##     id     ym    measure value 
## 1  T01 201602    quality     9 
## 2  T01 201603    quality    10 
## 3  T01 201604    quality     9 
## 4  T01 201605    quality     9 
## 5  T02 201602    quality     7 
## 6  T02 201603    quality     8 
## 7  T02 201604    quality     9 
## 8  T02 201605    quality    10 
## 9  T01 201602 durability     9 
## 10 T01 201603 durability     9 
## 11 T01 201604 durability    10 
## 12 T01 201605 durability     9 
## 13 T02 201602 durability     9 
## 14 T02 201603 durability     8 
## 15 T02 201604 durability     8 
## 16 T02 201605 durability     9 

The variable names are now contained in the data, which can be directly used by some packages. For example, we can use ggplot2 to plot data in such a format. The following code is an example of a scatter plot with facet grid of different combination of factors:

library(ggplot2) 
ggplot(toy_tests2, aes(x = ym, y = value)) +  
  geom_point() + 
  facet_grid(id ~ measure) 

Then, we can see a scatter plot grouped by product id and measure with ym as x values and value as y values:

Reshaping data frames using reshape2

The plot can be easily manipulated because the grouping factor (measure) is contained as data rather than columns, which is easier to represent from the perspective of the ggplot2 package:

ggplot(toy_tests2, aes(x = ym, y = value, color = id)) +  
  geom_point() + 
  facet_grid(. ~ measure) 

This time, we will present the points of the two products in different colors:

Reshaping data frames using reshape2
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.65.130