© Matt Wiley and Joshua F. Wiley 2016

Matt Wiley and Joshua F. Wiley, Advanced R, 10.1007/978-1-4842-2077-1_9

9. Other Tools for Data Management

Matt Wiley and Joshua F. Wiley1

(1)Elkhart Group Ltd. & Victoria College, Columbia City, Indiana, USA

Comparing data frames and data tables leads to an interesting question. What if there were more types of data? Particularly, what if there were different ways to store data that are all, at their heart, tables of some sort? In addition to data frames or tables, there are many ways to store data, many of which are just tables. The idea behind dplyr (Wickham and Francois, 2015) is that regardless of what the data back end might be, our experience should be the same. To allow this, dplyr implements generic functions for common data management tasks. For each of these generic functions, specific methods are written that translate the generic operation into whatever code or language is required for a specific back end. Using a layer of abstraction ensures that users get a consistent experience, regardless of the specific data format, or back end, being used. It also makes dplyr extensible, in that support for a new format can be added by simply writing additional functions or methods. The user experience need not change. For this chapter, our checkpoint header needs to have both the tibble and the dplyr packages installed and added:

library(checkpoint)            
checkpoint("2016-09-04", R.version = "3.3.1")
library(dplyr)
library(tibble)

The dplyr package names functions after verbs that accomplish a specific action. That is, rather than have a few functions that each perform many tasks, dplyr has many functions that typically accomplish only a specific task. Unlike data.table (Dowle, Srinivasan, Short, and Lianoglou, 2015), dplyr is functional in that it always takes input and returns output, in this case, data. It does not modify the data in place in memory as data.table does. However, it is still quite fast, because many of the functions are optimized, written in C++, and dplyr tries not to make unnecessary copies.

As before, we set some options to reduce the number of rows printed. The default data format for dplyr is a tibble (Wickham, Francois, and Müller, 2016). Tibbles are implemented in the tibble package. Essentially tibbles are an extension of data frames, which have some additional restrictions (such as defaulting to character strings rather than coercing character strings to factors) and nicer methods for printing or displaying on the screen. Data manipulation is handled by dplyr, while tibble handles the actual storage of data. In technical terms, the tibble package implements the R object class as well as some basic methods for tibbles.

options(stringsAsFactors = FALSE,              
         tibble.print_max = 20, ## if over 20 rows
         tibble.print_min = 5, ## print first 5
         digits = 2) ## reduce digits printed by R


iris$Species <- as.character(iris$Species)
diris <- as_tibble(iris)
diris
# A tibble: 150 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
# ... with 145 more rows

For data.table, we use sapply()to get a vector of classes for each column or variable. Although we could do the same with tibbles, tibbles automatically provide information about the types or classes of data in each column as well as total number of columns and rows (for example, <dbl> indicating a double or numeric value, with <chr> indicating character class data). We turn our attention to the common goals of working with data: sorting, subsetting, ordering, computing, and reshaping. A helpful way to think of dplyr is that it uses verbs to describe actions on the data.

Sorting

We sort in dplyr by the arrange() function , which takes a data set as its first argument. Additional arguments after the first argument provide sorting by those other columns. The default is an increasing order, and this is modified with a second function call of desc()to provide descending order.

diris <- arrange(diris, Sepal.Length)              
diris
# A tibble: 150 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          4.3         3.0          1.1         0.1  setosa
2          4.4         2.9          1.4         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
4          4.4         3.2          1.3         0.2  setosa
5          4.5         2.3          1.3         0.3  setosa
# ... with 145 more rows
diris <- arrange(diris, Sepal.Length, desc(Sepal.Width))
diris
# A tibble: 150 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          4.3         3.0          1.1         0.1  setosa
2          4.4         3.2          1.3         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
4          4.4         2.9          1.4         0.2  setosa
5          4.5         2.3          1.3         0.3  setosa
# ... with 145 more rows

It is possible to see the first-row index when a duplicate occurs for either the entire data set or for just a particular column, via anyDuplicated(). When calling this function on the entire data set, it looks for a match between all columns, which occurs for rows 78 and 79 in this data set. A duplicated variable in sepal length occurs in the third row, as shown in the following code:

anyDuplicated(diris)                
[1] 79


anyDuplicated(diris$Sepal.Length)
[1] 3

Whereas anyDuplicated()returns the row index of the first duplication, duplicated simply tells us how many such duplications occur in our table. Looking at our code, we see that for a full match in all columns, row 79 is the only time that occurs. On the other hand, duplicated sepal lengths may happen for the first time in row 3, but there are many more duplicates.

table(duplicated(diris))              
FALSE  TRUE
  149     1
table(duplicated(diris$Sepal.Length))
FALSE  TRUE
   35   115

If we want to see table results with only distinct values, distinct()works much like arrange(). The first formal is for data, while the rest specify column name variables. If no column names are specified, distinct() operates on all columns. Recall from the duplicated() call that there are 149 unique rows.

distinct(diris)              
# A tibble: 149 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          4.3         3.0          1.1         0.1  setosa
2          4.4         3.2          1.3         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
4          4.4         2.9          1.4         0.2  setosa
5          4.5         2.3          1.3         0.3  setosa
# ... with 144 more rows

Similarly, recall that there are only 35 unique rows if we consider only sepal length. Consequently, using the second formal of distinct() to hold the Sepal.Length column yields only 35 rows in our tibble. We show the first five of those rows here:

distinct(diris, Sepal.Length)              
# A tibble: 35 x 1
  Sepal.Length
         <dbl>
1          4.3
2          4.4
3          4.5
4          4.6
5          4.7
# ... with 30 more rows

This section closes with one last call to distinct(), showing more than one column called in the second and third formal arguments. Here, a row is unique provided both Sepal.Length and Sepal.Width do not duplicate. Pay special attention to rows 2 through 4, which have the same Sepal.Length, yet those rows are not duplicated because their Sepal.Widths are different:

distinct(diris, Sepal.Length, Sepal.Width)              
# A tibble: 117 x 2
  Sepal.Length Sepal.Width
         <dbl>       <dbl>
1          4.3         3.0
2          4.4         3.2
3          4.4         3.0
4          4.4         2.9
5          4.5         2.3
# ... with 112 more rows

Selecting and Subsetting

Selecting portions of the stored data in dplyr is different from the data.table methodology. Remember, dplyr works via function calls. These functions are designed to be independent of the underlying data structure. Nevertheless, we do want the same end results. Selecting rows by position is accomplished using slice(), which takes two formal arguments. The first is the data object, in our case diris, and the second allows us to select specific rows. We first select just the first five rows:

slice(diris, 1:5)              
# A tibble: 5 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          4.3         3.0          1.1         0.1  setosa
2          4.4         3.2          1.3         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
4          4.4         2.9          1.4         0.2  setosa
5          4.5         2.3          1.3         0.3  setosa

As seen before, - can be used for negation. Therefore, we could also drop all rows but the last five rows by dropping rows 1 to 145:

slice(diris, -(1:145))              
# A tibble: 5 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
         <dbl>       <dbl>        <dbl>       <dbl>     <chr>
1          7.7         3.8          6.7         2.2 virginica
2          7.7         3.0          6.1         2.3 virginica
3          7.7         2.8          6.7         2.0 virginica
4          7.7         2.6          6.9         2.3 virginica
5          7.9         3.8          6.4         2.0 virginica

Rather than select rows directly, we may set logical conditions on them. Again, these types of arguments belong in the second formal of our function call. However, in this case, we use filter() rather than slice(). The filter() function uses logical indexing. It is easy enough to select either all the rows in which Species either is (or is not) setosa by writing a logical test:

filter(diris, Species == "setosa")              
# A tibble: 50 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          4.3         3.0          1.1         0.1  setosa
2          4.4         3.2          1.3         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
4          4.4         2.9          1.4         0.2  setosa
5          4.5         2.3          1.3         0.3  setosa
# ... with 45 more rows
 filter(diris, Species != "setosa")
# A tibble: 100 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>
1          4.9         2.5          4.5         1.7  virginica
2          4.9         2.4          3.3         1.0 versicolor
3          5.0         2.3          3.3         1.0 versicolor
4          5.0         2.0          3.5         1.0 versicolor
5          5.1         2.5          3.0         1.1 versicolor
# ... with 95 more rows

It is also possible to select rows based on numeric values with inequalities, equality, or nonequality in addition to logical indexes. Both numeric and character tests may be mixed and matched to extract precisely the rows desired:

filter(iris, Sepal.Length < 5 & Petal.Width > .2)              
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          4.6         3.4          1.4         0.3     setosa
2          4.5         2.3          1.3         0.3     setosa
3          4.8         3.0          1.4         0.3     setosa
4          4.9         2.4          3.3         1.0 versicolor
5          4.9         2.5          4.5         1.7  virginica
filter(diris, Sepal.Length == 4.3 | Sepal.Width != 4.4)
# A tibble: 149 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          4.3         3.0          1.1         0.1  setosa
2          4.4         3.2          1.3         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
4          4.4         2.9          1.4         0.2  setosa
5          4.5         2.3          1.3         0.3  setosa
# ... with 144 more rows

If we seek multiple matches, the %in% operator may be used along with a list of interest to us. This allows us to separate the coding that indicates our choice of interest, and then filter based on that interest:

interest <- c(4.3, 4.4)              
filter(diris, Sepal.Length %in% interest)
# A tibble: 4 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          4.3         3.0          1.1         0.1  setosa
2          4.4         3.2          1.3         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
4          4.4         2.9          1.4         0.2  setosa

Alternately, if we wish to exclude any rows that have a certain characteristic, that may be done via negation. This is useful to remove outliers or erroneous data points:

filter(diris, !Sepal.Length %in% interest)              
# A tibble: 146 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          4.5         2.3          1.3         0.3  setosa
2          4.6         3.6          1.0         0.2  setosa
3          4.6         3.4          1.4         0.3  setosa
4          4.6         3.2          1.4         0.2  setosa
5          4.6         3.1          1.5         0.2  setosa
# ... with 141 more rows

The last filter example we show uses several arguments . Using multiple arguments is not required, but can be a helpful way to break code into human-readable chunks. Multiple arguments are logically equivalent to &.

filter(diris,              
       Sepal.Length == 4.3 | Sepal.Length == 4.4,
       Petal.Width < .2)
# A tibble: 1 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1          4.3           3          1.1         0.1  setosa

So far, we’ve selected rows from all columns. If we wish to select only specific columns, the select() function takes a database as its first formal, and column information in the rest of the formals. For named columns in your table data structure, selecting one or more columns is as simple as listing the column names. It is also possible to select by position. Contrast these named versus position column arguments and also recognize that both can make the column selections in any order needed:

select(diris, Sepal.Length, Sepal.Width)              
# A tibble: 150 x 2
  Sepal.Length Sepal.Width
         <dbl>       <dbl>
1          4.3         3.0
2          4.4         3.2
3          4.4         3.0
4          4.4         2.9
5          4.5         2.3
# ... with 145 more rows
select(diris, 1, 5, 2)
# A tibble: 150 x 3
  Sepal.Length Species Sepal.Width
         <dbl>   <chr>       <dbl>
1          4.3  setosa         3.0
2          4.4  setosa         3.2
3          4.4  setosa         3.0
4          4.4  setosa         2.9
5          4.5  setosa         2.3
# ... with 145 more rows

So far, specific columns are directly named. What if the columns need to be variable depending on the program or the user’s needs? In that case, character string references are desirable, as strings easily convert to variables. We show both a direct call to select_() and the use of a variable v to achieve the same results. Be sure to notice the underscore added to the name of the function! When using select_(), multiple variable names as strings can be passed as separate arguments.

select_(diris, "Sepal.Length", "Sepal.Width")              
# A tibble: 150 × 2
  Sepal.Length Sepal.Width
         <dbl>       <dbl>
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
# ... with 145 more rows

Our variable, v, could contain more than one column name. However, if used in the default way, only the first variable from the list will be returned. To get all the variables to be returned, when passing an object holding a vector of variable names, it should be passed to the .dots formal argument, shown in the second example that follows:

v <- c("Sepal.Length", "Sepal.Width")                
select_(diris, v)
# A tibble: 150 x 1
  Sepal.Length
         <dbl>
1          4.3
2          4.4
3          4.4
4          4.4
5          4.5
# ... with 145 more rows


select_(diris, .dots = v)
# A tibble: 150 × 2
  Sepal.Length Sepal.Width
         <dbl>       <dbl>
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
# ... with 145 more rows

Another way of achieving the same result with our original select() function is by using one of the select helper functions. There are seven of these helper functions, and the one we presently want is one_of(), which manages character vectors:

select(diris, one_of(v))              
# A tibble: 150 x 1
  Sepal.Length
         <dbl>
1          4.3
2          4.4
3          4.4
4          4.4
5          4.5
# ... with 145 more rows

When using the dplyr data structure tibble, approaches like those used in data tables work. These are not guaranteed to translate as smoothly throughout the dplyr experience. It seems best to engage dplyr in its native language. However, the next two function calls are given for completeness:

diris[, v]              
# A tibble: 150 x 1
  Sepal.Length
         <dbl>
1          4.3
2          4.4
3          4.4
4          4.4
5          4.5
# ... with 145 more rows
diris[, 1]
# A tibble: 150 x 1
  Sepal.Length
         <dbl>
1          4.3
2          4.4
3          4.4
4          4.4
5          4.5
# ... with 145 more rows

So far, all our work returns another tibble as output. Extracting the raw data as a vector uses several techniques. If we know the column name containing the data we want, it is easy to type the name directly. If our column may be variable, we reuse our variable v for the third method. Regardless of the method, the console prints the same data. Notice that these data are vectors, not data table structures. We marked the code to access the column data as a vector in bold only for emphasis.

head(diris[["Sepal.Length"]])
[1] 4.3 4.4 4.4 4.4 4.5 4.6
head(diris$Sepal.Length)       # easy to type
[1] 4.3 4.4 4.4 4.4 4.5 4.6
head(diris[[v]])               # easy if you have a R variable
[1] 4.3 4.4 4.4 4.4 4.5 4.6

Excluding a column or columns may be of more interest than selecting certain columns. Exclusion uses negation in several formats. Particularly, either the column(s) to be excluded may be named or referenced via position:

select(diris, -1)              
# A tibble: 150 x 4
  Sepal.Width Petal.Length Petal.Width Species
        <dbl>        <dbl>       <dbl>   <chr>
1         3.0          1.1         0.1  setosa
2         3.2          1.3         0.2  setosa
3         3.0          1.3         0.2  setosa
4         2.9          1.4         0.2  setosa
5         2.3          1.3         0.3  setosa
# ... with 145 more rows
select(diris, -Sepal.Length, -Petal.Width)
# A tibble: 150 x 3
  Sepal.Width Petal.Length Species
        <dbl>        <dbl>   <chr>
1         3.0          1.1  setosa
2         3.2          1.3  setosa
3         3.0          1.3  setosa
4         2.9          1.4  setosa
5         2.3          1.3  setosa
# ... with 145 more rows

Again, when variable exclusion is needed, we set a variable to hold the vector, and we may paste negation to the front of that by using paste0()and then using the select_() call to use character strings. Alternately, we may use negation with the select helper function one_of() in our usual select() function. Either way, we get the same results:

ex <- paste0("-", v)              
select_(diris, ex)
# A tibble: 150 x 4
  Sepal.Width Petal.Length Petal.Width Species
        <dbl>        <dbl>       <dbl>   <chr>
1         3.0          1.1         0.1  setosa
2         3.2          1.3         0.2  setosa
3         3.0          1.3         0.2  setosa
4         2.9          1.4         0.2  setosa
5         2.3          1.3         0.3  setosa
# ... with 145 more rows
select(diris, -one_of(v))
# A tibble: 150 x 4
  Sepal.Width Petal.Length Petal.Width Species
        <dbl>        <dbl>       <dbl>   <chr>
1         3.0          1.1         0.1  setosa
2         3.2          1.3         0.2  setosa
3         3.0          1.3         0.2  setosa
4         2.9          1.4         0.2  setosa
5         2.3          1.3         0.3  setosa
# ... with 145 more rows

Having now used this select convenience function a couple of times, we look briefly at four more of these. The goal of these four is that they help us choose variables without typing out an entire column name. In no particular order, starts_with(), ends_with(), contains(), and matches() all use regular expressions, default to looking in the current variable list, and ignore letter case by default, although this is a feature that may be turned off. Rather than demonstrate all of these, the following code shows two examples using starts_with(). One benefit of the dplyr approach of making focused functions named after verbs that accomplish one specific task is that it is easy to tell what a function does from its name. The remaining helper functions operate similarly to starts_with() and do exactly what their names indicate:

select(diris, starts_with("s"))              
# A tibble: 150 x 3
  Sepal.Length Sepal.Width Species
         <dbl>       <dbl>   <chr>
1          4.3         3.0  setosa
2          4.4         3.2  setosa
3          4.4         3.0  setosa
4          4.4         2.9  setosa
5          4.5         2.3  setosa
# ... with 145 more rows
select(diris, starts_with("s", ignore.case = FALSE))
# A tibble: 150 x 0

With this last bit of code, our journey through dplyrsubsetting ends. We turn our attention to variable renaming and ordering.

Variable Renaming and Ordering

It is a regrettably routine part of data hygiene that column names require renaming or reordering. What made sense (or is ”legal”) in one database may not work in another data bank. Many proprietary data management systems have fixed names that are unwieldy.

Viewing the names associated with a dplyr object requires the usual names() or colnames() functions called on our data set diris. Either way, we get the same results with the current column order:

names(diris)              
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
colnames(diris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"  

To change a name, call the rename() function on the data object and set the new name in the second formal by using the following format: new name = old name. Multiple columns can be renamed at once by passing them as additional arguments (for example, in the third, fourth formal). Confirm that the change took place by comparing the results of name() with the old results. In the following code, we remove the full stop between Sepal and Length in our data:

diris <- rename(diris, SepalLength = Sepal.Length)              
names(diris)
[1] "SepalLength"  "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

To reorder the columns to pair the lengths and the widths together, simply use the select() function and reassign a name to your data object. Be sure to remember that we renamed SepalLengthto no longer have a full stop in the middle!

diris <- select(diris, SepalLength, Petal.Length, Sepal.Width, Petal.Width, Species)              
diris
# A tibble: 150 x 5
  SepalLength Petal.Length Sepal.Width Petal.Width Species
        <dbl>        <dbl>       <dbl>       <dbl>   <chr>
1         4.3          1.1         3.0         0.1  setosa
2         4.4          1.3         3.2         0.2  setosa
3         4.4          1.3         3.0         0.2  setosa
4         4.4          1.4         2.9         0.2  setosa
5         4.5          1.3         2.3         0.3  setosa
# ... with 145 more rows

Earlier, you saw that numeric select calls could reorder columns, so we close out this section with one last reordering example. Here, we use a variable to hold numeric positions, although we could just as well use character names in place of numbers. Then we use the select_() function to reorganize our columns back to their original order.

v <- c(1, 3, 2, 4, 5)              
diris <- select_(diris, .dots = as.list(v))
diris
# A tibble: 150 x 5
  SepalLength Sepal.Width Petal.Length Petal.Width Species
        <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1         4.3         3.0          1.1         0.1  setosa
2         4.4         3.2          1.3         0.2  setosa
3         4.4         3.0          1.3         0.2  setosa
4         4.4         2.9          1.4         0.2  setosa
5         4.5         2.3          1.3         0.3  setosa
# ... with 145 more rows

Computing on Data and Creating Variables

Now that you have a good sense of how to manage original data in dplyr, we’ll start to focus on how to create new data from old data. For simplicity’s sake, we re-create our diris data set from the original iris data set. Then we’ll be ready to use the new function mutate()to create or replace columns. This function takes the usual first argument, and the latter formals express what is done to create new columns. Whether it is one or many, simply create expressions that let dplyr know what to append to the tibble:

diris <- mutate(diris, V0 = 0, X1 = 1L, X2 = 2L)              
diris
# A tibble: 150 x 8
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species    V0    X1    X2
         <dbl>       <dbl>        <dbl>       <dbl>   <chr> <dbl> <int> <int>
1          5.1         3.5          1.4         0.2  setosa     0     1     2
2          4.9         3.0          1.4         0.2  setosa     0     1     2
3          4.7         3.2          1.3         0.2  setosa     0     1     2
4          4.6         3.1          1.5         0.2  setosa     0     1     2
5          5.0         3.6          1.4         0.2  setosa     0     1     2
# ... with 145 more rows

Of course, new data might not be about creating wholly new columns. Instead, we may create new calculated columns based on existing information. Perhaps we want a new column named V that is the multiplication of Petal.Length and Petal.Width. For clarity’s sake, we first remove the three columns just added previously:

diris <- select(diris, -V0, -X1, -X2)              
diris <- mutate(diris, V = Petal.Length * Petal.Width)
diris
# A tibble: 150 x 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species     V
         <dbl>       <dbl>        <dbl>       <dbl>   <chr> <dbl>
1          5.1         3.5          1.4         0.2  setosa  0.28
2          4.9         3.0          1.4         0.2  setosa  0.28
3          4.7         3.2          1.3         0.2  setosa  0.26
4          4.6         3.1          1.5         0.2  setosa  0.30
5          5.0         3.6          1.4         0.2  setosa  0.28
# ... with 145 more rows

More exotic calculations are performed by doing them only when certain conditions occur. In this case, if_else()takes three arguments. The first is the logical condition on which to test the if/else statement. The second and third formals indicate what to do on the if or the else. Additionally, mutate() allows us not only to create a column with data, but also to change existing column data. Thus, if it was not enough in our first pass to create a column V2 that took on a multiplicative relationship between petal widths and lengths for setosa species, we could extend that to a square-root operation for virginica. To make sure we see a sample of each species, we use the slice() function to look at just three specific rows:

diris <- mutate(diris, V2 = if_else(Species == "setosa", Petal.Length * Petal.Width, NA_real_))              
diris <- mutate(diris, V2 = if_else(Species == "virginica",
                                    sqrt(Petal.Length * Petal.Width), V2))
slice(diris, c(1, 51, 101))
# A tibble: 3 x 7
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species     V    V2
         <dbl>       <dbl>        <dbl>       <dbl>      <chr> <dbl> <dbl>
1          5.1         3.5          1.4         0.2     setosa  0.28  0.28
2          7.0         3.2          4.7         1.4 versicolor  6.58    NA
3          6.3         3.3          6.0         2.5  virginica 15.00  3.87

It is not necessary to create a new column; it is possible to simply calculate information on a table by using the summarise() function. The first formal, as always, tells the function which data to use, while the rest are used to hold individual calculations. Note that one or many calculations might be coded. Here, we demonstrate two such calculations for arithmetic mean and standard deviation:

summarise(diris, M = mean(Sepal.Width), SD = sd(Sepal.Width))              
# A tibble: 1 x 2
      M    SD
  <dbl> <dbl>
1   3.1  0.44

Unlike data.table, dplyr works via functions. Moreover, while your mathematics professor may have spent some time lecturing about function composition (and that technique of nesting functions inside each other has its place), there is a cleaner look in dplyr. Chaining together several functions leaves us with the desired results when using the %>% operator. Suppose we want to take our data, use only the virginica species, and calculate the mean of the sepal widths. The result should be a tibble of just one variable, the mean we seek. Note that when the %>% operator is used, the data set is passed to the first argument. Thus, even though it looks like we specify the first formal argument to filter() in the following code, because we used the %>% operator, the function is modified so that the data set is passed as the first argument, and what we wrote, Species == "virginica", is passed as the second argument:

diris  %>%  filter(Species == "virginica") %>%  summarise(M = mean(Sepal.Width))              
# A tibble: 1 x 1
      M
  <dbl>
1     3

This idea of taking it one step at a time can be modified to manage all sorts of data manipulations. Thinking back to our study of data.table, the third formal of that structure, by =, allowed us to organize by elements of a column. For dplyr, we use group_by()on a column name to accomplish the same task. Expanding our mean example to all three species, we now have a two-column tibble with one row for each species. We could also get means for more than one of our original columns. Notice that virginica’s sepal width mean is 3, regardless of how we calculate it.

diris %>%  group_by(Species) %>%  summarise(M1 = mean(Sepal.Width), M2 = mean(Petal.Width))              
# A tibble: 3 x 3
     Species    M1    M2
       <chr> <dbl> <dbl>
1     setosa   3.4  0.25
2 versicolor   2.8  1.33
3  virginica   3.0  2.03

This chaining structure, just like the data table chaining structure, can be done indefinitely (or close to it). To find a correlation in the preceding data, we simply add one more step to the end of our chain:

diris %>%              
  group_by(Species) %>%
  summarise( M1 = mean(Sepal.Width),  M2 = mean(Petal.Width)) %>%
  summarise(r = cor(M1, M2))
# A tibble: 1 x 1
      r
  <dbl>
1 -0.76

Variables can create other variables as well. We create a new variable based on whether a petal width is greater or smaller than the median petal width for that species. This may have changed its appearance a bit, but this is still a tibble:

diris <- diris %>%                
  group_by(Species) %>%
  mutate(MedPW = Petal.Width > median(Petal.Width))
diris
Source: local data frame [150 x 8]
Groups: Species [3]


  Sepal.Length Sepal.Width Petal.Length Petal.Width Species     V    V2 MedPW
         <dbl>       <dbl>        <dbl>       <dbl>   <chr> <dbl> <dbl> <lgl>
1          5.1         3.5          1.4         0.2  setosa  0.28  0.28 FALSE
2          4.9         3.0          1.4         0.2  setosa  0.28  0.28 FALSE
3          4.7         3.2          1.3         0.2  setosa  0.26  0.26 FALSE
4          4.6         3.1          1.5         0.2  setosa  0.30  0.30 FALSE
5          5.0         3.6          1.4         0.2  setosa  0.28  0.28 FALSE
# ... with 145 more rows

It is also easy to group by multiple variables. Grouping by both species and median petal width Booleans, we recalculate means for sepal and petal widths:

diris  %>%                
  group_by(Species, MedPW) %>%
  summarise( M1 = mean(Sepal.Width),  M2 = mean(Petal.Width))
Source: local data frame [6 x 4]
Groups: Species [?]


     Species MedPW    M1    M2
       <chr> <lgl> <dbl> <dbl>
1     setosa FALSE   3.4  0.19
2     setosa  TRUE   3.5  0.38
3 versicolor FALSE   2.6  1.19
4 versicolor  TRUE   3.0  1.50
5  virginica FALSE   2.8  1.81
6  virginica  TRUE   3.1  2.27

Again, it is always possible to chain on more operations:

diris %>%              
  group_by(Species, MedPW) %>%
  summarise(
     M1 = mean(Sepal.Width),
     M2 = mean(Petal.Width)) %>%
  group_by(MedPW) %>%
  summarise(r = cor(M1, M2))
# A tibble: 2 x 2
  MedPW     r
  <lgl> <dbl>
1 FALSE -0.78
2  TRUE -0.76

Merging and Reshaping Data

Without repeating what we said about merging and reshaping data in the data.table chapter, we note that the function call to merge() still works, after a fashion. This function is the same function from data.table, and because we used dplyr to operate on tibbles, which are built on data frames, it works. However, when this is done, they lose their tibble status. Thus, we also introduce the *join() functions, which are much closer to some of the more traditional database languages.

Once again, we first run a bit of code to make four separate data objects. Please note again that this is not likely to be needed in real life, as our data usually comes from different databases:

diris <- diris  %>%  group_by(Species) %>%  select(Species, Sepal.Width) %>%  slice(1:3)                
diris
Source: local data frame [9 x 2]
Groups: Species [3]


     Species Sepal.Width
       <chr>       <dbl>
1     setosa         3.5
2     setosa         3.0
3     setosa         3.2
4 versicolor         3.2
5 versicolor         3.2
6 versicolor         3.1
7  virginica         3.3
8  virginica         2.7
9  virginica         3.0


diris2 <- slice(diris, c(1, 4, 7))
diris2
Source: local data frame [3 x 2]
Groups: Species [3]


     Species Sepal.Width
       <chr>       <dbl>
1     setosa         3.5
2 versicolor         3.2
3  virginica         3.3


dalt1 <- tibble(
   Species = c("setosa", "setosa", "versicolor", "versicolor",Data managementmerging and reshaping Data
               "virginica", "virginica", "other", "other"),
   Type = c("wide", "wide", "wide", "wide",
            "narrow", "narrow", "moderate", "moderate"),
   MedPW = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))
dalt1
# A tibble: 8 x 3
     Species     Type MedPW
       <chr>    <chr> <lgl>
1     setosa     wide  TRUE
2     setosa     wide FALSE
3 versicolor     wide  TRUE
4 versicolor     wide FALSE
5  virginica   narrow  TRUE
6  virginica   narrow FALSE
7      other moderate  TRUE
8      other moderate FALSE


dalt2 <- slice(dalt1, c(1, 3))
dalt2
# A tibble: 2 x 3
     Species  Type MedPW
       <chr> <chr> <lgl>
1     setosa  wide  TRUE
2 versicolor  wide  TRUE

The merge() function operates on two data sets. For one-to-one data such as diris2 and dalt2, the merge by itself keeps only the rows that match both data sets. We can see that for these two objects, virginica row in diris2 is dropped:

merge(diris2, dalt2)              
     Species Sepal.Width Type MedPW
1     setosa         3.5 wide  TRUE
2 versicolor         3.2 wide  TRUE

If we wish to keep all three rows of diris2, then there is some missing data after the merge. Now, in this case, since we arbitrarily choose diris2 to be our first formal, all.x = TRUE keeps all of the diris2 rows:

merge(diris2, dalt2, all.x = TRUE)              
     Species Sepal.Width Type MedPW
1     setosa         3.5 wide  TRUE
2 versicolor         3.2 wide  TRUE
3  virginica         3.3 <NA>    NA

It is also possible to set the second value, or all.y = TRUE, although in this case that has no noticeable difference. Depending on how good a match two particular tables are, there might be a different pattern of missing variables. Similarly, it is also possible to say that we want all rows from all tables. This has the potential to give the most missing data locations; in our example, such a merge would be identical to our three rows shown in the preceding example.

As mentioned, these are not tibbles. Now, we may wrap them in as_tibble() if we wish to merge and coerce back to a tibble. However, dplyr also operates on database-style objects, as you’ll see in our database chapter. In particular, Structured Query Language (SQL) can be accessed through dplyr. SQL has joins, and we’ll look at four types of joins now. To generalize greatly, merge() focuses on matching in some way. Joins focus on rows versus columns of object 1 and object 2.

All rows of the first object are returned in a left_join() as well as all columns of both the first and second objects. Because this is a dplyr function call, we get a tibble as our output. As you can see, this is similar (although not a perfect match in all cases) to all.x=TRUE:

left_join(diris2,dalt2)                
Joining, by = "Species"
Source: local data frame [3 x 4]
Groups: Species [?]


     Species Sepal.Width  Type MedPW
       <chr>       <dbl> <chr> <lgl>
1     setosa         3.5  wide  TRUE
2 versicolor         3.2  wide  TRUE
3  virginica         3.3  <NA>    NA

A right_join()is much like a left, except it returns all rows from the second object. Still, all columns from both objects return. In this case, there is no major difference between this type of join and the merge() function. However, this case is not all cases. Many charts on the internet claim to speak the truth about how joins work. In our experience, the accuracy of those explanations can be hit or miss. Our advice is to grab several small data tables, or cook up your own as we did here. Then start joining things. Somewhere along the way, after doing it often, you’ll gain a sense of which technique gets you closest to having the data you want.

right_join(diris2,dalt2)                
Joining, by = "Species"
Source: local data frame [2 x 4]
Groups: Species [?]


     Species Sepal.Width  Type MedPW
       <chr>       <dbl> <chr> <lgl>
1     setosa         3.5  wide  TRUE
2 versicolor         3.2  wide  TRUE

The closest match to the basic merge() function is the inner_join() function, which in this case would yield the same results as the previous right join. Contrastingly, from a merge point of view, the weirdest join is the anti-join(). The anti_join() returns all rows in the first database that do not have matching values in the second data object. As for columns, it returns only the first object’s columns.

anti_join(diris2,dalt2)                
Joining, by = "Species"
Source: local data frame [1 x 2]
Groups: Species [3]


    Species Sepal.Width
      <chr>       <dbl>
1 virginica         3.3

Between joins and merges, there are plenty of options for combining data into a single object. Remember, your goal is to eventually get just the data needed for analysis. Sometimes it makes sense to first combine one or more collections of data into a single table, and from there use the techniques from earlier in this chapter to reduce that to your desired data set. Of course, if merges are more familiar, they often work just fine. If you truly require dplyr, as_tibble() always works as a wrapper function to coerce your object back to what it should be. After all, data.table and dplyr’s tibblesare both extensions of data frames.

Our last goal is reshaping, which works better if there is a unique ID per row. We start by refreshing our data set one more time. From there, we introduce n(), which is a special way to refer to the number of rows, much like .N is in data tables:

diris <- as_tibble(iris)              
diris <- mutate(diris, ID = 1:n() )
diris
# A tibble: 150 x 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species    ID
         <dbl>       <dbl>        <dbl>       <dbl>   <chr> <int>
1          5.1         3.5          1.4         0.2  setosa     1
2          4.9         3.0          1.4         0.2  setosa     2
3          4.7         3.2          1.3         0.2  setosa     3
4          4.6         3.1          1.5         0.2  setosa     4
5          5.0         3.6          1.4         0.2  setosa     5
# ... with 145 more rows

To reshape()long, we use the base R function. From there, we call out repeatedly measured variables or variables we wish to stack in the long data set with the varying label. The label timevar takes the name of the variable indicating to which stack a row belongs. It is so named because often these are sorted based on time. Specify names for the newly stacked variables with v.names and use idvar for the ID variable. Any variables not listed in varying will be assumed to be not varying and will simply be repeated.

diris.long <- as_tibble(reshape(as.data.frame(diris),                
  varying = list(
    c("Sepal.Length", "Sepal.Width"),
    c("Petal.Length", "Petal.Width")),
  timevar = "Type",
  v.names = c("Sepal", "Petal"),
  idvar = c("ID"),
  direction = "long"))


diris.long
# A tibble: 300 x 5
  Species    ID  Type Sepal Petal
*   <chr> <int> <int> <dbl> <dbl>
1  setosa     1     1   5.1   1.4
2  setosa     2     1   4.9   1.4
3  setosa     3     1   4.7   1.3
4  setosa     4     1   4.6   1.5
5  setosa     5     1   5.0   1.4
# ... with 295 more rows

At this point, it might not be clear what Typeis. Type tells us whether we are in the length or width of sepals and petals. When our data stacks long from wide, it doubles in length. Now, the Sepal and Petal columns must do double duty, and Type is there to let us know which. Take a look at this slice() of our data:

slice(diris.long, c(1, 151))              
# A tibble: 2 x 5
  Species    ID  Type Sepal Petal
    <chr> <int> <int> <dbl> <dbl>
1  setosa     1     1   5.1   1.4
2  setosa     1     2   3.5   0.2

Type varies from 1 to 2 depending on whether it is length or width. However, that information is not coded into our data, so we mutate() our tibble to make sure it understands Type is a factor of two levels, one of which is Length and the other Width. Notice that our slice now makes much more sense after this cleanup.

diris.long <- mutate(diris.long, Type = factor(Type, levels = 1:2,                
                            labels = c("Length", "Width")))
diris.long
# A tibble: 300 x 5
  Species    ID   Type Sepal Petal
    <chr> <int> <fctr> <dbl> <dbl>
1  setosa     1 Length   5.1   1.4
2  setosa     2 Length   4.9   1.4
3  setosa     3 Length   4.7   1.3
4  setosa     4 Length   4.6   1.5
5  setosa     5 Length   5.0   1.4
# ... with 295 more rows


slice(diris.long, c(1, 151))
# A tibble: 2 x 5
  Species    ID   Type Sepal Petal
    <chr> <int> <fctr> <dbl> <dbl>
1  setosa     1 Length   5.1   1.4
2  setosa     1  Width   3.5   0.2

We took wide data and made it long. We now reverse the process by using reshape()again, this time with the direction formal set wide. Since we are going wide, we do not need varying, as we are about to get those through this process. We do need ids = diris.long$ID so that rows 1 and 151 are combined into the same single, wide row.

diris.wide2 <- as_tibble(reshape(as.data.frame(diris.long),              
                                 v.names = c("Sepal", "Petal"),
                                 timevar = "Type",
                                 idvar = "ID",
                                 ids = diris.long$ID,
                                 direction = "wide"))
diris.wide2
# A tibble: 150 x 6
  Species    ID Sepal.Length Petal.Length Sepal.Width Petal.Width
*   <chr> <int>        <dbl>        <dbl>       <dbl>       <dbl>
1  setosa     1          5.1          1.4         3.5         0.2
2  setosa     2          4.9          1.4         3.0         0.2
3  setosa     3          4.7          1.3         3.2         0.2
4  setosa     4          4.6          1.5         3.1         0.2
5  setosa     5          5.0          1.4         3.6         0.2
# ... with 145 more rows

At this point, you can see that reshape() is part of base R and is not a very dplyr way of doing things. However, it is perhaps more powerful and versatile. Hence we decided to showcase how to use it rather than more dplyr-type approaches such as tidyr. The truth is, tidyr is around two years old, and it likely will become the new way of managing such conversions. Given the speed at which technology progresses, by the time you, our gentle reader, have this book in hand, all of us will be ready for an update to this chapter.

Summary

The dplyr and tibble packages make working with data easier. By separating what we do to data from the data itself, we can learn a comparatively limited number of manipulations that we can apply to any data. Behind the scenes, dplyr translates those function calls to modify the data. In the next chapter, you’ll see how this allows us to quickly access big data. Table 9-1 describes the key functions used in this chapter.

Table 9-1. Key Functions Described in This Chapter

Function

What It Does

as_tibble()

Converts a data object (for example, a data frame) to a tibble.

arrange()

Arranges a named column in increasing order, or descending order if desc() is used on the column name.

distinct()

This is dplyr’s answer for unique(). It gives just those rows that are not the same.

slice()

Slices out the rows asked for from a tibble.

filter()

Filters out rows based on matched characteristics.

select()

Selects specific columns based on name. Functionality can be expanded by using the helper functions: one_of(), starts_with(), ends_with(), contains(), matches(), num_range(), and everything().

select_()

A more programming-friendly version of select() that takes character strings or R objects containing character strings.

rename()

Renames columns.

mutate()

Allows for changes or mutations to existing tibbles. Adding/deleting/modifying column data is a popular use.

summarise()

Creates a new tibble based on calculations from an original tibble.

%>%

Piping avoids function composition and allows for cleaner code to chain up multiple functions.

group_by()

Allows grouping by columns. Corresponds to the third formal (by = ) of data.table.

left_join()

Joins are a SQL way of dealing with data. A left-join keeps the first object’s rows.

right_join()

Keeps the second object’s rows.

inner_join()

Returns all columns of both variables and all rows from the first that have matches in the second data object.

anti_join()

Returns just the unique, unlinkable part of the first data object. This is quite helpful if the second object is meant to extend information on the first, as it gives a fast way to see what is left.

reshape()

Depending on whether direction = "wide" or direction = "long" is chosen, reshapes data wide or long. This is a base R function and to keep a tibble requires as_tibble().

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.144.248