Chapter 3
R Data, Part 2: More Complicated Structures

3.1 Introduction

R data is made up of vectors, but, as you already know, there are more complicated structures that consist of a group of vectors put together. In this chapter, we talk about the three major structures in R that data handlers need to know about. The most important of these is the data frame, in which, eventually, almost all of our data will be held. But in order to build up to the data frame, we first need to describe matrices and lists. A data frame is part matrix, part list, and in order to use data frames most efficiently, you need to be able to think of it in both ways. Furthermore, we do encounter matrices in the data cleaning world, since the table() command can produce something that is basically a matrix.

3.2 Matrices

A matrix (plural matrices) is essentially a vector, arrayed in a (two-dimensional) rectangle. As with a vector, every element of a matrix needs to be of the same type – logical, numeric, or character. Most of the matrices we will see will be numeric, but it is also possible to have a logical matrix, typically for subscripting, as we shall see. We start using the vector of 15 numbers, 101, 102, c03-math-001, 115, to produce a c03-math-002 (i.e., five rows by three columns) numeric matrix.

> (a <- matrix (101:115, nrow = 5, ncol = 3))
     [,1] [,2] [,3]
[1,]  101  106  111
[2,]  102  107  112
[3,]  103  108  113
[4,]  104  109  114
[5,]  105  110  115

There are a couple of points to mention here. First, the matrix is filled column by column, with the first column being filled before the second one starts. We often intuitively expect the matrix to be filled row by row, because our data comes in rows, and we read English left-to-right, but this is not how R works. If you need to load your data into your matrix by rows, use the byrow = TRUE argument. This arises when you copy a matrix off of a web page, for example; we expect the entries to be read along the top line, but R stores them down the first column. (We come back to this example in Section 6.5.3.)

Second, notice the row and column indicators such as [4,] and [,2]. In the following section, we will see how to use those numbers to extract elements from the matrix, or to assign new ones.

Third, the length() operator can be used on a matrix, but it returns the total number of elements in the matrix. More often we want to know the number of rows and columns; that information is returned by the nrow() and ncol() functions, or jointly by the dim() function, which gives the numbers of rows and columns in that order:

> length (a)
[1] 15
> dim (a)
[1] 5 3

Fourth, in our example, we used the matrix() function to create the matrix from one long vector. An alternative is to create a matrix from a set of equal-length vectors. The cbind() function (“c” for column) combines a set of vectors into a matrix column by column, while rbind() performs the operation row by row. If the vectors are of unequal length, R will use the usual recycling rules (Section 2.1.4). Again, all of the elements of a matrix need to be of the same sort, so if any vector is of type character, the entire matrix will be character.

As with vectors, arithmetic operations on matrices are performed element by element, so Aˆ2 squares each element of A and A * B multiplies two matrices element by element. There are special symbols for matrix-specific operations: for example, A %*% B performs the usual kind of matrix multiplication, t(A) transposes a matrix, and solve() inverts a matrix. These operations do not tend to come up much in data cleaning, but often, we want to perform an operation on a matrix row by row or column by column. We come back to these row and column operations in Section 3.2.3.

3.2.1 Extracting and Assigning

Since a matrix is just a vector, it is possible to use a subscript just like the one we used in Section 2.3.1 to pull out or replace an element. In the example above, a[6] would produce 106 (remember that we count by columns first), and a[6] <- 999 would replace that element with 999. However, it is much more common to identify elements of a matrix by two subscripts, one for the row and one for the column. These two subscripts are separated by a comma. In our example, a[1,2] would produce 106, and a[1,2] <- 999 would replace that value.

Of course, it is possible to ask for more than one entry at a time. In this example, we ask for a c03-math-003 sub-matrix from our original matrix a:

> a[c(4, 2), c(3, 1)]
     [,1] [,2]
[1,]  114  104
[2,]  112  102

The two rows we asked for, numbers 4 and 2, in that order, are returned, with the corresponding entries from columns 3 and 1, in that order. Just as when we use subscripts on a vector, we may use duplicate subscripts; a vector of negative numbers indicates that the corresponding entries should be removed.

If you leave one of the two subscripts empty, you are asking for an entire row or column. This command says “give me all the rows except for number 2, and all the columns.”

> a[-2,]
     [,1] [,2] [,3]
[1,]  101  106  111
[2,]  103  108  113
[3,]  104  109  114
[4,]  105  110  115

Notice here that some rows have been renumbered. The row that had been number 5 in the original a is now the fourth row. This is not surprising, but it raises the question as to whether we might be able to keep track of rows that have been deleted, since that would help us audit changes we have made to the data. We will describe one way to do that using row names in Section 3.2.2.

In addition to using a numeric subscript, we can use a logical one. Logical subscripts for rows or columns act exactly as logical subscripts for vectors (see Section 2.3.1). Whether you use numeric or logical subscripts, subscripting a matrix with row and column indices will return a rectangular object. To extract values from, or assign new values to, a non-rectangular set of entries, you can use a matrix subscript, which we describe in Section 3.2.5.

Demoting a Matrix to a Vector

In order to turn a matrix into a vector, use the c() function on it. Just as c() creates vectors from individual elements (see, e.g., Section 2.1.1), it also creates vectors from matrices. In our example, c(a) will produce a vector of 15 numbers. The entries in that vector come from the first column, followed by the second column, and so on. In order to extract data row by row, transpose the matrix first, using the t() function in a command like c(t(a)).

Sometimes, though, R produces a vector from a matrix when we did not expect it. In this example, see what happens when we ask for, say, the second column of a. Remember that a has five rows and three columns.

> a[,2]
[1] 106 107 108 109 110

The result of this operation is not a matrix with five rows and one column; it is a vector of length 5. This reduction – or “demotion” – from a matrix to a vector follows a general rule in R under which dimensions of length 1 are usually removed (“dropped”) by default. This can cause trouble when you have a function that is expecting a matrix, perhaps because it plans to use dim() to find the number of rows. If you pass a single column of a matrix, that is, a vector, such a function would call dim() on a vector, which returns the value NULL. The way around this is to specify that this dropping should not take place, using the drop = FALSE argument, like this:

> a[,2,drop = FALSE]
     [,1]
[1,]  106
[2,]  107
[3,]  108
[4,]  109
[5,]  110

The result of that operation is a matrix with five rows and one column. When building functions that take subsets of matrices, it is often a good idea to use drop = FALSE to ensure that the resulting subset is itself a matrix and not a vector value.

3.2.2 Row and Column Names

It is very convenient to have a matrix whose rows and columns have names. We can assign (and extract) row and column names with the dimnames() function, described in Section 3.3.2, and there are also functions named rownames() and colnames() to do the same job. (There is also an equivalent row.names() function, spelled with a dot, but, interestingly, there is no col.names() function.) Rows and columns are named automatically by the table() function (technically, a two-way table has class table, not matrix, but that distinction will not matter here). We start this extended example by constructing a table.

> yr <- rep (2015:2017, each = 5)
> market <- c(2, 2, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 3, 2)
> (tbl <- table (market, yr))
      yr
market 2015 2016 2017
     2    3    1    3
     3    2    4    2

Notice that the row-name entries ("2" and "3" under market) are not a column of the table; they are merely identifiers. This table has three columns, not four. Now we show the column names and demonstrate how they can be changed using the colnames() function.

> colnames (tbl)
[1] "2015" "2016" "2017"
> colnames (tbl) <- c("FY15", "FY16", "FY17")
> tbl
      yr
market FY15 FY16 FY17
     2    3    1    3
     3    2    4    2

Once row or column names have been assigned, we can refer to them by name as well as by number. This makes it possible to refer to a row or column in a consistent way, without having to know its location. Notice, though, that dimension names are characters, even if they look numeric. So, for example, tbl[2,] will produce the second row of the matrix tbl, while tbl["2",] will produce the row whose name is "2", regardless of what number that row has – and even if earlier rows have been removed.

> tbl[2,]
FY15 FY16 FY17
   2    4    2
> tbl["2",]
FY15 FY16 FY17
   3    1    3

3.2.3 Applying a Function to Rows or Columns

There are lots and lots of operations on matrices supported by R, but many of them are mathematical and not useful in data cleaning. One operation that does come up, though, is running a function separately on each row or column of a matrix. A few of these are so common that they are built in. Specifically, there are functions named colSums() and rowSums(), which compute all of the column sums or row sums, and corresponding functions for the means, colMeans() and rowMeans(). Very often, though, you want to apply some custom function, such as the one that tells how many entries are NA or missing. The facility for doing this is the apply() function, to which you supply the matrix, the direction of travel (1 for across rows, 2 for down columns), and then the function that is to be applied to each row or column. This last can be a function built into R, a function you have written yourself, or even a function defined on the fly.

> a <- matrix (101:115, 5, 3)
# These four commands produce identical results
> rowSums (a)
> apply (a, 1, sum)
# Pass argument na.rm into the sum() function
> apply (a, 1, sum, na.rm = T)
> apply (a, 1, function (x) sum (x))
[1] 318 321 324 327 330
# User-written command selects the second-smallest entry
# in each column
> apply (a, 2, function (x) sort(x)[2])
[1] 102 107 112

If each call to the function returns a vector of the same length, apply() creates a matrix. In this example, we use the range() function to produce two values for each column of a.

> apply (a, 2, range)
     [,1] [,2] [,3]
[1,]  101  106  111
[2,]  105  110  115

When apply() is used with a vector-valued function, such as range() in the last example, the output is arranged in columns, regardless of whether the operation was performed on the rows or the columns of the original matrix. This does not always match our intuition, particularly when the operation was performed on rows. In this example, we show the row-by-row ranges of the a matrix and then transpose using the t() function.

> apply(a, 1, range)
     [,1] [,2] [,3] [,4] [,5]
[1,]  101  102  103  104  105
[2,]  111  112  113  114  115
# Use t() to transpose that matrix
> t(apply(a, 1, range))
     [,1] [,2]
[1,]  101  111
[2,]  102  112
[3,]  103  113
[4,]  104  114
[5,]  105  115

A difficulty arises when different calls to the function produce vectors of different lengths. In that case, R cannot construct a matrix and has to return the results in the form of a list (we discuss lists in Section 3.3). This might arise, say, when looking for the locations of unusual values by column. In this example, we look for the locations in each column of values greater than 109 in the matrix a.

> apply (a, 2, function (x) which (x > 109))
[[1]]
integer(0)
[[2]]
[1] 5
[[3]]
[1] 1 2 3 4 5

This result tells us that the first column has no entries c03-math-004109, the second column's fifth entry is c03-math-005109, and all five entries in the third column are c03-math-006109. In general, you have to be aware that apply() might return a list if the function being applied can return vectors of different lengths.

3.2.4 Missing Values in Matrices

One very common use of apply() is to count the number of missing values in each row or column, since missing values always affect how we do data cleaning. This code shows how to count the number of NA value in each column. To show off some more of R's capabilities, we use the semicolon, which allows multiple commands on one line, and the multiple assignment operation, which lets us assign several things at once.

> a <- matrix (101:115, 5, 3); a[5, 3] <- a[3, 1] <- NA; a
     [,1] [,2] [,3]
[1,]  101  106  111
[2,]   NA  107  112
[3,]  103  108  113
[4,]  104  109  114
[5,]  105  110   NA
> apply (a, 2, function (x) sum (is.na (x)))
[1] 1 0 1

From the last command, we see there is one missing value in each of columns 1 and 3.

We saw how to use which() to identify missing values in a vector back in Section 2.4, and the same command can also identify missing values in a matrix. By default, which(is.na(vec)) will return the indices of vec with missing values as if vec had been stretched out into a long vector (column by column, as always). However, the arr.ind = TRUE argument will supply the row and column indices of the items selected by which(). This is extremely useful in tracking down a small number of missing values. In this example, we use which() to identify the missing entries in a.

> which (is.na (a))
[1]  3 15
> which (is.na (a), arr.ind = TRUE)
     row col
[1,]   2   1
[2,]   5   3

Here, which() returns a matrix with two named columns and two unnamed rows. Of course, this approach is not limited to finding NAs. It can also be used to find negative values, or anything else that is unexpected and needs to be cleaned.

3.2.5 Using a Matrix Subscript

In the last example, we saw how which() with arr.ind = TRUE returns a matrix giving a vector of rows and a vector of columns that, together, identify the cells that had NA values. One underused feature of R is that we can use a matrix subscript, such as the one returned by which() with arr.ind = TRUE to extract from, or assign to, another matrix. We can also use the vector returned by the ordinary use of which(), but the matrix approach sometimes makes it much easier to extract the necessary rows or columns. In this example, we construct a matrix with five columns of data and a sixth column named "Use". This final column tells us which of the data columns should be extracted for each of the rows.

> b <- matrix (1:20, nrow = 4, byrow = TRUE)
> b <- cbind (b, c(3, 2, 0, 5))
> colnames (b) <- c("P1", "P2", "P3", "P4", "P5", "Use")
> rownames (b) <- c("Spring", "Summer", "Fall", "Winter")
> b
     P1 P2 P3 P4 P5 Use
[1,]  1  2  3  4  5   3
[2,]  6  7  8  9 10   2
[3,] 11 12 13 14 15   0
[4,] 16 17 18 19 20   5

Since the first row's value of Use is 3, we want to extract the third element of that row; since the second row's value of Use is 2, we want the second element of that row; and so on. Without the ability to use a matrix subscript, we might be forced to loop through the rows of b, but in R we can extract all these items in one call. Our matrix subscript has two columns, one giving the rows from which we are extracting (in this case, all the rows of b in order) and another giving the column from which to extract (in this case, the values in the Use column of b). Here we show we can construct this matrix subscript and use it to extract the relevant entries of b.

> (subby <- cbind (1:nrow(b), b[,"Use"]))
     [,1] [,2]
[1,]    1    3
[2,]    2    2
[3,]    3    0
[4,]    4    5
> b[subby]
[1]  3  7 20

Notice that in this example the value of Use in the third row was zero – and therefore no value was produced for that row of the matrix subscript (see “zero subscripts” in Section 2.3.1). Negative values cannot be used in a matrix subscript.

As a real-life example of where this might occur, we were recently given a matrix of customer payments. The first 96 columns contained monthly payment amounts. The last column gave the number of the month with the last payment in it. Our task was to extract the payment amount whose month appeared in that final column. So if, in the first row, that column had the value 15, we would have extracted the amount from the 15th column of the payment matrix; and so on for the second and subsequent rows.

Two points here: notice that we extracted in our example earlier using b[subby] using no additional commas. The matrix subscript defines both rows and columns. Second, remember to use cbind() to construct the subscript argument (our subby above). Make sure that the matrix subscript really is a matrix, and not two separate vectors, or you will extract rows and columns separately. Matrix subscripting works with names, too. If our matrix b had had both row and column names, we could have used a character matrix in exactly the same way as the numeric subby. In that case, b would need both row and column names so that both columns of the subscript argument could be character. We cannot have one vector be numeric and the other character, because we need to combine them into a matrix, and all the entries in a matrix have to be of the same type. It is also possible to have a logical matrix act as a subscript – but the results are surprising and we do not recommend it.

3.2.6 Sparse Matrices

A sparse matrix is one whose entries are largely zero. For example, in a language processing application we might form a matrix with words in the rows and documents in the columns. Then a particular cell, say, the c03-math-007th one, would have a zero if word c03-math-008 did not appear in document c03-math-009, and since, in many examples, most words do not appear in most documents that matrix might have a high proportion of zeros. There are a number of schemes for representing sparse matrices. The recommended Matrix package (Bates and Maechler, 2016) implements many of these. We encounter sparse matrices in our work, but rarely in the context of data cleaning, so we will not discuss them in this book.

3.2.7 Three- and Higher-Way Arrays

Three-way (and higher-way) matrices are called arrays in R. An array looks like a matrix in that all of its elements need to be of the same type, but a three-way array requires three subscripts, a four-way array requires four subscripts, and so on. The only time we seem to have encountered such a thing in data cleaning is when constructing a three- or higher-way table(). In this example, we show a three-way table made from three vectors each of length 8, and then we extract the value 3 from the second row of the first column of the first “panel.”

> who <- rep (c("George", "Sally"), c(2, 6))
> when <- rep (c("AM", "PM"), 4)
> worked <- c(T, T, F, T, F, T, F, T)
> (sched <- table (who, when, worked))
, , worked = FALSE
        when
who      AM PM
  George  0  0
  Sally   3  0
, , worked = TRUE
        when
who      AM PM
  George  1  1
  Sally   0  3
> sched[2,1,1]
[1] 3

Many commands that work on matrices, like, apply() and prop.table(), operate on arrays as well. You can also use c() on an array to produce a vector – in this case, the first column of the first panel is followed by the second column of the first panel, and so on. The aperm() function plays the role of t() for higher-way arrays.

3.3 Lists

A list is the most general type of R object. A list is a collection of things that might be of different types or sizes; a list might include a numeric matrix, a character vector, a function, another list, or any other R object. Almost every modeling function in R returns a list, so it is important to understand lists when using R for modeling, but we also need to describe lists because one special sort of list is the data frame, which we describe in the following section.

Normally, we will encounter lists as return values from functions, but we can create a list with the list() function, like this:

> (mylist <- list (alpha = 1:3, b = "yes", funk = log, 45))
$alpha
[1] 1 2 3
$b
[1] "yes"
$funk
function (x, base = exp(1))  .Primitive("log")
[[4]]
[1] 45

Lists also appear as the output from the split() function, which divides a vector into (possibly unequal-length) pieces according to the value of another vector. We use this frequently in data cleaning. For example, we might divide a vector of people's ages according to their gender. In this simple example, we show how split() produces a list; later, in Section 3.5.1, we show how that list can be put to use.

> ages <- c(26, 45, 33, 61, 22, 71, 43)
> gender <- c("F", "M", "F", "M", "M", "F", "F")
> split (ages, gender)
$F
[1] 26 33 71 43
$M
[1] 45 61 22
> split (ages, ages > 60)
$`FALSE`
[1] 26 45 33 22 43
$`TRUE`
[1] 61 71

It is worth noting that if the second argument – gender in this case – has missing values, those values will be dropped from the output of split(). Notice also that in the second example the names of the list elements have been surrounded by backward quotes. This is for display, because FALSE and TRUE are not valid names here, but the character strings "FALSE" and "TRUE" are.

The length of a list, as found using length(), is the number of elements, regardless of how big each individual element is. The lengths() function returns a vector of lengths, one for each element on the list. In our example, length(mylist) returns the value 4, whereas lengths(mylist) returns a vector with four lengths in it (including the length of 1 that is returned for the function funk()). The str() command we described in Section 2.2.2 works on lists as well. The resulting value printed to the screen gives a description of every element on the list – one line for atomic elements and multiple lines for lists within lists. This is one way to help understand the structure of your data quickly.

3.3.1 Extracting and Assigning

In the first example in this section, the first three elements were given names and the fourth was not. That output hints at how to extract items from a list. You can use double square brackets – so mylist[[4]] will return 45 – or, if an element has a name, you can use the dollar sign and the name – so mylist$b will return "yes", and split(ages, ages > 60)$"TRUE" will return the vector of ages c03-math-01060. Single square brackets can be used, with a numeric, logical, or name subscript, but there's a catch – single square brackets return a list, not the contents of the list. This is useful if you want only a couple of pieces of a list. For example, mylist[1:2] will return a list with the first two elements of mylist, and mylist[1] will return a list with the first element of mylist – not as a vector but as a list. A logical subscript will also work here: mylist[c(T, T, F, F)] will return the same list as mylist[c(1,2)] or mylist[c("alpha", "b")]. Most of the lists we run into will have names, and we usually extract elements one at a time with the dollar sign, but the distinction between single and double brackets is still important. Single brackets create lists; double brackets extract contents. And what happens in our example if you ask for mylist[[2:3]] or mylist[[c(F, T, T, F)]]? Unsurprisingly, these commands generate errors.

When you request a list item using single brackets and a name that is not present on the list, R returns a list with one NULL element; with double brackets or the dollar sign, it returns the NULL itself. This is consistent with the rule that says single brackets produce lists, while double brackets and dollar signs extract contents. Using double brackets with a numeric subscript greater than the number of elements in the list, such as mylist[[11]] in our example, produces an error rather than a NULL.

Of course, to extract elements of a list by name, we need to know the names. We can determine the names a list has using the names() function. If the list has no names at all, this function will return NULL; if some elements have names, the names() function will return an empty string for those elements with no names. This example shows the names of the mylist list.

> names(mylist)
[1] "alpha" "b"     "funk"  ""

We can also use the names() function on the left-hand side of an assignment to change the names of the elements on a list. For example, the commands names(mylist)[4] <- "RPM" would change the name of the fourth element of mylist to "RPM".

Unlike when you use single or double square brackets, when using the dollar sign to extract an item, you don't need its full name. (Technically, you can pass the exact argument into double square brackets to control this behavior, but we don't.) You only need enough to identify the item unambiguously. In this example, mylist$a would be enough to produce the same numeric vector returned by mylist$alpha, but if there were two items on the list, say alpha and algorithm, typing mylist$a would produce NULL. You would need to specify at least mylist$alp in order to be unambiguous. It's often convenient to use these abbreviated names, but that approach is best suited for quick work at the command line. We recommend using full names in functions and scripts, to avoid confusion or even an error if new items get added to the list later.

To replace an item on a list, just re-assign it. If you want to add a new item to a list, just assign the new item to a new name. Here, naturally, you need to use the item's full name. If your mylist has an item called alpha and you use the command mylist$alp <- 3, you will create a new item named alp and leave the old one, alpha, unchanged. To delete an item from a list, you can use subscripting as we did for a vector. For example, either mylist <- mylist[c(1, 2, 4)] or mylist <- mylist[-3] will drop the third entry. But another, possibly easier, way is to assign NULL to the name or number. In this example, mylist$funk <- NULL or mylist[[3]] <- NULL would remove the item named funk from mylist. This behavior means that it is difficult to intentionally store a NULL value in a list, but this does not seem to be much of a limitation.

Another useful function for operating on lists is unlist(), which, as its names suggests, tries to turn your entire list into a vector. When the list contains unusual objects, such as the function element of the mylist list in our example, the results of unlist() can be difficult to predict. This example shows the effect of unlist() operating on a list of regular vectors, which we create by excluding the function element of mylist().

> unlist (mylist[-3])
alpha1 alpha2 alpha3      b
   "1"    "2"    "3"  "yes"   "45"

Here we can see that R has produced names for each of the elements from the vector mylist$a, and in a well-behaved list these names can be useful.

3.3.2 Lists in Practice

Generally, we do not need lists much when data cleaning. As we have noted, lists arise as the output of many R functions – a function in R cannot return more than one result, so if a function computes two things of different sizes, it will need to return a list. For example, the rle() function we described in Chapter 2 returns the lengths of runs and, separately, the value associated with each run. It is then your job to extract the pieces from the list. The pieces will almost always be named, so they will be able to be extracted using the $ operator. (In the case of rle(), the pieces are called lengths and values.) Lists also arise as the output from the split() command. Normally, after calling split() we would then call an apply()-type function on each element of the resulting list. We describe this in Section 3.5.1. And, of course, the apply functions can themselves produce lists, as we saw in Section 3.2.3.

Another common context in which lists arise concerns the dimension names of a matrix. The dimnames() function returns NULL when applied to a matrix without row or column names. Otherwise, it returns a list with two elements: the vector of row names and the vector of column names. In general, this return has to be a list, rather than a matrix, because the number of rows and number of columns will be different. Either of the two entries may be NULL, because a matrix may have row names without column names, or vice versa. The dimnames() function may be used to assign, as well as extract, dimension names. These examples continue the earlier ones using the two-way table tbl and the three-way table sched from Sections 3.2.2 and 3.2.7, respectively, and show dimnames() at work. Notice that dimnames() produces a list with three vectors of names from the three-way table.

> dimnames(tbl)
$market
[1] "2" "3"
$yr
[1] "FY15" "FY16" "FY17"
> dimnames (sched)
$who
[1] "George" "Sally"
$when
[1] "AM" "PM"
$worked
[1] "FALSE" "TRUE"

As we have seen before, dimension names are always characters. So in the three-way array example, the names for the worked dimension are the character strings "FALSE" and "TRUE", not the logical values. In the following example, we show how we can modify an element of the dimnames() list.

> dimnames(tbl)[[2]][1] <- "Archive"
> tbl
      yr
market Archive FY16 FY17
     2       3    1    3
     3       2    4    2

In the dimnames() assignment we change the first column name. Here, dimnames(tbl) produces a list, the [[2]] part extracts the vector of column names, and the [1] part accesses the element we want to change. Of course, we could have achieved the same result with dimnames(tbl)$yr[1] <- "Archive".

Another list that arises from R itself is the list of session options, returned from a call to the options() function. This list includes dozens of elements describing things such as the number of digits to be displayed, the current choice of editor, the choices going into scientific notation, and many more. Calling names(options()) will produce a vector of the names of the current options. You can examine a particular option, once you know its name, with a command like options()$digits. To set an option, pass its name and value into the options() function, with a command like options(digits = 9).

3.4 Data Frames

Now that we understand how matrices and lists work, we can focus on the most important object of all, the data frame. A data frame (written with a dot in R, as data.frame) is a list of vectors, all of which are the same length, so that they can be arrayed in a matrix-like rectangle. (Technically, the elements of a data frame can also be matrices, as long as they are of the right size, but let us avoid that complication. For our purposes, the elements of a data frame will be ordinary vectors.) The vectors in the list serve as the columns in the rectangle. A data frame looks like a matrix, with the critical difference that the different columns can be of different types. One column can be numeric, another character, a third factor, a fourth logical, and so on. Each vector has elements of one type, as usual, but the data frame allows us to store the sort of data we get in real life. So a data frame about people might contain their names (which would probably be character), their ages (often numeric, but possibly factor), their gender (possibly character, possibly factor), their eligibility for a particular program (which might be logical), and so on. In this example, we use the data.frame() function to construct a data frame. In data cleaning, our data frames are very often produced by functions that read in data from the disk, a database, or some other source. We describe methods of acquiring data in Chapter 6, but for the moment we will use this simple example.

> (mydf <- data.frame (
        Who = letters[1:5], Cost = c(3, 2, 11, 4, 0),
        Paid = c(F, T, T, T, F), stringsAsFactors = FALSE))
  Who Cost  Paid
1   a    3 FALSE
2   b    2  TRUE
3   c   11  TRUE
4   d    4  TRUE
5   e    0 FALSE 

There are a few points worth noting here. First, R has provided row names (visible as 1 through 5 on the left) to the data frame automatically. A matrix need not have row names or column names, and a list need not have names, but a data frame must always have both row names and column names. R will create them if they are not explicitly assigned, as it did here. The data.frame() function ensures (unless you specify otherwise) that column names are valid and not duplicated. You may specify row names explicitly, using the row.names argument, in which case they must be not duplicated and not missing. Column names can be examined and set using the names() command, as with a list, or with the colnames() or dimnames() commands, as with a matrix. Generally, you will probably find the names() or colnames() approaches to be easier, since they involve vectors and not a list. For row names, the rownames() and row.names() functions allow the row names of a data frame to be examined or assigned. Section 3.2.2 describes how row names can be useful when handling matrices, and those points are true for data frames as well.

A second point is that, by default, the data.frame() function turns character vectors into factors. Factors are discussed in Section 1.6, and, as we mention there, they are useful, even required, in some modeling contexts. They are rarely what we want in data cleaning, however. The best way to keep factors out of data frames is to not allow them in the first place; we accomplished this in the example above by passing the stringsAsFactors = FALSE argument to the data.frame() function. Without that argument, the Who column of mydf would have been a factor variable with five levels. Another way to prevent factors from being created is to set the stringsAsFactors global option to be FALSE, using the options(stringsAsFactors = FALSE) command. However, we cannot rely on all of the users of our code having that setting in place, so we always try to remember to turn this option off explicitly when we call data.frame(). This issue will arise again when we talk about combining data frames later in this section, and about reading data in from outside sources in Chapter 6.

There are several functions that help you examine your data frame. Of course, in many cases, it will be too big to simply print out and examine. The head() and tail() functions display only the first or last six rows of a data frame, by default, but this can be changed by the second argument, named n. So head(mydf, n = 10) will show the first 10 rows, tail(mydf, 12) will show the last 12, and, using a negative argument, head(mydf, -120) will show all but the last 120. The str() function prints a compact representation of a data frame that includes the type of each column, as well as the first few entries. Other useful functions include dim(), to report the numbers of rows and columns, and summary(), which gives a brief description of each column.

3.4.1 Missing Values in Data Frames

Because the columns of a data frame can be of different classes, missing values can be of different classes, too. A missing value in a numeric column will be a numeric missing value, while in a character column, the missing value will be of the character type. We discussed missing values at some length in Section 2.4. It is always good to know where missing values come from and why they exist – often investigating the causes of “missingness” will lead to discoveries about the data. The is.na() function operates on a data frame and returns a logical-valued matrix showing which elements (if any) are missing; the anyNA() function operates on data frames as well. One approach to handling missing data is to simply omit any observations (rows) of the data frame in which one or more elements is missing. R's na.omit() function does exactly that. (For this purpose, NaN is missing but Inf and -Inf are not.) This is the default behavior for a number of R's modeling functions, but in general we do not recommend deleting records with missing values until the reason for the values being missing is understood.

3.4.2 Extracting and Assigning in Data Frames

Since a data frame is matrix-like and also list-like, we can use both matrix-style and list-style subsetting operations on a data frame. One difference appears when we select a single row. With a matrix, selecting a single row returns a vector, unless you specify drop = FALSE (see Section 3.2.1). However, with a data frame, even a single row is returned as a data frame with one row because in general even one row of a data frame will contain entries of different types.

With that one difference, we extract rows from a data frame just as we extract rows from a matrix – by number, including negatives; using a logical vector; or by names (as we mentioned, the rows of a data frame, and the columns, always have names). We can extract columns using either list-style access or matrix-style access. List-style access uses single brackets to produce sub-lists, which in this case means that using single brackets will produce a data frame. Double bracket subscripts, or the dollar sign, will produce a vector. The difference is that double brackets require an exact name, unless exact = FALSE is set, whereas the dollar sign only requires enough of the name to be unambiguous. If there are two columns with similar names, and your request is not sufficient to determine a unique answer, nothing at all (i.e., NULL) is returned. Therefore, it makes sense, particularly when writing functions for other people, to use full names for columns.

Matrix-style access uses column names or numbers; just as with a matrix, selecting only one column will produce a vector unless you explicitly set drop = FALSE. This example shows a number of ways of extracting columns from data frames. We start by showing list-style access using single brackets.

> mydf[2]          # Numeric subscript
  Cost
1    3
2    2
3   11
4    4
5    0
> mydf["Cost"]     # Subscript by name
  Cost
1    3
2    2
3   11
4    4
5    0
> mydf[c(F, T, F)] # Logical subscript
  Cost
1    3
2    2
3   11
4    4
5    0

Each of those operations produced a data frame with five rows and one column (which is, of course, a list). In the following examples, we use double brackets together with a numeric or character subscript and produce a vector. As with a list, a logical subscript with more than one TRUE inside a pair of double brackets will produce an error. (You might have expected the same result with a numeric subscript; in fact, a numeric subscript of length 2 can be used; it acts as a one-row matrix subscript.) When using a character index inside double brackets, you can specify exact = FALSE to permit the same sort of matching that we get with the dollar sign.

> mydf[[2]]
[1]  3  2 11  4  0
> mydf[["Cost"]]
[1]  3  2 11  4  0
> mydf[["C"]]
NULL
> mydf[["C", exact = FALSE]]
[1]  3  2 11  4  0

Notice that the result in each of these cases is a vector. In the following examples, we show the use of the dollar sign to extract a column. In this case, as we mentioned, we need to specify only enough of the name to be unambiguous.

> mydf$W                     # Extracts the "Who" column
[1] "a" "b" "c" "d" "e"

The dollar sign can only refer to one column at a time. To extract more than one column, we can use single brackets as above, or matrix-style access in which we explicitly specify rows and columns. As with a matrix, leaving one of those two indices blank will select all of them, and R will produce vectors from single columns unless the drop = FALSE argument is specified. This example shows extraction using matrix-style syntax.

> mydf[1:2, c("Cost", "Paid")]
  Cost  Paid
1    3 FALSE
2    2  TRUE
> mydf[,"Who", drop = FALSE] # Example of drop = FALSE
  Who
1   a
2   b
3   c
4   d
5   e

Removing a column from a data frame is exactly like removing an element from a list and is accomplished in the same way – by assigning NULL to the column reference. Running the command mydf$Paid <- NULL will remove the column Paid from the data frame using list-style notation, and mydf[,"Paid"] <- NULL performs the same task using matrix-style notation.

To replace subsets of elements you can once again use the matrix-style or list-style syntax. So, for example, mydf[c(1,3), "b"] <- "A" and mydf$b[c(1,3)] <- "A" both replace the first and third entries of the b column of mydf with "A". (Of course, if that column had been numeric or logical before, this operation will force R to convert it to character.)

3.4.3 Extracting Things That Aren't There

The critical difference between a matrix and a data frame is that the columns of a data frame can be vectors of different types. Another difference manifests itself when you try to access an element that isn't there, maybe because you asked for a row or column number that was too big or a row or column name that didn't exist. In a vector, attempts to extract an item beyond the end of the vector will produce NAs. But if you ask a matrix for a row or column that doesn't exist, R will produce an error. This example shows the difference:

> (mat <- matrix (1001:1006, 2, 3)) # Matrix with six items
# Ask for a non-existent entry, using vector-like indexing
> mat[8]
[1] NA
> mat[,4] # Ask for a non-existent column
Error in mat[, 4] : subscript out of bounds

In general, we prefer the error. A function that sees an NA will often try to carry on, whereas an error will force you to stop and figure out what has happened.

The situation with data frames (and lists) is different. Supplying subscripts for which there are no rows produces one row with all NAs for every unusable subscript. The entries in these rows will have the same classes (numeric, character, etc.) that the data frame had. This arises when some rows have been deleted, and then you, or a program, try to access one of the deleted rows by name. In this example, we show how asking for rows that don't exist can cause trouble.

> mydf2 <- data.frame (alpha = 1:5, b = c(T, T, F, T, F),
  NX = c("NA", "NB", "NC", "ND", "NE"),
  stringsAsFactors = FALSE,
  row.names = c("Red", "Blue", "White", "Reddish", "Black"))
> mydf2
        alpha     b NX
Red         1  TRUE NA
Blue        2  TRUE NB
White       3 FALSE NC
Reddish     4  TRUE ND
Black       5 FALSE NE
# Let's ask for rows that don't exist.
> mydf2[c(9, 4, 7, 1),]
        alpha    b   NX
NA         NA   NA <NA>
Reddish     4 TRUE   ND
NA.1       NA   NA <NA>
Red         1 TRUE   NA

In this example, we see that the resulting data frame has four rows, two of which contain only NA values. The character column's NAs are represented with angle brackets, as <NA>, to make it easy to distinguish a missing value from the legitimate character string NA in row 1. The first two columns' NAs are numeric and logical. As elsewhere (e.g., Section 2.4.3), logical subscripts will recycle – which is rarely what you want – and usually produce unwanted results when they contain NAs.

In the following example, we show one more operation that can produce rows with NAs in them. Since our data frame has rows named both "Red" and "Reddish", asking for a row named "Re" is ambiguous and produces a row of NAs. (In contrast, the row names of a matrix may not be abbreviated; supplying a name that is not an exact row name produces an error.)

> mydf2["Re",]              # Not enough to be unambiguous
   alpha  b    c
NA    NA NA <NA>

A much more frequent problem happens when accessing columns. If you access a non-existent column in the matrix or list styles, using an abbreviation, R produces an error. In our example, mydf2[,"gamma"] (referring to a non-existent column), mydf2[,"N"] (referring to an abbreviated name, with a comma), and mydf2["N"] (without the comma) all produce errors. In contrast, when using the double-bracket notation, NULL is returned when a name is abbreviated or non-existent. (As we mentioned with lists, there is in fact an exact argument to the double brackets that we do not use.) Just like the NA returned when accessing a non-existent element of a vector, this NULL has the potential to be more trouble than an error would have been. The use of the dollar sign, as we mentioned, permits the use of unambiguously abbreviated names but produces a NULL when used with a non-existent name. In this example, we show how asking for a non-existent name can produce an unexpected result.

# Ask for the first column by abbreviated name.
> mydf2$alph
[1] 1 2 3 4 5               # No problem
# Create another column with a similar name
> mydf2$alpha.plus.1 <- mydf2$alpha + 1
> mydf2$alph
NULL
> mydf2$alph + 1            # No error, but..
numeric(0)                  # probably unexpected

The second-to-last operation produced NULL because alph was not sufficient to differentiate between the columns alpha and alpha.plus.1. If a row or column name matches exactly, R will extract it properly (so if you have alpha and alpha.plus.1 and ask for alpha, there is no ambiguity). It is a good practice to use complete names, unless there is a strong reason not to.

3.5 Operating on Lists and Data Frames

Very often we will want to operate on each of the elements of a list or each of the rows or columns of a data frame. For example, we might want to know how many missing values are in each column. In Section 3.2.3, we saw a matrix using apply(), but apply() does not work on a list (since a list doesn't have dimensions). The apply() function does work on data frames, but it first converts the data frame into a matrix. This conversion will only be sensible when all the columns are of the same type, as with the all-numeric data frame described in Section 3.5.2. In other cases, the results can be quite unexpected. In this example, we operate on the rows of a data frame, using apply(), to show how this can go wrong.

> (dd <- data.frame (a = c(TRUE, FALSE), b = c(1, 123),
      cc = c("a", "b"), stringsAsFactors = FALSE))
      a   b c
1  TRUE   1 a
2 FALSE 123 b
> apply (dd, 1, function (x) x)
   [,1]    [,2]
a  " TRUE" "FALSE"
b  "  1"   "123"
cc "a"     "b"  

Here the function passed to apply() does nothing but return whatever it passed to it. Since data frame dd has a character column, apply() converted the whole data frame into a character matrix. It does this in part by calling the format() function column by column, producing the results seen here: a value " TRUE" with a leading space in row 1 (formatted to be the same length as the string "FALSE"), and " 1" with two leading spaces in row 2 (formatted to be the same length as the string "123"). Analogous conversions happen whenever a data frame with at least one column that is neither logical nor numeric is passed to apply(), used in other matrix functions such as t() (transpose), or accessed with a matrix subscript.

A general approach to this sort of operation (element by element for a list, column by column for a data frame) is supplied by sapply() and lapply(). The lapply() function always returns a list, whereas sapply() runs lapply() and then tries to make the output into a vector (if the function always returns a vector of length 1) or a matrix (if the function returns a vector of constant length). Be careful, though, because if the different function calls return items of different lengths, sapply() will need to return a list, just as the ordinary apply() function did back in Section 3.2.3. Moreover, if the function returns elements of different types (perhaps as a row of a data frame), sapply() will try to convert these to a common type. In these cases, use lapply(). The following example shows one very common use of sapply(), which is to return the classes of each column in a data frame.

> sapply (mydf2, class)
       alpha            b           NX alpha.plus.1
   "integer"    "logical"  "character"    "numeric"

In this example, the regular apply() function will convert the whole data frame to character first, before computing the classes, which it would report as all character.

It is easy to operate on the columns of a data frame (or the elements of a list) with lapply() and sapply() functions. As we have seen, it is more difficult to operate on the rows. These two functions provide a solution to this problem. They can be used with an ordinary numeric vector as their first argument, in which case they act like a for() loop, applying their function to each element of the vector. The for()-like behavior of lapply() and sapply() is most useful when using a complicated function on each row of a data frame. The command sapply (1:nrow(ourdf), function (i) fancy (ourdf[i,])) runs a user-written function called fancy() on each row of a data frame. supplied by sapply() and lapply(). The argument to fancy() really is a data frame, and not one that has been converted into a matrix. In this example, we show how we might identify rows that contain the number 1. Note that the naïve use of apply() does not find the number 1 in the first row.

> apply (dd, 1, function (x) any (x == 1))
[1] FALSE FALSE
> sapply (1:2, function (i) any (dd[i,] == 1))
[1]  TRUE FALSE

3.5.1 Split, Apply, Combine

The family of apply() functions all operate as part of strategy that Wickham (2011) calls “split-apply-combine.” The data is split (possibly by row, possibly by column), a function is applied to each piece, and the results recombined. We have already met the tapply() function (Section 2.5.2), which performs exactly this set of operations on vectors. We can also do this explicitly via split() and sapply() or lapply(). We start the following example by constructing a data frame with some people's ages, genders, and ages of spouses, and computing the average value of Age by Gender. In this example, we do not specify stringsAsFactors = FALSE.

> age <- data.frame (Age = c(35, 37, 56, 24, 72, 65),
 Spouse = c(34, 33, 49, 28, 70, 66),
 Gender = c("F", "M", "F", "M", "F", "F"))
> split (age$Age, age$Gender)
$F
[1] 35 56 72 65
$M
[1] 37 24
> sapply (split (age$Age, age$Gender), mean)
   F    M
57.0 30.5

Here the split() function returns a list with the elements of Age divided by value of Gender. Then sapply() operates the mean() function on each element of the list and returns a vector (i.e., it performs both the “apply” and “combine” operations). In this example, we could have used tapply(age$Ages, age$Gender, mean) to produce an identical result.

However, unlike tapply(), split() can operate on a data frame, producing a list of data frames. We can then write a function to operate on each data frame. In this example, we split our data frame by Gender and then use summary() on each of the resulting data frames to return some information about every column. Summary() applied to the factor column Gender is more informative than when applied to a character column; this is why we did not specify stringsAsFactors = FALSE earlier. The result of the calls to summary() appears as a specially formatted table.

> split (age, age$Gender)
$F
  Age Spouse Gender
1  35     34      F
3  56     49      F
5  72     70      F
6  65     66      F
$M
  Age Spouse Gender
2  37     33      M
4  24     28      M
> lapply (split (age, age$Gender), summary)
$F
      Age            Spouse      Gender
 Min.   :35.00   Min.   :34.00   F:4
 1st Qu.:50.75   1st Qu.:45.25   M:0
 Median :60.50   Median :57.50
 Mean   :57.00   Mean   :54.75
 3rd Qu.:66.75   3rd Qu.:67.00
 Max.   :72.00   Max.   :70.00
$M
      Age            Spouse      Gender
 Min.   :24.00   Min.   :28.00   F:0
 1st Qu.:27.25   1st Qu.:29.25   M:2
 Median :30.50   Median :30.50
 Mean   :30.50   Mean   :30.50
 3rd Qu.:33.75   3rd Qu.:31.75
 Max.   :37.00   Max.   :33.00

Using sapply() in this case produces an unexpected result (try it!). That function tries hard to construct a vector or matrix whenever it can. A single command that produces essentially the same final result, without letting you save the list, is the by() function. In this example, by(age, age$Gender, summary) performs the summary() operation on each column, broken down by gender.

Under some circumstances, the three tasks of split, apply, and combine might require separate functions, each of which may have its own arguments and conventions. The dplyr package Wickham and Francois (2015) presents a set of tools that aim to make this sort of processing more consistent. Although this package is intended for data frames, the earlier plyr Wickham (2011) package handles lists and arrays as well. Both are intended to be fast and efficient and to permit parallel computation, which we address in Section 5.5. We have been accustomed to performing these tasks in regular R, and we recommend that users know how to perform these tasks there, since lots of existing code and users take that approach.

3.5.2 All-Numeric Data Frames

We noted above that it is difficult to apply a function to the rows of a data frame because the entries of a row may have different classes. All-numeric data frames, though – those whose columns are all logical or numeric – behave specially in these situations. When one of these data frames is converted to a matrix the numeric nature of the columns is preserved (with logicals being converted to numeric). These data frames can also be transposed, or accessed with a matrix subscript, without losing their numeric nature. All-numeric data frames provide a useful way of storing numbers in a matrix-like way while being able to use data-frame-like syntax – but, again, as soon as one character column (perhaps an ID) is added, the nature of the data frame changes.

Just as there are functions as.numeric() and so on to convert vectors from one class to another (see Section 2.2.3), R provides as.matrix() and as.data.frame() functions to convert data frames to matrices and vice versa. This is mostly useful for all-numeric data frames or for older functions that require numeric matrices.

3.5.3 Convenience Functions

We encourage users to use long names for their data objects and for their column names, for increased readability. However, this often leads to a situation where to use a simple expression we need a long line like the one in this example:

CustPayment2016$JanDebt +  CustPayment2016$FebPurch -
                           CustPayment2016$FebPmt

The with() and within() functions provide an easier way to perform operations such as these, and they are particularly useful when the same operation needs to be done multiple times on multiple data objects, usually data frames. For each of these functions, we pass the data frame's name and then the expression to be performed, like this:

with (CustPayment2016, JanDebt + FebPurch - FebPmt)

One issue that if the expression includes an assignment, the assignment is ignored. In order to create a new column in CustPayment2016 we would need code like this:

CustPayment2016$FebDebt <- with (CustPayment2016,
                     JanDebt + FebPurch - FebPmt)

As an alternative, the within() function can perform assignments; it returns a copy of the data with the expression evaluated. In this case, we could add a new column called FebDebt to the data frame with a command like this:

CustPayment2016 <- within (CustPayment2016,
                     FebDebt <- JanDebt + FebPurch - FebPmt)

Notice that in this example within() returns a copy, which then needs to be saved.

Two more convenience functions are the subset() and transform() functions. Much beloved of beginners, they make the subsetting and transformation process easier to follow by helping do away with square brackets. For example, we might extract all the rows of data frame d for which column Price is positive with a command like d[ d$Price > 0,]; subset() allows us to use the alternative subset(d, Price > 0). It is also possible to extract a subset of columns at the same time. The transform() function allows the user to specify transformations to existing columns in a data frame and returns the updated version. The help pages for both of these functions are accompanied by warnings that recommend using them interactively only, not for programming, and we generally avoid them.

A final convenience function is the ability to “pipe” provided by the %>% function in the magrittr package (Bache and Wickham, 2014). This is intended to make code more readable by allowing one function's output to serve as another's input directly at the command line, rather than requiring nested calls. For example, consider this evaluation of a mathematical expression:

> cos (log (sqrt (8 - 3)))
[1] 0.6933138

In R, we have to read this from the inside out: we compute c03-math-011; take the square root of the result; take the logarithm of that result; and finally compute the cosine of the result from the log() function. Using the pipe notation, we can pass the results of one computation to the next in the order in which they are performed. This example shows the same computation performed using the pipe notation.

> (8 - 3) %>% sqrt %>% log %>% cos
[1] 0.6933138

The pipe notation is particularly useful for nested functions and can be brought to bear on data frames. However, be aware that not every function is suitable for piping, and notice that the order of precedence required that we surround the c03-math-012 with parentheses.

3.5.4 Re-Ordering, De-Duplicating, and Sampling from Data Frames

Data frames can be re-ordered (i.e., sorted) using a command that extracts all the rows in a new order. This ordering will usually be a vector of row indices constructed with the order() function (see Section 2.6.2). So if a data frame named cust has columns ID and Date, then ord <- order(cust$ID, cust$Date) (or the slightly more convenient alternative, ord <- with(cust, order (ID, Date))) will produce a vector ord that shows the ordering of the data frame's rows by increasing ID, and then by increasing Date within ID. Therefore, the command cust <- cust[ord,] will replace the old cust with the newly ordered one.

In Section 2.6.4, we saw that the unique() function returns the unique entries in a vector, while its counterpart duplicated() returns a logical vector that is TRUE for any entry that appears earlier in the vector. These two functions operate directly on matrices and data frames as well. So the command unique(mydf) takes a data frame named mydf and returns the set of non-duplicated rows. As always, floating-point error can be a problem when detecting whether two things are identical.

One more operation that comes up is random sampling from a data frame. This is a good plan when the original data set is so big that it cannot be easily used for testing, for example, or plotting. As with re-ordering, the idea is to construct a sample of row indices and then to subset the data frame with that sample. The sample() command is useful here. In its most basic form, we pass an integer named x giving the number of rows in the data frame and an argument named size giving the desired sample size. The result is a random set of integers selected without replacement with each value from 1 to x being equally likely. To sample 200 rows from a data frame named mydf, we could use the command sam <- sample(nrow(mydf), 200) to get a vector of 200 row numbers, and then mydf[sam,] to do the sampling. (This presumes that there are 200 or more rows in mydf. If not, R produces an error.) Of course, the new data frame's rows will maintain the numbers they had in the original mydf, so the row names of the new version will be out of order. If that bothers you, a quick sam <- sort(sam) prior to subsetting will fix that. The sample() function also has a number of more sophisticated features, including sampling with replacement and the ability to specify different probabilities for different choices.

3.6 Date and Time Objects

Most data cleaning problems will include dates (and sometimes times). The most important tasks we face with dates in data cleaning are doing arithmetic (e.g., adding a number of days to a date or finding the number of days between two dates) and extracting each date's day, day of the week, month, calendar quarter, or year. Objects representing dates and times come in several forms in R, but since one of them takes the form of a list, we have postponed discussion of those objects until here.

3.6.1 Formatting Dates

There are lots of ways to display a date in text, and during data cleaning it will feel like you meet all of them. Americans might write July 4, 2017 as “7/4/17,” but to most of the rest of the world, this indicates April 7th. Furthermore, this representation leaves unclear precisely where in the string the day starts; it starts in the third character for an American's “7/4/17” but in the first for an internationally formatted date like “26/05/17.” The two unambiguous formats “2017-07-04” and “2017/07/04” are good starting points for storing dates, especially in text files outside of R. (The value “2017-7-4” is permitted, but this format leads to date strings of different lengths; 20170704 is easy to mistake for an integer.)

The simplest date class in R is called Date, and an object of this class is represented internally as an integer representing the number of days since a particular “origin” date. The as.Date() function converts text into objects of class Date in two ways. First, it can convert an integer number of days since the origin into a date. The usual origin date in R is January 1, 1970, or, unambiguously, “1970-01-01.” In this example, we show how a vector of integers can be converted into a Date object.

> (dvec <- as.Date (c(0, 17250:17252),
                    origin = "1970-01-01"))
[1] "1970-01-01" "2017-03-25" "2017-03-26" "2017-03-27"

Notice that the value 0 is converted into the origin date, “1970-01-01.” If we are given integer dates, we need to know what the origin is supposed to be. This concern arises when reading data in from the Excel spreadsheet program. Excel uses integer dates, but the origins are different between Windows and Mac, and Excel mistakenly treats 1900 as a leap year. We describe this in more detail in Section 6.5.2 when we describe reading data in.

The second conversion that as.Date() can perform is to convert text-based representations such as “7/4/17” or “July 4, 2017,” using a format string that describes the way the input text is formatted. Each piece of the format string that starts with % identifies one part of the date or time; other pieces represent characters such as space, comma, /, or - between pieces of the input text. For example, %B matches the name of the month and %a matches the name of the day of the week. The most important pieces of the format are %d for day of the month, %m for the month, and %y and %Y for two- and four-digit year, respectively. (Two-digit years between 69 and 99 are assumed to be twentieth-century ones starting with 19, and the rest, twenty-first-century ones.) The help page for as.Date() refers us to the help page for strptime(), which lists all of the possibilities. For example, this command uses the format string "%B %d, %Y" to convert text dates such as "September 20, 2016" into a Date object.

> as.Date (c("Feb 29, 2016", "Feb 29, 2017",
             "September 30, 2017"), format = "%b %d, %Y")
[1] "2016-02-29" NA           "2017-09-30"

Notice that the format string had to contain the same pattern of spaces and comma that the input text had. R was able to read both the three-letter abbreviation Feb and the full name September – but it produced an NA for Feb 29, 2017 which was not a legitimate date.

The names of the days of the week, and the months of the year, are set by the computer's locale (see Section 1.4.6). By changing locales R can be made to read in days or months in other languages, as well, which is useful when data comes from international sources. In this example, we have some dates in which the month has been given in Spanish. By changing the locale we can read these in; then by re-setting the locale we can use as.character() to convert them into English.

> sp.dates <- c("3 octubre 2016", "26 febrero 2017",
                "5 mayo 2017")
> as.Date (sp.dates, format = "%d %B %y")
[1] NA NA NA
# Not understood in English locale; use Spanish for now
> Sys.setlocale ("LC_TIME", "Spanish")
[1] "Spanish_Spain.1252"            # Setting was successful
> (dts <- as.Date (sp.dates, format = "%d %B %Y"))
[1] "2016-10-03" "2017-02-26" "2017-05-05"
> Sys.setlocale ("LC_TIME", "USA")  # Change back
[1] "English_United States.1252"    # Setting was successful
> as.character (dts, "%d %B %Y")
[1] "03 October 2016"  "26 February 2017" "05 May 2017"     

3.6.2 Common Operations on Date Objects

There are a number of convenience functions to manipulate date objects. The months() and weekdays() functions act on Date objects and return the names of the corresponding months and days of the week. Each has an abbreviate argument that defaults to FALSE; when set to TRUE these arguments produce three-letter abbreviations. In this example, we show examples of these convenience functions.

> d1 <- as.Date ("2017-01-02")
> d2 <- as.Date ("2017-06-15")
> weekdays (c(d1, d2))
[1] "Monday"   "Thursday"
> months (c(d1, d2))
[1] "January" "June"
> months (c(d1, d2), abbreviate = TRUE)
[1] "Jan" "Jun"
> quarters (c(d1, d2))
[1] "Q1" "Q2"

There is no function to extract the numeric day, month, or year from a Date object. These operations are performed using the format() function, which calls format.Date() to produce character output that can then be converted to numeric using as.numeric(). The elements of the format string are like those that are used in as.Date(). This example shows how to extract some of those pieces from a vector of Date objects – but, again, note that the output of format() is text.

> format (c(d1, d2), "%Y")
[1] "2017" "2017"
> format (c(d1, d2), "%d")
[1] "02" "15"
> format (d1, "%A, %B %m, %Y")
[1] "Monday, January 01, 2017"

The final command shows a more sophisticated formatting operation, using a format string like the one in as.Date().

It is permitted to use decimals in a Date object to represent times of day. If you want to create a date object to represent 1:00 p.m. on July 29, 2015, as.Date(16645 + 13/24, origin = "1970-01-01") will return a numeric, non-integer object that can be used as a date. However, as.Date("2017-07-29 13:00:00") produces a Date that is represented internally by the integer 17,376 – the time portion is ignored. Moreover non-integer parts are never displayed and can even be truncated by some operations. When times of day are required, it is a better idea to use a POSIXt object (Section 3.6.4).

3.6.3 Differences between Dates

Very often we need to know how far apart two dates are. The difference between two Date objects is not a date; it is instead a period of time. In R, one of these differences is stored as a difftime object. Some functions, such as mean() and range(), handle difftime objects in the expected way. Others, such as hist() (to produce a histogram) or summary(), fail or produce unhelpful results. Normally, we will convert difftime objects into numeric items with as.numeric(). Be careful, though: the units that R uses for the conversion can depend on the size of the difference, whereas for data cleaning we almost always want to use one consistent choice of unit. Therefore, it is a good habit, when converting difftime objects to numbers, to specify units = "days" (or whichever unit we want) explicitly. In this example, which continues the one above, we show addition on dates plus an example of a difftime object.

# Date objects are numeric; we can add and subtract them
> d1 + 30
[1] "2017-07-02"
> (d <- d2 - d1)
Time difference of 13 days # an object of class difftime
> as.numeric (d)           # convert to numeric, in days
[1] 13
> units (d)
[1] "days"
> as.numeric (d, units = "weeks")
[1] 1.857143

In the last pair of commands, we saw that as.numeric() produced an output in days by default, the units being revealed by the units() command. We can also set the units of a difftime object explicitly, with a command like units(d) <- "weeks", or use the difftime() function directly, like difftime(d2, d1, units = "weeks").

3.6.4 Dates and Times

If you don't need to do computations with times – only with dates – the Date class will be enough, at least back to 1752, when Britain switched from the Julian to the Gregorian calendar. If you need to do computations on times, there is a second set of objects that are stronger at storing and computing those. These are named POSIXct and POSIXlt objects, after the POSIX set of standards. Collectively, these two types of objects are called POSIXt objects. POSIXt objects measure the number of seconds (possibly with a decimal part) since the beginning of January 1, 1970 using Coordinated Universal Time (UTC), which is identical to Greenwich Mean Time (GMT). (Technically, the POSIX standard does not include leap seconds, a vector of which is given by R's built-in .leap.seconds variable. This has never affected us.)

The POSIXlt object is implemented as a list, which makes it easy to extract pieces; the POSIXct object acts more like a number, which makes it the choice for storing as a column in a data frame. We start with an example of a POSIXlt number. It prints out in a character string, but it behaves like a list. One unusual feature is that, to see the names of the list, you need to unlist() the object first. For example,

> (start <- as.POSIXlt("2017-01-17 14:51:23"))
[1] "2017-01-17 14:51:23 PST" # R has inferred time zone PST
> unlist (start)
   sec    min   hour   mday    mon   year   wday   yday
  "23"   "51"   "14"   "17"    "0"  "117"    "2"   "16"
 isdst   zone gmtoff
   "0"  "PST"     NA 

Here start really is a list, and we can extract components in the usual way, with a dollar sign or double brackets (but, although you can use its names, names(start) is NULL, and you cannot extract a subset of components with single brackets). Notice also that the first day of the month gets number 1, but the first month of the year, January, carries the number 0, and that the year element counts the number of years since 1900. The advantage of a list is that, given a vector of POSIXlt objects named date.vec, say, you can extract all the months at once with data.vec$mon – but again, January is month 0 and December is month 11. Weekdays are given in the list by wday, with 0–6 representing Sunday through Saturday, respectively. The weekdays() function from above, and the other Date functions, also work on POSIXt objects – but be aware that the results are displayed in the locale of the user. Notice that the time zone above, PST, is deduced by our computer from its locale. The help for DateTimeClasses gives more information on the niceties of time zones, many of which are system specific.

Although we can use the weekdays(), months(), and quarters() functions on POSIXct objects, we extract other components, such as years or hours, via the format() function, as we did for Date objects. This is slightly less efficient than the list-type extraction from a POSIXlt object, but we recommend using POSIXct objects where possible, because we have encountered unexpected behavior when changing time zones with POSIXlt objects.

It is worth noting that although a POSIXt object may have a time, a time is not required. When a Date object is converted into a POSIXt object, the resulting object is given a time of 00:00 (i.e., midnight) in UTC. A vector of POSIXt objects that are all at midnight display without the time visible, but they do contain a time value. When a POSIXt object is converted to a Date object, the time is truncated.

3.6.5 Creating POSIXt Objects

R's as.POSIXct() and as.POSIXlt() functions convert text that is unambiguously formatted into POSIXt objects just as as.Date() does. Here the date can be followed by a 24-hour clock time like 17:13:14 or a 12-hour time with an AM/PM indicator. More usefully, perhaps, these functions allow the use of a format string such as the one used by as.Date(). This format string, documented in the help for strptime(), allows times, time zones, and AM/PM indicators, attributes that are also accepted by as.Date(). Often we discard time information, since we are only interested in dates, but sometimes discarding time information can lead to incorrect conclusions. In this example, we construct two POSIXct objects that represent the same moment expressed in two different time zones.

> (ct1 <- as.POSIXct ("Mar 31, 2017 10:26:08 pm",
        format = "%b %d, %Y %I:%M:%S %p"))
[1] "2017-03-31 22:26:08 PDT"
> (ct2 <- as.POSIXct ("2017-04-01 05:26:08", tz = "UTC"))
[1] "2017-04-01 05:26:08 UTC"
> as.numeric (ct1 - ct2, units = "secs")
[1] 0

The first date, ct1, is not given an explicit time zone, so the system selects the local one (shown here as PDT). In the second example, we explicitly provide the UTC indicator with the tz argument. The as.numeric() command shows that the two times are identical. There are a few confusing properties of POSIXt objects. All the objects in a vector of length c03-math-0131 will be displayed with the local time zone, and their weekdays() and months() will be, too. For a single object, though, these functions refer to the time zone of the object, although, as this example shows, there is a complication.

> c(ct1, ct2)
[1] "2017-03-31 22:26:08 PDT" "2017-03-31 22:26:08 PDT"
> weekdays (c(ct1, ct2))
[1] "Friday" "Friday"
> weekdays (ct2) #
[1] "Saturday"
> weekdays (c(ct2))
[1] "Friday"

The top command shows that the vector of dates is displayed in our locale. That date refers to a moment that was on a Friday locally. When weekdays() acts on ct2 by itself, though, it shows that that moment was on a Saturday in Greenwich. In the final command, the c() causes ct2 to be converted to local time, where its date falls on a Friday.

To explicitly convert the time zone of a POSIXct object, you can set its tzone attribute, with a command like attr(ct1, tzone = "UTC"), or, equivalently, with tzone = "GMT"; see the help for Sys.timezone() for a way to determine the names of time zones. (The approach for POSIXlt objects is more complicated and we do not discuss it here.) Note that when POSIXct objects are converted to Date objects, they are rendered in UTC, so as.Date(ct1) and as.Date(ct2) both produce dates with value "2017-04-01".

The format string that is passed to as.POSIXct() allows for a lot of flexibility in the way dates are formatted. This example shows how you might convert R's own date stamp, produced by the date() function, into a POSIXct object and then a Date object.

> (curdate <- date())
[1] "Wed Sep 21 00:36:47 2016"
> (now <- as.POSIXct (curdate,
                      format = "%A %B %d %H:%M:%S %Y"))
[1] "2016-09-21 00:36:47 PDT" # POSIXct object
> as.Date (now)
[1] "2016-09-21"

As long as the format of the dates in your data is consistent, it will probably be possible to read them in using as.POSIXct(). In some cases, dates may appear with extraneous text. If the contents of the text is known exactly, the text can be matched. For example, the string Wednesday, the 17th of March, 2017 at 6:30 pm can be read in with the format string "%A, the %dth of %B, %Y at %I:%M %p". But this formatting will fail for the 21st or the 22nd or if the input string ends with p.m. (with periods). In cases where there is variable, extraneous text, you may have to resort to manipulating the text strings using the tools in Chapter 4.

3.6.6 Mathematical Functions for Date and Times

Since Date and POSIXt objects are numeric, many functions intended to work on numeric data also work on these date objects. In particular, range(), max(), min(), mean(), and median() objects in R. all produce vectors of date objects. The diff() function computes differences between adjacent elements in a vector, so diff(range(x)) produces the range of dates in the vector x as a difftime object. The summary() function acts on a vector of date objects, producing an object that is slightly different from a vector of dates but still usable. You can also tabulate Date and POSIXct objects with table() – but table() does not work on the list-like POSIXlt objects.

The seq() function can also be used to generate a sequence of dates. This is useful for generating the endpoints of “bins” for histograms or other summaries. As we mentioned, Date objects are implemented in units of days, so a sequence of Date objects one unit apart has values 1 day apart by default. However, POSIXt objects are in units of seconds, so a sequence of POSIXt objects one unit apart are 1 second apart. One way to create a sequence of POSIXt objects representing consecutive days is to use by = 86400, since there are c03-math-014 seconds in a day. However, R has a better approach. When called with a vector of Date or POSIXt objects, the seq() function invokes one of the functions, seq.Date() or seq.POSIXt(), that is smarter about date objects. These functions let you use the by argument with a word like "hour", "day" and so on. An additional value "DSTday" (for POSIXt only) ignores daylight saving time to produce the same clock time every day. In this example, we generate some sequences of Date and POSIXt objects. Notice that R suppresses times for POSIXt dates when all of the times in the vector are midnight.

> seq (as.Date ("2016-11-04"), by = 1, length = 4)
[1] "2016-11-04" "2016-11-05" "2016-11-06" "2016-11-07"
# Create and save a POSIXct object, for convenience
> ourPos <- as.POSIXct ("2016-11-04 00:00:00")
> seq (ourPos, by = 1, length = 3)
[1] "2016-11-04 00:00:00 PDT" "2016-11-04 00:00:01 PDT"
[3] "2016-11-04 00:00:02 PDT"
> seq (ourPos, by = "day", length = 3)
[1] "2016-11-04 PDT" "2016-11-05 PDT" "2016-11-06 PDT"
> seq (ourPos, by = "day", length = 4)
[1] "2016-11-04 00:00:00 PDT" "2016-11-05 00:00:00 PDT"
[3] "2016-11-06 00:00:00 PDT" "2016-11-06 23:00:00 PST"
> seq (ourPos, by = "DSTday", length = 4)
[1] "2016-11-04 PDT" "2016-11-05 PDT" "2016-11-06 PDT"
[4] "2016-11-07 PST"
> seq (ourPos, by = "month", length = 4)
[1] "2016-11-04 PDT" "2016-12-04 PST" "2017-01-04 PST"
[4] "2017-02-04 PST"

In the top example, we see a sequence of Date objects 1 day apart (as specified by by = 1). That same specification produces POSIXt dates 1 second apart. Using by = "day" moves the clock by 24 hours, but since the Pacific Time Zone, where these examples were generated, switched from daylight saving to standard time on November 6, 2016, the old time of midnight standard time was advanced 24 hours to 11 p.m. standard time. With by = "DSTday" the clock time is preserved across days. The final example shows how we can advance 1 month at a time – the help for seq.POSIXt() shows how these functions adjust for the case when advancing by month starting at January 31, for example.

Differences between two POSIXt objects, like differences between Date objects, are represented by difftime objects in R. Here, though, you need to be even more careful to specify the units when converting the difftime object to numeric. This example shows how neglecting that specification can cause problems.

> d1 <- as.POSIXct ("2017-05-01 12:00:00")
> d2 <- as.POSIXct ("2017-05-01 12:00:06") # d1 + 6 seconds
> d3 <- as.POSIXct ("2017-05-07 12:00:00") # d1 + 6 days
> (d2 - d1) == (d3 - d1)
[1] FALSE # expected
> as.numeric (d2 - d1) == as.numeric (d3 - d1)
[1] TRUE # possibly unexpected

Here the d2 c03-math-015 d1 difference has the value 6 seconds, while the d3 c03-math-016 d1 difference has the value 6 days. The units are preserved in the difftime objects but discarded by as.numeric(). It is a good practice to always specify units = "days" or whatever your preferred unit is, whenever you convert a difftime object to a numeric value.

3.6.7 Missing Values in Dates

Dates of different classes should not be combined in a vector. It is always wise to use an explicit function to force all the elements of a date vector to have the same class. This also applies to missing values in date objects – they need to be of the proper class. In this example, we combine an NA with the d1 date from above, using the c() function. The c() function can call a second function depending on the class of its first argument – c.Date(), c.POSIXct() or c.POSIXlt().

> c(d1, NA)
[1] "2017-05-01 12:00:00 PDT" NA
> c(NA, d1)
[1]         NA 1493665200
> c(as.POSIXct (NA), d1)
[1] NA                        "2017-05-01 12:00:00 PDT"
> c(NA, as.Date (d1))
[1]    NA 17287

The first command succeeds, as expected, because c.POSIXct() is able to convert the NA value into a POSIXct object. In the second command, though, c() sees the NA and does not call a class-specific function. Instead, it converts both values to numeric. The resulting second element is the number of seconds since the POSIXt origin date. The way around this is to explicitly specify an NA value of class POSIXct, as in the third command. The final command shows that this problem exists for Date objects as well – here, d1 is converted into the number of days since the origin date. The lesson is that you should ensure that every date element, even the NA ones, in your vector has the same class.

3.6.8 Using Apply Functions with Dates and Times

Often a data set will arrive as a data frame with a series of dates in each row. These might be dates on which a phenomenon is repeatedly recorded – monthly manpower data, for example, or payment information. If an operation needs to be performed on each row – say, finding the range of the dates in each one – it is tempting to use apply() on such a data frame. As with earlier examples (Section 3.5), this will not succeed – even (perhaps surprisingly) if the data frame's columns are all Date or all POSIXct. A better approach is to operate on each row via the lapply() or sapply() functions. Here we show an example of a data frame whose columns are both Date objects.

> date.df <- data.frame (
         Start = as.Date (c("2017-05-03", "2017-04-16")))
> date.df$End <- as.Date (c("2018-06-01", "2018-02-16"))
> date.df
       Start        End
1 2017-05-03 2018-06-01
2 2017-04-16 2018-02-16
> apply (date.df, 1, function (x) x[2] - x[1])
Error in x[2] - x[1]: non-numeric arg. to binary operator

Here, the apply() function converts the data frame to a character matrix. (Why it does not convert it to a numeric one is not clear.) So the mathematical operation fails. One way to apply the function to each row is via sapply(), as in this example:

> sapply (1:2, function (i)
                   as.numeric (date.df[i,2] - date.df[i,1],
                               units = "days"))
[1] 394 306 

Using sapply() to index the rows, we can compute each difference in days in a straightforward way. In general, you will need to pay attention when dealing with data frames of dates row by row.

3.7 Other Actions on Data Frames

It is a rare data cleaning task that does not involve manipulating data frames, and one very common operation is to combine two data frames. There are essentially three ways in which we might want to combine data frames: by columns (i.e., combining horizontally); by rows (i.e., stacking vertically); and matching up rows using a key (which we call merging). The first two of these are straightforward and the third is only a little more complicated. In this section, we describe these tasks, as well as some other actions you can perform on data frames. We show some more detailed examples in Chapter 7.

3.7.1 Combining by Rows or Columns

When we talk about “combining data frames by columns,” we mean combining them side by side, creating a “wide” result whose number of columns is the sum of the numbers of columns in the things being combined. We have seen the cbind() function, which is the preferred function for joining matrices. We can also supply two data frames as arguments to the data.frame() function and R will join them. Both cbind() and data.frame() can incorporate vectors and matrices in its arguments as well – but they will convert characters to factors unless you explicitly provide the stringsAsFactors = FALSE argument. R is prepared to recycle some inputs, but it is best if the things being combined have the same numbers of rows.

Recall that a data frame needs to have column names, and that we (almost) always want these to be distinct. If two columns have the same name, R will use the make.names() function with the unique = TRUE argument to construct a set of distinct names. If three data frames each have a column named a, for example, the result will have columns a, a.1, and a.2. It is always a good idea to examine the set of column names for duplication (perhaps using intersection() as in Section 2.6.3) to ensure that you know what action R will take.

Combining data frames by rows means stacking them vertically, creating a “tall” result whose number of rows is the sum of the numbers of rows in the things being combined. The rbind() function combines data frames in this way. We can only operate rbind() on things with the same number of columns; moreover, the columns need to have the same names, but they need not be in the same order; R will match the names up. You will almost always want the columns being joined to be of the same sort – numeric with numeric, character with character, POSIXct with POSIXct, and so on – otherwise, R will convert each column to a common class. We usually check the classes explicitly and recommend you pass the stringsAsFactors = FALSE argument to rbind(). If we have two data frames called df1 and df2, we start by comparing the names, using code like this:

> n1 <- names (df1)
> n2 <- names (df2)
> all (sort (n1) == sort (n2)) # should be TRUE

We sort the names of each data frame to account for the fact that they might be out of order. Next, we extract the class of each column. The results, c1 and c2 as follows, will often be vectors, although they might be lists if some columns produce a vector of length 2 or more. (This will be the case if any columns are POSIXct objects.) We compare these two objects as in this example:

> c1 <- sapply (df1, class)
> c2 <- sapply (df2, class)
> isTRUE (all.equal (c1, c2[names (c1)])) # should be TRUE

Notice that we re-order the names of c2 so that they match the order of the names of c1. The all.equal() function compares two objects and returns TRUE if they match, and a small report (a vector of character strings) describing their differences if they do not. This report is useful, but to test for equality in, for example, an if() statement, the isTRUE() function is useful. This function produces TRUE if its argument is a single TRUE, as returned by all.equal() when its arguments match, and FALSE if its argument is anything else, like the character strings produced by all.equal() when its arguments differ.

If the data frames being combined have the usual unmodified numeric row names, R will adjust them so that the resulting row names go from 1 upward, but if there are non-numeric or modified row names, R will try to keep them, again deconflicting matches to ensure that row names are distinct.

When combining a large number of data frames, the do.call() function will often be useful. This function takes the name of a function to be run, and a list of arguments and runs the function with those arguments. For example, the command log(x = 32, base = 2) produces the result 5, because c03-math-017. We get the exact same result with the command do.call("log", list(x = 32, base = 2)). Notice that the arguments are specified in the form of a list. This mechanism allows us to combine a large number of data frames in a fairly simple way. Suppose we have a list of data frames named list.of.df (such a list arises frequently as the output from lapply()). Extracting the individual data frames from the list can be tedious, but we can rbind() them all with a command like do.call ("rbind", list.of.df) (assuming the data frames meet the rbind() criteria). If the data frames are not already on a list, we can construct such a list with a command like list(first.df, second.df, ...).

3.7.2 Merging Data Frames

Merging is a more complicated and powerful operation. In the usual type of merging, each data frame has a “key” field, typically a unique one. The merge() matches up the keys and produces a data frame with one row per key, with all of the columns from both of the data frames. There are three main complications here: what to do when keys are present in one data set but not in the other, what to do when keys are duplicated, and what to do when keys match only approximately.

The action when keys are present in one data set, but not in the other, is controlled by the all.x and all.y arguments, both of which default to FALSE. For this purpose, x refers to the first-named data set and y to the second. By default, the result of the merge has one row for each key that appears in both x and y (except when there are duplicated keys). Database users call this an “inner join.” When all.x = TRUE and all.y = FALSE, the result has one row for each key in x (this is a “left join”). Columns of the corresponding keys that do not appear in y are filled with NA values. Naturally, the converse is true when all.x = FALSE and all.y = TRUE – the result has one row for each key in y and the result has NAs for those columns contributed from x for those keys that did not appear in y. When all.x = TRUE and all.y = TRUE, the result has one row for every key in either x or y (this is an “outer join”). In this example, we merge two small data sets to show the behavior brought about by all.x and all.y.

> (df1 <- data.frame (Key = letters[1:3], Value = 1:3,
    stringsAsFactors = FALSE))
  Key Value
1   a     1
2   b     2
3   c     3
> (df2 <- data.frame (Key = c("a", "c", "f"),
    Origin = 101:103, stringsAsFactors = FALSE))
  Key Origin
1   a    101
2   c    102
3   f    103
> merge (df1, df2, by = "Key")                 # inner join
  Key Value Origin
1   a     1    101
2   c     3    102
> merge (df1, df2, by = "Key", all.x = TRUE)   # left join
  Key Value Origin
1   a     1    101
2   b     2     NA
3   c     3    102
> merge (df1, df2, by = "Key", all.y = TRUE)   # right join
  Key Value Origin
1   a     1    101
2   c     3    102
3   f    NA    103
> merge (df1, df2, by = "Key", all.x = TRUE,
                               all.y = TRUE)   # outer join
  Key Value Origin
1   a     1    101
2   b     2     NA
3   c     3    102
4   f    NA    103

The behavior of merge() when keys are duplicated is straightforward, but it is rarely what we want. It is best to remove rows with duplicate keys, or to create a new column with a unique key, before merging. The number of rows produced by merge() when there are duplicates is the number of pairs of keys that match between the two data frames. In this example, we establish some duplicated keys and show the behavior of merge in the left join case.

> (df3 <- data.frame (Key = c("b", "b", "f", "f"),
          Origin = 101:104, stringsAsFactors = FALSE))
  Key Origin
1   b    101
2   b    102
3   f    103
4   f    104
> merge (df1, df3, by = "Key", all.x = TRUE)
  Key Value Origin
1   a     1     NA
2   b     2    101
3   b     2    102
4   c     3     NA

Here the merge produces one row every time a key in df1 matches a key in df3 – even if that happens more than once – in addition to producing rows for every key that does not match. If df1 had included two rows with the Key value of b, as df3 does, then the result would have had four rows with Key value b.

The issue of matching when keys that match only approximately is a thornier one. This arises when matching on people's names, for example, since these are often represented in slightly different ways – think about the slightly differing strings “George H. W. Bush,” “George HW Bush,” “George Bush 41,” and so on. The adist() and agrep() functions (see also the discussion of grep() in Section 1.4.2) help find keys that match approximately, but this sort of “fuzzy matching” (also called “entity resolution” or “record linkage”) is beyond the scope of this book.

3.7.3 Comparing Two Data Frames

At some point you will have two versions of a data frame, and you will want to know if they are identical. “Identical” can mean slightly different things here. For example, if two numeric vectors differ only by floating-point error, we would probably consider them identical. If a character vector has the same values as a factor, that might be enough to be identical, but it might not. The identical() function tests for very strict equivalence and can be used on any R objects. It returns a single logical value, which is TRUE when the two items are equal. The help page notes that this function should usually be applied neither to POSIXlt objects nor, presumably, to data frames containing these. This is partly because two times might represent the same value expressed in two different time zones.

The all.equal() function described above compares two objects but with slightly more room for difference. The tolerance argument lets you decide how different two numbers need to be before R declares them to be different. By default, R requires that the two data frames' names and attributes match, but those rules can be over-ridden. Moreover, two POSIXlt items that represent the same time are judged equal. When two items are equal under these criteria, all.equal() returns TRUE. Since the return value of all.equal() when its two arguments are not equal is a vector of text strings, one correct way to compare data frames a and b for equality is with isTRUE(all.equal (a, b)).

3.7.4 Viewing and Editing Data Frames Interactively

R has a couple of functions that will let you edit a matrix, list, or data frame in an interactive, spreadsheet-like form. The View() function shows a read-only representation of a data frame, whereas edit() allows changes to be made. The return value from the edit() function can be saved to reflect the changes. A more dangerous option is provided by data.entry(); changes made by that function are saved automatically. If you use these functions to clean your data, of course, your steps will not be reproducible, and we strongly recommend using commented scripts and functions, which we describe in Chapter 5.

3.8 Handling Big Data

The ability to acquire, clean, handle, and model with big data sets will surely become more and more important in coming years. From its beginning, R has assumed that all relevant data will fit into main memory on the machine being used, and although the amount of memory installed in a computer has certainly grown over time, the size of data sets has been growing much faster. Handling data sets too big for the computer is not part of this book's focus, but in the following section we lay out some ideas for dealing with data sets that are just too big to hold in memory.

Given data that requires more storage than main memory can provide, we often proceed by breaking the data into pieces outside of R. For text data we use the command-line tools provided by the bash program (Free Software Foundation, 2016), a widespread command interpreter that comes standard on OS X and Linux systems and which is available for Windows as well. Bash includes tools such as split, which breaks up a data set by rows; cut, which extracts specific columns; and shuf, which permutes the lines in a file (which helps when taking random samples). These tools provide the ability to break the data into manageable pieces.

Another approach for manipulating large data, this time inside R (in main memory), was noted in Section 2.7. R has support for “long vectors,” those whose lengths exceed c03-math-018, but these are not recommended for character data. Moreover, they are vectors rather than data frames, so the long vector approach does not mirror the data frame approach.

Sometimes the data can fit into memory, but the system is very slow performing any actions on it. In this case, the data.table packages might be useful; it advertises very fast subsetting and tabulation. Unfortunately, the syntax of the calls inside the data.table package is just foreign enough to be confusing. We will not cover the use of data.table in this book. If specific actions are slow, we can often gain insight by “profiling,” which is where we determine which actions are using up large amounts of time. The “Writing R Extensions” manual has a section on profiling (Section 3) that might be useful here.

Other ways to speed up computations include compiling functions and running in parallel. We discuss these and other ways to make your functions faster in Section 5.5.

There are several add-in packages that provide the ability to maintain “pointers” to data on disk, rather than reading the data into main memory. The advantage of this approach is that the size of objects it can handle is limited only by disk storage, which can be expected to be huge. In exchange, of course, we have to expect processing to be much slower because so much disk access can be required. Packages with this approach include bigmemory (Kane et al., 2013) and its relatives, and ff (Adler et al., 2014) . The tm package (Feinerer et al., 2008) does something similar for large bodies of text.

R is so popular in the data science world that there are many other programs, including big data storage mechanisms, for which R interfaces are available. This allows you to use familiar R commands to access these other mechanisms without having to understand the details of those programs. In this way, you can keep your data in a relational database or some storage facility that uses, for example, distributed memory for efficient retrieval. These approaches are beyond the scope of this book, but we have some discussion of acquiring data from a relational database in Chapter 6.

3.9 Chapter Summary and Critical Data Handling Tools

Matrices are important in many mathematical and statistical contexts, but they do not play an important role in data cleaning. However, learning about matrices makes learning about data frames more natural. Data frames also have the attributes of lists, so we have discussed lists as well in this chapter. But the important type of object in this chapter, and in R generally, is the data frame.

Data frames are often created by reading data in from outside R. We can also create them directly by combining vectors, matrices, or other data frames with the cbind(), rbind(), or merge() functions. We can add a character vector to a data frame with the dollar-sign notation, but whenever we supply a character vector as a column to be added to a data frame via data.frame() or cbind(), we need to specify the stringsAsFactors = FALSE argument.

Once our data has been placed into a data frame (say, one called data1), we often start by recording the classes of each column in a vector, using a command like col.cl <- sapply(data1, function(x) class(x)[1]). We use the function shown here rather than simply using the class() function, as we did earlier, to account for columns with a vector of two or more classes – usually these would be columns with one of the date classes like POSIXct. Keeping the names data1 for the data and col.cl for the vector of classes in this example, we use commands such as these as part of our data cleaning process:

  • table(col.cl) to tabulate the column classes. Often we will have an expectation that some proportion of the columns will be numeric, or that we will have, say, exactly 10 date columns. This is a good starting point to see if the data frame looks as we expect.
  • sapply(data1, function(x) sum(is.na(x))) to count missing values by column. If the number of columns is large, we would often use table() on the result of the sapply() call to see if there are a few columns with a large number of missing values. It is also interesting if many columns report the same number of missing values. For example, if there are 56 different columns each with exactly 196 missing values, we might hypothesize that those are the very same 196 records in every column – and investigate them. In some cases, we might also count the number of negative values or values equal to 99 or some other “missing” code. Instead of the function above, then, we might substitute function(x) sum (x < 0, na.rm = TRUE) or something analogous.
  • sapply(data1[,col.cl == "numeric"], range) to compute the ranges of numeric columns in a search for outliers or anomalies. If some columns have class “integer” we will need to address those as well, perhaps using col.cl %in% c("numeric", "integer"). We might have to add the na.rm = TRUE argument, and we might also use other functions here such as mean(), median(), or sd().
  • sapply(data1, function(x) length(unique(x))) to count unique values by column. Since these numbers will count NA values, we might instead use function(x) length(unique(na.omit(x))).

The apply() family functions provide a lot of power, but they need to be exercised carefully on data frames. The apply() function itself converts the data frame to a matrix first, and should only be used if all the columns of a data frame are of the same type. Sapply() tries to return a vector or matrix if it can, so if the return elements are of different classes they will often be converted. We suggest using lapply() unless you know that one of the other functions will succeed.

Another important focus of this chapter was on date (and time) objects. Although Date and POSIXct objects are implemented inside R in a numeric fashion, they are not quite numeric items in the usual ways. Similarly, while the POSIXlt object has some numeric features, it is best thought of as a list. So we deferred description of these objects until this chapter. Date and time data take up a lot of energy in the data cleaning process because the number of formats is large and variable and because of complications such as time zones and date arithmetic.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.241.82