Chapter 2
R Data, Part 1: Vectors

The basic unit of computation in R is the vector. A vector is a set of one or more basic objects of the same kind. (Actually, it is even possible to have a vector with no objects in it, as we will see, and this happens sometimes.) Each of the entries in a vector is called an element. In this chapter, we talk about the different sorts of vectors that you can have in R. Then, we describe the very important topic of subsetting, which is our word for extracting pieces of vectors – all of the elements that are greater than 10, for example. That topic goes together with assigning, or replacing, certain elements of a vector. We describe the way missing values are handled in R; this topic arises in almost every data cleaning problem. The rest of the chapter gives some tools that are useful when handling vectors.

2.1 Vectors

By a “basic” object, we mean an object of one of R's so-called “atomic” classes. These classes, which you can find in help(vector), are logical (values TRUE or FALSE, although T and F are provided as synonyms); integer; numeric (also called double); character, which refers to text; raw, which can hold binary data; and complex. Some of these, such as complex, probably won't arise in data cleaning.

2.1.1 Creating Vectors

We are mostly concerned with vectors that have been given to us as data. However, there are a number of situations when you will need to construct your own vectors. Of course, since a scalar is a vector of length 1, you can construct one directly, by typing its value:

> 5
[1] 5

R displays the [1] before the answer to show you that the 5 is the first element of the resulting vector. Here, of course, the resulting vector only had one entry, but R displays the [1] nonetheless. There is no such thing as a “scalar” in R; even c02-math-001, represented in R by the built-in value pi, is a vector of length 1. To combine several items into a vector, use the c() function, which combines as many items as you need.

> c(1, 17)
[1] 1 17
> c(-1, pi, 17)
[1] -1.000000  3.141593 17.000000
> c(-1, pi, 1700000)
[1] -1.000000e+00  3.141593e+00  1.700000e+06

R has formatted the numbers in the vectors in a consistent way. In the second example, the number of digits of pi is what determines the formatting; see Section 1.3.3. In example three, the same number of digits is used, but the large number has caused R to use scientific notation. We discuss that in Section 4.2.2. Analogous formatting rules are applied to non-numeric vectors as well; this makes output much more readable. The c() function can also be used to combine vectors, as long as all the vectors are of the same sort.

Another vector-creation function is rep(), which repeats a value as many times as you need. For example, rep(3, 4) produces a vector of four 3s. In this example, we show some more of the abilities of rep().

> rep (c(2, 4), 3)              # repeat a vector
[1] 2 4 2 4 2 4
> rep (c("Yes", "No"), c(3, 1)) # repeat elements of vector
[1] "Yes" "Yes" "Yes" "No"
> rep (c("Yes", "No"), each = 8)
 [1] "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "No"
[10] "No"  "No"  "No"  "No"  "No"  "No"  "No" 

The last two examples show rep() operating on a character vector. The final one shows how R displays longer vectors – by giving the number of the first element on each line. Here, for example, the [10] indicates that the first "No" on the second line is the 10th element of the vector.

2.1.2 Sequences

We also very often create vectors of sets of consecutive integers. For example, we might want the first 10 integers, so that we can get hold of the first 10 rows in a table. For that task we can use the colon operator, : . Actually, the colon operator doesn't have to be confined to integers; you can also use it to produce a sequence of non-integers that are one unit apart, as in the following example, but we haven't found that to be very useful.

> 1:5
[1] 1 2 3 4 5
> 6:-2
[1]  6  5  4  3  2  1  0 -1 -2 # Can go in reverse, by 1
> 2.3:5.9
[1] 2.3 3.3 4.3 5.3            # Permitted (but unusual)
> 3 + 2:7                      # Watch out here! This is 3 +
[1]  5  6  7  8  9 10          # (vector produced by 2:7)
> (3 + 2):7
[1] 5 6 7                      # This is 5:7

In that last pair of examples, we see that R evaluates the 2:7 operation before adding the 3. This is because : has a higher precedence in the order of operations than addition. The list of operators and their precedences can be found at ?Syntax, and precedence can always be over-ridden with parentheses, as in the example – but this is the only example of operator precedence that is likely to trip you up. Also notice that adding 3 to a vector adds 3 to each element of that vector; we talk more about vector operations in Section 2.1.4.

Finally, we sometimes need to create vectors whose entries differ by a number other than one. For that, we use seq(), a function that allows much finer control of starting points, ending points, lengths, and step sizes.

2.1.3 Logical Vectors

We can create logical vectors using the c() function, but most often they are constructed by R in response to an operation on other vectors. We saw examples of operators back in Section 1.3.2; the R operators that perform comparisons are <, <=, >, >=, == (for “is equal to”) and != (for “not equal to”). In this example, we do some simple comparisons on a short vector.

> 101:105>= 102                # Which elements are>= 102?
[1] FALSE  TRUE  TRUE  TRUE  TRUE
> 101:105 == 104                # Which equal (==) 104?
[1] FALSE FALSE FALSE  TRUE FALSE

Of course, when you compare two floating-point numbers for equality, you can get unexpected results. In this example, we compute 1 - 1/46 * 46, which is zero; 1 - 1/47 * 47, and so on up through 50. We have seen this example before!

> 1 - 1/46:50 * 46:50 == 0
[1]  TRUE  TRUE  TRUE FALSE  TRUE

We noted earlier that R provides T and F as synonyms for TRUE and FALSE. We sometimes use these synonyms in the book. However, it is best to beware of using these shortened forms in code. It is possible to create objects named T or F, which might interfere with their usage as logical values. In contrast, the full names TRUE and FALSE are reserved words in R. This means that you cannot directly assign one of these names to an object and, therefore, that they are never ambiguous in code.

The Number and Proportion of Elements That Meet a Criterion

One task that comes up a lot in data cleaning is to count the number (or proportion) of events that meet some criterion. We might want to know how many missing values there are in a vector, for example, or the proportion of elements that are less than 0.5. For these tasks, computing the sum() or mean() of a logical vector is an excellent approach. In our earlier example, we might have been interested in the number of elements that are c02-math-002102, or the proportion that are exactly 104.

> 101:105>= 102
[1] FALSE  TRUE  TRUE  TRUE  TRUE
> sum (101:105>= 102)
[1] 4                            # Four elements are>= 102
> 101:105 == 104
[1] FALSE FALSE FALSE  TRUE FALSE
> mean (101:105 == 104)
[1] 0.2                          # 20% are == 104

It may be worth pondering this last example for a moment. We start with the logical vector that is the result of the comparison operator. In order to apply a mathematical function to that vector, R needs to convert the logical elements to numeric ones. FALSE values get turned into zeros and TRUE values into ones (we discuss conversion further in Section 2.2.3). Then, sum() adds up those 0s and 1s, producing the total number of 1s in the converted vector – that is, the number of TRUE values in the logical vector or the number of elements of the original vector that meet the criterion by being c02-math-003. The mean() function computes the sum of the number of 1s and then divides that sum by the total number of elements, and that operation produces the proportion of TRUE values in the logical vector, that is, the proportion of elements in the original vector that meet the criterion.

2.1.4 Vector Operations

Understanding how vectors work is crucial to using R properly and efficiently. Arithmetic operations on vectors produce vectors, which means you very often do not have to write an explicit loop to perform an operation on a vector. Suppose we have a vector of six integers, and we want to perform some operations on them. We can do this:

> 5:10
[1]  5  6  7  8  9 10
> (5:10) + 4
[1]  9 10 11 12 13 14
> (5:10)ˆ2                      # Square each element;
[1]  25  36  49  64  81 100     # parentheses necessary

Just to repeat, arithmetic and most other mathematical operations operate on vectors and return vectors. So if you want the natural logarithm of every item in a vector named x, for example, you just enter log(x). If you want the square of the cosine of the logarithm of every element of x, you would use cos(log(x))ˆ2, and so on. There are functions, such as length(), sum(), mean(), sd(), min(), and max(), that operate on a vector and produce a single number (which, to be sure, is also a vector in R). There are also functions such as range(), which returns a vector containing the smallest and largest values, and summary(), which returns a vector of summary statistics, but one of the sources of R's power is the ability to perform computations on every element of a vector at once.

In the last two examples above, we operated on a vector and a single number simultaneously. R handles this in the natural way: by repeating the 4 (in the first example) or the 2 (in the second) as many times as needed. R calls this recycling. In the following example, we see what R does in the case of operating on two vectors of the same length. The answer is, it performs the operation between the first elements of each vector, then the second elements, and so on. In the opening command, we have the usual assignment, using <-, and also an additional set of parentheses outside that command. These additional parentheses cause the result of the assignment to be printed. Without them, we would have created thing1, but its value would not have been displayed.

> (thing1 <- c(20, 15, 10, 5, 0)ˆ2)
[1] 400 225 100  25   0
> (thing2 <- 105:101)
[1] 105 104 103 102 101
> thing2 + thing1
[1] 505 329 203 127 101
> thing2 / thing1
[1] 0.2625000 0.4622222 1.0300000 4.0800000       Inf

In the last lines, R computes the ratios element by element. The final ratio, 101/0, yields the result Inf, referring to an infinite value. We discuss Inf more in Section 2.4.4. The following example compares a function that returns a single, summary value to one that operates element by element.

> max (thing2, thing1)
[1] 400
> pmax (thing2, thing1)
[1] 400 225 103 102 101

The max() function produces the largest value anywhere in any of its arguments – in this case, the 400 from the first element of thing1. The pmax() (“parallel maximum”) function finds the larger of the first element of the two vectors, and the larger of the second element of the two vectors, and so on.

Two logical vectors can also be combined element by element, using the | logical operator for “or” (i.e., returning TRUE if either element is TRUE) and the & operator for “and” (i.e., returning TRUE only if both elements are TRUE). These operators differ in a subtle way from their doubled versions || and &&. The single versions evaluate the condition for every pair of elements from both vectors, whereas the doubled versions evaluate multiple TRUE/FALSE conditions from left to right, stopping as soon as possible. These doubled versions are most useful in, for example, if() statements.

Recycling

There can be a complication, though: what if two vectors being operated on are not of the same length?

> 5:10 + c(0, 10, 100, 1000, 10000, 100000)  # Two 6-vectors
[1]   5  16   107  1008  10009 100010 # Add by element
> 5:10 + c(1, 10, 100)         # A 6-vector and a 3-vector
[1]   6  16   107 9  19 110    # The 3-vector is replicated
> 5:10 + 3:7                   # A 6-vector and a 5-vector
[1]  8 10 12 14 16 13          # 5+3, 6+4, ..., 9+7, 10+3
Warning message:
In 5:10 + 3:7 :
  longer object length is not a multiple of shorter length

It is important to understand these last two examples because the problem of mismatched vector lengths arises often. In the first of the two examples, the 3-vector (1, 10, 100) was added to the first three elements of the 6-vector, and then added again to the second three elements. Once again R is recycling. No warning was issued because 3 is a factor of 6, so the shorter vector was recycled an exact number of times. In the final example, the 5-vector was added to the first five elements of the 6-vector. In order to finish the addition, R recycled the first element of the 5-vector, the value 3. That value was added to the last entry of the 6-vector, 10, to produce the final element of the result, 13. The recycling only used part of the 5-vector; since 5 is not a factor of 6, a warning was issued.

Recycling a vector of length 1, as we did when we computed (5:10) + 4, is very common. Recycling vectors of other lengths is rarer, and we suggest you avoid it unless you are certain you know what you are doing. When you see the longer object length... warning as we did in the last example, we recommend you treat that as an error and get to the root of that problem.

Tools for Handling Character Vectors

Almost every data cleaning problem requires some handling of characters. Either the data contains characters to start with – maybe names and addresses, or dates, or fields that indicate sex, for example – or we will need to construct some (perhaps turning sexes labeled 1 or 2 into M and F). We also often need to search through character strings to find ones that match a particular pattern; remove commas or currency signs that have been put into formatted numbers (such as “$2500.00”); or discretize a numeric variable into a smaller number of groups (such as turning an Age field into levels Child, Teen, Adult, Senior). Character data is so important, and so common, that we have devoted an entire chapter (Chapter 4) to special techniques for handling it.

2.1.5 Names

A vector may have names, a vector of character strings that act to identify the individual entries. It is possible to add names to a vector, and in this section we give examples of that. More commonly, though, R adds names to a table when you tabulate a vector using the table() function. We will have more to say about table(), and the names it produces, in Section 2.5. In the meantime, here is a simple example of a vector with names. Notice that the third name has an embedded space. This name is not “syntactically valid” according to R's rules. A syntactically valid name has only letters, numbers, dots, and underscores and starts either with a letter or a dot and then a non-numeric character. It is usually a bad practice to have a vector's names be invalid, but, as we show in the following example, it is possible. See Section 3.4.2 for information on how to ensure that your names are valid.

> vec <- c(101, 102, 103)
> names(vec)
NULL
> names(vec) <- c("a", "b", "Long name")
> names(vec)
[1] "a"         "b"         "Long name"

After the second line, R returned the special value NULL to indicate that the vector had no names. (We talk more about NULL in Section 2.4.5.) The names() function then assigned names to the elements of the vector. We can also assign names directly in the c() function, as in this example.

> c(a = 101, b = 102, Long.name = 103)
        a         b Long.name
      101       102       103 

In this case, we used a syntactically valid name; an invalid one would have had to be enclosed in quotation marks.

2.2 Data Types

The three data types we have mentioned so far – numeric, logical, and character – are the ones we most often use. R does support several other data types. In this section, we mention these data types briefly, and then discuss the important topic of converting data from one type to another. Sometimes this is an operation we do explicitly and intentionally; other times R performs the conversion automatically.

2.2.1 Some Less-Common Data Types

Integers

R can represent as integer values between c02-math-004 and c02-math-005. (This number is 2,147,483,647.) Values outside this range may be displayed as if they were integers, but they will be stored as doubles. When doing calculations, R automatically converts values that are too big to be integers into doubles, so the only time integer storage will matter is if you explicitly convert a really large value into an integer (see Section 2.2.3). If you need R to regard an item as an integer for some reason, you can append L on its end. So, for example, 123 is numeric but 123L is regarded as an integer value. Of course, it only makes sense to add L to a thing that really is an integer.

Raw

“Raw” refers to data kept in binary (hexadecimal) form. This is the format that data from images, sound, or video will take in R. We rarely need to handle that kind of file in a data cleaning problem. However, we do sometimes resort to using raw data when a file has unexpected characters in it, or at the beginning of an analysis when we do not know what sort of data a file might have. In that case, the data will be read into R and held as a vector of class raw. A raw vector is a string of bytes represented in hexadecimal form. It can be converted into character data (when that makes sense) with the rawToChar() function. We talk more about reading raw data, particularly to handle the case of unexpected characters, in Section 6.2.5.

Complex Numbers

R has the ability to manipulate complex numbers (numbers such as c02-math-006, where c02-math-007 is c02-math-008. Since complex numbers almost never arise in data cleaning, we will not discuss them in this book.

2.2.2 What Type of Vector Is This?

You can usually tell what sort of vector you have by looking at a few of its entries. Character data has entries surrounded by quotes; numeric entries have no quotes; and logical entries are either TRUE or FALSE. So, for example, the value "TRUE", with quotation marks, can only belong to a character vector. There are also several functions in R that tell you explicitly what sort of thing you have. Two of these functions, mode() and typeof(), tell you the basic type of vector. They are essentially identical for our purposes, except that typeof() differentiates between integer and double, whereas mode() calls them both numeric. The str() function (for “structure”) not only tells you the type of vector but also shows you the first few entries. A related function, class(), is a more general operator for complex types.

A second group of functions gives a TRUE/FALSE answer as to whether a specific vector has a specific mode. These functions are named is.logical(), is.integer(), is.numeric(), and is.character(), and each returns a single logical value describing the type of the vector. A more general version, is(), lets you specify the class as an argument: so is.numeric(pi) is identical to is(pi, "numeric"). This more general form is particularly useful when testing for more complicated, possibly user-defined classes.

2.2.3 Converting from One Type to Another

It is important to remember that a vector can contain elements of only one type. When types are mixed – for example, if you inadvertently insert a character element into a numeric vector – R modifies the entire vector to be of the more complicated type. Here is an example:

> c(1, 4, 7, 2, 5)      # Create numeric vector
[1] 1 4 7 2 5
> c(1, 4, 7, 2, 5, "3") # What if one element is character?
[1] "1" "4" "7" "2" "5" "3"

In this example, the entire vector got converted to character. The rule is that R will convert every element of a vector to the “most complicated” type of any of the elements. Logical is the least complicated type, followed by raw, numeric, complex, and then character. (Raw vectors behave a little differently from the others. See Section 6.2.5.)

It is important to know what values the less complicated types get when they are converted to more complicated ones. Logical elements that are converted into numeric become 0 where they have the value FALSE and 1 where they are TRUE. A logical converted into a character, however, gets values "FALSE" and "TRUE". A number gets converted into a high-accuracy text representation of itself, as we see in these examples.

> 1/7
[1] 0.1428571           # by default, 7 digits are displayed
> c(1/7, "a")
[1] "0.142857142857143" "a"

One instance where R frequently performs conversions automatically is from integer to numeric types.

Conversion Functions

R will convert less complicated types into more complicated ones where required. Sometimes you need to force the elements of a vector back into a less complicated representation. Just as there are functions whose names start with is. for testing the type of an object, there is a set of as. functions for converting from one type to another. The rules are these: a character will be successfully converted to a numeric if it has the syntax of a number. It may have leading or trailing spaces (or new-lines or tabs), but no embedded ones; it may have leading zeros; it may have a decimal point (but only one); it may not have embedded commas; it may have a leading minus or plus sign, and, if it is in scientific notation, the exponent character E may be in upper- or lower-case and may also be followed by a minus or plus sign. In this example, we show some character strings that do and do not get converted to numbers. Notice that the elements of the vector that do not get converted turn into missing values (NA). We discuss missing values in Section 2.4.

> as.numeric (c(" 123.5  ", "-123e-2", "4,355", "45. 6",
                "$23", "75%"))
[1] 123.50  -1.23     NA     NA     NA     NA
Warning message:
NAs introduced by coercion 

In this case, the first two elements were successfully converted. The third has a comma, the fourth has an embedded space, and the last two have non-numeric characters. In order to convert strings such as those into numbers, you would have to remove the offending characters. We describe how to manipulate text in Chapter 4.

The warning message you see here is a very common one. Unlike most warning messages, this one will often arise naturally in the course of data cleaning – but make sure you understand exactly where it's coming from.

The only character values that can be successfully converted into logical are "T", "TRUE", "True", and "true" and "F", "FALSE", "False", and "false". In this case, no extraneous spaces are permitted. All other character values are converted into NAs.

The rule is simple for converting numeric values into logical ones. Numeric values that are zero become FALSE; all other numbers become TRUE. The only issue is that sometimes numbers you expect to be zero aren't quite because of floating-point error. In this example, we convert some numbers and expressions to logical.

> as.logical (c(123, 5 - 5, 1e-300, 1e-400, 1 - 1/49 * 49))
[1]  TRUE FALSE  TRUE FALSE  TRUE

The first element here is clearly non-zero, so it gets converted to TRUE. The second evaluates to exactly zero and produces FALSE. The third is non-zero, but the fourth counts as zero since it is outside the range of double precision (see Section 1.3.3). The last element is our running example of an expression that “should” be zero but is not (again, see Section 1.3.3). Since it is not zero, it gets converted to TRUE. Numeric, non-missing values never produce NA when converted to logical.

2.3 Subsets of Vectors

We very often need to pull out just a piece of a vector. This is called subsetting or extracting. In most cases, where we extract a subset, we can use a similar expression to replace (or assign) new values to a subset of the elements in a vector. Knowing how to do this is crucial to data cleaning in R; you cannot work efficiently in R without understanding this material.

2.3.1 Extracting

We constantly perform this operation in one form or another when cleaning data: we look at subsets of rows or columns, we examine a vector for anomalous entries, we extract all the elements of one vector for which another has a specific value, and so on. There are three methods by which we can extract a subset of a vector. First, we can use a numeric vector to specify which elements to extract. This numeric vector is an example of a “subscript” and its entries are called “indices.” Second, we can use a logical subscript; and, third, we can extract elements using their names.

Numeric Subscripts

The most basic way to extract a piece of a vector is to use a numeric subscript inside square brackets. For example, if you have a vector named a, the command a[1] will extract the first element of a. The result of that command is a vector of length 1, of the same mode as the original a. The command a[2:5] will produce a vector of length 4, with the second through fifth elements of a. If you ask for elements that aren't there – if, for example, a only had three elements – then R will fill up the missing spots with missing (NA) values. We discuss those further in Section 2.4. In this example, we have a vector a containing the numbers from 101 to 105.

> (a <- 101:105)
[1] 101 102 103 104 105
> a[3]
[1] 103

It's possible to pull out elements in any order, just by preparing the subscript properly. You can even use a numeric expression to compute your subscript, but only do this if you're sure your expression is an integer. If the result of your expression isn't an integer, even if it misses by just a tiny bit, you will get something you might not expect.

> a[c(4, 2)]
[1] 104 102
> a[1+1]                # A simple expression; this works
[1] 102
> a[2.999999999999999]  # This is truncated to 2, but...
[1] 102
> a[2.9999999999999999] # exactly 3 in double-precision.
[1] 103
> a[49 * (1/49)]        # This index gets truncated to zero;
integer (0)             # R produces a vector of length zero

There are two kinds of special values in numeric subscripts: negative values and zeros. Negative values tell R to omit those values, instead of extracting them – so a[-1], for example, returns everything except the first element of a. You can have more than one negative number in your subscript, but you cannot mix positive and negative numbers, and that makes sense. (For example, in the expression a[c(-1, 3)], should the second element be returned or not?)

Zeros are another special value in a subscript. They are simply ignored by R. Zeros appear primarily as a result of the match() function; you will rarely use them intentionally yourself. Knowing that zeros are permitted helps make sense of the error message in the following example, though.

> a[-2]             # Omit element 2
[1] 101 103 104 105
> a[c(-1, 3)]       # Illegal
Error in a[c(-1, 3)] : only 0's may be mixed
    with negative subscripts
> a[-1:2]   # Illegal, because -1:2 evaluates to -1, 0, 1, 2
Error in a[-1:2] : only 0's may be mixed
    with negative subscripts
> a[-(1:2)] # -(1:2) is (-1, -2): omit elements 1 and 2.
[1] 103 104 105

Logical Subscripts

Logical subscripts are also very powerful. A logical subscript is a logical vector of the same length as the thing being extracted from. Values in the original vector that line up with TRUE elements of the subscript are returned; those that line up with FALSE are not.

We almost never construct the logical subscript directly, using c(). Instead it is almost always the result of a comparison operation. In this example, we start with a vector of people's ages, and extract just the ones that are > 60.

> age <- c(53, 26, 81, 18, 63, 34)
> age > 60
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE
> age[age > 60]
[1] 81 63

The age > 60 vector has one entry for each element of age, so it is easy to use that to extract the numeric values of age, which are c02-math-009. But the power of logical subscripting goes well beyond that. Imagine that we also knew the names of each of the people. Here we show how to extract the names just for the people whose ages are c02-math-010.

> people <- c("Ahmed", "Mary", "Lee", "Alex", "John", "Viv")
> age > 60            # Just as a reminder
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE
> people[age > 60]    # Return name where (age > 60) is TRUE
[1] "Lee"  "John"

This particular manipulation – extracting a subset of one vector based on values in another – is something we do in every data cleaning problem. It is important to be sure that you know exactly how it works.

One case where results might be unexpected is when you inadvertently cause a logical subscript to be converted to a numeric one. In the example above, suppose we had saved the logical vector as a new R object called age.gt.60. In the following example, we show what happens if R is allowed to convert that logical vector into a numeric one.

> age.gt.60  <- age > 60
> people[age.gt.60]
[1] "Lee"  "John" # as expected
> people[0 + age.gt.60]
[1] "Ahmed" "Ahmed"
> people[-age.gt.60]
[1] "Mary"   "Lee"    "Alex"   "John"   "Viv"

In the 0 + age.gt.60 example, R has to convert the logical subscript to numeric in order to perform the addition. After the addition, then, the subscript has the values 0 0 1 0 1 0, and the extraction produces the first element of the vector two times, ignoring the zeros. In the following example, the negative sign once again causes R to convert the logical subscript to numeric; after the application of the sign operator the subscript has the values 0 0 -1 0 -1 0. The extraction drops the first element (because of the -1 value) and the rest are returned. This is a mistake we sometimes make with a logical subscript – in this example, we probably intended to enter people[!age.gt.60], with the ! operator, in order to return people whose ages are not greater than 60.

When using a logical subscript, it is possible for the two vectors – the data and the subscript – to be of different lengths. In that case R recycles the shorter one, as described in Section 2.1.4. This might be useful if, say, you wanted to keep every third element of your original vector, but in general we recommend that your logical subscript be the same length as the original vector.

The which() function can be used to convert a logical vector into a numeric one. It returns the indices (i.e., the position numbers) of the elements that are TRUE. So this is particularly useful when trying to find one or two anomalous entries in a long vector of logical values. To find the locations of the minimum value in a vector y, you can use which(y == min(y)), but the act of finding the index specifically of the minimum or maximum value is so common that there are dedicated functions, called which.min() and which.max(), for this task. There is one difference, though: these two functions break ties by selecting the first index for which y is at its maximum or minimum, whereas which() returns all the matching indices.

Using Names

The third kind of subscripting is to use a vector's names. Since names are characters, a name subscript will need to be a character as well. Here is a named vector, together with an example of subscripting by name.

> (vec <- c(a = 12, b = 34, c = -1))
 a  b  c
12 34 -1
> vec["b"]
 b
34
> vec[names (vec) != "a"]
 b  c
34 -1 

To show all the values except the one named a, it is tempting to try something like vec[-"a"]. However, R tries to compute the value of “negative a,” fails, and produces an error. The final example above shows one way to exclude the element with a particular name from being extracted.

Named vectors are not uncommon, but they do not come up very often in data cleaning. The real use of names will become clearer in Chapter 3, where we will encounter rectangular structures that have row names, column names, or, very often, both.

2.3.2 Vectors of Length 0

Any of these extraction methods can produce a vector of length 0, if no element meets the criterion. This happens particularly often when all of the elements of a logical subscript are FALSE. A vector of length 0 is displayed as integer(0), numeric(0), character(0), or logical(0). In this example, we show how such a vector might arise.

> (b <- c(101, 102, 103, 104))
[1] 101 102 103 104
> a <- b[b < 99] # Reasonable, but no elements of b are < 99
> a
numeric(0)       # a has length 0

A zero-length vector cannot be used intelligently in arithmetic, and watch out: the sum() of a numeric or logical vector of length 0 is itself zero. If a zero-length vector is used as the condition in an if() statement, an error results. This is an error that arises in data cleaning, as in this example:

> sum (a)
[1] 0            # Possibly unexpected
> sum (a + 12345)
[1] 0            # Definitely unexpected
> if (a < 2) cat ("yes
")
Error in if (a < 2) cat("yes
") : argument is length zero

In the last example, we made use of the cat() function, which writes its arguments out to the screen, or, as R calls it, the console. The represents the new-line, to return the cursor to the left of the screen. When writing functions to do data cleaning (Chapter 5), we will need to check that the conditions being tested are not vectors of length 0.

2.3.3 Assigning or Replacing Elements of a Vector

Every operation that extracts some values can also be used to replace those values, simply using the extraction operation on the left side of an assignment. Of course, R will require that the resulting vector have all its entries of the same type. So, for example, a[2] <- 3 will replace the second entry of a with the value 3. If a is logical, this operation will force it to be numeric; if a is character, the second entry of a will be assigned the character value "3". Just as we can extract using logical subscripts or names, we can use those subscripting techniques for assignment as well. These examples show replacement with numeric and logical subscripts.

> (a <- c(101, 102, -99, 104, -99, 9106)) # last item should
[1]  101  102  -99  104  -99 9106         # have been 106
> a[6] <- 106               # numeric subscript
> a
[1] 101 102 -99 104 -99 106
> a[a < 0] <- 9999          # logical subscript
> a
[1]  101  102 9999  104 9999  106

As we mentioned, a logical subscript will almost always have the same length as the data vector on which it is operating. In the preceding example, the logical subscript a < 0 has the same length as a itself.

These examples show how names can be used to assign new values to the elements of a vector.

> b <- c("A", "missing", "C", "D")
> names (b) <- c("Red", "White", "Blue", "Green")
> b
      Red     White      Blue     Green
      "A" "missing"       "C"       "D"
> b["White"] <- "B"         # name subscript
> b
  Red White  Blue Green
  "A"   "B"   "C"   "D" 

It is also possible to assign to elements of a vector out past its end. This is one way to combine two vectors. Elements that are not assigned will be given the special NA value (see the following section). Another way to combine two vectors is with the c() command, but either way, if two vectors of different types are combined, R will need to convert them to the same type. In this example, we combine two vectors.

> a <- 101:103
> b <- c(7, 2, 1, 15)
> c(a, b)                    # Combine two vectors
[1] 101 102 103   7   2   1  15
> a                          # Unchanged; no assignment made
[1] 101 102 103
> a[4:7] <- b                # index non-existent values
> a
[1] 101 102 103   7   2   1  15
> b
[1]  7  2  1 15
> b[6] <- 22                 # index non-existent value
> b
[1]  7  2  1 15 NA 22        # b[5] filled in with NA

In the last example, b[6] was assigned, but no instruction was given about what to do with the newly created fifth element of b. R filled it in with the special missing value code, NA. The following section describes how NA values operate in R.

2.4 Missing Data (NA) and Other Special Values

In R, missing values are identified by NA (or, under some circumstances, by <NA>; see Sections 2.5 and 4.6). This is a special code; it is not the two capital letters N and A put together. Missing values are inevitable in real data, so it is important to know the effect they have on computations, and to have tools to identify them and replace them where necessary. In this section, we discuss NA values in vectors; subsequent chapters expand the discussion to describe the effect of NA values in other sorts of R objects.

Missing values arise in several ways. First, sometimes data is just missing – it would make sense for an observation to be present, but in fact it was lost or never recorded. Second, some observations are inherently missing. For example, a field named MortPayRat might contain the ratio of a customer's monthly home mortgage payment to her monthly income. Customers with no mortgage at all would presumably have no value for this field. An NA value would make more sense than a zero, which would suggest a mortgage payment of zero. Third, as we saw in the last section, missing values appear when we try to extract an item that was never present in a vector. For example, the built-in item letters is a vector containing the 26 lower-case letters of the English alphabet. The expression letters[27] will return an NA. Finally, we sometimes see other special values Inf or -Inf or NaN in response to certain computations, like trying to divide by zero. Those special values can often be treated as if they were NA values. We discuss these and a final special value, NULL, in this section.

Since all the elements of a vector must be of the same kind, there are actually several different kinds of NA. An NA in a logical vector is a little different from an NA in a numeric or character one. (There are actually objects named NA_real_, NA_integer_, and NA_character_, which make this explicit.) Normally, the difference will not matter, but there is one case where knowing about the types of NA can explain some behavior that both arises fairly often and also seems mysterious. We mention this in Section 2.4.3.

2.4.1 The Effect of NAs in Expressions

A general, if imprecise, rule about NA values is that any computation with an NA itself becomes an NA. If you add several numbers, one of which is an NA, the sum becomes NA. If you try to compute the range of a numeric with missing values, both the minimum and maximum are computed as NA. This makes sense when you think of an NA as an unknown that could take on any value. Basic mathematical computations for numeric vectors all allow you to specify the na.rm = TRUE argument, to compute the result after omitting missing values.

2.4.2 Identifying and Removing or Replacing NAs

In every data cleaning problem we need to determine whether there are NA values. What you cannot do to identify missing values is to compare them directly to the value NA. Just as adding an NA to something produces an NA, comparing an NA to something produces NA. So if a variable thing has value 3, the expression thing == NA produces NA, and if thing has value NA, the expression thing == NA also produces NA. To determine whether any of your values are missing, use the anyNA() function. This operates on a vector and returns a logical, which is TRUE if any value in the vector is NA. More useful, perhaps, is the is.na() function: if we have a vector named vec, a call to is.na(vec) returns a vector of logicals, one for each element in vec, giving TRUE for the elements that are NA and FALSE for those that are not. We can also use which(is.na(vec)) to find the numeric indices of the missing elements. Here, we show an example of a vector with NA values and some example of what operations can, and cannot, be performed on them.

> (nax <- c(101, 102, NA, 104))
[1] 101 102  NA 104
> nax * 2                  # Arithmetic on NAs gives NAs...
[1] 202 204  NA 208
> nax>= 102               # ...as do comparisons
[1] FALSE  TRUE    NA  TRUE
> mean (nax)               # One NA affects the computation
[1] NA
> mean (nax, na.rm = TRUE) # na.rm = TRUE excludes NAs
[1] 102.3333
> is.na (nax)              # Locate NAs with logical vector
[1] FALSE FALSE  TRUE FALSE
> which (is.na (nax))      # Numeric indices of NAs
[1] 3

When your data has NA or other special values, you are faced with a decision about how to handle them. Generally they can be left alone, replaced, or removed. Removing missing values from a single vector is easy enough; the command vec[!is.na(vec)] will return the set of non-missing entries in vec. A more sophisticated alternative is the na.omit() function, which not only deletes the missing values but also keeps track of where in the vector they used to be. This information is stored in the vector's “attributes,” which are extra pieces of information attached to some R objects.

> nax[!is.na (nax)]      # Return the non-missing values
[1] 101 102 104
> (nay <- na.omit (nax)) # This keeps track of deleted ones
[1] 101 102 104
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"

Data cleaners will very often want to record information about the original location of discarded entries. In this example, these can be extracted with a command like attr(nay, "na.action").

Things get more complicated when the vector is one of many that need to be treated in parallel, perhaps because the vector is part of a more complicated structure like a matrix or data frame. Often if an entry is to be deleted, it needs to be deleted from all of these parallel items simultaneously. We talk more about these structures, and how to handle missing values in them, in Chapter 3. (We also note that most modeling functions in R have an argument called na.action that describes how that function should handle any NA values it encounters. This is outside our focus on data cleaning.)

2.4.3 Indexing with NAs

When an NA appears in an index, NA is produced, but the actual effect that R produces can be surprising. This arises often in data cleaning, since it is common to have a vector (usually fairly long and as part of a larger data set) with many NAs that you may not be aware of. Suppose we have a vector of data b and another vector of indices a, and we want to extract the set of elements of b for which a has the value 1, like this: b[a == 1]. The comparison a == 1 will return NA wherever a is missing, and b[NA] produces NA values. So the result is a vector with both the entries of b for which a == 1 and also one NA for every missing value in a. This is almost never what we want. If we want to extract the values of b for which a is both not missing and also equal to 1, we have to use the slightly clunky expression b[!is.na(a) & a == 1]. This example shows what this might look like in practice.

> (b <- c(101, 102, 103, 104))
[1] 101 102 103 104
> (a <- c(1, 2, NA, 4))
[1]  1  2 NA  4
> b[!is.na (a) & a == 2] # We probably want this...
[1] 102
> b[a == 2]              # ...and not this.
[1] 102  NA

In the following example, we show how two commands that look alike are treated slightly differently by R.

> b[a[2]]                # a[2] = 2; extract element 2 of b
[1] 102                  # ... which is 102
> b[a[3]]                # a[3] is NA
[1] NA
> (a <- as.logical (a))  # Now convert a to logical
[1] TRUE TRUE   NA TRUE
> b[a[3]]                # a[3] is NA
[1] NA NA NA NA

In the first example of b[a[3]], the value in a[3] was a numeric NA, so R treated the subscripting operation as a numeric one. It returned only one value. In the second example, a[3] was a logical NA, and when R subscripts with a logical – even when that logical value is NA – it recycles the subscript to have the same length as the index (we saw this in Section 2.1.4).

The lesson here is that when you have an NA in a subscript, R may return something other than what you expect.

2.4.4 NaN and Inf Values

A different kind of special value can arise when a computation is so big that it overflows the ability of the computer to express the result. Such a value is expressed in R as Inf or -Inf. On 64-bit machines Inf is a bit bigger than c02-math-011; it most often appears when a positive number is accidentally divided by zero. Inf values are not missing, and is.na(Inf) produces FALSE. Another special value is NaN, “not a number,” which is the result of certain specific computations such as 0/0 or Inf + -Inf or computing the mean of a vector of length 0. Unlike Inf, an NaN value is considered to be missing. As with NA values, Inf and NaN values take over every computation in which they are evaluated. There are rules for when more than one is present – for example, Inf + NA gives NA, but NaN + NA gives NaN. From a data cleaning perspective, all of these values cause trouble and you will generally want to identify any of these values early on. The function is.finite() is useful here; this produces TRUE for numbers that are neither NA nor NaN nor Inf or -Inf. So in that sense it serves as a check on valid values. To see whether every element of a numeric vector vec consists of values that are not any of these special ones, use the command all(is.finite(vec)).

2.4.5 NULL Values

A final sort of special value is the R value NULL. A NULL is an object with zero length, no contents and no class. (A vector of length 0 has no contents, but since it has a class – numeric, logical, or something else – it is not NULL.) In data cleaning, NULLs most often arise when attempting to access an element of a list, or a column of a data frame, which does not exist. We discuss this in Section 3.4.3. For the moment, the important point is that we can test for NULL values with the is.null() function, and that if you index using a NULL value the result will be a vector of length 0.

2.5 The table() Function

The table() function is so important in data cleaning that it merits its own section. This command, as its name suggests, produces a table giving, for each of the unique values in its argument, the number of times that value appears. In this example, we will create a vector with some color names in it, and we will add in an NA as well.

> vec <- rep (c("red", "blue", NA, "green"), c(3, 2, 1, 4))
> vec
 [1] "red"   "red"   "red"   "blue"  "blue"  NA
 [7] "green" "green" "green" "green"
> table (vec)
vec
 blue green   red
    2     4     3 

There are a couple of things to notice here. First, the ordering of the results in the table is alphabetical, rather than being determined by the order the entries appear in the vector vec. Second, the resulting object is not quite a named vector, as you can see by the word vec that appears above the word blue. (We omit this line in many future displays to save space.) In fact, this object has class table, but it can be treated like a named vector – so, for example, table(vec)["green"] produces 4. Third, by default table omits NA as well as NaN values. In data cleaning this is almost never what we want. There are two different arguments to the table() function that serve to declare how you want missing values to be treated. The first of these is named useNA. This argument takes the character values "no" (meaning exclude NA values, which was the default as seen earlier), "ifany" (meaning to show an entry for NAs if there are any, but not if there aren't) and "always", meaning to show an entry for NAs whether there are any NA values or not. In our current example, where there is one NA, the table() command with useNA set to "ifany" or "always" will produce output like this:

> table (vec, useNA = "always")
 blue green   red  <NA>
    2     4     3     1 

Notice that R displays the entry for NA values as <NA>, with angle brackets. This makes it easier to use the characters "NA" as a regular character string, perhaps for “North America” or possibly “sodium.” (This angle bracket usage will appear again later.) R will not be confused if you have both NA values and also actual character strings with the angle brackets, such as "<NA>", but it is definitely a bad practice. To see what happens when there are no NAs, let us look at the same vector without its missing entry, which is number 6.

> table (vec[-6], useNA="ifany")
 blue green   red
    2     4     3
> table (vec[-6], useNA="always")
 blue green   red  <NA>
    2     4     3     0 

For data cleaning purposes, we almost always want to know about missing values, so we will almost always want useNA to be "ifany" or "always". The second missing-value argument, exclude, allows you to exclude specific values from the table. By default, exclude has the value c(NA, NaN), which is why those values do not appear in tables. Most commonly we set this value to NULL to signify that no entries should be excluded, although sometimes we exclude certain very common values. Here we might want to exclude the common value green while tabulating all other values, including NAs. The following example shows how we can do that. It also shows the use of exclude = NULL.

> table (vec, exclude="green")
blue  red <NA>
   2    3    1
> table (vec, exclude=NULL)
 blue green   red  <NA>
    2     4     3     1 

It is possible to supply both useNA and exclude at the same time, but the results may not be what you expect. We recommend using either useNA or exclude to display missing values in every table.

2.5.1 Two- and Higher-Way Tables

If we give two vectors of the same length to the table() function, the result is a two-way table, also called a cross-tabulation. For example, suppose we had a vector called years, one for each transaction in our data set, with values 2015, 2016, and 2017; and suppose we also had a vector called months, of the same length, with values such as "Jan", "Feb", and so on. Then table(years, months) would produce a 3 c02-math-012 12 table of counts, with each cell in the table telling how many entries in the two vectors had the values for the cell. That is, the top-left cell would give the number of entries from January 2015; the one to the right of that would give the number of entries for February 2015; and so on. (If there are fewer than 12 months represented in the data, of course, there will be fewer than 12 columns in the table.) This is an important data cleaning task – to determine whether two variables are related in ways we expect. If, for example, we saw no transactions at all in March 2016, we would want to know why.

In R, a two-way table is treated the same as a matrix; we discuss matrices in detail in the following chapter. For very large vectors, the data.table() function in the data.table package (Dowle et al., 2015) may prove more efficient than table(). Three- and higher-way tables are produced when the arguments to table() are three or more equal-length vectors. These tables are treated in R as arrays; we give an example in Section 3.2.7. The xtabs() function is also useful for creating more complex tables.

2.5.2 Operating on Elements of a Table

The table() command counts the number of observations that fall into a particular category. In the example above, the table(years, months) command produces a two-way table of counts. Often we want to know more than just how many observations fall into a cell. R has several special-purpose functions that operate on tables. The prop.table() function takes, as its first argument, the output from a call to table(), and depending on its second argument produces proportions of the total counts in the table by cell, or by row, or by column. In this example we set up three vectors, each of length 15. Then we show the effect of calling table(), and of calling prob.table() on the result. By default, prop.table() computes the proportions of observations in each cell of the table. In the final example, we use the second argument of 2 to compute the proportions within each column; supplying 1 would have produced the proportions within each row.

> yr <- rep (2015:2017, each=5)
> market <- c("a", "a", "b", "a", "b", "b", "b", "a", "b",
              "b", "a", "b", "a", "b", "a")
> cost <- c(64, 87, 71, 79, 79, 91, 86, 92, NA,
            55, 37, 41, 60, 66, 82)
> (tab <- table (market, yr))
      yr
market 2015 2016 2017
     a    3    1    3
     b    2    4    2
> prop.table (tab)    # These proportions sum to 1
      yr
market       2015       2016       2017
     a 0.20000000 0.06666667 0.20000000
     b 0.13333333 0.26666667 0.13333333
> prop.table (tab, 2) # Each column's proportions sum to 1
      yr
market 2015 2016 2017
     a  0.6  0.2  0.6
     b  0.4  0.8  0.4

The margin.table() command produces the marginal totals from a table – that is, row or column totals (controlled by the second argument) for a two-way table, and corresponding sums for a higher-way one. The addmargins() function incorporates those totals into the table, producing a new row or column named Sum (or both). This is often a summary statistic we want, but watch out – the convention regarding the second argument of addmargins() is not the same as that of prop.table() and margin.table(). This example shows addmargins() in action.

> addmargins (tab)            # append row and column sums
      yr
market 2015 2016 2017 Sum
   a      3    1    3   7
   b      2    4    2   8
   Sum    5    5    5  15
> addmargins (tab, 2)         # append column sums
      yr
market 2015 2016 2017 Sum
     a    3    1    3   7
     b    2    4    2   8

We might also want to know the average, standard deviation, or maximum of entries in a numeric variable, broken down by which cell they fall into. In our example, we might want the maximum cost among the three observations from 2015 with market a, and for the two from 2015 and market b, and so on. For this purpose we use the tapply() function, whose name reminds us that it applies a function to a table. This function's arguments are the vector on which to do the computation (in our example, cost), an argument named INDEX describing the grouping (here, we might use the vector yr), and then the function to be applied. The following example shows tapply() at work. In the first line, we use the min() function to produce the minimum value for each year – but an NA is produced for 2016 since one cost for that year is NA. We can pass the na.rm = TRUE argument into tapply(), which then passes it into min() as in the following example, if we want to compute the minimum value among non-missing entries.

> tapply (cost, yr, min)       # find minimum within each yr
2015 2016 2017
  64   NA   37
> tapply (cost, yr, min, na.rm = TRUE)
2015 2016 2017
  64   55   37 

It is possible to extend this example to the two-way case of minimum cost, or another statistic, by both market and year. Here the tabularization part, represented by the argument INDEX, needs to be a list. We discuss lists starting in Section 3.3; for the moment, just know that a list is required when grouping with more than one vector. In the first example as follows, we compute the mean of the cost values for each combination of market and year (using na.rm = TRUE as above, and the list() function to construct the list). In the second example, we show how we can supply our own function “in line,” which makes it more transparent than if we had written a separate function. The details of writing functions are covered in Chapter 5, but here our function takes one argument, named x, and returns the value given by the sum of the squares of the entries of x. (In this example, we pass the na.rm = TRUE argument directly to sum to keep our function simpler.) The tapply() function is in charge of calling our function six times, once for each cell of the table.

> tapply (cost, list (market, yr), mean, na.rm = TRUE)
      2015     2016     2017
a 76.66667 92.00000 59.66667
b 75.00000 77.33333 53.50000
> tapply (cost, list (market, yr),
                   function (x) sum (xˆ2, na.rm = TRUE))
   2015  2016  2017
a 17906  8464 11693
b 11282 18702  6037

2.6 Other Actions on Vectors

In this section, we describe additional actions on vectors that we find particularly important for data cleaning. This includes rounding numeric values, sorting, set operations, and the important topics of identifying duplicates and matching.

2.6.1 Rounding

R operates on numeric vectors using double-precision arithmetic, which means that often there are more significant digits available than are useful. Results will often need to be displayed with, say, two or three significant digits. The natural way to prepare displays like this is through formatting the numbers – that is, changing the way they display, but not their actual values. We discuss formatting in Section 1.2. But sometimes we want to change the numbers themselves, perhaps to force them to be integers or to have only a few significant digits. The round() function and its relatives do this. Round() lets the user specify the number of digits to the right of the decimal place to be saved; the signif() function lets him or her specify the total number of significant digits retained. So round(123.4567, 3) produces 123.457, while signif(123.4567, 3) produces 123. A negative second argument produces rounding the nearest power of 10, so round(123.4567, -1) rounds to the nearest 10 and produces 120, while round(123.4567, -2) rounds to the nearest 100 and produces 100. The trunc() function discards the part the decimal and produces an integer; floor() and ceiling() round to the next lower and next higher integer, respectively, so floor(-3.4) is -4 while trunc(-3.4) is -3. Rounding of problematic entries (like those that end in a 5) can be affected by floating-point error (see Section 1.3.3).

2.6.2 Sorting and Ordering

It is common to have to sort the elements of a vector, and the sort() function performs that task in R. By default, the sort is from smallest to largest, but the decreasing = TRUE argument will reverse the order. There are two minor complications. First, sort() will drop NA and NaN values by default, giving a vector shorter than the original when these values are present. This behavior is controlled by the na.last argument, which itself defaults to NA. If set to TRUE, this argument will have the sort() function place NA and NaN values at the end, and, if FALSE, at the beginning of the sorted output.

A second complication is in sorting character vectors. Sorting in this case is alphabetical, of course, so if the characters are text representations of numbers such as "1", "2", "5", "10", and "18", the resulting output, sorted alphabetically, will be "1", "10", "18", "2", and "5". Moreover, the sorting order depends on the character set and locality being used. We mentioned this in Section 1.4.6 and address it further in Section 1.5.

The related order() function returns a set of indices that you can use to sort a vector. This is useful when you want to re-arrange one vector's values in the order specified by a second vector. (If that sounds as if it wouldn't be a common task, wait until Section 3.5.4.) In this example, we have a vector of names, and a vector of scores, and we want the names in ascending order of score.

> nm <- c("Freehan", "Cash", "Horton",
    "Stanley", "Northrop", "Kaline")
> scores <- c(263, 263, 285, 259, 264, 287) # 2 tied at 263
> nm[order(scores)]              # ascending order of score
[1] "Stanley"  "Freehan"  "Cash"
[4] "Northrop" "Horton"   "Kaline"
> nm[order(scores, nm)]                 # tie broken by nm
[1] "Stanley"  "Cash"     "Freehan"     # (alphabetically)
[4] "Northrop" "Horton"   "Kaline"
> nm[order (scores, decreasing = TRUE)] # descending
[1] "Kaline"   "Horton"   "Northrop"
[4] "Freehan"  "Cash"     "Stanley"

As in the example, the order() function can be given more than one vector. In this case, the second vector is used to break ties in the first; if a third vector were supplied, it would be used to break any remaining ties, and so on. It is very common to re-order a set of data that has time indicators (month and year, maybe) from oldest to newest. The order() function has the same na.last argument that sort() has, although its default value is TRUE.

2.6.3 Vectors as Sets

Often we need to find the extent to which two vectors have values that overlap. For example, we might have customer data from two sources and we want to determine the extent to which the customer IDs agree; or we might want to find the set of states in which none of our customers reside. These call for techniques that treat vectors as sets and that will normally be most useful when the data is a small number of integers, character data, or factors, about which we say more in Section 1.6. They can be used with non-integer data as well, but as always we cannot rely on two floating-point numbers that we expect to be equal actually being equal.

The essential set membership operation is performed by the %in% function. R has a few functions with names like this, surrounded by percentage signs. This allows us to use a command like a %in% b, rather than the equivalent, but perhaps less transparent, is.element(a, b). The return value is a vector the same length as a, with a logical indicating whether each element of a is found anywhere in b. In data cleaning we very often tabulate the result of this function call; so a command like table(a %in% b) produces a table of FALSE and TRUE, giving the number of items in a that were not found in b, and the number that were. For this purpose, an NA value in a matches only an NA in b, and similarly an NaN value in a matches only an NaN value in b. In this example, we compare some alphanumeric characters to the built-in data set letters containing the 26 lower-case letters of the alphabet.

> c("g", "5", "b", "J", "!") %in% letters
[1]  TRUE FALSE  TRUE FALSE FALSE
> table (c("g", "5", "b", "J", "!") %in% letters)
FALSE  TRUE
    3     2 

The union(), intersect(), and setdiff() functions produce the union, intersection, and difference between two sets. This example shows those functions in action.

> union (c("g", "5", "b", "J", "!"),
                  letters)     # elements in either vector
  [1] "g" "5" "b" "J" "!" "a" "c" "d" "e" "f" "h" "i" "j"
 [14] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w"
 [27] "x" "y" "z"
> intersect (c("g", "5", "b", "J", "!"),
                  letters)     # elements in both vectors
[1] "g" "b"
> setdiff (c("g", "5", "b", "J", "!"),
                  letters)     # elements of a not in b
[1] "5" "J" "!"

2.6.4 Identifying Duplicates and Matching

Another data cleaning task is to find duplicates in vectors. The anyDuplicated() function tells you whether any of the elements of a vector are duplicates. The unique() function extracts only the set of distinct values (including, by default, NA and NaN). The distinct values appear in the output in the order in which they appear in the input; for data cleaning purposes we will often sort those unique values.

Often it will be important to know which elements are duplicates. The duplicated() function returns a logical vector with the value TRUE for the second and subsequent entries in a set of duplicates. However, the first entry in a set of duplicates is not indicated. For example, duplicated(c(1, 2, 1, 1)) returns FALSE FALSE TRUE TRUE; the first 1 is not considered duplicated under this definition. (Alternatively, the fromLast = TRUE argument reads from the end of the vector back to the beginning, but again the “first” member of a set of duplicates is not indicated.) Combining a call with fromLast = FALSE and one with fromLast = TRUE, using the union() function, identifies all duplicates.

A common task is to find all the entries that are duplicated anywhere in the data set (or that are never duplicated). One way to do this is via table(). Any value that appears more than once is, of course, duplicated (but remember that floating-point numbers might not match exactly). In this example, we construct a vector from the lower-case letters, but add a few duplicates.

> let <- c(letters, c("j", "j", "x"))
> (tab <- table (let))
let
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
> which (tab != 1) # table locations where duplicates appear
 j  x
10 24              # 10th & 24th table entries aren't ones
> names (tab)[tab != 1]
[1] "j" "x"

It is often useful to use table() twice in a row. This example counts the number of entries that appear once, twice, and so on in the original data. Consider this example:

> table (table (let))
 1  2  3
24  1  1           # 24 entries are 1, one is 2, one is 3

The last line shows that there are 24 entries in let that appear once; one entry, x, that appears twice; and one entry, j, that appears three times. We use this in almost every data cleaning problem to find entries that appear more often than we expect. In a real application, we might have tens of thousands of elements and only a few duplicates. The which(tab != 1) command shows us the elements that are duplicated, but not how many times each one appears; the table(table(let)) command shows us how many duplicates there are, but not which letter goes with which count.

Another important task is matching, which is where we identify where, in a vector, we can find the values in another vector. We will find this particularly useful when merging data frames in Section 3.7.2. There are two ways to handle elements that do not match; they can be returned as NA, preserving the length of the original argument in the length of the return value, or, with the nomatch = 0 argument, they can be returned as 0, which allows the return value to be used as an index. In this example, we match two sets of names.

> nm <- c("Jensen", "Chang", "Johnson",
    "Lopez", "McNamara", "Reese")
> nm2 <- c("Lopez", "Ruth", "Nakagawa", "Jensen", "Mays")
> match (nm, nm2)
[1]  4 NA NA  1 NA NA
> nm2[match (nm, nm2)]
[1] "Jensen" NA       NA       "Lopez"  NA       NA

The third command tells us that the first element of nm, which is Jensen, appears in position 4 of nm2; the second element of nm, Chang, does not appear in nm2, and so on. We can extract the elements that matched from the nm2 vector as in the last line – but the NA entries in the output of match() produce NAs in the vector of names. An easier approach is to supply the nomatch = 0 argument, as in this example.

> match (nm, nm2, nomatch = 0)
[1] 4 0 0 1 0 0
> nm2[match (nm, nm2, nomatch = 0)]
[1] "Jensen" "Lopez" 

We use match() (or its equivalent) in any data cleaning problem that requires combining two data sets. Understanding how match() works makes data cleaning easier. Match() is, in fact, a more powerful version of %in%.

2.6.5 Finding Runs of Duplicate Values

During a data cleaning problem, it often happens that a particular identifier – a name or account number, perhaps – appears many times in an input data set. As an example we might be given a list of payments, with each payment identified by a customer number and each customer contributing dozens of payments. It will be useful to count the number of times each repeated item appears. We also use this on logical vectors to find, for example, the locations and lengths of sets of payments that are equal to 0. The rle() function (the name stands for “run length encoding”) does exactly this: given a vector, it returns the number of “runs” – that is, repetitions – and each run's length. In this example, we show what the output of the rle() function looks like.

> rle (c("a", "b", "b", "a", "c", "c", "c"))
Run Length Encoding
  lengths: int [1:4] 1 2 1 3
  values : chr [1:4] "a" "b" "a" "c"

This output shows that the vector starts with a run of length 1 (the first element in the lengths vector) with value a (the values vector); then a run of length 2 with value b; and so on. The output is actually returned in the form of a list with two parts named lengths and values; in Section 3.3, we discuss how to access the pieces of a list individually.

2.7 Long Vectors and Big Data

Starting in version 3.0.0, R introduced something called a long vector, a special mechanism that allows vectors to be much longer than before. Since there are only c02-math-013 values of an integer, entries in a long vector beyond that point will have to be indexed by double indices. Other than that, this extension should, in principle, be invisible to users. One exception is that the match() function, and its descendants, is.element() and %in%, do not work on long vectors. On long vectors, table() can be very slow and the data.table package provides some faster alternatives. R's documentation suggests avoiding the use of long vectors that are characters.

2.8 Chapter Summary and Critical Data Handling Tools

This chapter introduces R vectors, which come in several forms, primarily logical, numeric, and character. The mode(), typeof(), and class() functions give you information about the class of a vector. The set of is. functions like is.numeric() returns a TRUE/FALSE result when an object is of the specified model, and the set of as. functions performs the conversion. Remember that logicals are simpler than numerics, and numerics simpler than character, and that converting from a simpler to a more complicated mode is straightforward. Converting from a more complicated to a simpler mode follows these rules:

  • Converting character to numeric produces NA for things that aren't numbers, like the character strings "TRUE" or "$199.99".
  • Converting character to logical produces NA for any string that isn't "TRUE", "True", "true", "T", "FALSE", "False", "false" or "F".
  • Converting numeric to logical produces FALSE for a zero and TRUE for any non-zero entry (and watch out for floating-point error here).

Extracting and assigning subsets of vectors are critical parts of any data cleaning project. We can use any of the modes as an index or “subscript” with which to extract or assign. A logical subscript returns the values that match up with its TRUE entries. Logical subscripts are extended by recycling where necessary (but most often when we do this it is by mistake). A numeric subscript returns the values specified in the subscript – and, unsurprisingly, numeric subscripts are not recycled. The which() command identifies TRUE values in a logical vector, so you can use that to convert a logical subscript to a numeric one. Finally, a character subscript will extract, from a named vector, elements whose names are present in the subscript (and, again, this kind of subscript is not recycled).

Any kind of vector can have missing values, indicated by NA, and there are a few other special values as well. Missing values influence computations they are involved in, so we often want to supply an argument like na.rm = TRUE to a function computing a sum, mean or other summary statistic on numeric data. You should expect to encounter missing values in any data set from any source and be prepared to accommodate them.

The table() function is critical to data cleaning. It tabulates a vector, returning the number of times each unique value appears, with names corresponding to the original values in the data set. Passing two or more vectors to table() produces a two- or higher-way cross-tabulation. We recommend adding the useNA = "ifany", useNA = "always", or exclude = NULL arguments to ensure that table() counts and displays the number of NA values, unless you're certain no values are missing. Using table() on the output of table() – as in table(table(x)) – tells us how many items in a vector x appear once, twice, and so on. This is useful for detecting entries that appear more often than expected.

Using names() on the output of table() will produce the unique entries in a vector, but we also use the unique() function to find these. We spend a lot of energy in identifying duplicates, and the duplicated() function is useful here – although, remember, it does not return TRUE for the first item in a set of duplicates. The is.element() and %in% functions help determine the extent to which two sets of values overlap; both of these are simpler versions of the match() function, which is critical to combining data from different sources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.32.86