chapter 4
R Data, Part 3: Text and Factors

A lot of data comes in character (“string”) form, sometimes because it really is text, and sometimes because it was originally intended to be numeric but included a small number of non-numeric items such as, for example, the word “Missing.” Almost every data cleaning problem requires manipulating text in some way, to find entries that include particular strings, to modify column names, or something else. In this chapter, we describe some of the operations you can perform on character data. This includes extracting pieces of strings, formatting numbers as text, and searching for matches inside text.

However, there are really two ways that character data can be stored in R. One is as a vector of character strings, as we saw in Chapter 2. The tools we mentioned above are primarily for this sort of data. A second way text can appear in R is as a factor, which is a way of storing individual text entries as integers, together with a set of character labels that match the integers back to the text. Factors are important in many R modeling functions, but they can cause trouble. We discuss factors in Section 4.6.

One consideration has become much more important in recent years: handling text from alphabets other than the English one. We are very often called on to deal with text containing accented characters from Western European languages, and increasingly, particularly as a result of data from social media sources, we find ourselves with text in other alphabets such as Cyrillic, Arabic, or Korean. The Unicode system of representing all the characters from all the world's alphabets (together with other symbols such as emoji) is implemented in R through encodings including the very popular UTF-8; Section 4.5 discusses how we can handle non-English texts in R.

4.1 Character Data

4.1.1 The length() and nchar() Functions

The length of a character vector is, as with other vectors, the number of elements it has. In the case of a character vector, you also might want to know how many characters are in each element. We use the nchar() function for that. Remember that some characters require two keystrokes to type (see Section 1.3.3 for a discussion), but they still count as only one character. In this example, we construct a character vector and observe how many characters each element has.

> (planets <- c("Mercury", "Venus", NA, "Mars"))
[1] "Mercury" "Venus"   NA        "Mars"
> length (planets) # Four elements
[1] 4
> nchar (planets)  # Count characters
[1]  7  5 NA  4

Notice that the number of characters in the missing value is itself missing. In older versions of R (through 3.2), nchar() reported the lengths of missing values as 2 – as if those entries were made up of the two characters NA. Starting with version 3.3, returning NA for that string's length is the default, though the older behavior can be requested by passing the keepNA = FALSE argument.

4.1.2 Tab, New-Line, Quote, and Backslash Characters

There are a few characters in R that need special treatment. We discussed this in Section 1.3.3, but it is worth repeating that if you want to enter a tab, a new-line character, a double quotation mark, or a backslash character, it needs to be “protected” – we say “escaped” – by a backslash. The leading backslash does not count as a character and is not part of the string – it's just a way to enter these characters that otherwise would be taken literally. As an example, consider entering into R this text: She wrote, "To enter a 'new-line,' type " " ." Normally, of course, we enclose text in quotation marks, but here R will think that the character string ends at the quotation mark preceding To. To remedy that, we escape the two inner double quotation marks. (Alternatively, we could enclose the entire quote in single, instead of double quotation marks. Then we would have to escape the two inner single quotation marks.) Moreover, the backslash is a special character. It, too, needs to be escaped. So to enter our quote into an R object, we need to type this:

> (quo <-
      "She wrote, "To enter a 'new-line,' type "\n"."")
[1] "She wrote, "To enter a 'new-line,' type "\n".""
> nchar (quo)
[1] 47
> cat (quo, "
")
She wrote, "To enter a 'new-line,' type "
"." 

Notice the length of the string as given by nchar(). Even though it takes 52 keystrokes to type it in, there are only 47 characters in the string. There is no real difference between single and double quotes in R; if you create a string with single quotes, it will be displayed just as if it had been created with double quotes.

The backslash also escapes hexadecimal (base 16) and Unicode entries. Hexadecimal values describe entries in the ASCII table that converts binary values into text ones. For example, if you type "x45", R returns the value from the ASCII table that has been given value 45 in base 16 (69 in decimal): the upper-case E. Passing the string "x45" to nchar() returns the value 1. Unicode entries can be one or more characters, and arguments to nchar() help control what that function will return in more complicated examples. We talk more about Unicode in Section 4.5.

4.1.3 The Empty String

In an earlier chapter (Section 2.3.2), we saw that some vectors have length 0. We could create a character vector of length 0 with a command like character(0). However, something that is much more common in text handling is the empty string, which is a regular character string that does not happen to have any characters in it. This is indicated by "", two quote characters with nothing between them. That empty string has length() 1 but nchar() 0. Often the empty string will correspond to a missing value but not always. It is very common to see empty strings when, for example, reading data in from spreadsheets. In our experience, spreadsheets will sometimes produce empty strings, and other times produce strings of spaces (e.g., sometimes, when all the other entries in a column are two characters long, the empty cells of the spreadsheet may contain two spaces). Naturally, these different types of empty or blank strings will need to be addressed in any data cleaning task.

One area of confusion is when using table() on a character vector. The names() of the table will always be exactly right, but since those names are displayed without quotation marks, leading spaces are impossible to see.

> vec <- c (" ", " ", "", "   ", "", "2016", "",
            " 2016", "2016", "   ")
> table (vec)
vec
                   2016  2016
    3     2     2     1     2
> names (table (vec))
[1] ""      " "     "   "   " 2016" "2016" 

In this example, we have items that are empty, items that consist of one space and three spaces, and items that look like 2016 but sometimes with a leading space.

The output of the table() function is not enough to determine the values being tabulated because of the leading spaces. We need the names() function applied to the table (or, equivalently, something like unique(vec)) to determine what the values are.

The nzchar() is a fast way to determine whether a string is empty or not; it returns TRUE for strings that have non-zero length and FALSE for empty strings (think of “nz” as indicating “non-zero”).

4.1.4 Substrings

Another action we perform frequently in data cleaning is to extract a piece of a string. This might be extracting a year from a text-formatted date, for example, or grabbing the last five characters of a US mailing address, which hold the ZIP code. The tool for this is the substring() function, which takes a piece of text, an argument first giving the position of the first character to extract, and an argument last that gives the last position. The last argument defaults to 1 million, so unless your strings exceed that length, last can be omitted when we seek the end of a string. For example, to extract characters three and four from a string named dt containing "2017-02-03", we use the command substring(dt, 3, 4); the result is the string "17". (If the string has fewer than three characters, the empty string is returned.) To extract the final five characters we could use substring(dt, nchar(dt) - 4). This extracts characters 6–10 from a string of length 10, characters 21–25 from a string of length 25, and so on.

The substring() function works on vectors, so substring(vec, nchar(vec) - 4) will produce a vector the same length as vec, giving the last five (or up to five) characters of each of its entries. In this example, the first argument was a numeric vector, and in general both first and last may be vectors. This lets us use substring() to pull out parts of each element in a string vector depending on its contents (e.g., “all the characters after the first parenthesis”).

We can exploit this vectorization to use substring() to break a string into its individual characters. The command substring(a, 1:nchar(a), 1:nchar(a)) does exactly that, just as if we had called substring(a, 1, 1), substring(a, 2, 2), and so on. Another, slightly more efficient, way to break a string into its characters is mentioned below under strsplit() (Section 4.4.7).

One of the strengths of substring() is that it can be used on the left side of an assignment operation. For example, to change the last two letters of each month's name to "YZ", you could do this, using R's built-in month.name object, as we do in this example.

> new.month <- month.name
> substring (new.month, nchar (new.month) - 1) <- "YZ"
> new.month
 [1] "JanuaYZ"   "FebruaYZ"  "MarYZ"     "AprYZ"
 [5] "MYZ"       "JuYZ"      "JuYZ"      "AuguYZ"
 [9] "SeptembYZ" "OctobYZ"   "NovembYZ"  "DecembYZ" 

4.1.5 Changing Case and Other Substitutions

R is case-sensitive, and so we often need to manipulate the case of characters (i.e., change upper-case letters to lower-case or vice versa). The tolower() and toupper() functions perform those operations, as does the equivalent casefold() function, which takes an argument called upper that describes the direction of the intended change (upper = TRUE means “change to upper case,” with FALSE, the default, indicating “change to lower case”). Note that case-folding works with non-Roman alphabets in which that operation is defined, such as Cyrillic and Greek. The help page for these functions describes a more complicated approach that can capitalize the first letter in each word, which is particularly useful for multi-word names such as “kuala lumpur” or “san luis obispo.”

A more general substitution facility is provided by the chartr() (“character translation”) function. This takes two arguments that are vectors of characters, plus a string, and it changes each character in the first argument into the corresponding character in the second argument.

4.2 Converting Numbers into Text

Numbers get special treatment when they are converted into text because R needs to decide how they should be formatted. As we have seen earlier, R formats entries in a numeric vector for display, but that formatting is part of the print-out, not part of the vector, and the formatting can change when the vector changes. In this section, we describe some of the details of those formatting choices. We also describe how R uses scientific notation, and how you can create categorical versions of numeric vectors.

4.2.1 Formatting Numbers

Often it is convenient to represent a series of numbers in a consistent format for reporting. The primary tools for formatting are format() and sprintf(). Format() provides a number of useful options, particularly for lining up decimal points and commas. (The European usage, with a comma to denote the start of the decimal and a period to separate thousands, is also supported.) Of course, formatting strings nicely in R doesn't guarantee that those strings will line up nicely in a report; that will depend on which font is used to display the formatted strings. Still, format() is a fast and easy way to format a set of numbers in a common way. Important arguments are digits, to determine the number of digits, nsmall to determine the number of digits in the “small” part (i.e., to the right of the decimal point), big.mark to determine whether a comma is used in the “big” part, drop0trailing, which removes trailing zeros in the small part, and zero.print, which, if TRUE, causes zeros to be printed with spaces. (You can also specify an alternate character like a dot, which might be useful when most entries are zero.)

This example shows some of these arguments at work.

# Seven digits by default, decimals aligned
> format (c(1.2, 1234.56789, 0))
[1] "   1.200" "1234.568" "   0.000"
# Add comma separator
> format (c(1.2, 1234.56789, 0), big.mark=",")
[1] "    1.200" "1,234.568" "    0.000"
# Currency style, blank zeros
> format (c(1.2, 1234.56789, 0), digits = 6, nsmall=2,
          zero.print=F, width=2)
[1] "   1.20" "1234.57" "       "

In the last example, the digits and nsmall arguments had to be chosen carefully in order to produce exactly two digits to the right of the decimal point. (The nsmall argument describes the minimum, not the maximum, number of digits to be printed.) There are a few formatting tasks, including incorporating text and adding leading zeros, that format() is not prepared for, and these are handled by sprintf().

The sprintf() function takes its name from a common function in the C language (the name evokes “string print, formatted”). This powerful function is complicated, and we just give a few examples here. The important point about sprintf() is the format string argument, which describes how each number is to be treated. (In R fashion, the format string can be a vector, in which case either that argument or the numerics being formatted may have to be recycled in the manner of Section 2.1.4.) A format string contains text, which gets reproduced in the function's output (this is useful for things such as dollar signs) and conversion strings, which describe how numbers and other variables should appear in that output. Conversion strings start with a percent sign and contain some optional modifiers and then a conversion character, which describes the manner of object being formatted. Although sprintf() can produce hexadecimal values and scientific notation (see the following discussion) with the proper conversion characters, the two most useful are i (or d) for integer values, f for double-precision numerics, and s for character strings. So, for example, sprintf("%f", 123) formats 123 as a double precision using its default conversion options and produces the text "123.000000", while sprintf("%f", 123.456) produces "123.456000".

Much of the power in sprintf() comes from the modifiers. Primary among these are the field width and precision, two numbers separated by a period that give the minimum width (the total number of characters, including sign and decimal point) and the number of digits to the right of the decimal points, respectively. Other modifiers include the 0, to pad with leading zeros; the space modifier, which leaves a space for the sign if there isn't one (so that negative and positive numbers line up), and the + modifier, which produces plus signs for positive numbers. So, to continue the example, the format string in sprintf("%9.1f", 123.456) asks for a field width of 9 and a precision of 1, and the result is the nine-character string " 123.5". The command sprintf("%09.1f", 123.456) asks for leading zeros and therefore produces "0000123.5". The items to be formatted, and even the format string itself, can be vectors. This vectorization makes it straightforward to insert numbers into sentences like this:

costs <- c(1, 333, 555.55, 123456789.012)
# Format as integers using %d
> sprintf ("I spent $%d in %s", costs, month.name[1:4])
[1] "I spent $1 in January"    "I spent $333 in February"
[3] "I spent $555 in March"    "I spent $123456789 in April"

In this example, each element of costs and month.name[1:4] is used, in turn, with the format string.

The format strings are very flexible. We show two more examples here.

# Format as double-precision (%f) with default precision
> sprintf ("I spent $%f in %s", costs, month.name[1:4])
[1] "I spent $1.000000 in January"
[2] "I spent $333.000000 in February"
[3] "I spent $555.550000 in March"
[4] "I spent $123456789.012000 in April"
# Format as currency, without specifying field width
> sprintf ("I spent $%.2f in %s", costs, month.name[1:4])
[1] "I spent $1.00 in January"
[2] "I spent $333.00 in February"
[3] "I spent $555.55 in March"
[4] "I spent $123456789.01 in April"

One final feature of sprintf() is that field width or precision (but not both) can themselves be passed as an argument by specifying an asterisk in the format conversion string. This allows fine-tuning of the widths of output, which is useful in reporting. Suppose we wanted all four output strings from the last example to have the same length. We can compute the length of the largest number in costs (after rounding to two decimal points) and supply that length as the field width, as seen in this example.

> biggest <- max (nchar (sprintf ("%.2f", costs)))
> sprintf ("I spent $%*.2f in %s",
            biggest, costs, month.name[1:4])
[1] "I spent $        1.00 in January"
[2] "I spent $      333.00 in February"
[3] "I spent $      555.55 in March"
[4] "I spent $123456789.01 in April"   

Although sprintf() is complicated, it is very handy for at least one job – generating labels that look like 001, 002, 003, and so on. The command sprintf("%03d", 1:100) will generate 100 labels of that sort.

4.2.2 Scientific Notation

Scientific notation is the practice of representing every number by an optional sign, a number between 1 and 10, and a multiplier of a power of 10. The choice that R makes to put a number into scientific notation depends on the number of significant digits required. For example, the number 123,000,050 is written 1.23e+08, but 123,000,051 is written 123000051. When one number in a vector (or matrix) needs to be represented in scientific notation, R represents them all that way, which can be helpful or annoying, depending on the job at hand. In this example, we show some of the effects of the way R displays numbers in scientific notation. Notice that the rules are slightly different for integer and for floating-point values.

> 100000           # Big enough to start scientific notation
[1] 1e+05
> c(1, 100000)     # Both numbers get scientific notation
[1] 1e+00 1e+05
> c(1, 100000, 123456)     # R keeps precision here
[1]      1 100000 123456
> as.integer (10000000 + 1)
[1] 10000001               # Integers are a little different

There is no easy way to change scientific notation for a single command. The R option scipen controls the “crossover” points between regular (“fixed”) and scientific notation, which depends on the number of characters required to print the vector out. (This, in turn, depends on the number of digits R is prepared to display, which depends on the digits option.) Set options(scipen = 999) to disable all scientific notation, and options(scipen = -999) to require scientific notation everywhere – but don't forget to set it back to the default value of 0 as needed. (As with other options() calls, the value of scipen is re-set when you close and re-open R.) An alternative is to use the format() command with the scientific = FALSE option. This example shows the format() command at work on a large number.

> format (10000000)
[1] "1e+07"            # scientific notation
> format (100000000, scientific = FALSE)
[1] "100000000"

Notice that, like sprintf(), format() always produces a character string, which makes further numeric computation difficult.

4.2.3 Discretizing a Numeric Variable

Very often we construct a discretized, categorical version of a numeric vector with just a few levels for exploration or modeling purposes (we sometimes call this procedure “binning”). For example, we might want to convert a numeric vector into a categorical with levels “Small,” “Medium,” and “Large.” The natural tool for this in R is the cut() function. The arguments are the vector to be discretized, the breakpoints, and, optionally, some labels to be applied to the new levels. The result of a call to cut() is a factor vector; we discuss factors in Section 4.6, but for the moment we will simply convert the result back to characters. In this example, we start with a numeric vector and bin them into three groups. We will set the boundary points at 4 and 7.

> vec <- c(1, 5, 6, 2, 9, 3, NA, 7)
> as.character (cut (vec, c(1, 4, 7, 10)))
[1] NA       "(4,7]"  "(4,7]"  "(1,4]"  "(7,10]" "(1,4]"
[7] NA       "(4,7]" 

The cut() function has some distracting quirks. By default, intervals do not include their left endpoint, so that in this example, the value 1 does not belong to any interval. This produces the NA in element 1 of the output; the second, of course, arises from the missing value in vec. The include.lowest = TRUE argument will force the leftmost breakpoint to belong to the leftmost bin. In this example, the number 1 would be found in the leftmost bin if include.lowest = TRUE were specified. Alternatively, the right = FALSE argument makes intervals include their left end and exclude their right (in which case include.lowest = TRUE actually refers to the largest of the breakpoints). In this example, any values larger than 10 would have produced NAs as well. This requires that you know the lower and upper limits of the data before deciding what the breakpoints need to be.

If the exact locations of the breakpoints are not important, cut() provides a straightforward way to produce bins of equal width or of approximately equal counts. The first is accomplished by specifying the breaks argument as an integer. In this case, cut() assigns every non-missing observation (even the lowest) to one of the bins. For bins of approximately equal counts, we can compute the quantiles of the numeric vector and use those as breakpoints. The following two examples show both of these approaches on a set of 100 numbers generated from R's random number generator with the standard normal distribution. We use the set.seed() function to initialize the random number generator; if you use this command your generator and ours should produce the same numbers. First we pass breaks as an integer to produce bins of approximately equal width.

> set.seed (246)
> vecN <- rnorm (100)
> table (cut (vecN, breaks = 5))
 (-3.18,-1.94] (-1.94,-0.709] (-0.709,0.522]
             2             21             52
  (0.522,1.75]    (1.75,2.99]
            20              5 

In the following example, we use R's quantile() function to compute the minimum, quartiles and maximum of the vecN vector. (Other choices are possible through the use of the probs argument.) Once the quantiles are computed we can pass them as breakpoints to produce four bins with approximately equal counts – but, as before, cut() produces NA for the smallest value unless include.lowest = TRUE – and by default, table() omits NA values.

> quantile (vecN)
         0%         25%         50%         75%        100%
-3.17217563 -0.61809034 -0.06712505  0.45347704  2.98469969
> table (cut (vecN, quantile (vecN))) # lowest value omitted
  (-3.17,-0.618] (-0.618,-0.0671]  (-0.0671,0.453]
              24               25               25
    (0.453,2.98]
              25
> table (cut (vecN, quantile (vecN), include.lowest = TRUE))
  [-3.17,-0.618] (-0.618,-0.0671]  (-0.0671,0.453]
              25               25               25
    (0.453,2.98]
              25 

Notice how supplying include.lowest = TRUE changed the first bin from a half-open interval (indicated by the label starting with () to a closed one (label starting with [). In general, the default labels are somewhat unwieldy – a character value like "(-0.618,-0.0671]" will be difficult to manage. The cut() function allows us to pass in a vector of text labels using the labels argument.

4.3 Constructing Character Strings: Paste in Action

Character strings arise in data we bring in from other sources, but very often we need to construct our own. The primary tool for building character strings is the paste() function, plus its sibling paste0(). In its simplest form, paste() sticks together two character vectors, converting either or both, as necessary, from another class into a character vector first. By default R inserts a space in between the two. For example, paste("a" ,"b") produces the result "a b" while paste(1 == 2, 1 + 2) evaluates the two arguments, converts them to character (see Section 2.2.3) and produces "FALSE 3".

In practice, we prefer to control the character that gets inserted. We want a space sometimes, for example, when constructing diagnostic messages, but more often we want some other separator, in order to construct valid column names, for example. The sep argument to paste() allows us to specify the separator. Very often in our work we use a period, by setting sep = ".", or no separator at all, by setting sep = "". In the latter case, we can also use the paste0() function, which according to the help file operates more efficiently in this case.

What really gives paste() its power is that it handles vectors. If any of its arguments is a vector, paste() returns a vector of character strings, recycling (Section 2.1.4) shorter ones as needed. This gives us great flexibility in constructing sets of strings. For example, the command paste0("Col", LETTERS) produces a vector of the strings "ColA", "ColB", and so on, up to "ColZ".

A final useful argument to paste() is collapse, which combines all the strings of the vector into one long string, using the separator specified by the value of the collapse argument. Common choices are the empty string "", which joins the pieces directly, and the new-line and tab characters, when formatting text for output to tables.

Paste() is such a big part of character manipulation in R that we think it important to show a few examples of how it works and where it can be useful.

4.3.1 Constructing Column Names

When a data frame is constructed from data without header names, R constructs names such as V1 and V2. Normally, we will want to replace these with meaningful names of our own, but in big data sets the act of typing in those names is tedious and error-prone. Moreover it is often true that the names follow a pattern – for example, we might have a customer ID followed by 36 months of balance data from 2016 to 2018, followed by 36 months of payment data for the same years. One way to generate those latter 72 names is through the outer() function. This function operates on two vectors and performs another function on each pair of elements from the two vectors, producing a matrix of results. For example, outer(1:10, 1:10, "*") produces a 10 c04-math-001 10 multiplication table. The command outer(month.abb, 2016:2018, paste, sep="."), similarly, produces a matrix. In this example, we show the first few rows of that matrix using the head() command.

> head (outer (month.abb, 2016:2018, paste, sep = "."), 3)
     [,1]       [,2]       [,3]
[1,] "Jan.2016" "Jan.2017" "Jan.2018"
[2,] "Feb.2016" "Feb.2017" "Feb.2018"
[3,] "Mar.2016" "Mar.2017" "Mar.2018"

To construct column labels with Bal. on the front, we can simply paste that string onto the elements of the matrix. Remember that paste() converts its arguments into character vectors before operating, so the result of this operation is a vector of column labels, as shown in this example, where again we show only a few of the elements of the result.

> myout <- outer (month.abb, 2016:2018, paste, sep = ".")
> paste0 ("Bal.", myout)[1:3]
[1] "Bal.Jan.2016" "Bal.Feb.2016" "Bal.Mar.2016"

So to construct a vector with all 73 desired names, we could use a single command as in this example.

newnames <- c("ID", paste0 ("Bal.", myout),
                    paste0 ("Pay.", myout))

We note that outer() is not very efficient. For very big data sets, we might create separate vectors from the year part and the month part, and then paste them together. Suppose that the balance and payment values alternated, so the first two columns gave balance and payment for January 2016, the next two for February 2016, and so on. Then a straightforward way to construct the labels using paste() is by repeating the components as needed, with rep(), and then pasting the resulting vectors together:

# 2 values x 12 months x 3 years
part1 <- rep (c("Bal", "Pay"), 12 * 3)
# Double each month, repeat x 3
part2 <- rep (rep (month.abb, each = 2), 3)
part3 <- rep (2016:2018, each = 24)
newnames <- c("ID", paste (part1, part2, part3, sep = "."))

The task in this example could actually have been done more easily with expand.grid(). This function takes, as arguments, vectors of values and produces a data frame containing all combinations of all the values. Since the output is a data frame, for many purposes you will want to specify stringsAsFactors = FALSE. The next step is to use paste() on the columns of the data frame. We use paste() regularly and in many contexts.

4.3.2 Tabulating Dates by Year and Month or Quarter Labels

Often we want to summarize vectors of dates (Section 3.6) across, for example, years, months, or calendar quarters. An easy way to do this is by pasting together identifiers of year and month, then using table() or tapply() to compute the relevant numbers of interest. We use paste() here because the built-in months() and quarters() functions do not produce the year as well (and the format() function does not extract quarters). In this example, we first generate 600 dates at random between January 1, 2015 and December 31, 2016 (a period of 731 days) and then tabulate them by quarter.

> set.seed (2016)
> dts <- as.Date (sample (0:730, size = 600),
                  origin = "2015-01-01")
> table (quarters (dts)) # Shows calendar quarter
 Q1  Q2  Q3  Q4
134 153 151 162 

To combine both year and quarter, we can use substring() to extract the years, then paste them together with the quarters. (We could also have extracted the years with format(dts, "%Y").) We put the years first in these labels so that the table labels are ordered chronologically. In this example, we paste the year and quarter, and then tabulate.

> table (paste0 (substring (dts, 1, 4), ".",
                 quarters (dts)))
2015.Q1 2015.Q2 2015.Q3 2015.Q4 2016.Q1 2016.Q2 2016.Q3
     72      71      75      81      62      82      76
2016.Q4
     81 

To add months to the years, we could call the months() function and once again use paste() to combine the year and month information. Alternatively, we can use format() directly, as in this example. Notice, however, that table() sorts its entries alphabetically by name.

> (mtbl <- table (format (dts, "%Y.%B")))
    2015.April    2015.August  2015.December ...
            24             23             24 ...

To put these entries into calendar order explicitly, we can use paste0() to construct a vector of names to be used as an index. Then we can use that index to re-arrange the entries in the mtbl table. We show that in this example.

> (month.order <- paste0 (2015:2016, ".", month.name))
 [1] "2015.January"   "2016.February"  "2015.March" ...
> mtbl[month.order]
  2015.January  2016.February     2015.March ...
            24             21             27 ...

4.3.3 Constructing Unique Keys

Often we need to construct a single column that uniquely labels each row in a data frame. For example, we might have a table with one row for each customer in every month in which a transaction took place. Neither customer number nor month is enough to uniquely identify a transaction, but we can construct a unique key by pasting account number, year, and month. In this example, we would probably use a two-digit numeric month here and put year before month. That way an alphabetical ordering of the keys would put every customer's transactions together in increasing date order.

4.3.4 Constructing File and Path Names

In many data processing applications, our data is spread out over many files and we need to process all the files automatically. This might require constructing file names by pasting together a path name, a separator like /, and a file name. R can then loop over the set of file names to operate on each one. As an example, one way to get the full (absolute) file names of all the files in your working directory is by combining the name of the directory (retrieved with getwd()) and the names of the files (retrieved with list.files()). The command paste(getwd(), list.files(), sep = "/") produces a character vector of the absolute file names of files in the working directory. This is not quite the same as the output from list.files(full.names = TRUE); we discuss interacting with the file system in more detail in Section 5.4.

4.4 Regular Expressions

A regular expression is a pattern used in a tool to find strings that match the pattern. The patterns can be very complicated and perform surprisingly sophisticated matches, and in fact entire books have been written about regular expressions. While we cannot cover all the complexities of regular expressions in this book, we can make you knowledgeable enough to do powerful things.

We use regular expressions to find strings that match a rule, or set of rules, called a pattern. For example, the pattern a matches strings that include one or more instances of a anywhere in them. The pattern a8 matches strings with a8, with no intervening characters. Most characters, such as a and 8 in this example, match themselves. What gives regular expressions their power is the ability to add certain other characters that have special meaning to the pattern. The exact set of special characters differs across the different kinds of regular expression, but as a first example, the character ˆ means “at the start of a line,” and $ means “at the end of a line.” So the pattern ˆThe matches every string that starts with The; end$ matches every string that ends with end, and ˆNo$ matches every string that consists entirely of No. By default, patterns are case-sensitive, but shortly we will see how to ignore case.

4.4.1 Types of Regular Expressions

The details of regular expressions differ from one implementation to the next, so a regular expression you write for R may not work in, for example, Python or another language. Actually, R supports two sorts of regular expressions: one is POSIX-style (named for the same POSIX standards group that gave us the POSIXt date objects), and the other is Perl-style, referring to the regular expressions used in the Perl language. (Specifically, if you need to look this up somewhere, the POSIX style incorporates the GNU extensions and the Perl style comes via the PCRE library.) By default, POSIX regular expressions are used in R.

4.4.2 Tools for Regular Expressions in R

There are three primary tools for regular expressions in R: grep() and its variants, regexpr() and its variants, and sub() and its variants. These three are similar in implementation. We start by describing grep() in some detail. The grep() function takes a pattern and a vector of strings, and returns a numeric vector giving the indices of the strings that match the pattern. With the value = TRUE argument, grep() returns the matching strings themselves. The related function grepl() (the letter l on the end standing for “logical”) returns a logical vector with TRUE indicating the elements that match. In this example, we look through R's built-in state.name vector to find elements with the capital letter C.

> grep ("C", state.name)
[1]  5  6  7 33 40
> grep ("C", state.name, value = TRUE)
[1] "California"     "Colorado"       "Connecticut"
[4] "North Carolina" "South Carolina"
> grep ("ˆC", state.name, value = TRUE)
[1] "California"  "Colorado"    "Connecticut"

The first call to grep() produces a vector of indices. These five numbers show the locations in state.name where strings containing C can be found. With value = TRUE, the names of the matching states are returned. In the final example, we search only for strings that start with C.

Several other arguments are also important. First, the ignore.case argument defaults to FALSE, but when set to TRUE, it allows the search to ignore whether letters are in upper- or lower-case. Second, setting invert = TRUE reverses the search – that is, grep() produces the indices of strings that do not match the pattern. (The invert argument is not available for grepl(), but of course you can use ! applied to the output of grepl() to produce a logical vector that is TRUE for non-matchers.) Third, fixed = TRUE suspends the rules about patterns and simply searches for an exact text string. This is particularly useful when you know your pattern, and also it has a special character in it – such as, for example, a negative amount indicated with parentheses, such as ($1,634.34). To continue an earlier example, grep("ˆThe", vec) finds the entries of vec that start with the three characters The, whereas grep("ˆThe", vec, fixed = TRUE) finds the entries that include the four characters ˆThe anywhere in the string.

A fourth useful argument is perl, which, when set to TRUE, leads the grep functions to use Perl-type regular expressions. Perl-type regular expressions have many strengths, but for this development we will describe the default, POSIX style. Finally, all of these regular expression functions permit the use of the useBytes argument, which specifies that matching should be done byte by byte, rather than character by character. This can make a difference when using character sets in which some characters are represented by more than one byte, such as UTF-8 (see Section 4.5).

4.4.3 Special Characters in Regular Expressions

We have seen how ˆ and $ match, respectively, the beginning and end of a line. There are a number of other special characters that have specific meanings in a regular expression. In order for one of these special characters to be used to stand for itself, it needs to be “protected” by a backslash. We talk more about the way backslashes multiply in R regular expressions in the following section.

Table 4.1 lists the special characters in R's implementation of POSIX regular expressions. Many implementations of regular expressions work on lines of text in a file, so we use the word “line” here synonymously with “element of a character vector.”

Table 4.1 Special characters in R (POSIX) regular expressions

Char Name Purpose Example
Matching characters
. Period Match any character t.e matches strings with tae, tbe
t9e, t;e, and so on, anywhere
[ ] Brackets Match any character t[135]e matches t1e, t3e, t5e;
between them t[1-5]e matches t1e, t2e, …, t5e,
but note: [a-d] might mean[abcd]
or [aAbBcCdD], depending on your
computer. See “character ranges”
ˆ Caret (i) Start of line ˆL matches lines starting with L
(ii) “Not” when appear- t[ˆh]e matches lines containing t,
ing first in brackets then something not an h, then e
$ Dollar End of line the$ matches lines ending with the
| Pipe “Or” operator th|sc matches either th or sc
( ) Parentheses Grouping operators
\ Backslash Escape character See text
Repetition characters
{ } Braces Enclose repetition (a|b){3} matches lines with three
operators (a or b)'s in a row, for example, aba, bab, …
, Comma Separate repetition j{2,4} matches jj, jjj, jjjj
operators
* Asterisk Match 0 or more ab* matches a, ab, abb, …
+ Plus Match 1 or more ab+ matches ab, abb, abbb, …
? Question mark Match 0 or 1 ab? matches a or ab

4.4.4 Examples

In this section, we show our first examples of using regular expressions to locate matching strings. Remember that, by default, grep() gives the indices of the matching strings; pass value = TRUE to get the strings themselves and use grepl() to get a logical indication of which strings match. In these functions, a string matches or does not – there is no notion of the position within a string where a match takes place. The tool for that is regexpr(), described in Section 4.4.5. We start by creating a vector of strings that contain the string sen in different cases and locations.

> sen <- c("US Senate", "Send", "Arsenic", "sent", "worsen")
> grep ("Sen", sen)             # which elements have "Sen"?
[1] 1 2
> grep ("Sen", sen, value = TRUE)     # elements with "Sen"
[1] "US Senate" "Send"
> grep ("[sS]en", sen, value = TRUE)  #  either case "S"
[1] "US Senate" "Send"      "Arsenic"   "sent"      "worsen"
> grep ("sen", sen, value = TRUE,     # upper or lower-case
        ignore.case = T)
[1] "US Senate" "Send"      "Arsenic"   "sent"      "worsen"
> grep ("ˆ[sS]en", sen, value = T)    # start "Sen" or "sen"
[1] "Send" "sent"
> grep ("sen$", sen, value = T)       # end with "sen"
[1] "worsen"

The first grep() produces the indices of elements that match the pattern – this is useful for extracting the subset of items that match. The second grep() uses value = TRUE to returns the element themselves. These simple examples start to show the power of regular expressions. That power is multiplied by the ability to detect repetition, as we see next.

Repetition

The second part of Table 4.1 describes some repetition operators. Often we seek not a single character, but a set of matching characters – a series of digits, for example. The regular expression repetition operator ? allows for zero or one matches, essentially making the match optional; the * allows for zero or more, and the + operator allows for one or more matches. So the pattern 0+ matches one or more consecutive zeros. Since the dot character matches any character, the combination .+ means “a sequence of one or more characters”; this combination appears frequently in regular expressions. We also often see the similar pattern .* for “zero or more characters.” The following example shows how we can match strings with extraneous text using repetition operators. We start by creating a vector of strings, and our goal is to find elements of the vector that start with Reno and are followed at some point later in the string by a ZIP code (five digits).

> reno <- c("Reno", "Reno, NV 895116622", "Reno 911",
            "Reno Nevada 89507")
> grep ("Reno.+[0-9]{5}", reno, value = TRUE)
[1] "Reno, NV 895116622" "Reno Nevada 89507"

Here, the .+ accounts for any text after the o in Reno and the [0-9]{5} describes the set of five digits. It is tempting to add spaces to your pattern to make it more readable, but this is a mistake; the regular expression will then take the spaces literally and require that they appear. Notice that the nine-digit number was also matched by the {5} repetition, since the first five of the nine digits satisfy the requirement.

We end this section with a more complicated example. Here, we search for strings with dates in the form of a one- or two-digit numeric day, a month name (as a three-letter abbreviation), and a four-digit year number, when there might be text between any of these pieces. This example shows the text to be matched.

> dt <- c("Balance due 16 Jun or earlier in 2017",
     "26 Aug or any day in 3018",
     "'76 Trombones' marched in a 1962 film",
     "4 Apr 2018", "9Aug2006",
     "99 Voters May Register in 20188")

The pieces of the regular expression to detect the dates are these. First, we can have leading text, so .* will match that if it is present. Second, [0-3]?[0-9] matches a one-digit number (since the first digit is optional, as indicated by the ?) or a two-digit number less than 40. Next, there is (optional) additional text, followed by a set of month names. The month-related part of the pattern looks like (Jan|Feb|Mar...|Dec), where the pipes denote that any month will match and the parentheses make this a single pattern. (The abbreviations in the pattern will match a full name in the text.) Finally, we match some more additional text, followed by four digits that have to start with a 1 or a 2. We construct the month-related part of the pattern first by using paste() with the collapse argument.

> (mo <- paste (month.abb, collapse = "|"))
[1] "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec"
> re <- paste0 (".*[0-3]?[0-9].*(", mo, ").*[1-2][0-9]{3}")
> grep (re, dt, value = TRUE)
[1] "Balance due 13 Jun or earlier in 2017"
[2] "26 Aug or any day in 2018"
[3] "9Aug2006"
[4] "99 Voters May Register in 20188"        

Notice that the mar in marched does not match the month abbreviation Mar. However, the 99 in the final string matches the day portion of our pattern. That is because the [0-3] is optional; the first 9 matches in the [0-9] pattern and the second, in the .* pattern. Moreover, the five-digit year 20188 in that string matches the pattern [1-2][0-9]{3} because its first four digits do. We see how to refine this example later in the section – but regular expressions are tricky!

The Pain of Escape Sequences

Special characters give regular expression much of their power. But sometimes we need to use special characters literally – for example, we might want to find strings that contain the actual dollar sign $. A dollar sign in a pattern normally indicates the end of a line; to use it literally in a pattern it needs to be “escaped” with a backslash. So in order for the regular expression “engine” to look for a dollar sign, we need to pass it the pattern $. But remember that to type a backslash into R, we need to type two backslashes, since R also uses the backslash as the character that “protects” certain other characters (in strings like for new-line). That is, we have to type \$ in R so that the engine can see $ and know to search for a dollar sign. In this example, we create a vector of character strings and search for a dollar sign among them. Remember that $ matches the end of a string. In the first grep() command below, the pattern $ matches every element in the vector that has an end – all of them.

> vec <- c("c:\temp", "/bin/u", "$5", "
", "2 back: \\")
> grep ("$", vec)            # Indices of elements with ends
[1] 1 2 3 4 5
> grep ("$", vec, value = TRUE)
Error: '$' is an unrecognized escape...
> grep ("\$", vec, value = TRUE)
[1] "$5" 

The pattern $ looks to R as if we are constructing a special “protected” character such as or . Since there is no such character in R, we see an error. The next command produces the elements of vec that contain dollar signs since value = TRUE; in this case, the only element that matches is $5.

Other special characters also need to be escaped. To search for a dot, use \.; to search for a left parenthesis, use \(, and so on. The “pain” of this section's title refers to searching for the backslash itself. Since a backslash is represented as \, and we need to pass two of them to the engine, the pattern for finding backslashes in a string is \\. This looks like four characters, but it's actually two (as nchar ("\\") will confirm). The first tells the regular expression engine to take the second literally.

Backslashes are fortunately pretty rare in text, but they do arise in path names in the Windows operating system. In this example, we show how we can locate strings containing the backslash character.

> grep ("\", vec)
Error in grep("\", vec) :
 invalid regular expression '', reason 'Trailing backslash'
> grep ("\\", vec, value = TRUE)         # elements with 
[1] "c:\temp"     "2 back: \\"
> grep ("\\\\", vec, value = TRUE)     # two backslashes
[1] "2 back: \\"

In the first command, the backslash \ in valid in R, but because the regular expression engine uses the backslash as well, it expects a second character (like $ in our example above). When no second character is found, grep() produces an error. The second example shows the elements of vec that contain a backslash. Notice that the character in position 4 is a single character. The backslash depicts its special nature but is not part of the actual character. The final pattern matches the string with two backslashes.

The fixed = TRUE argument can alleviate some of the pain when searching for text that includes special characters. In this example, we repeat the searches above using fixed = TRUE.

> grep ("\", vec, value = TRUE, fixed = TRUE)       # one 
[1] "c:\temp"     "2 back: \\"
> grep ("\\", vec, value = TRUE, fixed = TRUE)     # two 
[1] "2 back: \\"

As a final example in this section, we show how we can use the pipe character | to find elements of vec with either forward slashes or backslashes.

> grep ("\\|/", vec, value = TRUE)
[1] "c:\temp"     "/bin/u"     "2 back: \\"
> grep ("\|/", vec, value = TRUE, fixed = TRUE)
character(0)

In the first command, we found strings containing either a backslash (\\) or forward slash (/), the two separated by the pipe character indicating “or.” In the second command, we used fixed = TRUE to look for strings containing the literal text |/ in that order – and of course none was found.

Character Ranges and Classes

We saw earlier how we can match ranges of digits by enclosing them in square brackets as [0-9]. This extends to other sets of characters. For example, we might want to match any of the lower-case letters, or any punctuation character, or any of the letters A–G of the musical scale. It is easy to specify a range of characters using square brackets and a hyphen, so [a-z] matches a lower-case letter and [A-G] matches an upper-case musical note. To match musical notes given in either case, we can combine a range with the pipe character: [A-G]|[a-g] matches any of those seven letters in either case. To include a hyphen in the pattern, put it first or last in the brackets. (You can also put an opening square bracket in a set or range, but to include a closing square bracket, you will need to escape it so that the set isn't seen as ending with that character.) So, for example, the range [X-Z] matches “any of the letters X, Y or Z,” and includes Y, whereas the set [XZ-] matches “any of X, Z, or hyphen” and does not.

We can negate a character class or range by preceding it with the caret character ˆ. So the set [ˆXZ-] matches any characters other than X, Z, or hyphen. Notice that the caret must be inside the brackets; outside, it matches the start of the line as we saw above. A caret elsewhere than the first character is interpreted literally – that is, it matches the caret character.

There is a predefined set of character classes that makes it easy to specify certain common sets. These include [:lower:], [:upper:], and [:alpha:] for lower-case, upper-case, and any letters; [:digit:] for digits; [:alnum:] for alphanumeric character (letters or numbers); [:punct:] for punctuation; and a few more (see the help pages). Notice that the name of the class includes the square brackets; to use these in a regular expression they need to be enclosed in another set of square brackets. So, for example, the pattern [[:digit:]] matches one digit, and [ˆ[:digit:]] matches any character that is not a digit. We start this example by showing how to identify strings that include, or do not include, upper-case letters.

> vec <- c("1234", "6", "99 Balloons", "Catch 22", "Mannix")
> grep ("[[:upper:]]", vec, value = TRUE)   # any upper-case
[1] "99 Balloons" "Catch 22"    "Mannix"
> grep ("[ˆ[:upper:]]", vec, value = TRUE)  # any non-upper
[1] "1234"        "6"           "99 Balloons" "Catch 22"
[5] "Mannix"
> grep ("ˆ[ˆ[:upper:]]+$", vec, value = TRUE) # no upper
[1] "1234" "6"   

The first grep() uses the [:upper:] character class to identify strings with at least one upper-case character in them. It is tempting to think that the second regular expression, [ˆ[:upper:]], will find strings consisting only of non-upper-case characters, but as you can see the result is something different. In fact, this pattern matches every string that has at least one non-upper-case character. The last example shows how to identify strings consisting entirely of these characters – we specify that a sequence of one or more non-upper-case characters (i.e., [ˆ[:upper:]] followed by +) should be all that can be found on the line (i.e., between the ˆ and the $).

Some classes are so commonly used that they have aliases. We can use d for [:digit:] and s for [:space:], and D and S for “not a digit,” “not a space.” (Here, “space” includes tab and possibly other more unusual characters.) This makes it easy to, for example, find strings that contain no digits at all, as we see in this example.

> grep ("ˆ[ˆ[:digit:]]+$", vec, value = TRUE) # long way
[1] "Mannix"
> grep ("ˆ\D*$", vec, value = TRUE)          # shorter
[1] "Mannix"

Word Boundaries

Often we require that a match take place on a word boundary, that is, at the beginning or end of a word (which can be a space or related character such as tab or new-line, or at the beginning or end of the string). Word boundaries are indicated by , or by the pair < and >. The characters that are considered to go into a word include all the alphanumeric (non-space) characters. Recall our earlier example where we tried to locate strings with dates included – such as, for example, "4 Apr 2018". Our earlier effort inadvertently matched a year with the value 20188. Using the word-boundary characters, we can specify that, in order to match, a string must include a word with exactly four digits. This example shows how that might be done.

> (newvec <- grep ("\<\d{4}\>", dt, value = TRUE))
[1] "Balance due 16 Jun or earlier in 2017"
[2] "26 Aug or any day in 3018"
[3] "'76 Trombones' marched in a 1962 film"
[4] "4 Apr 2018"
> grep (mo, newvec, value = TRUE)
[1] "Balance due 16 Jun or earlier in 2017"
[2] "26 Aug or any day in 3018"
[3] "4 Apr 2018"    

In the first command, we found strings that contained words of exactly four digits and saved the result into the new item newvec. The second command then searched for the month string in that new vector. We did not look for the days in this example, but in practice we very often use multiple passes (sometimes with invert = TRUE) in order to extract the set of strings we need. It may be more computationally efficient to call grep() only once for any problem, but that may not be the fastest route overall.

4.4.5 The regexpr() Function and Its Variants

While grep() identifies strings that match patterns, the regexpr() function is more precise: it returns the location of the (first) match within the string – that is, the number of the first character of the match. We can use this information to not only identify strings that contain numbers but also extract the number itself. This example shows the result of calling regexpr() with a pattern that looks for the first stand-alone integer in each string. It is not enough to extract a set of digits because that would match strings such as 11-dimensional or B2B. Word boundaries provide the mechanism for specifying an integer, as seen here:

> (regout <- regexpr ("\<\d+\>", dt))
[1] 13  1  2  1 -1  1
attr(,"match.length")
[1]  2  2  2  1 -1  2
attr(,"useBytes")
[1] TRUE

The regexpr() function returns a vector (plus some other information we describe as follows). The vector, which starts with 13, shows the number of the character where the first integer begins. For example, the number 16 in the first element of dt appears starting at the 13th character in that string, the number 26 starts in the 1st character of the second element, and so on. The -1 in the fifth position indicates that string does not contain an integer as a word.

The function also returns attributes, extra pieces of information attached to its output. The match.length attribute in this case gives the length of the match – so the first element is 2 because the integer in the first string is two characters long; the fourth is 1 because the integer in the fourth string is one character long. (We will not need the useBytes attribute.) We could extract the match.length vector using the attr() function, and then use substring() to extract the numbers in the strings. But a more convenient alternative is provided by regmatches(), which takes the initial string and the output of regexpr() and performs the extraction for us, as in this example.

> regmatches (dt, regout)
[1] "13" "26" "76" "4"  "99"

There are five entries in this vector because only five of the strings contained integers. (The -1 values in the original vector remind you of which strings did not produce values here.)

Finding All Matches

The regexpr() function finds the first instance of a match in a vector of strings. To find all the matches is only a little more complicated. We use the gregexpr() function, the g evoking “global.” The return value of gregexpr() is a list, not a vector, because some strings may contain many integers. However, regmatches() works on this return value just as it does for regexpr(). In this example, we extract all of the integers from each of our strings in one command.

# Note that some output from this command is suppressed
> (gout <- gregexpr ("\<\d+\>", dt))
[[1]]
[1] 13 34
attr(,"match.length")
[1] 2 4
...
[[2]]
[1]  1 22
attr(,"match.length")
[1] 2 4
...
[[6]]
[1]  1 27
attr(,"match.length")
[1] 2 5
...
> regmatches (dt, gout)
[[1]]
[1] "13"   "2017"
[[2]]
[1] "26"   "2018"
[[3]]
[1] "76"   "1962"
[[4]]
[1] "4"    "3018"
[[5]]
character(0)
[[6]]
[1] "99"    "20188"
> matrix (as.numeric (unlist (regmatches (dt, gout))),
          ncol = 2, byrow = TRUE)
     [,1]  [,2]
[1,]   13  2017
[2,]   26  2018
[3,]   76  1962
[4,]    4  3018
[5,]   99 20188

Here, the result of the call to regmatches() is a list of length 6, one for each string in the original dt. The fifth entry in the list is empty because the fifth entry of dt had no integers that were words. The final command shows one way you might form the list into a two-column numeric matrix, a useful step on the way to constructing a data frame. A second approach would use do.call() and rbind().

Greedy Matching

By default, regular expression matching is “greedy” – that is, matches are as long as possible. As an example consider using the pattern \d.*\d to find a digit, zero or more characters, and a second digit in the string "4 Apr 3018". You might expect the regular expression engine to find the string "4 Apr 3", but in fact it gathers as much as possible: "4 Apr 3018", stopping at the last 8. Adding a question mark makes the match “ungreedy” (or “lazy”) – so that \d.*?\d produces "4 Apr 3".

4.4.6 Using Regular Expressions in Replacement

In addition to finding matches, R has tools that allow you to replace the part of the string that matches a pattern with a new string. These are sub(), which replaces the first matching pattern, and gsub(), which replaces all the matching patterns. The replacement text is not a regular expression. For example, here is a vector of four character strings. In the first example, we replace the first lower-case i with the number 9. In the second, we replace the first instance of either i or I with 9, and in the last, we replace all instances of either one with 9.

> (mytxt <- c("This is", "what I write.",
              "Is it good?", "I'm not sure."))
[1] "This is" "what I write." "Is it good?" "I'm not sure."
> sub("i", "9", mytxt)       # replace first i with 9
[1] "Th9s is" "what I wr9te." "Is 9t good?" "I'm not sure."
> sub("[iI]", "9", mytxt)    # replace first (i or I) with 9
[1] "Th9s is" "what 9 write." "9s it good?" "9'm not sure."
> gsub("[iI]", "9", mytxt)   # replace all (i or I) with 9
[1] "Th9s 9s" "what 9 wr9te." "9s 9t good?" "9'm not sure."

Sometimes the text being matched is needed in the replacement. This can sometimes be done very neatly using “backreferences.” When a regular expression is enclosed in parentheses, its matching strings get labeled by integers and can be re-used in the replacement string by referring to them as 1, 2, and so on – of course, to be typed into R as \1, \2, and so on. In this example, we are given names in the form “Firstname Lastname” and asked to produce names of the form “Lastname, Firstname.”

> folks <- c("Norman Bethune", "Ralph Bunche",
                 "Lech Walesa", "Nelson Mandela")
> sub ("([[:alpha:]]+) ([[:alpha:]]+)", "\2, \1", folks)
[1] "Bethune, Norman" "Bunche, Ralph"   "Walesa, Lech"
[4] "Mandela, Nelson"

The first argument to the sub() command gave the pattern: a series of one or more letters (captured as backreference 1), a space, and another series of letters (backreference 2). The replacement part gives the second backreference, then a comma and space, and then the first backreference. We note that this task is more complicated with people whose names use three words, since sometimes the second word is a middle or maiden name (as with John Quincy Adams or Claire Booth Luce) and sometimes it is part of the last name (Martin Van Buren, Arthur Conan Doyle) – and of course some people's names require four or more words (Edna St Vincent Millay, Aung San Suu Kyi).

4.4.7 Splitting Strings at Regular Expressions

It is common to want to split a string whenever a particular character occurs. This is more or less the opposite of the paste() operation. For example, in our work we often construct a unique key to identify each of our observations, using paste(). We might combine a company identifier, transaction identifier, and date, with a command like key <- paste(co.id, tr.id, date, sep = "-"). Of course, in this example, the sep = "-" argument specifies a hyphen as the separator.

At a later time, it might be necessary to “unpaste” those keys into their individual parts. The strsplit() function performs this duty. In this example, strsplit(key, "-") produces a list with one entry for each string in key. Each entry is a vector of parts that result when the key is broken at its hyphens; so if one key looked like 00147-NY-2016-K before the split, the corresponding entry in the output of strsplit() would be a vector with four elements (and no hyphens). If the key had two hyphens in a row, there would have been an empty string in the output vector. In this example, we show the effect of strsplit() on several keys constructed using hyphens.

> keys <- c("CA-2017-04-02-66J-44", "MI-2017-07-17-41H-72",
            "CA-2017-08-24-Missing-378")
> (key.list <- strsplit (keys, "-"))
[[1]]
[1] "CA"   "2017" "04"   "02"   "66J"  "44"
[[2]]
[1] "MI"   "2017" "07"   "17"   "41H"  "72"
[[3]]
[1] "CA"      "2017"    "08"      "24"      "Missing" "378"    

In cases like these, where the number of pieces is the same in every key, it is common to construct a matrix or data frame from the parts. We saw a similar example using the output of regmatches() in an earlier section. Here, we construct a character matrix in the same way. We can then use data.frame() to make the matrix into a data frame, although the columns of the latter will be character unless you then convert them explicitly.

> matrix (unlist (key.list), ncol = 6, byrow = TRUE)
     [,1] [,2]   [,3] [,4] [,5]      [,6]
[1,] "CA" "2017" "04" "02" "66J"     "44"
[2,] "MI" "2017" "07" "17" "41H"     "72"
[3,] "CA" "2017" "08" "24" "Missing" "378"

Note that the alternative do.call("rbind", key.list) produces the same character matrix.

Unlike the sep argument to paste(), which is a character string, the second argument to strsplit() can be a regular expression. The strsplit() function also accepts the fixed, perl, and useBytes arguments as the other regular expression operators do. Because that second argument can be a regular expression, extra work is required to split at periods. The command strsplit(key, ".") produces a split at every character, since the period can represent any character, so it returns an unhelpful vector of empty strings. The command strsplit(key, "\.") or its alternatives, strsplit(key, "[.]") or strsplit(key, ".", fixed = TRUE) will split at periods. Remember that the output of strsplit() is always a list, even if only one character string is being split.

4.4.8 Regular Expressions versus Wildcard Matching

The patterns used in regular expressions are more complicated, and more powerful, than the sort of wildcard matching that many users will have seen as part of a command-line interpreter. In wildcard matching, the only special characters are *, meaning “match zero or more characters,” and ?, meaning “match exactly one character.” So, for example, the wildcard-type pattern *an? matches any string that includes an followed by exactly one more character. R does not use wildcard matching, but it does allow you to convert a wildcard pattern, which R calls a “glob,” into a regular expression, by means of the glob2rx() function. For example, glob2rx("*an?") produces "ˆ.*an.$". Notice the ˆ and $ sign; glob2rx() adds those by default, but they can be omitted with the trim.head = TRUE and trim.tail = TRUE arguments.

4.4.9 Common Data Cleaning Tasks Using Regular Expressions

Regular expressions make it possible to do many complicated things to text, specific to your particular problem and data. There are a few operations, though, that seem to be called for in a lot of data cleaning tasks. In these sections, we describe how to do some of these.

Removing Leading and Trailing Spaces

One frequent need in text handling is removing leading and trailing spaces from text. The regular expression "ˆ *" matches any string with leading spaces, while " *$" matches one with trailing spaces. To match either or both of these, we combine them with the pipe character, and use gsub() instead of sub() since some strings will have both kinds of matches – as in this example.

> gsub ("ˆ *| *$", "", c("  Both Kinds ", "Trailing   ",
                         "Neither",  "     Leading"))
[1] "Both Kinds" "Trailing"    "Neither"    "Leading"   

Here, the embedded space inside "Both Kinds" does not match – it is neither leading nor trailing – and is not deleted. The command gsub(" ", "", vec) will remove all spaces in every element of vec.

Converting Formatted Currency into Numeric

We see something similar in formatted currency amounts such as $12,345.67. Here, we need to remove the currency symbol and the comma before converting to numeric. If the only currency sign we expected to encounter was the dollar sign, we might do this:

> as.numeric (gsub ("\$|,", "", "$12,345.67"))
[1] 12345.67

Recall that the as.numeric() will accept and ignore leading and trailing spaces. More generally, we might delete any non-numeric leading characters like this:

> as.numeric (gsub ("ˆ[ˆ0-9.]|,", "", "$12,345.67"))
[1] 12345.67
> as.numeric (gsub ("ˆ[ˆ[:digit:]]|,", "", "$12,345.67"))
[1] 12345.67 

In this example, the first ˆ indicates “a string that starts with c04-math-002.” The [ˆ0-9.] bracketed expression starts with a ˆ, meaning “not,” so that part means “anything except a number or a dot.” The |, sequence says “or a comma,” so the regular expression will find any leading non-numeric (and non-period) characters, as well as any commas anywhere, and delete them all.

Removing HTML Tags

Occasionally, we come across text formatted with HTML tags. These are instructions to the browser regarding display of the material, so, for example, <b>Bold</b> formats the word “Bold” in bold face. Other tags indicate headings, delineate cells of tables, paragraphs, and so on. It can be useful to strip out all of the formatting information as a first step toward processing the text. Every tag starts with the angle bracket < and ends with >. So, given a character string txt, the command gsub("<.*?>", "", txt) will delete all the brackets (the < and > are treated literally) and all the text between any pair (here, the .* indicates “zero or more characters” and the ? instructs the engine to match in a lazy way).

Converting Linux/OS X File Paths to R and Windows Ones

The Windows file system uses the backward slash, , to separate directories in a file path, whereas Linux and Mac operating systems use the forward one, /. Suppose we are given a Linux-style path like /usr/local/bin, and we want to switch the direction of the slashes. The command gsub("/", "\\", "/usr/local/bin") will produce the desired result. To make the change in the other direction, the command gsub("\\", "/", "\usr\local\bin") will convert Windows path separators to Linux ones. As an alternative in this case, we can specify the matching pattern exactly with a command like gsub("\", "/", "\usr\local\bin", fixed = TRUE).

4.4.10 Documenting and Debugging Regular Expressions

Regular expressions are complicated, and debugging them is hard. It is annoying (and time-consuming) to try to fix a regular expression that you know is wrong, but you're not sure why. It is worse to have one that is wrong and not knowing it. There are online aids to diagnosing problems with regular expressions that we have found useful. An Internet search will turn up a number of helpful sites – but make sure that the site you find describes the regular expression type (POSIX with GNU extensions or PCRE) that you use. Because regular expressions are complicated, be sure to document them as well as you can. Write out the patterns you expect to match and the rules you use to match them.

4.5 UTF-8 and Other Non-ASCII Characters

4.5.1 Extended ASCII for Latin Alphabets

Up until now we have implicitly been dealing only with the “usual” characters, those found on a keyboard used in English-speaking countries. The starting point for the way characters are displayed is ASCII, a table that gives characters and the corresponding standard digital representations. ASCII provides representations of only 128 characters, many of which are unprintable “control” characters, such as tab, new-line, or the command to ring the bell of an old-fashioned teleprinter. Much of the text we handle in our work is of this sort, but ASCII does not include codes for letters with accents or other diacritical marks, required for many Western European languages. Every computer today honors a much broader character set, often based on a standard named latin1, but realized in slightly different ways by different manufacturers. For example, Windows uses its own “Win-1252” table, which includes some characters not found in latin1, such as the Euro currency symbol and the curly “smart quotes,” and Apple OS X uses a table called “Mac OS Roman.” Each character has a hexadecimal representation – for example, the upper-case E with a circumflex, Ê, has code ca, and typing "xca" into R (with quotation marks because this is text) will produce that character. The x is used to introduce hexadecimal notation in R, and it is case-sensitive – X may not be used – but the hexadecimal digits themselves are not case-sensitive. Entering text in hexadecimal is different from entering numeric values in hexadecimal. Typing 0xca produces the number whose hexadecimal value is ca, that is, the number 202. Typing "xca" produces the character whose code in the ASCII table is ca, that is, Ê.

Characters represented by their hexadecimal codes can be used just like regular characters, as arguments to grep() or other functions. (They can also be entered in other, different ways that depend on your computer and keyboard.) Almost all of these characters will display on your screen, depending on which fonts you have installed. One exceptional character is the so-called null character, which has code 00 (following the convention that every character requires two hexadecimal digits). This character is not permitted in R text; if needed nulls can be held in objects of class raw, but they should be avoided. In Chapter 6, we describe how you can skip null characters when reading data from outside sources.

The Windows and OS X character codes generally coincide. The one commonly encountered character for which the two disagree is the Euro currency symbol, €, which was introduced after the latin1 standard was decided. In the Win-1252 table, the symbol has hexadecimal value 80, whereas in OS X, it has db.

4.5.2 Non-Latin Alphabets

Of course, the need for standardization goes beyond the Euro sign. Increasingly, with the availability of data from social media data and other sources, analysts need methods to read, store, and process characters from very different languages such as Chinese, Arabic, and Russian. The computing communities have settled on Unicode, which is a system that intends to describe all the symbols in all the world's alphabets. Unicode values are shown in R by preceding them with U (or u, but the upper-case U is more general). Unicode includes ASCII as a subset. For example, the lower-case letter k has an ASCII and Unicode representation as the hexadecimal value 6b, so typing "U6b" or "U006B" into R will produce a lower-case k. As with other hexadecimal encodings, Unicode characters may be in either case.

As a non-Latin example, the two Chinese characters image represent the word “China” in (simplified) Chinese. These cannot be represented in ASCII, but their Unicode representations are (from left) "U4E2D" and "U56FD", and these values can be entered (inside quotation marks) directly in R, as in this example:

> "U4e2dU56fd"
[1] image                # If fonts permit
> nchar ("U4e2dU56fd")
[1] 2                   # Two characters...
> nchar ("U4e2dU56fd", type = "bytes")
[1] 6                   # ...requiring six bytes in UTF-8

There are several ways to represent Unicode, but the most popular, particularly in web pages, is UTF-8. In this encoding, each character in Unicode is represented by one or more bytes. For our purposes, it is not important to know how the encoding works, but it is important to be aware that some characters, particularly those in non-European alphabets, require more than one byte. In the example above, the two Chinese characters take up six bytes in UTF-8.

Depending on your computer, its fonts, and the windowing system, the Chinese characters may not appear. Instead, you might see the Unicode representation (such as U4e2d), an empty square indicating an unprintable character, an empty space, or even, on some computers, some seemingly garbled characters such as image. Sometimes these characters indicate the latin1 encoding, but on some computers the very same representation will be used for UTF-8. You can ensure that the computer knows these characters are UTF-8 by examining their encoding (the following section). The important point is that the display of UTF-8 characters can be inconsistent from one machine to the next, even when the encodings are correctly preserved. We talk about reading and writing UTF-8 text in Section 6.2.3.

4.5.3 Character and String Encoding in R

Handling Unicode in R requires knowledge of one more detail, which is “encoding.” R assigns an encoding to every element in a character vector (and different elements in a vector may have different encodings). ASCII strings are unencoded (so their encoding is marked as unknown); strings with latin1 characters (but no non-Latin Unicode) are encoded as latin1 and strings with non-Latin Unicode are encoded as UTF-8. The Encoding() function returns the encoding of the strings in a vector and iconv() will convert the encodings. In the following first example, we create a latin1 string and look for the à character using regexpr(). The search succeeds whether the regular expression is entered in latin1 style (as "xe0"), Unicode style (as "Ue0", or directly with the keyboard. In each case, the à is found in location 9 as we expect.

> (yogi <-  "It's dxe9jxe0 vu all over again.")
[1] "It's déjà  vu all over again."
> Encoding (yogi)
[1] "latin1"
> c(regexpr ("xe0", yogi), regexpr ("ue0", yogi),
    regexpr ("à ", yogi))
[1] 9 9 9

Different encodings only cause problems in the rare cases where the Win-1252 and Mac OS Roman pages disagree with Unicode, and the primary example of this issue is, again, the Euro sign. In this example, we create a string containing the Euro sign using the Windows value "x80" (to repeat this example with OS X, use "xdb"). We then use grepl() to check to see if the sign is found in the string. R encodes the string as latin1 when it sees the non-ASCII character. Here, the Euro sign in the latin1 string is not matched by the Unicode Euro, but after the string is converted into UTF-8, the Unicode Euro is matched.

> (bob <- "bob owes me x80123")
[1] "bob owes me €123"
> Encoding (bob)
[1] "latin1"
> (euro <- "U20ac")                 # Assign Unicode Euro
[1] "€"
> Encoding (euro)
[1] "UTF-8"
> grepl (euro, bob)                  # Is it there?
[1] FALSE
> (bob <- iconv (bob, to = "UTF-8")) # Convert to UTF-8
[1] "bob owes me €123"
> grepl (euro, bob)                  # Is it there?
[1] TRUE

Notice that iconv() has no effect on strings that contain only ASCII text. These will continue to have encoding “unknown.”

UTF-8 is vital for handling non-European text. Although the display is not always perfect, R is usually intelligent about handling UTF-8 once it is read in. UTF-8 text behaves as expected in regular expressions, paste() and other string manipulation tools. R's functions to read from, and write to, files also support the notion of encoding in UTF-8 and other formats. We talk more about reading and writing UTF-8 in Chapter 6.

We have noted that the display of UTF-8 strings can be unexpected on some computers (at least, for some characters). Even on computers equipped with the correct fonts, though, an issue arises when UTF-8 characters are part of a data frame. When the print() function is applied to a data frame, it calls the print.data.frame() function, which in turns calls format(). This later, though, reacts poorly to UTF-8, often converting it into a form like <U+4E2D>. In this example, we create a data frame with those Chinese characters and show the results of printing the data frame.

> data.frame (a = "U4e2dU56fd", stringsAsFactors = FALSE)
                 a
1 <U+4E2D><U+56FD>

Here, the data.frame() command produced a data frame whose one entry was two characters. The data frame, as displayed by the print.data.frame() function, shows the <U+4E2D>-type notation. Despite the display, the underlying values of the characters are unchanged – as seen in the next command.

> data.frame (a = "U4e2dU56fd",
              stringsAsFactors = FALSE)[1,1]
[1] image

R shows the expected result because print() is being called on a vector, not a data frame.

Sometimes, UTF-8 is inadvertently saved to disk in the <U+4E2D> form as literal characters – < followed by U, and so on. At the end of the chapter, we show one way to reconstruct the original UTF-8 from this representation.

4.6 Factors

4.6.1 What Is a Factor?

A factor is a special type of R vector that looks like text but in many cases behaves like an integer. Factors are important in modeling, but they often cause trouble in data entry and cleaning. In this section, we describe how factors are created, how they behave, and how to get them to do what you want them to do.

Factors arise in several ways. You can create a factor vector from some other vector using the factor() or equivalent as.factor() function; this will often be a final step, after data cleaning has been completed and modeling is about to start. Factors are also created automatically by R when constructing data frames, or when character vectors are added into data frames, with the data.frame() or cbind() functions (Sections 3.4 and 3.7.1), or when reading data into R from other formats (Section 6.1.2). In both of these cases, the behavior can be changed through a function argument or global option.

Factors are useful in a number of places in R but particularly in modeling. They provide a natural and powerful way of representing categorical variables in a statistical model. However, we recommend that you only turn character vectors into factors when all the data cleaning is finished and it is time to start modeling. Chapter 7 shows a complete data cleaning example in depth and there we ensure that our character data starts out and remains as character. Still, it is important to understand how factors work in R.

Think of a factor as having two parts. One part is the set of possible values, the “levels.” In a manpower example, the levels of a factor named “Gender” might be “Male” and “Female,” and perhaps a third called “Not Recorded.” The second part is a set of integer codes that R uses to represent and store the levels. These codes start at 1 and go up. By default, R assigns codes to levels alphabetically – so in this example, “Female” would be represented by 1, “Male” by 2, and “Not Recorded” by 3. The class() of a factor vector is factor, showing its special nature, but the mode() of a factor vector is numeric, and the typeof() is integer, referring to the underlying codes that R stores. The advantage of this representation is efficiency: in a data set of a million observations, it is clearly much more efficient to store a million small integers than to store millions of copies of longer strings.

4.6.2 Factor Levels

Once a set of levels is defined for a factor, it is resistant to change. If you try to change a value of one of the elements of a factor vector to a new value that is not already a level, R sets that value to NA and issues a warning. Conversely, if you remove all the elements with a particular value from the vector, that value is still one of the levels. In this example, we create a factor whose levels are the three colors of a traffic light.

> (cols  <- factor (c("red", "yellow", "green", "red",
                      "green", "red", "red")))
[1] red    yellow green  red    green  red    red
Levels: green red yellow
> table (cols)
 green    red yellow
     2      4      1 

We can tell that the result is showing factor levels, rather than character strings, because there are no quotation marks and because R also prints out the levels themselves. Notice that the levels consist of the unique values in the vector, sorted alphabetically, and that the table() command performs as expected on the factor vector. The levels and labels arguments to the factor() function control the setting and ordering of the factor's levels. In the following example, we show what happens when we exclude the elements whose values are green from the vector.

> cols[cols != "green"]
[1] red    yellow red    red    red
Levels: green red yellow
> table (cols[cols != "green"])
 green    red yellow
     0      4      1 

In this example, we see that the green level is still present in the vector, even though none of the elements in the vector have that value. Moreover, the table() command acknowledges the empty level. This can be annoying when many levels are empty, but it can also be helpful when, for example, levels are months of the year and some sources omit some months. In this case, tables constructed from the different sources can be expected to line up nicely.

Another way in which factor levels are resistant to change is shown in this example, where we try to change the value yellow to amber. We start by making a copy of cols called cols2.

> cols2 <- cols
> cols2[2] <- "amber"
Warning message:
In `[<-.factor`(`*tmp*`, 2, value = "amber") :
  invalid factor level, NA generated
> cols2
[1] red   <NA>  green red   green red   red
Levels: green red yellow

This assignment failed because amber is not one of the levels of the factor vector cols2. It would have been okay to assign to the yellow element of our vector the value green or red because those levels existed in the factor. But trying to assign a new value, such as amber, generates an NA. Notice how that NA is displayed with angle brackets, as <NA>, to help distinguish it from a legitimate level value of NA.

The levels() function shows you the set of levels in a factor, and you can use that function in an assignment to change the levels. Here, we show how we might have changed the yellow level to have a different label.

> levels(cols)
[1] "green"  "red"    "yellow"
> levels(cols)[3] <- "amber"
> cols
[1] red   amber green red   green red   red
Levels: green red amber

This operation changes only the level labels; the underlying integer values are not changed. Here, then, the labels are no longer in alphabetical order. We often want to control the order of the levels in our factors; a good example is when we tabulate a factor whose levels are the names of the months. By default, the levels are set alphabetically (April, then August, and so on, up to September) – this affects the output of the table() function (and more, like the way plots are laid out). The order of the levels can be specified in the original call to the factor() function, and we can re-order the levels using another call to factor(), as in this example:

> levels(cols)
[1] "green" "red"   "amber"
> factor (cols, levels = c("red", "amber", "green"))
[1] red   amber green red   green red   red
Levels: red amber green

In this example, we changed the level through use of the factor() function. To repeat, assigning to the levels() function changes only the labels, not the underlying integers. The following example shows one common error in factor handling, which is assigning levels directly.

> (bad.idea <- cols)
[1] red   amber green red   green red   red
Levels: green red amber
> levels(bad.idea) <- c("red", "amber", "green")
> bad.idea
[1] amber green red   amber red   amber amber
Levels: red amber green

Here, the elements of bad.idea that used to be red are now amber. If you use this approach, make sure this is what you wanted.

The feature of R that causes more data-cleaning problems than any other, we think, is this: Factor values are easy to convert into their integer codes but we almost never want this. In the following section, we see an example of how having a factor can produce unexpected results.

4.6.3 Converting and Combining Factors

To convert a factor f to character, simply use as.character(f). Actually, the help files tell us that it is “slightly more efficient” to use levels(f)[f]. Here, the interior [f] is indexing the set of levels after converting f, internally, to its underlying integer codes. Usually, we use the slightly less efficient approach because we think it is easier to read. R's conversion of factors to integers can be useful when exploited carefully; this arises more often in plotting than in data cleaning.

When a factor f has text labels that look like integers, it is tempting to try to convert it directly into a numeric vector using as.numeric(). This is almost always a mistake; convert levels to numeric only after first converting to character. This example shows how this conversion can go wrong. We start by creating a factor containing levels that look numeric, except that one of the values in the vector (and therefore one of the levels of the factor) is the text string Missing. This factor is intended to give the indices of elements to be extracted from the vector src.

> wanted <- factor (c(2, 6, 15, 44, "Missing"))    # indices
> src <- 101:200                    # vector to extract from
> as.numeric (wanted)               # ...but this happens
[1] 2 4 1 3 5
> src[wanted]
[1] 102 104 101 103 105

When wanted is created, its text labels ("2", "6", …, "Missing") are stored, together with its integer codes. By default, these are assigned according to the alphabetical order of the labels; so "15" gets level 1, "2" gets level 2, "44" gets level 3, and so on. When we enter src[wanted], R uses these integer codes to extract elements from src. If we actually want the 2nd, 6th, 15th, and so on elements of src, we have to convert the elements of wanted to character first, and then to numeric, as in this example.

> src[as.numeric (as.character (wanted))]
[1] 102 106 115 144  NA
Warning message:
NAs introduced by coercion 

Here, the warning message is harmless – it indicates that the text Missing could not be converted to a numeric value.

One time that the behavior of factors can be helpful is when we need to convert text into numeric labels for whatever reason. For example, given a character vector sex containing the values "F" and "M", we might be called on to produce a numeric vector with 0 for "F" and 1 for "M". In this case, factor(sex) creates a factor with levels 1 and 2; as.numeric(factor(sex)) creates an integer vector with values 1 and 2; and therefore as.numeric(factor(sex)) - 1 produces a numeric vector with values 0 and 1.

It is surprisingly difficult to combine two factor vectors, even if they have the same levels. R will convert both vectors to their underlying integer codes before combining them. Our recommendation is to always convert factors to characters before doing anything else to them. There is one happy exception, though, when two or more data frames containing factors are being combined with rbind() (Section 6.5). Other than in this case, however, combining factor vectors will usually end badly. We recommend converting factors into character, combining, and then, if necessary, calling factor() to return the new vector to factor form.

4.6.4 Missing Values in Factors

Like other vectors, factors may have missing values. Missing values look like NA values in most vectors, but in factors they are represented by <NA> with angle brackets. This level is special and does not prevent you from having a real level whose value is actually <NA>, but you should avoid that. (Analogously, it's permitted to have the string value "NA", and it is a good idea to avoid that, too.) Values of a factor that are missing have no level. In this example, we create a vector with missing values, and also with values that are legitimately "NA" and "<NA>".

> (f <- factor (c("b", "a", "NA", "b", NA, "a", "c",
                  "b", NA, "<NA>")))
 [1] b    a    NA   b    <NA> a    c    b    <NA> <NA>
Levels: <NA> a b c NA # alphabetized by default
> table (f, exclude=NULL)
<NA>    a    b    c   NA <NA>
   1    2    3    1    1    2
levels (f)
[1] "<NA>" "a"    "b"    "c"    "NA"  

The levels() function makes no mention of the true missing values, since they do not have a level. The first <NA> in the output of table() describes the final element of the vector, whereas the last <NA> refers to the two items that really were missing. Clearly there is a possibility of confusion here.

When elements of a factor vector are missing, the addNA() function can be used to add an explicit level (which is itself NA) to the factor. More often we want to replace the NA values with an explicit level so that, for example, those entries are accounted for in the result of table(). In this example, we show one way to add such a level.

> (f <- factor (c("b", "a", NA, "b", "b", NA, "c", "a")))
[1] b    a    <NA> b    b    <NA> c    a
Levels: a b c
> f <- as.character(f)                # Convert to character
> f[is.na (f)] <- "Missing"           # Replace missings
> (f <- factor (f))                   # Re-factorize
[1] b      a       Missing b       b       Missing c       a
Levels: a b c Missing

Here, the factor is converted to character, missing values replaced by a value like Missing, and then the vector converted back to factor.

4.6.5 Factors in Data Frames

Factors routinely appear in data frames, and, as we have mentioned, they are important in R modeling functions. Factors inside data frames act just like factors outside them (except sometimes when printing, as we saw with Chinese characters in an earlier example) – they have a fixed set of levels and they are represented internally as integers. A few points should be noticed here. First, as we mentioned above, R is not good at combining factor vectors on their own, but when data frames containing factors are combined with rbind(), R creates new factors from the factors in the input, extending the set of levels to include all the levels from both data frames. The set of levels is formed by concatenating the two initial sets; the levels are not re-sorted. (If a column is factor but its corresponding column in another data frame is character, then the resulting combined column will have the class of the column in the first data frame passed to rbind().) Second, applying functions to the rows of a data frame containing factors can produce unexpected results. We discuss applying functions to the rows of a data frame in Section 3.5 and the concerns there apply even more to data frames containing factor levels. Our recommendation is to not use apply() functions on data frames, particularly with columns of different types. Instead, use sapply() or lapply() on columns. If you need to process rows separately, loop over the rows with a command like lapply(1:nrow(mydf), function(i) ...) where your function operates on mydf[i,], the c04-math-003th row of the data frame mydf.

4.7 R Object Names and Commands as Text

4.7.1 R Object Names as Text

In some data cleaning problems, a large set of related objects need to be created or processed. For example, there might be 500 tables stored in disk files, and we want to read them all into R, saving them in objects with names such as M2.2013.Jan, M2.2013.Feb, c04-math-004, and M2.2016.Dec. Or, we might have data frames named p001, p002, c04-math-005, p100 and we want to run a function on each one. It is easy enough to construct the set of names using paste() and sprintf() (see Section 4.2.1). But there is a distinction between the name "p001" (a character string with four characters) and the R object p001 (a data frame). The R command get() accepts a character string and returns the object with that name. (If there is no object by that name, an error is produced; the related function exists() can test to see whether such an object exists, and get0() allows a value to be specified in place of the error.)

One place where get() is useful is when examining the contents of your workspace. The ls() command returns the names of the objects there; by using get() in a loop we can apply a function to every object in the workspace. For example, the object.size() function reports the size of an object in your workspace in bytes (by default). This function operates on an object, not a name in character form. So often we do something like this: first, we produce the set of names of the objects of interest, perhaps with a command like projNames <- ls(pattern = "ˆprojA") to identify all the names of objects that start with projA. Then, the command sapply(projNames, function(i) object.size(get(i))) passes each name to the function, and the function uses get() to produce the object itself and report its size. The result is a named vector of the sizes of every object in the workspace whose name starts with projA.

The complement of get() is assign(). This takes a name and a value and creates a new R object with that name and value. Be careful; it will over-write an existing object with that name. Assigning is useful when each iteration of a loop produces a new object. In the following example, we use a for() loop to create an item named AA whose value is 1, one named BB with value 2, and so on, up to an object ZZ with value 26. (We used double letters here to avoid creating an item F that might conflict with the alias for FALSE.)

> for (i in 1:26)
       assign (paste0 (LETTERS[i], LETTERS[i]), i, pos = 1)
> get ("WW")               # Example
[1] 23
# Remove the 26 new objects from the workspace
> remove (list = grep ("ˆ[A-Z]{2}$", ls (), value = T))

The final command uses a regular expression to remove all items in the workspace whose names start (ˆ) with a letter ([A-Z]) that is repeated ({2}) and then come to an end ($). The remove() and rm() commands operate identically. Notice the pos = 1 argument to assign(). At the command line this has no effect. Inside a function it creates a variable in your R workspace, not one local to the function. We discuss the notions of global and local variables in Section 5.1.2.

4.7.2 R Commands as Text

It is also possible to construct R commands as text and then execute them. Suppose in our earlier example that we have objects p001, p002, c04-math-006, p100 and we want to run a function report() on each one, producing results res001, c04-math-007, res100. We could use the get() and assign() approach from above, like this:

nm <- paste0 ("p", sprintf ("%03d", 1:100))   # Object names
res <- paste0 ("res", sprintf ("%03d", 1:100))# Result names
for (i in nm) {                               # Begin loop
    result <- report (get (i))                # Run function
    assign (res[i], result, pos=1)
}

But it is easy to think of more complicated examples where each call is different. Perhaps the caller needs to supply the month and year associated with a file as an argument, or perhaps each call requires an additional argument whose name also varies. In these cases, it can be useful to construct a vector of R commands using, say, paste0(), and then execute them. This requires a two-step process: first the text is passed to parse() with the text argument, to create an R “expression” object; then the eval() function executes the expression. For example, to compute the logarithm of 11 and assign it to log.11 we can use the command eval(parse(text = "log.11 <- log(11)")). After this command runs, our R workspace has a new variable called log.11 whose value is about 2.4.

Imagine having objects p001, p002, c04-math-008, p100 and also q001, q002, c04-math-009, q100, and suppose we wanted to run res001 <- report(p001, q001), res002 <- report(p002, q002) and so on. It is simple to construct a set of 100 character strings containing these commands:

> num <- sprintf ("%03d", 1:100)       # 001, 002, etc.
> pnm <- paste0 ("p", num)
> qnm <- paste0 ("q", num)
> rnm <- paste0 ("res", num)
> cmd <- paste0 (rnm, " <- report (", pnm, ", ", qnm, ")")
> cmd[45]                              # as an example
[1] "res045 <- report (p045, q045)"

Now all 100 reports can be run with the command eval(parse(text = cmd)). This approach can both save time and cut down on the errors associated with copying and modifying dozens – or hundreds – of similar lines of code.

As a final example, we encountered a problem with some UTF-8 data (Section 4.5), which we solved with regular expressions (Section 4.4) and eval(). Under some circumstances, UTF-8 can be saved to disk as ASCII in a form like "<U+4E2D><U+56FD>" – that is, with a literal representation of <, U, +, and so on. To convert this into “real” UTF-8, we used regular expressions and the gsub() command to delete each > and to replace each <U+ with U. Of course, + and are special characters and will need to be escaped. Then we surrounded the entire result in quotation marks. The resulting string contains what we might have typed in at the R command line, and when it is executed with parse() and eval(), the UTF-8 characters are produced, as in this example:

> inp <- "<U+4E2D><U+56FD>"             # ASCII (not UTF-8)
> (out <- gsub (">", "", inp))          # remove > chars
[1] "<U+4E2D<U+56FD"
> (out <- gsub ("<U\+", "\\U", out)) # change <U+ to U
[1] "\U4E2D\U56FD"
> (out <- paste0 (""", out, """))     # add quotes
[1] ""\U4E2D\U56FD""
> eval (parse (text = out))

[1]image

4.8 Chapter Summary and Critical Data Handling Tools

This chapter discusses character data, which forms an important part of almost every data cleaning project. Even if you have very little data in text form you will need to be proficient at handling text in order to modify column names or to operate on multiple files across multiple directories. This chapter includes discussion of these important R tools:

  • The substring() function, which extracts a piece of a string as identified by the starting and ending positions. This function and the others in this chapter are made more powerful by the fact that they are vectorized, so they can operate on a whole set of strings as once.
  • The format() and sprintf() functions, which help convert numeric values into nicely-formatted strings. Sprintf() in particular provides a powerful interface for formatting values into report-like strings. Also handy here is the cut() function, which lets us convert a numeric variable into a categorical one for reporting or modeling.
  • The paste() and paste0() functions. These combine strings into longer ones in a vectorized way. We use the paste functions in every data cleaning project.
  • Regular expression functions. These functions (grep() and grepl(), regexpr() and gregexpr(), sub() and gsub(), and strsplit()) use regular expressions to find, extract, or replace parts of strings that match patterns. The power of regular expressions comes from the flexibility that the patterns provide. Regular expressions form a big subject, but we find that even a limited knowledge of them makes data cleaning much easier and more efficient.
  • Tools for UTF-8. UTF-8 describes a particular, popular encoding of the set of Unicode characters. More and more data cleaning problems will involve non-Roman text and R provides tools for handling these strings.
  • Factors. Factors are indispensable in some modeling contexts in R, and they provide for efficient storage of text items. In data cleaning tasks, however, they often get in the way. Remember to convert factors, even ones that look numeric, into character before converting the result into numeric.
  • The get() and assign() functions. These let us manipulate R objects by name, even when the name is held in an R object. This can make some repetitive tasks much simpler. The combination of parse() and eval() lets us construct R commands and execute them – again, allowing us to execute sequences of commands once we have created them with paste() and other tools.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.32.230