© Matt Wiley and Joshua F. Wiley 2016

Matt Wiley and Joshua F. Wiley, Advanced R, 10.1007/978-1-4842-2077-1_8

8. Data Munging with data.table

Matt Wiley and Joshua F. Wiley1

(1)Elkhart Group Ltd. & Victoria College, Columbia City, Indiana, USA

We already introduced the data.table package (Dowle, Srinivasan, Short, and Lianoglou, 2015). The data.table package is the heart of this chapter, covering the basics of accessing, editing, and manipulating data under the broad term data management. Although not glamorous, data management is a critical first step to data visualization or analysis. Furthermore, the majority of time on a particular analysis project often comes from data management. For example, running a linear model in R takes one line of code, once the data is clean and in the expected format. Data management is challenging because raw data comes in all types, shapes, and formats, and missing data is common. In addition, you may also have to combine or merge separate data sources. In this chapter, we go beyond the basic use of data.table to more-complex data management tasks.

There tend to be two stages to this sort of data wrangling. One-time conversions are often manual, as writing code often is not efficient if it is not reused (for example, changing one or two variable names, or renaming a data file to be consistent). For operations needing repetition (for example, renaming or labeling hundreds of variables) or working with larger data, more programming is used for data management. Add the stringdist package (van der Loo, 2014) to your checkpoint (Microsoft, 2016); another needed library is foreign (R Core Team, 2015). We run checkpointas well as code to make data tables print in a neat fashion:

library(checkpoint)              
checkpoint("2016-09-04", R.version = "3.3.1")


library(stringsdist)
library(data.table)
library(foreign)
options(width = 70) # only 70 characters per line


options(stringsAsFactors = FALSE,
        datatable.print.nrows = 20,
        datatable.print.topn = 3,
        digits = 2)

Data Munging / Cleaning

To obtain data that is of the second stage of munging, we download data from the National Survey of Children’s Health, 2003 (ICPSR 4691) at www.icpsr.umich.edu/icpsrweb/ICPSR/studies/4691 , where we chose the Stata data format. We place the file in our working directory and use read.dta()to input the data to R. After converting it to a data table, we set the key to be the identification number column IDNUMR:

d <- read.dta("ICPSR_04691/DS0001/04691-0001-Data.dta")              
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
d <- as.data.table(d)
setkey(d, IDNUMR)

To get a sense of the structure of this much data (there are over 300 variables), we use the str() function . We suppress attributes in order to focus on the column names and the types of data inside each column. Setting strict.width = "cut" enforces the options we set earlier, and we look at only the first 20 columns.

str(d, give.attr = FALSE, strict.width = "cut", list.len = 20)              
Classes ‘data.table’ and 'data.frame': 102353 obs. of  301 variables:
 $ IDNUMR  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ STATE   : Factor w/ 51 levels "1-AK","2-AL",..: 35 25 45 18 36 37..
 $ MSA_STAT: Factor w/ 4 levels "-2 - MISSING",..: 4 4 4 4 4 4 4 4 4..
 $ AGEYR_CH: int  12 1 10 7 9 11 9 15 17 1 ...
 $ TOTKIDS4: Factor w/ 4 levels "1 - 1 CHILD",..: 1 1 1 3 4 2 3 3 1 ..
 $ AGEPOS4 : Factor w/ 5 levels "1 - ONLY CHILD",..: 1 1 1 4 2 2 3 2..
 $ S1Q01   : Factor w/ 8 levels "-4 - PARTIAL INTERVIEW",..: 6 5 6 5..
 $ RELATION: Factor w/ 6 levels "-2 - MISSING",..: 3 2 3 3 2 2 2 4 3..
 $ TOTADULT: Factor w/ 6 levels "-2 - MISSING",..: 4 3 2 3 2 3 3 2 4..
 $ EDUCATIO: Factor w/ 6 levels "-2 - MISSING",..: 4 4 4 4 3 4 4 3 4..
 $ PLANGUAG: Factor w/ 5 levels "-2 - MISSING",..: 2 2 2 2 2 2 2 2 2..
 $ S2Q01   : Factor w/ 11 levels "-4 - PARTIAL INTERVIEW",..: 5 5 5 ..
 $ S2Q02R  : int  59 29 49 45 96 64 51 64 65 33 ...
 $ HGHT_FLG: int  0 0 0 0 0 0 0 0 0 0 ...
 $ S2Q03R  : int  100 20 55 60 98 115 64 135 115 26 ...
 $ WGHT_FLG: int  0 0 0 0 0 0 0 0 0 0 ...
 $ BMICLASS: Factor w/ 5 levels "-2 - MISSING",..: 3 1 3 5 1 3 3 3 3..
 $ S2Q04   : Factor w/ 8 levels "-4 - PARTIAL INTERVIEW",..: 5 5 5 5..
 $ S2Q05   : Factor w/ 8 levels "-4 - PARTIAL INTERVIEW",..: 4 4 4 4..
 $ S2Q06   : Factor w/ 8 levels "-4 - PARTIAL INTERVIEW",..: 4 4 4 4..
  [list output truncated]

Many columns contain missing data that is currently stored as levels of a factor, rather than as R’s NA to indicate to R that the values are missing. With over 100,000 rows, we definitely need to automate any changes. From the study’s documentation, we can find the values used that we will recode to missing. These values have consistent labels, but their numbers change depending on the number of legitimate levels in a variable. We use the table()function to generate a frequency table of the unique values in a variable. This is useful both as a way to see the unique values and to become familiar with the data and how much is missing (relatively little for variables such as education, and more for BMI class). We want to recode partial interview, not in universe, missing, legitimate skip, don’t know, and refused to NA in R. However, the labels are not identical—EDUCATIO has the number 97 for refused, whereas S1Q01 has 7 for refused.

table(d[, EDUCATIO])                

                      -2 - MISSING          1 - LESS THAN HIGH SCHOOL
                                 3                               4661
2 - 12 YEARS, HIGH SCHOOL GRADUATE          3 - MORE THAN HIGH SCHOOL
                             21238                              76022
                   96 - DON'T KNOW                       97 - REFUSED
                               324                                105
table(d[, PLANGUAG])


          -2 - MISSING            1 - ENGLISH 2 - ANY OTHER LANGUAGE         3 - DON'T KNOW
                     1                  94380                   7912                     51
           4 - REFUSED
                     9
table(d[, BMICLASS])


           -2 - MISSING           1-UNDERWEIGHT         2-NORMAL WEIGHT 3-AT RISK OF OVERWEIGHT
                  19963                    5921                   45650                   12207
           4-OVERWEIGHT
                  18612
table(d[, S1Q01])


-4 - PARTIAL INTERVIEW   -3 - NOT IN UNIVERSE           -2 - MISSING   -1 - LEGITIMATE SKIP
                     0                      0                      1                      0
              1 - MALE             2 - FEMALE         6 - DON'T KNOW            7 - REFUSED
                 52554                  49719                     14                     65

Recoding Data

In addition to programmatically finding cases with values to recode to missing (for example, 7 - REFUSED in S1Q01, and 97 - REFUSED in EDUCATIO), we also want to drop those levels from the factor after converting them to missing. We can solve this by using regular expressions to search for the character strings we know. For example, we know that REFUSED always ends in that regardless of whether it starts with 7 - or 97 - or anything else. We can search using regular expressions via the grep() function, which returns the value or the numeric position of matches, or the grepl() function, which returns a logical (hence the l) vector of whether each element matched or not.

Note

Regular expressions are a powerful tool for finding and matching character string patterns. In R, grep() and grepl() return the positions or a logical vector of which vector elements match the expected pattern. Combined with replacement, regular expressions help recode data.

We show these two functions in action with a simple example. The first formal, pattern, is the regular expression used for matching, while x is the data to search through. Notice the difference between logical vector versus numeric position. Either way, the second and fourth locations of x contain our matching letters of abc:

grep( pattern = "abc",  x = c("a", "abc", "def", "abc", "d"))                
[1] 2 4
grepl( pattern = "abc", x = c("a", "abc", "def", "abc", "d"))
[1] FALSE  TRUE FALSE  TRUE FALSE

So far, we have a basic regular expression. We could use this for our data, searching for REFUSED instead of abc, although it is good to be as accurate as possible. For example, what if one variable has 1 - REFUSED TREATMENT, 2 - DID NOT REFUSED TREATMENT, 3 - REFUSED (ignoring the grammar), where 1 and 2 are valid responses and 3 indicates a refusal to answer the question. A search for REFUSEDwould match all of those. We ultimately want to use grepl(), as logical values are useful indexes to set specific cases to NA. To ensure that we are grabbing the right values while checking that our regular expression is accurate, though, we use grep() with value = TRUE, which returns the matching strings. This makes it easy to see what we are matching.

grep( pattern = "REFUSED",                
      x = c( "1 - REFUSED TREATMENT", "2 - DID NOT REFUSED TREATMENT", "3 - REFUSED"),
      value = TRUE)
[1] "1 - REFUSED TREATMENT"    "2 - DID NOT REFUSED TREATMENT"        "3 - REFUSED"

To make it more specific, we could add the hyphen(-), which, although more precise, still gives us a false positive:

grep( pattern = "- REFUSED",                
       x = c( "1 - REFUSED TREATMENT", "2 - DID NOT REFUSED TREATMENT", "3 - REFUSED"),
       value = TRUE)
[1] "1 - REFUSED TREATMENT" "3 - REFUSED"  

To go further, we specify that - REFUSEDmust be the last part of the string. That is, nothing can come after - REFUSED. This is done by using the $ in regular expressions to signify the end of the pattern string:

grep( pattern = "- REFUSED$",                
      x = c( "1 - REFUSED TREATMENT", "2 - DID NOT REFUSED TREATMENT", "3 - REFUSED"),
      value = TRUE)
[1] "3 - REFUSED"

Now, your data may be different, so what we need is a variety of ways to refine just what type of matching we need to have. Regular expressions allow us to specify what we expect to find. The following pattern, although short, contains much information. The [0-9] signifies the digits 0 to 9, and the + means the previous expression should occur one or more times (but not zero times). This code allows for both 1 and 97 but not a blank, and not anything but a number. After one or more numbers, the regular expression indicates that we expect to find - REFUSED and then the end of the string, which we specify using the dollar sign. We extend our possible values to match for our x value and specify our pattern precisely, with the results shown here:

grep( pattern = "[0-9]+ - REFUSED$",                
      x = c( "1 - TREATMENT REFUSED TREATMENT", "2 - DID NOT REFUSED TREATMENT",
            "3 - JACK - REFUSED", "4 - REFUSED", "97 - REFUSED", "- REFUSED"),
      value = TRUE)
[1] "4 - REFUSED"  "97 - REFUSED"

As a final step, we know that if we want REFUSED, the string starts with a number as well. Because it may be a negative number, we expand our previous expression to search for a negative operator zero or more times at the start of the string, followed by a number one or more times. We indicate the start of the string by using ^, and we indicate zero or more times by using *. By using this regular expression, we help protect against matching legitimate values in the data and errantly converting them to NA. We are quite specific in what we want, and anything that does not match our pattern is rejected.

grep( pattern = "^[-]*[0-9]+ - REFUSED$",                
      x = c( "1 - TREATMENT REFUSED TREATMENT", "2 - DID NOT REFUSED TREATMENT",
             "3 - JACK - REFUSED", "4 - REFUSED", "97 - REFUSED", "-97 - REFUSED",
             "- REFUSED", "TRICK CASE 4 - REFUSED"),
      value = TRUE)
[1] "4 - REFUSED"   "97 - REFUSED"  "-97 - REFUSED"

By default, regular expressions are also case sensitive. In-depth coverage of using regular expressions is beyond the scope of this book, but we explain some examples throughout this chapter. We recommend Mastering Regular Expressions by Jeffrey Friedl (O’Reilly Media, 2006) for readers who are interested in comprehensive coverage of regular expressions. Regular expressions are quite useful for matching and working with string data because they allow us to encode very specific searches.

Going back to our data, we want to match more than just REFUSED. One option would be to create individual regular expressions for each and loop through, but this is cumbersome and inefficient. Instead, we can modify our regular expression to indicate different options in some positions. This is done by using a pipe, |, which functions as or, to separate options. In the following code, we keep the specifics about the start of the string but allow REFUSED or MISSING to be the ending text:

grep( pattern = "^[-]*[0-9]+ - REFUSED$|MISSING$",                
      x = c( "1 - TREATMENT REFUSED TREATMENT", "2 - DID NOT REFUSED TREATMENT",
             "3 - JACK - REFUSED", "4 - REFUSED", "97 - REFUSED", "-97 - REFUSED",
             "-2 - MISSING", "- REFUSED", "TRICK CASE 4 - REFUSED", "4 - REFUSED HELP"),
      value = TRUE)
[1] "4 - REFUSED"   "97 - REFUSED"  "-97 - REFUSED" "-2 - MISSING"

Now we are ready to replace responses with NA for the patterns we want. As you see, the regular expression has become something rather complex. A word of caution is that many symbols have special meanings in regular expressions. Searching for special symbols requires special care. For instance, you have seen that * is used to indicate zero or more times, not a literal asterisk. To search for a literal character, it must be escaped by using a backslash (). We now build our pattern that ought to identify all missing values we would like to see in our data set on child health. We pull our pattern out as a separate piece for readability and run one last test on our made-up x data. It is important to note that although the regular expression has a line break in order to fit in the book, it should be written as a single line with no space or line break before the NOT IN UNIVERSE$ portion.

p <- "^[-]*[0-9]+ - REFUSED$|MISSING$|DON'T KNOW$|LEGITIMATE SKIP$|PARTIAL INTERVIEW$|                  
      NOT IN UNIVERSE$"


grep( pattern = p,
     x = c( "1 - TREATMENT REFUSED TREATMENT", "2 - DID NOT REFUSED TREATMENT",
            "3 - JACK - REFUSED", "4 - REFUSED", "97 - REFUSED", "-97 - REFUSED",
            "-2 - MISSING", "96 - DON'T KNOW", "-4 - LEGITIMATE SKIP", "-3 - PARTIAL INTERVIEW",
            "-2 - NOT IN UNIVERSE", "- REFUSED", "TRICK CASE 4 - REFUSED",
            "4 - PARTIAL INTERVIEW OF DOCTOR"),
     value = TRUE)
[1] "4 - REFUSED"          "97 - REFUSED"        "-97 - REFUSED"       "-2 - MISSING"          
[5] "96 - DON'T KNOW"        "-4 - LEGITIMATE SKIP"   "-3 - PARTIAL INTERVIEW" "-2 - NOT IN UNIVERSE"  

Now that we can find cases we want to set to missing, let’s do it! Rather than type each variable, we loop through them in data.table, setting matching cases to NA. If it is a factor, we use droplevels() to remove unused factor levels and otherwise return them as is. To replace each variable, we use the := operator introduced in the previous chapter in a new way. Previously, you saw how to create a variable in a data table as dat[, NewVar := value]. Now we are going to replace multiple variables at once. To do this, we pass a vector of variable names on the left of the assignment operator, :=, and a list of the values on the right-hand side. Because the variables already exist in the data set, we are overwriting them rather than creating new ones. The variable names are stored in the vector, v. We wrap v in parentheses so that data.table evaluates v as an R object name containing a vector of variable names. If we did not wrap v in parentheses, data.table would try to assign the results to a variable named v.

The next challenge is selecting the variables. We normally type unquoted variable names in a data table, but our variable names are stored in the vector v. To accomplish this, we leverage an internal object in data.table, .SD, which is essentially a list version of the data table. However, we do not want to loop through every single variable in the data table. Although .SD defaults to being all variables in the data table, we can make .SD contain only a subset of the variables in the data table by using the .SDcols argument. .SDcols is a formal argument that takes a character vector and uses that to control the variables included in the internal .SD object in a data table. Thus, we specify the variables to include in .SD by writing .SDcols = v. This operation does not return any easy-to-read values, so instead we return to EDUCATIOand S1Q01:

v <- c("EDUCATIO", "PLANGUAG", "BMICLASS", "S1Q01", "S2Q01")                  
d[, (v) := lapply(.SD, function(x) {
   x[grepl(pattern = p, x)] <- NA
  if (is.factor(x)) droplevels(x) else x }), .SDcols = v]


table(d[, EDUCATIO])

         1 - LESS THAN HIGH SCHOOL 2 - 12 YEARS, HIGH SCHOOL GRADUATE
                              4661                              21238
         3 - MORE THAN HIGH SCHOOL
                             76022
table(d[, S1Q01])


  1 - MALE 2 - FEMALE
     52554      49719

Note that there is some inefficiency because the data is copied each time for the function run in data table. It would be more efficient to select the relevant rows in data.table and then set those as missing. The difficulty here is that our results are factors, and we want to drop the excess levels or coerce them to characters. We did not coerce them to factors in the first place; the data came that way from the Stata data file. This is simply part of dealing with the data that we get. If we had all character data, we might have used a slightly more efficient idiom. If we wished, we might also have converted all factors to the character class at the beginning.

Note

The gsub() function combines matching with regular expressions, and replaces the portions of a string that match the regular expression with new text in this form:

gsub(pattern = "regex", replacement = "new values", x = "original data").

Another recoding task we may want to do is to drop the numbers so that only the labels remain. This can be accomplished using the gsub() function, which performs regular expression matching within a string and then replaces matches with specified values (which can be a zero-length string). We first look at the following simple example:

gsub( pattern = "abc", replacement = "",  x = c("a", "abcd", "123abc456"))
[1] "a"      "d"      "123456"

Because it just uses regular expressions to match and remove numbers, we can reuse our pattern from before to test this out:

p.remove <- "^[-]*[0-9]+ - "                
gsub( pattern = p.remove, replacement = "",
      x = c( "1 - TREATMENT REFUSED TREATMENT", "2 - DID NOT REFUSED TREATMENT",
             "3 - JACK - REFUSED", "4 - REFUSED", "97 - REFUSED", "-97 - REFUSED",
             "-2 - MISSING", "96 - DON'T KNOW", "-4 - LEGITIMATE SKIP",
             "-3 - PARTIAL INTERVIEW", "-2 - NOT IN UNIVERSE", "- REFUSED",
             "TRICK CASE 4 - REFUSED", "4 - PARTIAL INTERVIEW OF DOCTOR"))
 [1] "TREATMENT REFUSED TREATMENT" "DID NOT REFUSED TREATMENT"   "JACK - REFUSED"
 [4] "REFUSED"                     "REFUSED"                     "REFUSED"
 [7] "MISSING"                     "DON'T KNOW"                  "LEGITIMATE SKIP"
[10] "PARTIAL INTERVIEW"           "NOT IN UNIVERSE"             "- REFUSED"
[13] "TRICK CASE 4 - REFUSED"      "PARTIAL INTERVIEW OF DOCTOR"

For efficiency, we can combine this with our previous code to set some cases to missing. Doing it in one step avoids repetitive processing. First, though, we read the data back in so we have the raw data:

d <- read.dta("ICPSR_04691/DS0001/04691-0001-Data.dta")                
d <- as.data.table(d)
setkey(d, IDNUMR)

Because gsub()coerces the input to characters regardless of whether we pass factors or characters to it, we will get characters out. So to know whether the input was a factor or not, we create an object, f, which is a logical indicating whether x started off as a factor. Then, rather than drop levels, we just convert them to factor if appropriate. Other than that change, we meld our earlier code with our removal code:

d[, (v) := lapply(.SD, function(x) {                  
   f <- is.factor(x)
   x[grepl(pattern = p, x)] <- NA
   x <- gsub(pattern = p.remove, replacement = "", x)
   if (f) factor(x) else x }), .SDcols = v]


table(d[, EDUCATIO])

12 YEARS, HIGH SCHOOL GRADUATE          LESS THAN HIGH SCHOOL          MORE THAN HIGH SCHOOL
                         21238                           4661                          76022
table(d[, S1Q01])


FEMALE   MALE
 49719  52554


table(d[, S2Q01])

EXCELLENT      FAIR      GOOD      POOR VERY GOOD
    65252      2189     10680       297     23903

Recoding Numeric Values

Because they have a more systematic structure, recoding numeric values is easier than recoding string data. To indicate different types of “missingness” in numeric data, people often use out-of-bounds values, such as a negative number or very high numbers (for example, 999). Coding different types of missing values makes sense from a data perspective, as the codes provide additional information (for example, skipped a question, not an applicable question). However, from a practical use perspective, we typically want to ignore all missing data (for example, when calculating the mean, it makes no sense to include -2 as a child’s age when the parent refused to report it or -3 if a parent forgot). We can find values that fall between a range by using the %between% operator in the data.table package, and we can locate the complement by using ! as before.

For example, for height, the documentation indicates that values zero or below, and above 90, are used to code various types of missing. Similarly, for weight, values zero or below, as well as above 900, are used to code different types of missing values. To tell R that these values indicate missing, we need to recode them. We see from our table() data that follows that there are indeed some missing values, and we can see exactly what values are used to code them (-2, 96, 96, 996, 997):

table(d[!S2Q02R %between% c(0, 90), S2Q02R])                  

  -2   96   97
   4 8096  131
table(d[!S2Q03R %between% c(0, 900), S2Q03R])


  -2  996  997
   4 2375   80

Again we could operate on these variables individually. However, that becomes cumbersome for many variables. The pattern often is that if valid values are less than 9, then 9 is missing. If values go into double digits, then >90 is missing, and so on. We recode programmatically by examining the maximum value and setting the bounds accordingly. We see what the variable was like originally with the following code and in Figure 8-1. To select a single variable while maintaining a data table, rather than a numeric vector, we wrap the variable name as a list by using the .() shorthand for the list() function. The result is that the hist() function dispatches to methods for a data table, resulting in somewhat more elegant histograms:

A393929_1_En_8_Fig1_HTML.jpg
Figure 8-1. Outliers that are near >900 and >90
hist(d[, .(S2Q03R)])                
hist(d[, .(S2Q02R)])

We focus on three columns that need numeric recoding. For each variable, we loop through calling the column name in our list. Designating m as our maximum, we start off with j and i set equal to 1. As long as j is equal to 1 and i is less than or equal to the number of values to try (for example, 9, 99, 999), we keep doing the next calculation. If the maximum (ignoring missing) of the variable, k, is less than the maximum of the ith m value, then set j to 0 (which breaks the loop) and replace any values outside the range of 0 and the ith m. If the ith m is greater than 90 or minus some minuscule number, set the variable to missing (NA_integer_). In essence, we are counting through our list by columns, and we check whether the maximum value in our entire column is one of our maximum m values. Once it is there, we make sure that, effectively, values from 9 to 10, or 90 to 100, or 900 to 1000, are set to missing. Note that we use NA_integer_ rather than NA because we are replacing a subset of values in an existing variable. data.table expects that the class of data being used to replace values within an existing variable match the class of that variable. We could use NAearlier because we overwrote the entire variable rather than replacing select values from within the variable.

v2 <- c("S2Q02R", "S2Q03R", "AGEYR_CH")                  
m <- sort(c(9, 99, 999))


for (k in v2) {
   j <- i <- 1
   while(j == 1 & i <= length(m)) {
     if (max(d[[k]], na.rm = TRUE) < m[i]) {
      j <- 0
      d[!(get(k) %between% c(0, ifelse(m[i] > 90,
        m[i] - 9, m[i] - 1e-9))), (k) := NA_integer_]
    } else { i <- i + 1   }
  }
}

We show the result of this after recoding by using the same histogram function as before and in Figure 8-2. Notice the n has been lowered a fair bit in both cases, and naturally the x-axis is much tighter.

A393929_1_En_8_Fig2_HTML.jpg
Figure 8-2. The outliers are gone because histograms ignore missing values

With this, we conclude our section about data hygiene. Next, we show how to create new variables to simplify the analysis process. In a data science perspective, variable creation is sometimes called feature creation. In many domains of social and behavioral research, questionnaires or surveys are commonly employed. In education, tests are administered. Regardless of whether the data is from tests, questionnaires, or surveys, a common task is to aggregate responses to individual items or variables to create a scale score, the overall test score, or some other aggregate index.

Creating New Variables

A questionnaire asks whether a doctor or health professional ever told the respondent that the focal child had any of nine possible conditions (for example, asthma, ADD or ADHD, depression or anxiety problems, diabetes, developmental delay, or physical impairment). Composite variable creation is the goal, to capture a number of issues. Some problems apply only to older children, and some respondents may not know the answer or may have refused to answer. Thus, we might take the average number of yes responses to calculate the proportion of yes responses out of all valid responses per respondent.

We first create a list of variables that pulls these nine columns’ worth of data. From there, to get our actual variables, we unlist()them from our data table to see all possible responses that may require cleanup to reduce to a Yes/No scenario. Notice in the following code that there are several cases of missing data, legitimate skips, don’t know, and refused:

v.health <- paste0("S2Q", c(19, 20, 21, 22, 23, 24, 26, 35, 37))                
v.health
[1] "S2Q19" "S2Q20" "S2Q21" "S2Q22" "S2Q23" "S2Q24" "S2Q26" "S2Q35" "S2Q37"
table(unlist(d[, v.health, with = FALSE]))


-4 - PARTIAL INTERVIEW   -3 - NOT IN UNIVERSE           -2 - MISSING
                     0                      0                    120
  -1 - LEGITIMATE SKIP                 0 - NO                1 - YES
                 48388                 833454                  37720
        6 - DON'T KNOW            7 - REFUSED
                  1362                    133

We already know how to clean these sorts of variables by using regular expressions and grep(). After cleanup, we check all the responses again, including NAs in our table with the argument useNA = "ifany". It is important to note that although the regular expression has a line break to fit in this book, it should be written as a single line with no space or line break before the NOT IN UNIVERSE$ portion.

p <- "^[-]*[0-9]+ - REFUSED$|MISSING$|DON'T KNOW$|LEGITIMATE SKIP$|PARTIAL INTERVIEW$|                
      NOT IN UNIVERSE$"


d[, (v.health) := lapply(.SD, function(x) {
   x[grepl(pattern = p, x)] <- NA
   if (is.factor(x)) droplevels(x) else x
}), .SDcols = v.health]


table(unlist(d[, v.health, with = FALSE]), useNA = "ifany")

 0 - NO 1 - YES    <NA>
 833454   37720   50003

Now we can create a new variable that is the average of yes responses. To calculate the average by row, we use the rowMeans() function . We leverage the fact that logical values are stored as FALSE = 0 and TRUE = 1. Thus, we test whether the character data is equal to 1 - YES, which will return TRUE or FALSE, and then take the row means of those logical values. We ignore (drop) missing values by setting na.rm = TRUE.

d[, HealthConditions := rowMeans(d[, v.health, with = FALSE] == "1 - YES", na.rm = TRUE)]                

table(round(d$HealthConditions, 2), useNA = 'ifany')

    0  0.11  0.12  0.14  0.17   0.2  0.22  0.25  0.29  0.33  0.38   0.4  0.43  0.44
77426 16116   239    20     5   676  4386   156    11  1933    75    81    12   768
  0.5  0.56  0.57   0.6  0.62  0.67  0.71  0.75  0.78  0.88  0.89   NaN
   51   264     3     3    13    80     1     5    19     1     3     6

We could follow a similar process to get the row counts. Looking at the counts, something has happened, though. There are no missing values! This is because with rowSums(), when na.rm = TRUE and a row has no valid data, its sum is zero, not a missing value. To correct this, we need to manually set to missing any cases/rows where all are missing:

d[, NHealthConditions := rowSums(d[, v.health, with = FALSE] == "1 - YES", na.rm = TRUE)]                
table(d$NHealthConditions, useNA = 'ifany')


    0     1     2     3     4     5     6     7     8
77432 17068  4630  2018   820   278    84    20     3

We count the number of missing responses per respondent, select rows where the number missing equals the total, and set those to missing. Again note that when replacing a subset of values from a variable in data.table, match the class of missing to the class of the variable (for example, NA_integer_ for integer, NA_real_ for numeric, NA_character_ for character, NA for logical). That gives us a better count:

d[, NMissHealthConditions := rowSums(is.na(d[, v.health, with = FALSE]))]                
d[NMissHealthConditions == length(v.health), NHealthConditions := NA_integer_]
table(d$NHealthConditions, useNA = 'ifany')


    0     1     2     3     4     5     6     7     8  <NA>
77426 17068  4630  2018   820   278    84    20     3     6

Using rowMeans()or rowSums()to create new variables is a relatively elegant solution. However, it is not a very data.table approach. The calculation occurs outside the original data.table; we call the object, d, a second time and subset the columns for use. When calling the data object a second time, we lose access to many features of data.table, such as performing operations for only certain subsets of the data, or performing an operation by some grouping factor.

If ignoring missing values is not an issue, we can easily work directly with data.table by using the Reduce() function. Reduce()takes a function (typically, a binary operator) as its first argument, and a vector or list of arguments. We show examples here of a vector and of a list with addition, division, and powers:

Reduce(`+`, c(1, 2, 3))              
[1] 6
Reduce(`+`, list(1:3, 4:6, 7:9))
[1] 12 15 18
Reduce(`/`, list(1:3, 4:6, 7:9))
[1] 0.036 0.050 0.056
Reduce(`^`, list(1:3, 4:6, 3:1))
[1]    1 1024  729

This can then easily be applied to variables in a data.table. We cannot directly add the health variables, because they are factor class data. However, we could write a function to deal with it:

fplus <- function(e1, e2) {              
   if (is.factor(e1)) {
     e1 <- as.numeric(e1) - 1 }
   if (is.factor(e2)) {
     e2 <- as.numeric(e2) - 1 }
   e1 + e2
}

We select the columns by using .SDcols, and then .SD will be a list of variables that we can reduce and store results in a new variable:

d[, NHealthConditions2 := Reduce(fplus, .SD), .SDcols = v.health]                
table(d$NHealthConditions2)


    0     1     2     3     4     5     6     7     8
65459 16116  4386  1927   768   264    79    19     3

Fuzzy Matching

Approximate string matching, or fuzzy matching, is a technique whereby observations link to a reference list. For example, you may have a list of registered users, and then individuals write their names on an attendance sheet. Alternately, you may be working with filenames that are supposed to match certain pieces of information.

We start with two lists; the reference names are our registered user list. These are the names we believe in and want to match to the observed names that were written in quickly on our hypothetical attendance sheets. Users may have attended more than one event and thus show up multiple times, or may not show up at all. Also note that some of these words have double spaces between them, which are required to match our later output.

reference.Names <- c("This Test", "Test Thas",              
  "Jane Mary", "Jack Dun-Dee")
observed.Names <- c("this test", "test this", "test that",
  "JaNe  Mary.", "Mary Sou", "Jack Dee", "Jane Jack")

The challenge is to find out the number of events that registered users attended (that is, signed). A first pass can be done by using the stringdistmatrix() function from the stringdist package. We use the Damerau-Levenshtein distance method, which essentially counts the number of characters that have to change to go from one string to another. From this, we can see for each observed name the minimum number of characters that must be changed to match one of our reference names. For this test, it takes two character changes to match the first reference name, eight for the second reference name, and so on. Note that by default, stringdistmatrix() is case sensitive, so switching cases counts as one change. Notice that this matrix gives us a grid of those matches, where the matrix elements are the distance:

stringdistmatrix(reference.Names, observed.Names, method = "dl")              
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    2    8    7   10    8    7    8
[2,]    8    3    3    9    8    8    8
[3,]    8    8    8    3    7    6    3
[4,]   11   11   11    9   10    4    9

If we want only matches, we can use approximate matching. It takes similar arguments, but also the maximum distance allowed before calling nothing a match. Notice that it returns positions in the reference name vector:

amatch(observed.Names, reference.Names, method = "dl", maxDist = 4)              
[1]  1  2  2  3 NA  4  3

We can expand a bit and make it non-case-sensitive by converting to lowercase via tolower(), or to uppercase via toupper(). Additionally, we can remove some things we do not care about, such as various punctuation marks. Notice that both of these methods have a chance to give us lower distances overall (although in our case, there are no periods to remove):

stringdistmatrix( tolower(reference.Names), tolower(observed.Names), method = "dl")                
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0    6    5   10    8    7    8
[2,]    6    1    1    9    8    8    8
[3,]    8    8    8    2    7    6    3
[4,]   11   11   11    9   10    4    9


stringdistmatrix( tolower(gsub("\.", "", reference.Names)),
                 tolower(gsub("\.", "", observed.Names)), method = "dl")
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0    6    5    9    8    7    8
[2,]    6    1    1    8    8    8    8
[3,]    8    8    8    1    7    6    3
[4,]   11   11   11    9   10    4    9

We can split a character string by a particular character by using strsplit(). Any pattern or character can be used, but we use spaces. We use backslash s (s) as a special character in regular expressions to mean any type of space. A second backslash is required in order to escape the first backslash. That is because the special character is not s, which would be escaped as s, but rather the special character is the compound s, thus the escaped version is \s. We demonstrate the result of strsplit() in the following code:

strsplit( x = reference.Names[1], split = "\s")[[1]]                
[1] "This" "Test"


strsplit( x = observed.Names[2], split = "\s")[[1]]
[1] "test" "this"

Now we can use stringdistmatrix()again. The result shows that each subchunk of the second observed name is one character manipulation away from one of the subchunks from the first reference name:

stringdistmatrix(              
   strsplit( x = reference.Names[1], split = "\s")[[1]],
   strsplit( x = observed.Names[2],  split = "\s")[[1]],
   method = "dl")
     [,1] [,2]
[1,]    4    1
[2,]    1    4

If we ignore case, we get perfect matches. The second chunk of observed name 2 matches the first chunk of reference name 1, and the first chunk of observed name 2 perfectly matches the second chunk of the reference name 1:

stringdistmatrix(              
   strsplit( x = tolower(reference.Names[1]), split = "\s")[[1]],
   strsplit( x = tolower(observed.Names[2]),  split = "\s")[[1]],
   method = "dl")
     [,1] [,2]
[1,]    3    0
[2,]    0    3

To return the sum of the best matches , we can pick the minimum and sum. The results show us that ignoring case and ignoring ordering makes the second observed name a perfect match for the first reference name. We would never have gotten this result if we tested it as an overall string.

sum(apply(stringdistmatrix(              
   strsplit( x = tolower(reference.Names[1]), split = "\s")[[1]],
   strsplit( x = tolower(observed.Names[2]),  split = "\s")[[1]],
   method = "dl"), 1, min))
[1] 0

Combining everything you have learned, yields a list of techniques to use for string matching. We find values to ignore (for example, punctuation), split strings (for example, on spaces), and ignore case. We write a function to help with matching names. To this, we also add an optional argument, fuzz, to control whether to include close matches within a certain degree of tolerance. We also count exact matches. This code is not optimized for speed or efficiency, but is an example combining many aspects of what you’ve learned. It also applies that knowledge to a set of names, where each written name (index argument) compares against a vector of possible candidates (the pool). We comment the function inline to explain the pieces, rather than write a wall of text at the beginning. We also do not bold this code because of its length.

matchIt <- function(index, pool, ignore = c("\.", "-"), split = FALSE,
                    ignore.case = TRUE, fuzz = .05, method = "dl") {


## for those things we want to ignore, drop them, e.g., remove spaces, periods, dashes
  rawpool <- pool


  for (v in ignore) {
    index <- gsub(v, "", index)
    pool <- gsub(v, "", pool)
  }


## if ignore case, convert to lowercase
  if (ignore.case) {
    index <- tolower(index)
    pool <- tolower(pool)  }


  if (!identical(split, FALSE)) {
    index2 <- strsplit(index, split = split)[[1]]
    index2 <- index2[nzchar(index2)]
    pool2 <- strsplit(pool, split = split)
    pool2 <- lapply(pool2, function(x) x[nzchar(x)])


## calculate distances defaults to the Damerau-Levenshtein distance
    distances <- sapply(pool2, function(x) {
      sum(apply(stringdistmatrix(index2, x, method = method),
                1, min, na.rm = TRUE))
    })
  } else {
    ## calculate distances defaults to the Damerau-Levenshtein distance
    distances <- stringdist(index, pool, method = method)
  }


## some methods result in Infinity answers, set these missing
  distances[!is.finite(distances)] <- NA


## get best and worst
  best <- min(distances, na.rm = TRUE)
  worst <- max(distances, na.rm = TRUE)


## if fuzz, grab all distances within fuzz percent of the difference between best and worst
  if (fuzz) {
    usedex <- which(distances <= (best + ((worst - best) * fuzz)))
  } else {
    usedex <- which(distances == best)  }


## define a distance below which it is considered a perfect or exact match
  perfect <- distances[usedex] < .01
  out <- rawpool[usedex]


## count the number of perfect matches
  count <- sum(perfect)


##the function continues onto the next page##
  if (any(perfect)) {
  ## if there are perfect matches, use just one
    match <- out[perfect][1]
## return a data table of the perfect match and number of perfect matches (probably 1)
    data.table(
      Match = match,
      N = count,
      Written = NA_character_)
  } else {
## if no perfect match, return list of close matches, comma separated and exactly as written
    data.table(
      Match = paste(out, collapse = ", "),
      N = count,
      Written = index)
  }
}

Now we loop through each observed name and try to match it to reference names within 5 percent of the best match (.05). Since the output is always the same, we can combine it row-wise.

output <- lapply(observed.Names, function(n) {                
  matchIt( index = n, pool = reference.Names, ignore = c("\.", "-"), split = "\s", fuzz = .05)
})


output <- do.call(rbind, output)
output
                     Match N   Written
1:               This Test 1        NA
2:               This Test 1        NA
3:               Test Thas 0 test that
4:               Jane Mary 1        NA
5:               Jane Mary 0  mary sou
6:            Jack Dun-Dee 0  jack dee
7: Jane Mary, Jack Dun-Dee 0 jane jack

We can see that This Testshows up in two rows, so we aggregate up and print the final output. There are two perfect matches for This Test. There is no perfect match for Test Thas, although there is a close one listed under the Uncertain column. Finally, jack dee is a close but uncertain match for Jack Dun-Dee, and jane jack is an uncertain match for Jane Mary or Jack Dun-Dee.

finaloutput <- output[, .(              
  N = sum(N),
  Uncertain = paste(na.omit(Written), collapse = ", ")),
  by = Match]
finaloutput
                     Match N Uncertain
1:               This Test 2          
2:               Test Thas 0 test that
3:               Jane Mary 1  mary sou
4:            Jack Dun-Dee 0  jack dee
5: Jane Mary, Jack Dun-Dee 0 jane jack

Although it is challenging to cover every example and use case, these basic tools should be enough to cover many types of data management tasks when used in combination.

Summary

In this chapter, you saw how to locate text based on either exact regular expressions or fuzzy matching. Additionally, you learned how to substitute new string pieces for old. Finally, we discussed ways of creating new variables in data.table. Table 8-1 summarizes the key functions presented in this chapter.

Table 8-1. Key Functions Described in This Chapter

Function

What It Does

read.dta()

Reads Stata files

str()

Displays R object structure

grep()

Pattern matching that returns a vector showing which elements match

grepl()

Pattern matching that returns a vector showing Boolean for all values in x

gsub()

Matches a pattern and makes a substitution by returning the original vector after making changes

rowMeans(), rowSums()

Calculates the average or sum of values for each row of a data table or matrix

Reduce()

Takes a binary function as its first argument and applies that to its remaining argument

.SD

Reserved name within data.table that can be used to refer to all the variables within the data table as a list

.SDcols

Argument to data.table that allows specification of the variables to be included in .SD

strsplit()

Splits a string based on a particular splitting character

stringdistmatrix()

Creates a matrix that has distances between observations and expected values

amatch()

Provides a vector that shows which index in the expected values most closely matches the observed values

tolower()

Coerces all characters in the given string to lowercase

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.163.208