String processing and pattern matching

Pattern matching is concerned with identifying patterns of characters in strings, and has a long history in computer programming outside of its use in R. The simplest kind of pattern matching would be to ask whether a given character is equal to a value or a group of values, which would be a simple program to write in nearly any language, but it would also have very limited functionality. A bigger problem is dealing with patterns of characters; for example, uppercase alphabet characters, numerals, and so on. A language for describing patterns of string characters has been identified and adopted in many languages including R, called regular expressions, which the grep family of R functions is based on. We will first discuss these functions and then delve into using regular expressions.

The grep family of functions includes a number of similar functions for identifying and replacing patterns of text. The most commonly used functions are as follows:

  • grep: This function is used to find strings that match a given character pattern. It takes a vector of strings as input and produces a vector of indices of those strings in the vector that match the given pattern.
  • grepl: This function is used to find strings that match a given character pattern, but differs from grep in the output. This function takes a vector of strings as input and produces a vector of logical values telling which elements of the original vector match the pattern.
  • sub: This function searches for a character pattern in a string and then replaces it with another string of text. It only makes this replacement in the first matching pattern it finds.
  • gsub: This function searches for a character pattern in a string and then replaces it with another string of text. As opposed to sub, this function makes the replacement in all available matches.

There is one more function that is potentially quite useful that does not require strict pattern matching:

  • agrep: This function tells which elements of a vector of strings closely match a given pattern. It takes as input a vector of strings and returns a vector of indices reporting which elements of the input vector match the pattern. Rather than using strict matching, it will allow for close matches. Closeness is determined by the number of insertions, deletions, or substations that have to be made to achieve a perfect match.

Regular expressions

Regular expressions can contain literal characters or symbolic representations of characters, but what makes them powerful are metacharacters, character classes, and sequences.

For letters or words, we can just use the literal string representation. For example, locate the letter "k" in the following code:

> pumpkins$weight[grep(pattern = "k", pumpkins$weight)]
[1] "2.4kg"  "3.1 kg"

To find particular letter sequences, we can simply use their literal string representation, but for those characters that have a symbolic meaning in regular expressions, we have to escape them with the \ prefix.

Most punctuation marks cannot be used as literal characters, because regular expressions have a particular meaning to these. If we try to find cells with a "." without keeping this in mind, things go astray:

> pumpkins$weight[grep(pattern = ".", pumpkins$weight)]
[1] "2.3"        "2.4kg"      "3.1 kg"     "2700 grams" "24"        

The cell containing "2700 grams" has no "." yet gets included.

If we want to search for a ".", we need to escape "." with "\". The correct way to do this would be as follows:

> pumpkins$weight[grep(pattern = "\.", pumpkins$weight)]
[1] "2.3"    "2.4kg"  "3.1 kg"

It is also worth mentioning that some characters that would normally be interpreted literally are given a special meaning by adding "\" in front of them. For example, "d" would be literally interpreted in a regular expression as "d", but "\d" would be interpreted as a digit.

Note

There are multiple standards for regular expressions, and different languages may handle regular expressions slightly differently. Perl regular expressions are allowed in R in many commands by passing the perl = TRUE argument to most functions in the grep family.

The following table gives the meaning of metacharacters and sequences. This is not a comprehensive list, and regular expressions can differ depending on context and the language being used. Some commonly used regular expressions in R are shown in the following table, though the list is not comprehensive:

Metacharacters

Match meaning

.

This character means any character.

$

This character means end of line.

?

This character means zero or one of the previous character.

*

This character means zero or more of the previous character.

+

This character means one or more of the previous character.

^

This character means line beginning if it is outside of the [ ] operator, and if it is inside the [ ] operator, then it means negate the following character class.

|

This character means an or operator.

[ ]

This character means character class described within brackets.

{ }

This character means number of times the preceding pattern should be present for a positive match. (The ?, *, and + metacharacters mentioned previously are shortcuts for this expression.)

\d

This character means a digit.

\D

This character means a non-digit character.

\s

This character means a space character.

\S

This character means a non-space character.

\w

This character means an alphanumeric character.

\W

This character means a non-alphanumeric character.

Now, let's look at how some of these can be used. We already saw an example of the "." metacharacter earlier in this chapter.

To look for at least one zero, we can use the following code:

> pumpkins$weight[grep(pattern = "0+", pumpkins$weight)]
[1] "2700 grams"

To look for those cases where someone recorded non-digit characters (which leaves out the final observation), use the following code:

> pumpkins$weight[grep(pattern = "\D", pumpkins$weight)]
[1] "2.3"        "2.4kg"      "3.1 kg"     "2700 grams"

What if we want to look for cases where units were not recorded? We can look for a digit followed by an end of the string as shown in the following code:

> pumpkins$weight[grep(pattern = "\d$", pumpkins$weight)]
[1] "2.3" "24"

The final thing we will introduce here are character classes, which really make regular expressions very powerful. As an example, instead of telling R to search a string for any of the following elements: a, b, c, d, e, and so on, we can use the regular expression [[:letters:]]. A table of character classes recognized in R is shown in the following table:

Character class

Meaning

[aeiou]

This character means a lowercase vowel.

[AEIOU]

This character means an uppercase vowel.

[0-9]

This character means a digit.

[a-z]

This character means a lowercase letter.

[A-Z]

This character means an uppercase letter.

[a-zA-Z0-9]

This character means a letter (either upper or lowercase) or a digit.

[^0-9]

This character means anything except a digit.

[[:alpha:]]

This character means an upper or lowercase letter.

[[:punct:]]

This character means a punctuation character.

[[:print:]]

This character means a printable character.

[[:digit:]]

This character means a digit character.

The last four character classes are examples of the POSIX character classes, which is a UNIX standard compatible with many other languages. There are other POSIX complaint expressions with significant overlap with other regular expressions in R.

We will use pattern matching to clean up the previous dataset. Now, in this example, we have only five observations, so we could clean it up by hand, but we will try to come up with some general rules that can be used to yield a data frame with one column of numbers using the same units and one column of strings using the same naming convention. As we will see, cleaning up datasets is often not a matter of statistical or mathematical decision making, but a series of decisions as to how data entry can go wrong and how to interpret this.

Firstly, there are a few ways where data entry can go wrong, such as:

  • Record the units (rather than just the number), as illustrated in the pumpkin dataset earlier in this chapter.
  • Record the data in the wrong units (we want kilograms not other units).
  • Record the data in error in a manner we can't be sure about. An example of this is the fifth entry in the pumpkins dataset. Is this an accurate weight in the right units? Is it missing a decimal point? Is it just a complete error?

If data entry goes wrong in the first or second way, we can figure out exactly how to correct it easily, and we will write some R code to do this. If data entry goes wrong the third way, it is a bit of a problem, and we have to ask ourselves if we want to guess where things went wrong, or if we want to just call those observations missing. Any time that manual data entry is involved, the third type of problem is usually present. For example, if looking at human temperature in degrees Fahrenheit, a temperature of 999 is not right, but it may be 99 with a third digit accidentally typed or 99.9, which we are not sure about.

There are many ways to do this. Here, our general approach will be to first clean the text out of all weight entries. Then we will identify those entries recorded in grams instead of kilograms. We will then come up with a consistent naming paradigm for the locations. Once this is done, we will create a new cleaned data frame. Finally, we will get rid of elements of the data frame where the weights don't make sense.

Firstly, let's get rid of the text from the weights column. Here we just substitute alphabetical characters with nothing and coerce these to numbers. Let's have a look at the following example:

corrected.data <- as.numeric(gsub(pattern = "[[:alpha:]]", "", pumpkins$weight))

We then identify those cells where the units are in grams based on the number of digits, assuming that a series of four digits represents something measured in grams rather than in kilograms, and we divide these measurements by 1000. If there were other units measured in pounds or ounces, we would need to figure this out ahead of time and add another statement. The number of digits is passed using the {4} argument. In general, this is the technique used to identify the number of consecutive instances of the character class that one is seeking. Let's have a look at the following example:

units.error.grams <- grep(pattern = "[[:digit:]]{4}", pumpkins$weight)
corrected.data[units.error.grams] <- corrected.data[units.error.grams] / 1000

We then fix the locations. Here we will use approximate pattern matching, because there are many ways to misspell "Europe", and we don't want to think of all of them. We will use the agrep function here, which looks for approximate matches. In order to look for an approximate match, agrep needs to know what the true match looks like, given by the pattern argument. Since we are doing approximate pattern matching, we also need to figure out what we consider as an approximate representation of the pattern versus something that is not a representation of the pattern at all. We will do this with the max.dist argument, which tells agrep how far off from the pattern a string of text can be and still be considered an approximate representation of the pattern. We tell agrep how many single character insertions, deletions, and substitutions are allowed. Let's have a look at the following example:

european <- agrep(pattern = "europe", pumpkins$location, ignore.case = TRUE, max.dist = list(insertions = c(1), deletions = c(2)))
american <- agrep(pattern = "us", pumpkins$location, ignore.case = TRUE, max.dist = list(insertions = 0, deletions = 2, substitutions = 0))
corrected.location <- pumpkins$location
corrected.data[european] <- "europe"
corrected.data[american] <- "US"

Finally, we create a new data frame with the consistent data as shown in the following code, and review what our new data looks like. Let's have a look at the following example:

> cleaned.pumpkins <- data.frame(corrected.data, corrected.location)
> names(cleaned.pumpkins) <- c('weight', 'location')
> cleaned.pumpkins
  weight location
1    2.3   europe
2    2.4   europe
3    3.1       US
4    2.7       US
5   24.0       US
> summary(cleaned.pumpkins)
     weight       location
 Min.   : 2.3   europe:2  
 1st Qu.: 2.4   US    :3  
 Median : 2.7             
 Mean   : 6.9             
 3rd Qu.: 3.1             
 Max.   :24.0    

Uh oh! The median weight is 2.7 kg with a mean of 6.9 kg, and a maximum of 24 kg. Clearly, there is an error here. Any pumpkin with a weight over 10 kg (two digits) is likely to be an error, so we get rid of these. (This is not a statistical question, but a judgment on the part of the researcher based on non-statistical knowledge of what is an implausible value.) We can either create a new data frame or fill it in as missing data.

Create a new dataset using the following code:

cleaned.pumpkins.2 <- cleaned.pumpkins[cleaned.pumpkins$weight <= 10,]

Fill in nonsensical values with missing values as follows:

cleaned.pumpkins[cleaned.pumpkins$weight >= 10,1] <- NA
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.45.5