8
Regular expressions and essential string functions

The Web consists predominantly of unstructured text. One of the central tasks in web scraping is to collect the relevant information for our research problem from heaps of textual data. Within the unstructured text we are often interested in systematic information—especially when we want to analyze the data using quantitative methods. Systematic structures can be numbers or recurrent names like countries or addresses. We usually proceed in three steps. First we gather the unstructured text, second we determine the recurring patterns behind the information we are looking for, and third we apply these patterns to the unstructured text to extract the information. This chapter will focus on the last two steps. Consider HTML documents from the previous chapters as an example. In principle, they are nothing but collections of text. Our goal is always to identify and extract those parts of the document that contain the relevant information. Ideally we can do so using XPath—but sometimes the crucial information is hidden within atomic values. In some settings, relevant information might be scattered across an HTML document, rendering approaches that exploit the document structure useless. In this chapter we introduce a powerful tool that helps retrieve data in such settings—regular expressions. Regular expressions provide us with a syntax for systematically accessing patterns in text.

Consider the following short example. Imagine we have collected a string of names and corresponding phone numbers from fictional characters of the “The Simpsons” TV series. Our task is to extract the names and numbers and to put them into a data frame.


R> raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555
-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, 
Homer5553642Dr. Julius Hibbert"

The first thing we notice is how the names and numbers come in all sorts of formats. Some numbers include area codes, some contain dashes, others even parentheses. Yet, despite these differences we also notice the similarities between all the phone numbers and the names. Most importantly, the numbers all contain digits while all the names contain alphabetic characters. We can make use of this knowledge by writing two regular expressions that will extract only the information that we are interested in. Do not worry about the details of the functions at this point. They simply serve to illustrate the task that we tackle in this chapter. We will learn the various elements the queries are made up of and also how they can be applied in different contexts to extract information and get it into a structured format. We will return to the example in Section 8.1.3.

images

We can input the results into a data frame:

images

Although R offers the main functions necessary to accomplish such tasks, R was not designed with a focus on string manipulation. Therefore, relevant functions sometimes lack coherence. As the importance of text mining and natural language processing in particular has increased in recent years, several packages have been developed to facilitate text manipulation in R. In the following sections—and throughout the remainder of this volume—we rely predominantly on the stringr package, as it provides most of the string manipulation capability we require and it enforces a more consistent coding behavior (Wickham 2010).

The following section introduces regular expressions as implemented in R. Section 8.2 provides an overview on how string manipulation can be used in practice. This is done by presenting commands that are available in the stringr package. If you have previously worked with regular expressions, you can skip Section 8.1 without much loss. Section 8.3 concludes with some aspects of character encodings—an important concept in web scraping.

8.1 Regular expressions

Regular expressions are generalizable text patterns for searching and manipulating text data. Strictly speaking, they are not so much a tool as they are a convention on how to query strings across a wide range of functions. In this section, we will introduce the basic building blocks of extended regular expressions as implemented in R. The following string will serve as a running example:


R> example.obj <-"1. A small sentence. - 2. Another tiny sentence."

8.1.1 Exact character matching

At the most basic level characters match characters—even in regular expressions. Thus, extracting a substring of a string will yield itself if present:


R> str_extract(example.obj,"small")
[1]"small"

Otherwise, the function would return a missing value:


R> str_extract(example.obj,"banana")
[1] NA

The function we use here and in the remainder of this section is str_extract() from the stringr package, which we assume is loaded in all subsequent examples. It is defined as str_extract(string, pattern) such that we first input the string that is to be operated upon and second the expression we are looking for. Note that this differs from most base functions, like grep() or grepl(), where the regular expression is typically input first.1 The function will return the first instance of a match to the regular expression in a given string. We can also ask R to extract every match by calling the function str_extract_all():


R> unlist(str_extract_all(example.obj,"sentence"))
[1]"sentence""sentence"

The stringr package offers both str_whatever() and str_whatever_all() in many instances. The former addresses the first instance of a matching string while the latter accesses all instances. The syntax of all these functions is such that the character vector in question is the first element, the regular expression the second, and all possible additional values come after that. The functions’ consistency is the main reason why we prefer to use the stringr package by Hadley Wickham (2010). We introduce the package in more detail in Section 8.2. See Table 8.5 for an overview of the counterparts of the stringr functions in base R.

As str_extract_all() is ordinarily called on multiple strings, the results are returned as a list, with each list element providing the results for one string. Our input string in the call above is a character vector of length one; hence, the function returns a list of length one, which we unlist() for convenience of exposition. Compare this to the behavior of the function when we call it upon multiple strings at the same time. We create a vector containing the strings text, manipulation, and basics. We use the function str_extract_all() to extract all instances of the pattern a:


R> out <- str_extract_all(c("text","manipulation","basics"),"a")
R> out
[[1]]
character(0)

[[2]]
[1]"a""a"

[[3]]
[1]"a"

The function returns a list of the same length as our input vector—three—where each element in the list contains the result for one string. As there is no a in the first string, the first element is an empty character vector. String two contains two as, string three one occurrence.

By default, character matching is case sensitive. Thus, capital letters in regular expressions are different from lowercase letters.


R> str_extract(example.obj,"small")
[1]"small"

small is contained in the example string while SMALL is not.


R> str_extract(example.obj,"SMALL")
[1] NA

Consequently, the function extracts no matching value. We can change this behavior by enclosing a string with ignore.case().2


R> str_extract(example.obj, ignore.case("SMALL"))
[1]"small"

We are not limited to using regular expressions on words. A string is simply a sequence of characters. Hence, we can just as well match particles of words …


R> unlist(str_extract_all(example.obj,"en"))
[1]"en""en""en""en"

… or mixtures of alphabetic characters and blank spaces.


R> str_extract(example.obj,"mall sent")
[1]"mall sent"

Searching for the pattern en in the example string returns every instance of the pattern, that is, both occurrences in the word sentence, which is contained twice in the example object. Sometimes we do not simply care about finding a match anywhere in a string but are concerned about the specific location within a string. There are two simple additions we can make to our regular expression to specify locations. The caret symbol () at the beginning of a regular expression marks the beginning of a string—$ at the end marks the end.3 Thus, extracting 2 from our running example will return a 2.


R> str_extract(example.obj,"2")
[1]"2"

Extracting a 2 from the beginning of the string, however, fails.


R> str_extract(example.obj,"⁁2")
[1] NA

Similarly, the $ sign signals the end of a string, such that …


R> unlist(str_extract_all(example.obj,"sentence$"))
character(0)

… returns no matches as our example string ends in a period character and not in the word sentence. Another powerful addition to our regular expressions toolkit is the pipe, displayed as |. This character is treated as an OR operator such that the function returns all matches to the expressions before and after the pipe.


R> unlist(str_extract_all(example.obj,"tiny|sentence"))
[1]"sentence""tiny""sentence"

8.1.2 Generalizing regular expressions

Up to this point, we have only matched fixed expressions. But the power of regular expressions stems from the possibility to write more flexible, generalized search queries. The most general among them is the period character. It matches any character.


R> str_extract(example.obj,"sm.ll")
[1]"small"

Another powerful generalization in regular expressions are character classes, which are enclosed in brackets—[]. A character class means that any of the characters within the brackets will be matched.


R> str_extract(example.obj,"sm[abc]ll")
[1]"small"

The above code extracts the word small as the character a is part of the character class [abc]. A different way to specify the elements of a character class is to employ ranges of characters, using a dash -.


R> str_extract(example.obj,"sm[a-p]ll")
[1]"small"

In this case, any characters from a to p are valid matches. Apart from alphabetic characters and digits, we can also include punctuation and spaces in regular expressions. Accordingly, they can be part of a character class. For example, the character class [uvw. ] matches the letters u, v and w as well as a period and a blank space. Applying this to our running example (Recall: “1. A small sentence. - 2. Another tiny sentence.”) yields all of its constituent periods and spaces but neither u, v, or w as there are none in the object. Note that the period character in the character class loses its special meaning. Inside a character class, a dot only matches a literal dot.

images

So far, we have manually specified character classes. However, there are some typical collections of characters that we need to match in a body of text. For example, we are often interested in finding all alphabetic characters in a given text. This can be accomplished with the character class [a-zA-Z], that is, all letters from a to z as well as all letters from A to Z. For convenience, a number of common character classes have been predefined in R. Table 8.1 provides an overview of selected predefined classes.

Table 8.1 Selected predefined character classes in R regular expressions

[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9
[:lower:] Lowercase characters: a–z
[:upper:] Uppercase characters: A–Z
[:alpha:] Alphabetic characters: a–z and A–Z
[:alnum:] Digits and alphabetic characters
[:punct:] Punctuation characters: . , ; etc.
[:graph:] Graphical characters: [:alnum:] and [:punct:]
[:blank:] Blank characters: Space and tab
[:space:] Space characters: Space, tab, newline, and other space characters
[:print:] Printable characters: [:alnum:], [:punct:] and [:space:]

Source: Adapted from http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html

In order to use the predefined classes, we have to enclose them in brackets. Otherwise, R assumes that we are specifying a character class consisting of the constituent characters. Say we are interested in extracting all the punctuation characters in our example. The correct expression is


R> unlist(str_extract_all(example.obj,"[[:punct:]]"))
[1]"."".""-"".""."

Notice how this differs from


R> unlist(str_extract_all(example.obj,"[:punct:]"))
[1]"n""t""n""c""n""t""t""n""n""t""n""c"

Not enclosing the character class returns all the :, p, u, n, c, and t in our running example. Note that the duplicate : does not throw off R. A redundant inclusion of a character in a character class will only match each instance once.


R> unlist(str_extract_all(example.obj,"[AAAAAA]"))
[1]"A""A"

Furthermore, while [A-Za-z] is almost identical to [:alpha:], the former disregards special characters, such that …


R> str_extract("François Hollande","Fran[a-z]ois")
[1] NA

… returns no matches, while …


R> str_extract("François Hollande","Fran[[:alpha:]]ois")
[1]"François"

… does. The predefined character classes will cover many requests we might like to make but in case they do not, we can even extend a predefined character class by adding elements to it.

images

In this case, we extract all punctuation characters along with the capital letters A, B, and C. Incidentally, making use of the range operator we introduced above, this extended character class could be rewritten as [[:punct:]A-C]. Another nifty use of character classes is to invert their meanings by adding a caret () at the beginning of a character class. Doing so, the function will match everything except the contents of the character class.

images

Accordingly, in our case every non-alphanumeric character yields every blank space and punctuation character. To recap, we have learned that every digit and character matches itself in a regular expression, a period matches any character, and a character class will match any of its constituent characters. However, we are still missing the option to use quantification in our expressions. Say, we would like to extract a sequence starting with an s, ending with a l, and any three alphabetic characters in between from our running example. With the tools we have learned so far, our only option is to write an expression like s[[:alpha:]][[:alpha:]][[:alpha:]]l. Recall that we cannot use the . character as this would match any character, including blank spaces and punctuation.


R> str_extract(example.obj,"s[[:alpha:]][[:alpha:]][[:alpha:]]l")
[1]"small"

Writing our regular expressions in this manner not only quickly becomes difficult to read and understand, but it is also inefficient to write and more prone to errors. To avoid this we can add quantifiers to characters. For example, a number in {} after a character signals a fixed number of repetitions of this character. Using this quantifier, a sequence such as aaaa could be shortened to read a{4}. In our case, we thus write …


R> str_extract(example.obj,"s[[:alpha:]]{3}l")
[1]"small"

… where [[:alpha:]]{3} matches any three alphabetic characters. Table 8.2 provides an overview of the available quantifiers in R. A common quantification operator is the + sign, which signals that the preceding item has to be matched one or more times. Using the . as any character we could thus write the following in order to extract a sequence that runs from an A to sentence with any number—greater than zero—of any characters in between.

Table 8.2 Quantifiers in R regular expressions

? The preceding item is optional and will be matched at most once
* The preceding item will be matched zero or more times
+ The preceding item will be matched one or more times
{n} The preceding item is matched exactly n times
{n,} The preceding item is matched n or more times
{n,m} The preceding item is matched at least n times, but not more than m times

Source: Adapted from http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html


R> str_extract(example.obj,"A.+sentence")
[1]"A small sentence. - 2. Another tiny sentence"

R applies greedy quantification. This means that the program tries to extract the greatest possible sequence of the preceding character. As the . matches any character, the function returns the greatest possible sequence of any characters before a sequence of sentence. We can change this behavior by adding a ? to the expression in order to signal that we are only looking for the shortest possible sequence of any characters before a sequence of sentence. The ? means that the preceding item is optional and will be matched at most once (see again Table 8.2).


R> str_extract(example.obj,"A.+?sentence")
[1]"A small sentence"

We are not restricted to applying quantifiers to single characters. In order to apply a quantifier to a group of characters, we enclose them in parentheses.


R> unlist(str_extract_all(example.obj,"(.en){1,5}"))
[1]"senten""senten"

In this case, we are asking the function to return a sequence of characters where the first character can be any character and the second and third characters have to be an e and an n. We are asking the function for all instances where this sequence appears at least once, but at most five times. The longest possible sequence that could conform to this request would thus be 3 × 5 = 15 characters long, where every second and third character would be an e and an n. In the next code snippet we drop the parentheses. The function will thus match all sequences that run from any character over e to n where the n has to appear at least once but at most five times. Consider how the previous result differs from the following:


R> unlist(str_extract_all(example.obj,".en{1,5}"))
[1]"sen""ten""sen""ten"

So far, we have encountered a number of characters that have a special meaning in regular expressions.4 They are called metacharacters. In order to match them literally, we precede them with two backslashes. In order to literally extract all period characters from our running example, we write


R> unlist(str_extract_all(example.obj,"\."))
[1]".""."".""."

The double backslash before the period character is interpreted as a single literal backslash. Inputting a single backslash in a regular expression will be interpreted as introducing an escape sequence. Several of these escape sequences are quite common in web scraping tasks and should be familiar to you. The most common are and which mean new line and tab. For example, “a a” is interpreted as a, three new lines, and another a. If we want the entire regular expression to be interpreted literally, we have a better alternative than preceding every metacharacter with a backslash. We can enclose the expression with fixed() in order for metacharacters to be interpreted literally.


R> unlist(str_extract_all(example.obj, fixed(".")))
[1]".""."".""."

Most metacharacters lose their special meaning inside a character class. For example, a period character inside a character class will only match a literal period character. The only two exceptions to this rule are the caret () and the -. Putting the former at the beginning of a character class matches the inverse of the character class’ contents. The latter can be applied to describe ranges inside a character class. This behavior can be altered by putting the - at the beginning or the end of a character class. In this case it will be interpreted literally.

One last aspect of regular expressions that we want to introduce here are a number of shortcuts that have been assigned to several specific character classes. Table 8.3 provides an overview of available shortcuts.

Table 8.3 Selected symbols with special meaning

w Word characters: [[:alnum:]_]
W No word characters: [⁁[:alnum:]_]
s Space characters: [[:blank:]]
S No space characters: [⁁[:blank:]]
d Digits: [[:digit:]]
D No digits: [⁁[:digit:]]
 Word edge
B No word edge
< Word beginning
> Word end

Consider the w character. This symbol matches any word character in our running example, such that …

images

… extracts every word separated by blank spaces or punctuation. Note that w is equivalent to [[:alnum:]_] and thus the leading digits are interpreted as whole words. Consider further the useful shortcuts for word edges >, <, and . Using them, we can be more specific in the location of matches. Imagine we would like to extract all e from our running example that are at the end of a word. To do so, we could apply one of the following two expressions:


R> unlist(str_extract_all(example.obj,"e\>"))
[1]"e""e"
R> unlist(str_extract_all(example.obj,"e\b"))
[1]"e""e"

This query extracts the two e from the edges of the word sentence. Finally, it is even possible to match a sequence that has been previously matched in a regular expression. This is called backreferencing. Say, we are looking for the first letter in our running example and—whatever it may be—want to match further instances of that particular letter. To do so, we enclose the element in question in parentheses—for example, ([[:alpha:]]) and reference it using 1.5


R> str_extract(example.obj,"([[:alpha:]]).+?\1")
[1]"A small sentence. - 2. A"

In our example, the letter is an A. The function returns this match and the subsequent characters up to the next instance of an A. To make matters a little more complicated, we now look for a lowercase word without the letter a up to and including the second occurrence of this word.


R> str_extract(example.obj,"(\<[b-z]+\>).+?\1")
[1]"sentence. - 2. Another tiny sentence"

The expression we use is (\<[b-z]+\>).+?\1. First, consider the [b-z]+ part. The expression matches all sequences of lowercase letters of length one or more that do not contain the letter a. In our running example, neither the 1 nor the A fulfill this requirement. The first substring that would match this expression is the double l in the word small. Recall that the + quantifier is greedy. Hence, it tries to capture the longest possible sequence which would be ll instead of l. This is not what we want. Instead, we are looking for a whole word of lowercase letters that do not contain the letter a. Thus, to exclude this finding we add the \< and \> to the expression to signal a word's beginning and end. This entire expression is enclosed in parentheses in order to reference it further down in the expression. The first part of the string that this expression matches is the word sentence. Next, we are looking for the subsequent occurrence of this substring in our string using the \1—regardless of what comes in between (.+?). Not so easy, is it?

8.1.3 The introductory example reconsidered

Now that we have encountered the main ingredients of regular expressions, we can come back to our introductory example of sorting out the Simpsons phone directory. Take another look at the raw data.


R> raw.data
[1]"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. 
Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. 
Julius Hibbert"

In order to extract the names, we used the regular expression [[:alpha:]., ]{2,}. Let us have a look at it step by step. At its core, we used the character class [:alpha:], which signals that we are looking for alphabetic characters. Apart from these characters, names can also contain periods, commas and empty spaces, which we want to add to the character class to read [[:alpha:]., ]. Finally, we add a quantifier to impose the restriction that the contents of the character class have to be matched at least twice to be considered a match. Failing to add a quantifier would extract every single character that matches the character class. Moreover, we have to specify that we only want matches of at least length two; otherwise the expression would return the empty spaces between some of the phone numbers.

images

We also wanted to extract all the phone numbers from the string. The regular expression we used for the task was a little more complicated to conform to the different formats of the phone numbers. Let us consider the elements that phone numbers consist of, mostly digits (\d). The primary source of difficulty stems from the fact that the phone numbers were not formatted identically. Instead, some contained empty spaces, dashes, parentheses, or did not have an area code attached to them.

Applying our knowledge of regular expressions, we are now able to dismantle the regular expression. In its entirety it reads \(?(\d{3})?\)?(-| )?\d{3}(-| )?\d{4}. Let us go through the expression. The first part of the expression reads \(?(\d{3})?\)?. In the center we find \d{3}, which we use to collect the three-digit area code. As this is not contained in every phone number we enclose the expression with two parentheses and add a question mark, signaling that this part of the expression can be dropped. Before and after this core element we add \( and \) to incorporate two literal parentheses surrounding the three-digit area code. These too can be dropped, if the phone number does not contain them, using the ?. Next, our regular expression contains the expression (-| )?. This means that either a dash or an empty space will be matched, but again, we enclose the entire expression with parentheses and add a question mark in order to signal that this part of the expression might be missing. These elements are then simply repeated. Specifically, we are looking for three digits, another dash or empty space that might or might not be part of the phone number, and four more digits. Applying this to our mock example yields


R> phone <- unlist(str_extract_all(raw.data,"\(?(\d{3})?\)?(-| )?\d
{3}(-| )?\d{4}"))
R> phone
[1]"555-1239""(636) 555-0113""555-6542""555 8904"
[5]"636-555-3226""5553642"

Before moving on to discuss how regular expressions can be used in practice in the subsequent section, we would like to conclude this part with some general observations on regular expressions. First, even though we have provided a fairly comprehensive picture on how we can go about generalizing regular expressions to meet our string manipulation needs, there are still several aspects that we have not covered in this section. In particular, there are two flavors of regular expressions implemented in R—extended basic and Perl regular expressions. In the above example we have exclusively relied on the former. While Perl regular expressions provide some additional features, most tasks can be accomplished by relying on the default flavor—the extended basic variant.6

Although there is no harm in learning Perl regular expressions we advise you to stick to the default for several reasons. One, it is generally confusing to keep two flavors in mind—especially if this is your first time approaching regular expressions. Two, most tasks can be accomplished with the default implementation. Sometimes this means solving a task in two steps rather than one but in many instances this behavior is even preferable. We believe that it is poor practice to try and come up with a “golden expression” that accomplishes all your string manipulation needs in just one line. For the sake of readability one should try to restrict the number of steps that are taken in any given line of code. This simplifies error detection and furthermore helps grasp what is going on in your code when revisiting it at a later stage. Keeping this rule in mind, the use of such intricate concepts as backreferences becomes dubious. While there may be instances when they cannot be avoided, they also tend to make code confusing. Splitting all the steps that are taken inside a backreference expression into several smaller steps is often preferable.

Now we have the building blocks ready to take a look at what can be accomplished with regular expressions in practice.

8.2 String processing

8.2.1 The stringr package

In this section we present some of the available functions that rely on regular expressions. To do so we look at functions that are implemented in the stringr package. Two functions we have used throughout the last section were str_extract() and str_extract_all(). They extract the first/all instance/s of a match between the regular expression and the string. To reiterate, str_extract() extracts the first matching instance to a regular expression …

images

We have pointed out that the function outputs differ. In the former case a character vector is returned, while a list is returned in the latter case. Table 8.4 gives an overview of the different functions that will be introduced in the present chapter. Column two presents a short description of the function's purpose, column three specifies the format of the return value. If instead of extracting the result we are interested in the location of a match in a given string, we use the functions str_locate() or str_locate_all().

images

Table 8.4 Functions of package stringr in this chapter

Function Description Output
Functions using regular expressions
str_extract() Extracts first string that matches pattern Character vector
str_extract_all() Extracts all strings that match pattern List of character vectors
str_locate() Returns position of first pattern match Matrix of start/end positions
str_locate_all() Returns positions of all pattern matches List of matrices
str_replace() Replaces first pattern match Character vector
str_replace_all() Replaces all pattern matches Character vector
str_split() Splits string at pattern List of character vectors
str_split_fixed() Splits string at pattern into fixed number of pieces Matrix of character vectors
str_detect() Detects patterns in string Boolean vector
str_count() Counts number of pattern occurrences in string Numeric vector
Further functions
str_sub() Extracts strings by position Character vector
str_dup() Duplicates strings Character vector
str_length() Returns length of string Numeric vector
str_pad() Pads a string Character vector
str_trim() Discards string padding Character vector
str_c() Concatenates strings Character vector

The function outputs a matrix with the start and end position of the first instance of a match, in this case the 35th to 38th characters in our example string. We can make use of positional information in a string to extract a substring using the function str_sub().


R> str_sub(example.obj, start = 35, end = 38)
[1]"tiny"

Here we extract the 35th to 38th characters that we know to be the word tiny. Possibly, a more common task is to replace a given substring. As usual, this can be done using the assignment operator.


R> str_sub(example.obj, 35, 38) <-"huge"
R> example.obj
[1]"1. A small sentence. - 2. Another huge sentence."

str_replace() and str_replace_all() are used for replacements more generally.


R> str_replace(example.obj, pattern ="huge", replacement ="giant")
[1]"1. A small sentence. - 2. Another giant sentence."

We might care to split a string into several smaller strings. In the easiest of cases we simply define a split, say at each dash.

images

We can also fix the number of particles we want the string to be split into. If we wanted to split the string at each blank space, but did not want more than five resulting strings, we would write

images

So far, all the examples we looked at have assumed a single string object. Recall our little running example that consists of two sentences—but only one string.


R> example.obj
[1]"1. A small sentence. - 2. Another huge sentence."

We can apply the functions to several strings at the same time. Consider a character vector that consists of several strings as a second running example:


R> char.vec <- c("this","and this","and that")

The first thing we can do is to check the occurrence of particular pattern inside a character vector. Assume we are interested in knowing whether the pattern this appears in the elements of a given vector. The function we use to do this is str_detect().


R> str_detect(char.vec,"this")
[1] TRUE TRUE FALSE

Moreover, we could be interested in how often this particular word appears in the elements of a given vector …


R> str_count(char.vec,"this")
[1] 1 1 0

… or how many words there are in total in each of the different elements.


R> str_count(char.vec,"\w+")
[1] 1 2 2

We can duplicate strings …


images

… or count the number of characters in a given string.


R> length.char.vec <- str_length(char.vec)
R> length.char.vec
[1] 4 8 8

Two important functions in web data manipulation are str_pad() and str_trim(). They are used to add characters to the edges of strings or trim blank spaces.


images

In this case we add white spaces to the shorter string equally on both sides such that each string has the same length. The opposite operation is performed using str_trim(), which strips excess white spaces from the edges of strings.


images

Finally, we can join strings using the str_c() function.


R> cat(str_c(char.vec, collapse ="
"))
this
and this
and that

Here, we join the three strings of our character vector into a single string. We add a new line character ( ) and produce the result using the cat() function, which interprets the new line character as a new line. Beyond joining the contents of one vector, we can use the function to join two different vectors.


R> str_c("text","manipulation", sep ="")
[1]"text manipulation"

If the length of one vector is the multiple of the other, the function automatically recycles the shorter one.


R> str_c("text", c("manipulation","basics"), sep ="")
[1]"text manipulation""text basics"

Throughout this book we frequently rely on the stringr package for strings processing. However, base R provides string processing functionality as well. We find the base functions less consistent and thus more difficult to learn. If you still want to learn them or want to switch from base R functionality to the stringr package, have a look at Table 8.5. It provides an overview of the analogue functions from the stringr package as implemented in base R.

Table 8.5 Equivalents of the functions in the stringr package in base R

stringr function Base function
Functions using regular expressions
str_extract() regmatches()
str_extract_all() regmatches()
str_locate() regexpr()
str_locate_all() gregexpr()
str_replace() sub()
str_replace_all() gsub()
str_split() strsplit()
str_split_fixed()
str_detect() grepl()
str_count()
Further functions
str_sub() regmatches()
str_dup()
str_length() nchar()
str_pad()
str_trim()
str_c() paste(), paste0()

8.2.2 A couple more handy functions

Many string manipulation tasks can be accomplished using the stringr package we introduced in the previous section. However, there are a couple of additional functions in base R we would like to introduce in this section. Text data, especially data scraped from web sources, is often messy. Data that should be matched come in different formats, names are spelled differently—problems come from all sorts of places. Throughout this volume we stress the need to cleanse data after it is collected. One way to deal with messy text data is the agrep() function, which provides approximate matching via the Levenshtein distance. Without going into too much detail, the function calculates the number of insertions, deletions, and substitutions necessary to transform one string into another. Specifying a cutoff, we can provide a criterion on whether a pattern should be considered as present in a string.


R> agrep("Barack Obama","Barack H. Obama", max.distance = list(all = 3))
[1] 1

In this case, we are looking for the pattern Barack Obama in the string Barack H. Obama and we allow three alterations in the string.7 See how this compares to a search for the pattern in the string Michelle Obama.


R> agrep("Barack Obama","Michelle Obama", max.distance = list(all = 3))
integer(0)

Too many changes are needed in order to find the pattern in the string; hence there is no result. You can change the maximum distance between pattern and string by adjusting both the max.distance and the costs parameter. The higher the max.distance parameter (default = 0.1), the more approximate matches it will find. Using the costs parameter you can adjust the costs for the different operations necessary to liken to strings.

Another handy function is pmatch(). The function returns the positions of the strings in the first vector in the second vector. Consider the character vector from above, c("this","and this","and that").


R> pmatch(c("and this","and that","and these","and those"), char.vec)
[1]  2  3 NA NA

We are looking for the positions of the elements in the first vector (c("and this","and that","and these","and those") in the character vector. The output signals that the first element is at the second position, the second at the third. The third and fourth elements in the first vector are not contained in the character vector. A final useful function is make.unique(). Using this function you can transform a collection of nonunique strings by adding digits where necessary.

images

Although there are a lot of handy functions already available, there will always be problems and situations when the one special function desperately needed is missing. One of those problems might be the following. Imagine we have to check for more than one pattern within a character vector and want to get a logical vector indicating compliant rows or an index listing all the compliant row numbers. For checking patterns, we know that grep(), grepl(), or str_detect() might be good candidates. Because grep() offers a switch for returning the matched text or a row index vector, we try to build a solution starting with grep(). We begin by downloading a test dataset of Simpsons episodes and store it in the local file episodes.Rdata.

images

images

Let us load the table containing all the Simpsons episodes.


R> load("episodes.Rdata")

As you can see below, it is easy to switch between different answers to the same question—which episodes mention Homer in the title—using grep(), grepl() and using the value = TRUE option. The easy switch makes these functions particularly valuable when we start developing regular expressions, as we might need an index or logical vector at the end, but we can use the value option to check if the used pattern actually works.


R> grep("Homer",episodes$title[1:10], value=T)
[1]"Homer's Odyssey""Homer's Night Out"
R> grepl("Homer",episodes$title[1:10])
[1] FALSE  FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

What is missing, however, is the option to ask for a whole bunch of patterns to be matched at the same time. Imagine we would like to know whether there are episodes where Homer and Lisa are mentioned in the title. The standard solution would be to make a logical vector for each separate pattern to be matched and later combine them to a logical vector that equals TRUE when all patterns are found.


R> iffer1 <- grepl("Homer",episodes$title)
R> iffer2 <- grepl("Lisa",episodes$title)
R> iffer <- iffer1 & iffer2
R> episodes$title[iffer]
[1]"Homer vs. Lisa and the 8th Commandment"

Although this solution might seem acceptable in the case of two patterns, it becomes more and more inconvenient if the number of patterns grows or if the task has to be repeated. We will therefore create a new function built upon grep().

images

images

The idea of the grepall() function is that we need to repeat the pattern search for a series of patterns—as we did in the previous code snippet when doing two separate pattern searches. Going through a series of things can be done by using a loop or more efficiently by using apply functions. Therefore, we first apply the grepl() function to get the logical vectors indicating which patterns were found in which row. We use sapply() because we have a vector as input and would like to have a matrix like object as output. What we get is a matrix with columns referring to the different search patterns and rows referring to the individual strings. To make sure all patterns were found in a certain row we use a second apply—this time we use apply() because we have a matrix as input—where the all() function returns TRUE when all values in a row are true and FALSE if any one value in a row is false. Depending on whether or not we want to return a vector containing the row numbers or a vector containing the text for which all the patterns were found the value option switches between two different uses of the internal logical vector to return row numbers or text accordingly. To get the full logical vector we can use the logic option. Besides providing functionality that works like grep() and grepl() for multiple search terms, all other options like ignore.case, perl, fixed, or useBytes are forwarded to the first apply step, so that this functionality is also part of the new function.

8.3 A word on character encodings

When working with web-based text data—particularly non-English data—one quickly runs into encoding issues. While there are no simple rules to deal with these problems, it is important to keep the difficulties that arise from them in mind. Generally speaking, character encodings refer to how the digital binary signals are translated into human-readable characters, for example, making a “d” from “01100100.” As there are many languages around the world, there are also many special characters, like ä, ø, ç, and so forth. The issues arise since there are different translation tables such that without knowing which particular table is used to encode a binary signal it is difficult to draw inferences on the correct content of a signal. If you have not changed the defaults, R works with the system encoding scheme to present the output. You can query this standard with the following function:


R> Sys.getlocale()
[1]"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;
LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"

If you have not figured it out already from the names on the cover, this book was written by four guys from Germany on a computer with a German operating system. The name of the character encoding, hidden behind the number 1252, is Windows-1252 and it is the default character encoding on systems that run Microsoft Windows in English and some other languages. Your output is likely to be a different one. For example, if you are working on a Windows PC and are located in the United States, R will give you a feedback like English_United States.1252. If you are operating on a Mac, the encoding standard is UTF-8.8 Let us input a string with some special characters. Consider this fragment from a popular Swedish song, called “small frogs” (små grodorna):


R> small.frogs <-"Små grodorna, små grodorna är lustiga att se."
R> small.frogs
[1]"Små grodorna, små grodorna är lustiga att se."

There are several special characters in this fragment. By default, our inputs and outputs are assumed to be of Windows-1252 standard; thus the output is correct. Using the function iconv(), we can translate a string from one encoding scheme to another:


R> small.frogs.utf8 <- iconv(small.frogs, from ="windows-1252", to ="UTF-8")
R> Encoding(small.frogs.utf8)
[1]"UTF-8"
R> small.frogs.utf8
[1]"Små grodorna, små grodorna är lustiga att se."

In this case, the function applies a translation table from the Windows-1252 encoding to the UTF-8 standard. Thus, the binary sequence is recast as a UTF-8-encoded string. Consider how this behavior differs from the one we encounter when applying the Encoding() function to the string.


R> Encoding(small.frogs.utf8) <-"windows-1252"
R> small.frogs.utf8
[1]"Små grodorna, små grodorna är lustiga att se."

Doing so, we force the system to treat the UTF-8-encoded binary sequence as though it were generated by a different encoding scheme (our system default Windows-1252), resulting in the well-known garbled output we get, for example, when visiting a website with malspecified encodings. There are currently 350 conversion schemes available, which can be accessed using the iconvlist() function.

images

Having established the importance of keeping track of the encodings of text and web-based text, in particular, we now turn to the question of how to figure out the encoding of an unknown text. Luckily, in many instances a website gives a pointer in its header. Consider the <meta> tag with the http-equiv attribute from the website of the Science Journal, which is located at http://www.sciencemag.org/.


R> library(RCurl)
R> enc.test <- getURL("http://www.sciencemag.org/")

R> unlist(str_extract_all(enc.test,"<meta.+?>"))
[1]"<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />"
[2]"<meta name="googlebot" content="NOODP" />"
[3]"<meta name="HW.ad-path" content="/" />"

The first tag provides some structured information on the type of content we can expect on the site as well as how the characters are encoded—in this case UTF-8. But what if such a tag is not available? While it is difficult to guess the encoding of a particular text, a couple of handy functions toward this end have been implemented in the tau package. There are three functions available to test the encoding of a particular string, is.ascii(), is.locale(), and is.utf8(). What these functions do is to test whether the binary sequences are “legal” in a particular encoding scheme. Recall that the letter “å” is stored as a particular binary sequence in the local encoding scheme. This binary sequence is not valid in the ASCII scheme—hence, the string cannot have been encoded in ASCII. And in fact, this is what we find:


R> library(tau)
R> is.locale(small.frogs)
[1] TRUE
R> is.ascii(small.frogs)
[1] FALSE

Summary

Many aspects of automated data collection deal with textual data. Every step of a typical web scraping exercise might involve some form of string manipulation. Be it that you need to format a URL request according to your needs, collect information from an HTML page, (re-)arrange results that come in the form of strings, or general data cleansing. All of these tasks could require some form of string manipulation. This chapter has introduced the most important tool for any of these tasks—regular expressions. These expressions allow you to search for information using highly flexible queries.

The chapter has also outlined the main elements of string manipulation. First, we considered the ingredients of regular expressions as implemented in R. Starting with the simplest of all cases where a character represents itself in a regular expression, we subsequently treated more elaborate concepts to generalize searches, such as quantifiers and character classes. In the second step, we considered how regular expressions and string manipulation is generally performed. To do so, we principally looked at the function range that is provided by the stringr package and several functions that go beyond the package. The chapter concluded with a discussion on how to deal with character encodings.

Further reading

In this chapter, we introduced extended basic regular expressions as implemented in R. Check out http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html for an overview of the available concepts. We restricted our exposition to extended regular expressions, as these suffice to accomplish most common tasks in string manipulation. There is, however, a second flavor of regular expressions that is implemented in RPerl regular expressions. These introduce several aspects that allow string manipulations that were not discussed in this chapter.9 Should you be interested in finding out more about Perl regular expressions, check out http://www.pcre.org/.

Problems

  1. Describe regular expressions and why they can be used for web scraping purposes.

  2. Find a regular expression that matches any text.

  3. Copy the introductory example. The vector name stores the extracted names.

    images

    1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
    2. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
    3. Construct a logical vector indicating whether a character has a second name.
  4. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

    1. [0-9]+\$
    2. \b[a-z]{1,4}\b
    3. .*?\.txt$
    4. \d{2}/\d{2}/\d{4}
    5. <(.+?)>.+?</\1>
  5. Rewrite the expression [0-9]+\$ in a way that all elements are altered but the expression performs the same task.

  6. Consider the mail address chunkylover53[at]aol[dot]com.

    1. Transform the string to a standard mail format using regular expressions.
    2. Imagine we are trying to extract the digits in the mail address. To do so we write the expression [:digit:]. Explain why this fails and correct the expression.
    3. Instead of using the predefined character classes, we would like to use the predefined symbols to extract the digits in the mail address. To do so we write the expression \D. Explain why this fails and correct the expression.
  7. Consider the string <title>+++BREAKING NEWS+++</title>. We would like to extract the first HTML tag. To do so we write the regular expression <.+>. Explain why this fails and correct the expression.

  8. Consider the string (5-3)⁁2=5⁁2-2*5*3+3⁁2 conforms to the binomial theorem. We would like to extract the formula in the string. To do so we write the regular expression [⁁0-9=+*()]+. Explain why this fails and correct the expression.

  9. The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

    
    clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
    Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
    d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
    fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr
    
  10. Why it is important to be familiar with character encodings when working with string data?

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.215.1