© Thomas Mailund 2019
Thomas MailundR Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-4894-2_8

8. Working with Strings: stringr

Thomas Mailund1 
(1)
Aarhus, Denmark
 
The stringr package gives you functions for string manipulation. The package will be loaded when you load the tidyverse package :
library(tidyverse)
You can also load the package alone using
library(stringr)

Counting String Patterns

The str_count() function counts how many tokens a string contain, where tokens, for example, can be characters or words.

By default, str_count() will count the number of characters in a string.
strings <- c(
    "Give me an ice cream",
    "Get yourself an ice cream",
    "We are all out of ice creams",
    "I scream, you scream, everybody loves ice cream.",
    "one ice cream,
    two ice creams,
    three ice creams",
    "I want an ice cream. Do you want an ice cream?"
)
str_count(strings)
## [1] 20 25 28 48 55 46

For each of the strings in strings, we get the number of characters. The result is the same as we would get using nchar() (but not length() which would give us the length of the list containing them, six in this example).

You can be explicit in specifying that str_count() should count characters by giving it a boundary() option. This determines the boundary between tokens, that is, the units to count.
str_count(strings, boundary("character"))
## [1] 20 25 28 48 55 46
If you want to count words instead, you can use boundary("word"):
str_count(strings, boundary("word"))
## [1] 5 5 7 8 9 11
You can use two additional options to boundary(): "line_break" and "sentence." They use heuristics for determining how many line breaks and sentences the text contains dependent on a locale().
str_count(strings, boundary("line_break"))
## [1] 5 5 7 8 11 11
str_count(strings, boundary("sentence"))
## [1] 1 1 1 1 3 2

Notice that the line breaks are not the newlines in the text. The line breaks are where you would be expected to put newlines in your locale().

Finally, str_count() lets you count how often a substring is found in a string:
str_count(strings, "ice cream")
## [1] 1 1 1 1 3 2
str_count(strings, "cream") # gets the screams as well
## [1] 1 1 1 3 3 2
The pattern you ask str_count() to count is not just a string. It is a regular expression. Some characters take on special meaning in regular expressions.1 For example, a dot represents any single character, not a full stop.
str_count(strings, ".")
## [1] 20 25 28 48 53 46
If you want your pattern to be taken as a literal string and not a regular expression, you can wrap it in fixed():
str_count(strings, fixed("."))
## [1] 0 0 0 1 0 1
Since the pattern is a regular expression, we can use it to count punctuation characters:
str_count(strings, "[:punct:]")
## [1] 0 0 0 3 2 2
Or the number of times ice cream(s) is at the end of the string:
str_count(strings, "ice creams?$")
## [1] 1 1 1 0 1 0
The s? means zero or one s, and the $ means the end of the string or at the end of the string except that it might be followed by a punctuation mark.
str_count(strings, "ice creams?[:punct:]?$")
## [1] 1 1 1 1 1 1

Splitting Strings

Sometimes you want to split a string based on some separator—not unlike how we split on commas in comma-separated value files. The stringr function for this is str_split() .

We can, for example, split on a space:
strings <- c(
    "one",
    "two",
    "one two",
    "one two",
    "one. two."
)
str_split(strings, " ")
## [[1]]
## [1] "one"
##
## [[2]]
## [1] "two"
##
## [[3]]
## [1] "one" "two"
##
## [[4]]
## [1] "one" ""  ""  "two"
##
## [[5]]
## [1] "one." "two."

Since we are splitting on a single space, we get empty strings for "one two" which contains three spaces.

You can use the boundary() function for splitting as well. For example, you can split a string into its words using boundary("word):
str_split(strings, boundary("word"))
## [[1]]
## [1] "one"
##
## [[2]]
## [1] "two"
##
## [[3]]
## [1] "one" "two"
##
## [[4]]
## [1] "one" "two"
##
## [[5]]
## [1] "one" "two"

When we do this, we get rid of the empty strings from the previous example, and we also get rid of the full stops in the last string.

Capitalizing Strings

You can use the str_to_lower() to transform a string into all lowercase.
macdonald <- "Old MACDONALD had a farm."
str_to_lower(macdonald)
## [1] "old macdonald had a farm."
Similarly, you can use str_to_upper() to translate it into all uppercase.
str_to_upper(macdonald)
## [1] "OLD MACDONALD HAD A FARM."
If you use str_to_sentence(), the first character is uppercase and the rest lowercase.
str_to_sentence(macdonald)
## [1] "Old macdonald had a farm."
The str_to_title() function will capitalize all words in your string.
str_to_title(macdonald)
## [1] "Old Macdonald Had A Farm."

Wrapping, Padding, and Trimming

If you want to wrap strings, that is, add newlines, so they fit into a certain width, you can use str_wrap() .
strings <- c(
    "Give me an ice cream",
    "Get yourself an ice cream",
    "We are all out of ice creams",
    "I scream, you scream, everybody loves ice cream.",
    "one ice cream,
    two ice creams,
    three ice creams",
    "I want an ice cream. Do you want an ice cream?"
)
str_wrap(strings)
## [1] "Give me an ice cream"
## [2] "Get yourself an ice cream"
## [3] "We are all out of ice creams"
## [4] "I scream, you scream, everybody loves ice cream."
## [5] "one ice cream, two ice creams, three ice creams"
## [6] "I want an ice cream. Do you want an ice cream?"
The default width is 80 characters, but you can change that using the width argument.
str_wrap(strings, width = 10)
## [1] "Give me an ice cream"
## [2] "Get yourself an ice cream"
## [3] "We are all out of ice creams"
## [4] "I scream, you scream, everybody loves ice cream."
## [5] "one ice cream, two ice creams, three ice creams"
## [6] "I want an ice cream. Do you want an ice cream?"
You can indent the first line in the strings while wrapping them using the indent argument.
str_wrap(strings, width = 10, indent = 2)
## [1] " Give me an ice cream"
## [2] " Get yourself an ice cream"
## [3] " We are all out of ice creams"
## [4] " I scream, you scream, everybody loves ice cream."
## [5] " one ice cream, two ice creams, three ice creams"
## [6] " I want an ice cream. Do you want an ice cream?"

If you want your string to be left, right, or center justified, you can use str_pad() .

The default is right-justifying strings.
str_pad(strings, width = 50)
## [1] "                           Give me an ice cream"
## [2] "                      Get yourself an ice cream"
## [3] "                   We are all out of ice creams"
## [4] " I scream, you scream, everybody loves ice cream."
## [5] "one ice cream, two ice creams, three ice creams"
## [6] " I want an ice cream. Do you want an ice cream?"
If you want to left justify instead, you can pass "right" to the side argument.
str_pad(strings, width = 50, side = "right")
## [1] "Give me an ice cream"
## [2] "Get yourself an ice cream"
## [3] "We are all out of ice creams"
## [4] "I scream, you scream, everybody loves ice cream."
## [5] "one ice cream, two ice creams, three ice creams"
## [6] "I want an ice cream. Do you want an ice cream?"

You need to use "right" to left justify because the side argument determines which side to pad, and for left-justified text, the padding is on the right.

If you want to center your text, you should use "both"; you are padding both on the left and on the right.
str_pad(strings, width = 50, side = "both")
## [1] "              Give me an ice cream             "
## [2] "           Get yourself an ice cream           "
## [3] "          We are all out of ice creams         "
## [4] " I scream, you scream, everybody loves ice cream."
## [5] "one ice cream, two ice creams, three ice creams"
## [6] " I want an ice cream. Do you want an ice cream?"
In these padding examples, we do not keep the lengths of the strings below the padding width. If a string is longer than the padding width, it is unchanged. You can use the str_trunc() function to cut the width down to a certain value. For example, we could truncate all the strings to width 25 before we pad them:
strings %>% str_trunc(25) %>% str_pad(width = 25, side = "left")
## [1] "     Give me an ice cream"
## [2] "Get yourself an ice cream"
## [3] "We are all out of ice ..."
## [4] "I scream, you scream, ..."
## [5] " one ice cream, two..."
## [6] "I want an ice cream. D..."
The str_trim() function removes whitespace to the left and right of a string:
str_trim(c(
    " one small coke",
    "two large cokes ",
    " three medium cokes "
))
## [1] "one small coke" "two large cokes"
## [3] "three medium cokes"

It keeps whitespace inside the string.

Since str_trim() does not touch whitespace that is not flanking on the left or right, we cannot use it to remove extra spaces inside our string. For example, if we have two spaces between two words, as in the following example, str_trim() leaves them alone.
str_trim(c(
    " one small coke",
    "two large cokes ",
    " three medium cokes "
))
## [1] "one small coke" "two large cokes"
## [3] "three medium cokes"
If we want the two spaces to be shortened into a single space, we can use str_squish() instead.
str_squish(c(
    " one small coke",
    "two large cokes ",
    " three medium cokes "
))
## [1] "one small coke" "two large cokes"
## [3] "three medium cokes"

Detecting Substrings

To check if a substring is found in another string, you can use str_detect() .
str_detect(strings, "me")
## [1] TRUE FALSE FALSE FALSE FALSE FALSE
str_detect(strings, "I")
## [1] FALSE FALSE FALSE TRUE FALSE TRUE
str_detect(strings, "cream")
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
The pattern is a regular expression, so to test for ice cream followed by a full stop, you cannot search for “ice cream.”.
str_detect(strings, "ice cream.")
## [1] FALSE FALSE TRUE TRUE TRUE TRUE
You can, again, use a fixed() string.
str_detect(strings, fixed("ice cream."))
## [1] FALSE FALSE FALSE TRUE FALSE TRUE
Alternatively, you can escape the dot.
str_detect(strings, "ice cream\.")
## [1] FALSE FALSE FALSE TRUE FALSE TRUE
You can test if a substring is not found in a string by setting the negate argument to TRUE.
str_detect(strings, fixed("ice cream."), negate = TRUE)
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
Two special case functions test for a string at the start or end of a string:
str_starts(strings, "I")
## [1] FALSE FALSE FALSE TRUE FALSE TRUE
str_ends(strings, fixed("."))
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
If you want to know where a substring is found, you can use str_locate(). It will give you the start and the end index where it found a match.
str_locate(strings, "ice cream")
##      start end
## [1,]    12  20
## [2,]    17  25
## [3,]    19  27
## [4,]    39  47
## [5,]     5  13
## [6,]    11  19
Here you get a start and an end index for each string, but string number six has more than one occurrence of the pattern.
strings[6]
## [1] "I want an ice cream. Do you want an ice cream?"
You only get the indices for the first occurrence when you use str_locate().
str_locate(strings[6], "ice cream")
##     start end
## [1,]   11  19
The function str_locate_all() gives you all occurrences.
str_locate_all(strings[6], "ice cream")
## [[1]]
##     start end
## [1,]   11  19
## [2,]   37  45
If you want the start and end points of the strings between the occurrences, you can use invert_match().
ice_cream_locations <- str_locate_all(strings[6], "ice cream")
ice_cream_locations
## [[1]]
##      start end
## [1,]    11  19
## [2,]    37  45
invert_match(ice_cream_locations[[1]])
##     start end
## [1,]    0  10
## [2,]   20  36
## [3,]   46  -1

Extracting Substrings

To extract a substring matching a pattern, you can use str_extract() . It gives you the first substring that matches a regular expression.
str_extract(strings, "(s|ice )cream\w*")
## [1] "ice cream" "ice cream" "ice creams"
## [4] "scream"    "ice cream" "ice cream"
It only gives you the first match, but if you want all substrings that match you can use str_extract_all().
strings[4]
## [1] "I scream, you scream, everybody loves ice cream."
str_extract(strings[4], "(s|ice )cream\w*")
## [1] "scream"
str_extract_all(strings[4], "(s|ice )cream\w*")
## [[1]]
## [1] "scream"    "scream"    "ice cream"

Transforming Strings

We can replace a substring that matches a pattern with some other string.
lego_str <- str_replace(strings, "ice cream[s]?", "LEGO")
lego_str
## [1] "Give me an LEGO"
## [2] "Get yourself an LEGO"
## [3] "We are all out of LEGO"
## [4] "I scream, you scream, everybody loves LEGO."
## [5] "one LEGO, two ice creams, three ice creams"
## [6] "I want an LEGO. Do you want an ice cream?"
lego_str <- str_replace(lego_str, "an LEGO", "a LEGO")
lego_str
## [1] "Give me a LEGO"
## [2] "Get yourself a LEGO"
## [3] "We are all out of LEGO"
## [4] "I scream, you scream, everybody loves LEGO."
## [5] "one LEGO, two ice creams, three ice creams"
## [6] "I want a LEGO. Do you want an ice cream?"
These two replacement operators can be written as a pipeline to make the code more Tidyverse-y:
strings %>%
    str_replace("ice cream[s]?", "LEGO") %>%
    str_replace("an LEGO", "a LEGO")
## [1] "Give me a LEGO"
## [2] "Get yourself a LEGO"
## [3] "We are all out of LEGO"
## [4] "I scream, you scream, everybody loves LEGO."
## [5] "one LEGO, two ice creams, three ice creams"
## [6] "I want a LEGO. Do you want an ice cream?"
Like most of the previous functions, the function only affects the first match. To replace all occurrences, you need str_replace_all().
strings %>%
    str_replace_all("ice cream[s]?", "LEGO") %>%
    str_replace_all("an LEGO", "a LEGO")
## [1] "Give me a LEGO"
## [2] "Get yourself a LEGO"
## [3] "We are all out of LEGO"
## [4] "I scream, you scream, everybody loves LEGO."
## [5] "one LEGO, two LEGO, three LEGO"
## [6] "I want a LEGO. Do you want a LEGO?"
You can refer back to matching groups in the replacement string, something you will be familiar with for regular expressions.
us_dates <- c(
    valentines = "2/14",
    my_birthday = "2/15",
    # no one knows but let's just go
    # with this
    jesus_birthday = "12/24"
)
# US date format to a more sane format
str_replace(us_dates, "(.*)/(.*)", "\2/\1")
## [1] "14/2" "15/2" "24/12"
The str_dup() function duplicates a string, that is, it repeats a string several times.
str_c(
    "NA",
    str_dup("-NA", times = 7),
    " BATMAN!"
)
## [1] "NA-NA-NA-NA-NA-NA-NA-NA BATMAN!"
Here we also used str_c() to concatenate strings. This function works differently from c(); the latter will create a vector of multiple strings, while the former will create one string.
# -- concatenation -------------------------------------------
c("foo", "bar", "baz")
## [1] "foo" "bar" "baz"
str_c("foo", "bar", "baz")
## [1] "foobarbaz"

A more direct way to extract and modify a substring is using str_sub() . It lets you extract a substring specified by a start and an end index, and if you assign to it, you replace the substring. The str_sub() function is less powerful than the other functions as it doesn’t work on regular expressions, but because of this, it is also easier to understand.

If you do not know where a substring is found, you must first find it. You can use str_locate() for this.
my_string <- "this is my string"
my_location <- str_locate(my_string, "my")
my_location
##     start end
## [1,]    9  10
s <- my_location[,"start"]
e <- my_location[,"end"]
str_sub(my_string, s, e)
## [1] "my"
my_string_location <- str_locate(my_string, "string")
s <- my_string_location[,"start"]
e <- my_string_location[,"end"]
str_sub(my_string, s, e)
## [1] "string"
your_string <- my_string
s <- my_location[,"start"]
e <- my_location[,"end"]
str_sub(your_string, s, e) <- "your"
your_string
## [1] "this is your string"
your_banana <- your_string
your_string_location <- str_locate(your_string, "string")
s <- your_string_location[,"start"]
e <- your_string_location[,"end"]
str_sub(your_banana, s, e) <- "banana"
your_banana
## [1] "this is your banana"
When you assign to a call to str_sub(), it looks like you are modifying a string. This is an illusion. Assignment functions create new data and change the data that a variable refers to. So, if you have more than one reference to a string, be careful. Only one variable will point to the new value; the remaining will point to the old string. This is not specific to str_sub() but for R in general, and it is a potential source of errors.
my_string
## [1] "this is my string"
your_string
## [1] "this is your string"
your_banana
## [1] "this is your banana"

If you often write code to produce standard reports, virtually the same text each time but with a few selected values changed, then you are going to love str_glue(). This does precisely what you want. You give str_glue() a template string, a string with the mutable pieces in curly brackets. The result of calling str_glue() is the template text but with the values in the curly brackets replaced by what R expression they contain.

The most straightforward use is when the template refers to variables.
macdonald <- "Old MacDonald"
eieio <- "E-I-E-I-O"
str_glue("{macdonald} had a farm. {eieio}")
## Old MacDonald had a farm. E-I-E-I-O
The variables do not need to be global. They can also be named arguments to str_glue() .
str_glue(
    "{macdonald} had a farm. {eieio}",
    macdonald = "Thomas",
    eieio = "He certainly did not!"
)
## Thomas had a farm. He certainly did not!
Generally, you can put R expressions in the curly brackets, and the result of evaluating the expressions will be what is inserted into the template string.
str_glue(
    "{str_dup("NA-", times = 7)}NA BATMAN!"
)
## NA-NA-NA-NA-NA-NA-NA-NA BATMAN!
x <- seq(1:10)
str_glue(
  "Holy {mean(x)} BATMAN!"
  )
## Holy 5.5 BATMAN!
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.163.91