Topics in This Chapter
6.1 Converting between Strings and Code Point Sequences
6.7 Regular Expression Literals
6.9 Regular Expressions and Unicode
6.10 The Methods of the RegExp
Class
In this chapter, you will learn about the methods that the standard library provides for string processing. We will then turn to regular expressions, which let you find strings that match patterns. After an introduction into the syntax of regular expressions and the JavaScript-specific idiosyncrasies, you will see how to use the API for finding and replacing matches.
A string is a sequence of Unicode code points. Each code point is an integer between zero and 0x10FFFF
. The fromCodePoint
function of the String
class assembles a string from code point arguments:
let str = String.fromCodePoint(0x48, 0x69, 0x20, 0x1F310, 0x21) // 'Hi !'
If the code points are in an array, use the spread operator:
let codePoints = [0x48, 0x69, 0x20, 0x1F310, 0x21] str = String.fromCodePoint(...codePoints)
Conversely, you can turn a string into an array of code points:
let characters = [...str] // [ 'H', 'i', ' ', '', '!' ]
The result is an array of strings, each containing a single code point. You can obtain the code points as integers:
codePoints = [...str].map(c => c.codePointAt(0))
JavaScript stores strings as sequences of UTF-16 code units. The offset in a call such as 'Hi '.codePointAt(i)
refers to the UTF-16 encoding. In this example, valid offsets are 0
, 1
, 2
, 3
, and 5
. If the offset falls in the middle of a pair of code units that make up a single code point, then an invalid code point is returned.
If you want to traverse the code points of a string without putting them in an array, use this loop:
for (let i = 0; i < str.length; i++) {
let cp = str.codePointAt(i)
if (cp > 0xFFFF) i++
. . . // Process the code point cp
}
The indexOf
method yields the index of the first occurrence of a substring:
let index = 'Hello yellow'.indexOf('el') // 1
The lastIndexOf
method yields the index of the last occurrence:
index = 'Hello yellow'.lastIndexOf('el') // 7
As with all offsets into JavaScript strings, these values are offsets into the UTF-16 encoding:
index = 'Iyellow'.indexOf('el') // 4
The offset is 4
because the “yellow heart” emoji is encoded with two UTF-16 code units.
If the substring is not present, these methods return -1
.
The methods startsWith
, endsWith
, and includes
return a Boolean result:
let isHttps = url.startsWith('https://') let isGif = url.endsWith('.gif') let isQuery = url.includes('?')
The substring
method extracts a substring, given two offsets in UTF-16 code units. The substring contains all characters from the first offset up to, but not including, the second offset.
let substring = 'Iyellow'.substring(3, 7) // 'yell'
If you omit the second offset, all characters until the end of the string are included:
substring = 'Iyellow'.substring(3) // 'yellow'
The slice
method is similar to substring
, except that negative offsets are counted from the end of the string. -1
is the offset of the last code unit, -2
the offset of its predecessor, and so on. This is achieved by adding the string length to a negative offset.
'Iyellow'.slice(-6, -2) // 'yell'
, same as slice(3, 7)
The length of 'Iyellow'
is 9—recall that the takes two code units. The offsets -6
and -2
are adjusted to 3
and 7
.
With both the substring
and slice
methods, offsets larger than the string length are truncated to the length. Negative and NaN
offsets are truncated to 0
. (In the slice
method, this happens after adding the string length to negative offsets.)
If the first argument to substring
is larger than the second, the arguments are switched!
substring = 'Iyellow'.substring(7, 3) // 'yell'
, same as substring(3, 7)
In contrast, str.slice(start, end)
yields the empty string if start
≥ end
.
I prefer the slice
method over substring
. It is more versatile, has a saner behavior, and the method name is shorter.
Another way of taking a string apart is the split
method. That method splits a string into an array of substrings, removing the provided separator.
let parts = 'Mary had a little lamb'.split(' ') // ['Mary', 'had', 'a', 'little', 'lamb']
You can supply a limit for the number of parts:
parts = 'Mary had a little lamb'.split(' ', 4) // ['Mary', 'had', 'a', 'little']
The separator can be a regular expression—see Section 6.12, “String Methods with Regular Expressions” (page 133).
Calling str.split('')
with an empty separator splits the string into strings that each hold a 16-bit code unit, which is not useful if str
contains characters above u{FFFF}
. Use [...str]
instead.
In this section, you will find miscellaneous methods of the String
class. Since strings are immutable in JavaScript, none of the string methods change the contents of a given string. They all return a new string with the result.
The repeat
method yields a string repeated a given number of times:
const repeated = 'ho '.repeat(3) // 'ho ho ho '
The trim
, trimStart
, and trimEnd
methods yield strings that remove leading and trailing white space, or just leading or trailing white space. White space characters include the space character, the nonbreaking space u{00A0}
, newline, tab, and 21 other characters with the Unicode character property White_Space
.
The padStart
and padEnd
methods do the opposite—they add space characters until the string has a minimum length:
let padded = 'Hello'.padStart(10) // ' Hello'
, five spaces are added
You can also supply your own padding string:
padded = 'Hello'.padStart(10, '=-') // =-=-=Hello
The first parameter is the length of the padded string in bytes. If your padding string contains characters that require two bytes, you may get a malformed string:
padded = 'Hello'.padStart(10, '')
// Padded with two hearts and an unmatched code unit
The toUpperCase
and toLowerCase
methods yield a string with all characters converted to upper- or lowercase.
let uppercased = 'Straße'.toUpperCase() // 'STRASSE'
As you can see, the toUpperCase
method is aware of the fact that the uppercase of the German character 'ß'
is the string 'SS'
.
Note that toLowerCase
does not recover the original string:
let lowercased = uppercased.toLowerCase() // 'strasse'
String operations such as conversion to upper- and lowercase can depend on the user’s language preferences. See Chapter 8 for methods toLocaleUpperCase
, toLocaleLowerCase
, localeCompare
, and normalize
that are useful when you localize your applications.
See Section 6.12, “String Methods with Regular Expressions” (page 133), for string methods match
, matchAll
, search
, and replace
that work with regular expressions.
The concat
method concatenates a string with any number of arguments that are converted to strings.
const n = 7 let concatenated = 'agent'.concat(' ', n) // 'agent 7'
You can achieve the same effect with template strings or the join
method of the Array
class:
concatenated = `agent ${n}` concatenated = ['agent', ' ', n].join('')
Table 6-1 shows the most useful features of the String
class.
Table 6-1 Useful Functions and Methods of the String
class
Name |
Description |
---|---|
Functions |
|
|
Yields a string consisting of the given code points |
Methods |
|
|
|
|
The index of the first or last occurrence of |
|
The substring of code units with index between |
|
This string, repeated |
|
This string with leading, trailing, or leading and trailing white space removed |
|
This string, padded at the start or end until its length reaches |
|
This string with all letters converted to lower or upper case |
|
An array of parts obtained by removing all copies of the separator (which can be a regular expression). If |
|
The index of the first match of |
|
This string, with the first match of |
|
An array of matches if |
|
An iterable of the match results |
Finally, there are global functions for encoding URL components and entire URLs—or, more generally, URIs using schemes such as mailto
or tel
—into their “URL encoded” form. That form uses only characters that were considered “safe” when the Internet was first created. Suppose you need to produce a query for translating a phrase from one language into another. You might construct a URL like this:
const phrase = 'à coté de' const prefix = 'https://www.linguee.fr/anglais-francais/traduction' const suffix = '.html' const url = prefix + encodeURIComponent(phrase) + suffix
The phrase is encoded into '%C3%A0%20cot%C3%A9%20de'
, the result of encoding characters into UTF-8 and encoding each byte into a code %hh
with two hexadecimal digits. The only characters that are left alone are the “safe” characters
A-Z a-z 0-9 ! ' ( ) * . _ ~ -
In the less common case, if you need to encode an entire URI, use the encodeURI
function. It also leaves the characters
# $ & + , / : ; = ? @
unchanged since they can have special meanings in URIs.
In Chapter 1, you saw template literals—strings with embedded expressions:
const person = { name: 'Harry', age: 42 } message = `Next year, ${person.name} will be ${person.age + 1}.`
Template literals insert the values of the embedded expressions into the template string. In this example, the embedded expressions person.name
and person.age + 1
are evaluated, converted to strings, and spliced with the surrounding string fragments. The result is the string
'Next year, Harry will be 43.'
You can customize the behavior of template literals with a tag function. As an example, we will be writing a tag function strong
that produces an HTML string, highlighting the embedded values. The call
strong`Next year, ${person.name} will be ${person.age + 1}.`
will yield an HTML string
'Next year, <strong>Harry</strong> will be <strong>43</strong>.'
The tag function is called with the fragments of the literal string around the embedded expressions, followed by the expression values. In our example, the fragments are 'Next year, '
, ' will be '
, and '.'
, and the values are 'Harry'
and 43
. The tag function combines these pieces. The returned value is turned into a string if it is not already one.
Here is an implementation of the strong
tag function:
const strong = (fragments, ...values) => { let result = fragments[0] for (let i = 0; i < values.length; i++) result += `<strong>${values[i]}</strong>${fragments[i + 1]}` return result }
When processing the template string
strong`Next year, ${person.name} will be ${person.age + 1}.`
the strong
function is called like this:
strong(['Next year, ', ' will be ', '.'], 'Harry', 43)
Note that all string fragments are put into an array, whereas the expression values are passed as separate arguments. The strong
function uses the spread operator to gather them all in a second array.
Also note that there is always one more fragment than there are expression values.
This mechanism is infinitely flexible. You can use it for HTML templating, number formatting, internationalization, and so on.
If you prefix a template literal with String.raw
, then backslashes are not escape characters:
path = String.raw`c:users ate`
Here, u
does not denote a Unicode escape, and
is not turned into a newline character.
Even in raw mode, you cannot enclose arbitrary strings in backticks. You still need to escape all `
characters, $
before {
, and before
`
and {
.
That doesn’t quite explain how String.raw
works, though. Tag functions have access to a “raw” form of the template string fragments, in which backslash combinations such as u
and
lose their special meanings.
Suppose we want to handle strings with Greek letters. We follow the convention of the LATEX markup language for mathematical formulas. In that language, symbols start with backslashes. Therefore, raw strings are attractive—users want to write
u
and upsilon
, not \nu
and \upsilon
. Here is an example of a string that we want to be able to process:
greek` u=${factor}upsilon`
As with any tagged template string, we need to define a function:
const greek = (fragments, ...values) => { const substitutions = { alpha: 'α', . . ., nu: 'ν', . . . } const substitute = str => str.replace(/\[a-z]+/g, match => substitutions[match.slice(1)]) let result = substitute(fragments.raw[0]) for (let i = 0; i < values.length; i++) result += values[i] + substitute(fragments.raw[i + 1]) return result }
You access the raw string fragments with the raw
property of the first parameter of the tag function. The value of fragments.raw
is an array of string fragments with unprocessed backslashes.
In the preceding tagged template literal, fragments.raw
is an array of two strings. The first string is
u=
, and the second string is upsilon
.
${ uupsilon{
including three backslashes. The second string has two characters:
}}
Note the following:
The
in
u
is not turned into a newline.
The u
in upsilon
is not interpreted as a Unicode escape. In fact, it would not be syntactically correct. For that reason, fragments[1]
cannot be parsed and is set to undefined
.
${factor}
is an embedded expression. Its value is passed to the tag function.
The greek
function uses regular expression replacement, which is explained in detail in Section 6.13, “More about Regex Replace” (page 135). Identifiers starting with a backslash are replaced with their substitutions, such as ν for
u
.
Regular expressions specify string patterns. Use them whenever you need to locate strings that match a particular pattern. For example, suppose you want to find hyperlinks in an HTML file. You need to look for strings of the form <a href=". . .">
. But wait—there may be extra spaces, or the URL may be enclosed in single quotes. Regular expressions give you a precise syntax for specifying what sequences of characters are legal matches.
In a regular expression, a character denotes itself unless it is one of the reserved characters
. * + ? { | ( ) [ ^ $
For example, the regular expression href
only matches the string href
.
The symbol .
matches any single character. For example, .r.f
matches href
and prof
.
The *
symbol indicates that the preceding construct may be repeated 0 or more times; with the +
symbol, the repetition is 1 or more times. A suffix of ?
indicates that a construct is optional (0 or 1 times). For example, be+s?
matches be
, bee
, and bees
. You can specify other multiplicities with { }
—see Table 6-2.
A |
denotes an alternative: .(oo+|ee+)f
matches beef
or woof
. Note the parentheses—without them, .oo+|ee+f
would be the alternative between .oo+
and ee+f
. Parentheses are also used for grouping—see Section 6.11, “Groups” (page 131).
A character class is a set of character alternatives enclosed in brackets, such as [Jj]
, [0-9]
, [A-Za-z]
, or [^0-9]
. Inside a character class, the -
denotes a range (all characters whose Unicode values fall within the two bounds). However, a -
that is the first or last character in a character class denotes itself. A ^
as the first character in a character class denotes the complement—all characters except those specified. For example, [^0-9]
denotes any character that is not a decimal digit.
There are six predefined character classes: d
(digits), s
(white space), w
(word characters), and their complements D
(non-digits), S
(nonwhite space), and W
(nonword characters).
The characters ^
and $
match the beginning and end of input. For example, ^[0-9]+$
matches a string entirely consisting of digits.
Be careful about the position of the ^
character. If it is the first character inside brackets, it denotes the complement: [^0-9]+$
matches a string of non-digits at the end of input.
I have a hard time remembering that ^
matches the start and $
the end. I keep thinking that $
should denote start, and on the US keyboard, $
is to the left of ^
. But it’s exactly the other way around, probably since the archaic text editor QED used $
to denote the last line.
Table 6-2 summarizes the JavaScript regular expression syntax.
If you need to have a literal . * + ? { | ( ) [ ^ $
, precede it by a backslash. Inside a character class, you only need to escape [
and , provided you are careful about the positions of
] - ^
. For example, []^-]
is a class containing all three of them.
Table 6-2 Regular Expression Syntax
Expression |
Description |
Example |
---|---|---|
Characters |
||
A character other than |
Matches only the given character |
|
|
Matches any character except |
|
|
The Unicode code point with the given hex value (requires |
|
|
The UTF-16 code unit with the given hex value |
|
|
Form feed ( |
|
|
The control character corresponding to the character L |
|
|
The character c |
|
Character Classes |
||
|
Any of the characters represented by C1, C2, . . . |
|
|
Complement of a character class |
|
|
A Unicode property (see Section 6.9); its complement (requires the |
|
|
A digit |
|
|
A word character |
|
|
A space from |
|
Sequences and Alternatives |
||
XY |
Any string from |
|
X |
Any string from |
|
Grouping |
||
|
Captures the match of X into a group—see Section 6.11 |
|
|
Matches the nth group |
|
|
Captures the match of X with the given name |
|
|
The group with the given name |
|
|
Use parentheses without capturing X |
In |
Other |
See Section 6.14 |
|
Quantifiers |
||
X |
Optional X |
|
X |
0 or more X, 1 or more X |
|
X |
n times X, at least n times |
|
X |
Reluctant quantifier, attempting the shortest match before trying longer matches |
|
Boundary Matches |
||
|
Beginning, end of input (or beginning, end of line if the |
|
|
Word boundary, nonword boundary |
|
A regular expression literal is delimited by slashes:
const timeRegex = /^([1-9]|1[0-2]):[0-9]{2} [ap]m$/
Regular expression literals are instances of the RegExp
class.
The typeof
operator, when applied to a regular expression, yields 'object'
.
Inside the regular expression literal, use backslashes to escape characters that have special meanings in regular expressions, such as the .
and +
characters:
const fractionalNumberRegex = /[0-9]+.[0-9]*/
Here, the escaped .
means a literal period.
In a regular expression literal, you also need to escape a forward slash so that it is not interpreted as the end of the literal.
To convert a string holding a regular expression into a RegExp
object, use the RegExp
function, with or without new
:
const fractionalNumberRegex = new RegExp('[0-9]+\.[0-9]*')
Note that the backslash in the string must be escaped.
A flag modifies a regular expression’s behavior. One example is the i
or ignoreCase
flag. The regular expression
/[A-Z]+.com/i
matches Horstmann.COM
.
You can also set the flag in the constructor:
const regex = new RegExp(/[A-Z]+.com/, 'i')
To find the flag values of a given RegExp
object, you can use the flags
property which yields a string of all flags. There is also a Boolean property for each flag:
regex.flags // 'i' regex.ignoreCase // true
JavaScript supports six flags, shown in Table 6-3.
Table 6-3 Regular Expression Flags
Single Letter |
Property Name |
Description |
---|---|---|
|
|
Case-insensitive match |
|
|
|
|
|
|
|
|
Match Unicode characters, not code units—see Section 6.9 |
|
|
Find all matches—see Section 6.10 |
|
|
Match must start at |
The m
or multiline
flag changes the behavior of the start and end anchors ^
and $
. By default, they match the beginning and end of the entire string. In multiline mode, they match the beginning and end of a line. For example,
/^[0-9]+/m
matches digits at the beginning of a line.
With the s
or dotAll
flag, the .
pattern matches newlines. Without it, .
matches any non-newline character.
The other three flags are explained in later sections.
You can use more than one flag. The following regular expression matches upper- or lowercase letters at the start of each line:
/^[A-Z]/im
For historical reasons, regular expressions work with UTF-16 code units, not Unicode characters. For example, the .
pattern matches a single UTF-16 code unit. For example, the string
'Hello '
does not match the regular expression
/Hello .$/
The character is encoded with two code units. The remedy is to use the u
or unicode
flag:
/Hello .$/u
With the u
flag, the .
pattern matches a single Unicode character, no matter how it is encoded in UTF-16.
If you need to keep your source files in ASCII, you can embed Unicode code points into regular expressions, using the u{ }
syntax:
/[A-Za-z]+ u{1F310}/u
Without the u
flag, /u{1F310}/
matches the string 'u{1F310}'
.
When working with international text, you should avoid patterns such as [A-Za-z]
for denoting letters. These patterns won’t match letters in other languages. Instead, use p{Property}
, where Property is the name of a Boolean Unicode property. For example, p{L}
denotes a Unicode letter. The regular expression
/Hello, p{L}+!/u
matches
'Hello, värld!'
and
'Hello, !'
Table 6-4 shows the names of other common Boolean properties.
For Unicode properties whose values are not Boolean, use the syntax p{Property=Value}
. For example, the regular expression
/p{Script=Han}+/u
matches any sequence of Chinese characters.
Using an uppercase P
yields the complement: P{L}
matches any character that is not a letter.
Table 6-4 Common Boolean Unicode Properties
Name |
Description |
---|---|
|
Letter |
|
Uppercase letter |
|
Lowercase letter |
|
Decimal number |
|
Punctuation |
|
Symbol |
|
White space, same as |
|
Emoji characters, modifiers, or components |
RegExp
ClassThe test
method yields true
if a string contains a match for the given regular expression:
/[0-9]+/.test('agent 007') // true
To test whether the entire string matches, your regular expression must use start and end anchors:
/^[0-9]+$/.test('agent 007') // false
The exec
method yields an array holding the first matched subexpression, or null
if there was no match.
For example,
/[0-9]+/.exec('agents 007 and 008')
returns an array containing the string '007'
. (As you will see in the following section, the array can also contain group matches.)
In addition, the array that exec
returns has two properties:
index
is the index of the subexpression
input
is the argument that was passed to exec
In other words, the array returned by the preceding call to exec
is actually
['007', index: 7, input: 'agents 007 and 008']
To match multiple subexpressions, use the g
or global
flag:
let digits = /[0-9]+/g
Now each call to exec
returns a new match:
result = digits.exec('agents 007 and 008') // ['007', index: 7, . . .] result = digits.exec('agents 007 and 008') // ['008', index: 15, . . .] result = digits.exec('agents 007 and 008') // null
To make this work, the RegExp
object has a property lastIndex
that is set to the first index after the match in each successful call to exec
. The next call to exec
starts the match at lastIndex
. The lastIndex
property is set to zero when a regular expression is constructed or a match failed.
You can also set the lastIndex
property to skip a part of the string.
With the y
or sticky
flag, the match must start exactly at lastIndex
:
digits = /[0-9]+/y digits.lastIndex = 5 result = digits.exec('agents 007 and 008') // null digits.lastIndex = 8 result = digits.exec('agents 007 and 008') // ['07', index: 8, . . .]
If you simply want an array of all matched substrings, use the match
method of the String
class instead of repeated calls to exec
—see Section 6.12, “String Methods with Regular Expressions” (page 133).
let results = 'agents 007 and 008'.match(/[0-9]+/g) // ['007', '008']
Groups are used for extracting components of a match. For example, here is a regular expression for parsing times with groups for each component:
let time = /([1-9]|1[0-2]):([0-5][0-9])([ap]m)/
The group matches are placed in the array returned by exec
:
let result = time.exec('Lunch at 12:15pm') // ['12:15pm', '12', '15', 'pm', index: 9, . . .]
As in the preceding section, result[0]
is the entire matched string. For i
> 0, result[i]
is the match for the i
th group.
Groups are numbered by their opening parentheses. This matters if you have nested parentheses. Consider this example. We want to analyze line items of invoices that have the form
Blackwell Toaster USD29.95
Here is a regular expression with groups for each component:
/(p{L}+(s+p{L}+)*)s+([A-Z]{3})([0-9.]*)/u
In this situation, group 1 is 'Blackwell Toaster'
, the substring matched by the expression (p{L}+(s+p{L}+)*)
, from the first opening parenthesis to its matching closing parenthesis.
Group 2 is ' Toaster'
, the substring matched by (s+p{L}+)
.
Groups 3 and 4 are 'USD'
and '29.95'
.
We aren’t interested in group 2; it only arose from the parentheses that were required for the repetition. For greater clarity, you can use a noncapturing group, by adding ?:
after the opening parenthesis:
/(p{L}+(?:s+p{L}+)*)s+([A-Z]{3})([0-9.]*)/u
Now 'USD'
and '29.95'
are captured as groups 2 and 3.
When you have a group inside a repetition, such as (s+p{L}+)*
in the example above, the corresponding group only holds the last match, not all matches.
If the repetition happened zero times, then the group match is set to undefined
.
You can match against the contents of a captured group. For example, consider the regular expression:
/(['"]).*1
/
The group (['"])
captures either a single or double quote. The pattern 1
matches the captured string, so that "Fred"
and 'Fred'
match the regular expression but "Fred'
does not.
Even though they are supposed be outlawed in strict mode, several JavaScript engines still support octal character escapes in regular expressions. For example, 11
denotes
, the character at code point 9.
However, if the regular expression has 11 or more capturing groups, then 11
denotes a match of the 11th group.
Numbered groups are rather fragile. It is much better to capture by name:
let lineItem = /(?<item>p{L}+(s+p{L}+)*)s+(?<currency>[A-Z]{3})(?<price>[0-9.]*)/u
When a regular expression has one or more named groups, the array returned by exec
has a property groups
whose value is an object holding group names and matches:
let result = lineItem.exec('Blackwell Toaster USD29.95') let groupMatches = result.groups // { item: 'Blackwell Toaster', currency: 'USD', price: '29.95' }
The expression k<name>
matches against a group that was captured by name:
/(?<quote>['"]).*k<quote>/
Here, the group with the name “quote” matches a single or double quote at the beginning of the string. The string must end with the same character. For example, "Fred"
and 'Fred'
are matches but "Fred'
is not.
The features of the RegExp
are summarized in Table 6-5.
Table 6-5 Features of the RegExp
Class
Name |
Description |
---|---|
Constructors |
|
|
Constructs a regular expression from the given |
Properties |
|
|
A string of all flags |
|
Boolean properties for all flag types |
Methods |
|
|
|
|
Match results for the current match of this regular expression inside |
As you saw in Section 6.10, “The Methods of the RegExp
Class” (page 130), the workhorse method for getting match information is the exec
method of the RegExp
class. But its API is far from elegant. The String
class has several methods that work with regular expressions and produce commonly used results more easily.
For a regular expression without the global flag set, the call str.match(regex)
returns the same match results as regex.exec(str)
:
'agents 007 and 008'.match(/[0-9]+/) // ['007', index: 7, . . .]
With the global flag set, match
simply returns an array of matches, which is often just what you want:
'agents 007 and 008'.match(/[0-9]+/g) // ['007', '008']
If there is no match, the String.match
method returns null
.
RegExp.exec
and String.match
are the only methods in the ECMAScript standard library that yield null
to indicate the absence of a result.
If you have a global search and want all match results without calling exec
repeatedly, you will like the matchAll
method of the String
class that is currently a stage 3 proposal. It returns an iterable of the match results. Let’s say you want to look for all matches of the regular expression
let time = /([1-9]|1[0-2]):([0-5][0-9])([ap]m)/g
The loop
for (const [, hours, minutes, period] of input.matchAll(time)) { . . . }
iterates over all match results, using destructuring to set hours
, minutes
, and period
to the group matches. The initial comma ignores the entire matched expression.
The matchAll
method yields the matches lazily. It is efficient if there are many matches but only a few are examined.
The search
method returns the index of the first match or -1
if no match is found:
let index = 'agents 007 and 008'.search(/[0-9]+/) // Yields index 7
The replace
method replaces the first match of a regular expression with a replacement string. To replace all matches, set the global flag:
let replacement = 'agents 007 and 008'.replace(/[0-9]/g, '?') // 'agents ??? and ???'
The split
method can have a regular expression as argument. For example,
str.split(/s*,s*/)
splits str
along commas that are optionally surrounded by white space.
In this section, we have a closer look at the replace
method of the String
class.
The replacement string parameter can contain patterns starting with a $
that are processed as shown in Table 6-6.
Table 6-6 Replacement String Patterns
Pattern |
Description |
---|---|
|
The portion before or after the matched string |
|
Matched string |
|
The nth group |
|
The group with the given name |
|
Dollar sign |
For example, the following replacement repeats each vowel three times:
'hello'.replace(/[aeiou]/g, '$&$&$&') // 'heeellooo'
The most useful pattern is the group pattern. Here, we use groups to match the first and last name of a person in each line and flip them:
let names = 'Harry Smith Sally Lin' let flipped = names.replace( /^([A-Z][a-z]+) ([A-Z][a-z]+)/gm, "$2, $1") // 'Smith, Harry Lin, Sally'
If the number after the $
sign is larger than the number of groups in the regular expression, the pattern is inserted verbatim:
let replacement = 'Blackwell Toaster $29.95'.replace('$29', '$19')
// 'Blackwell Toaster $19.95'
—there is no group 19
You can also use named groups:
flipped = names.replace(/^(?<first>[A-Z][a-z]+) (?<last>[A-Z][a-z]+)$/gm, "$<last>, $<first>")
For more complex replacements, you can provide a function instead of a replacement string. The function receives the following arguments:
The string that was matched by the regular expression
The matches of all groups
The offset of the match
The entire string
In this example, we just process the group matches:
flipped = names.replace(/^([A-Z][a-z]+) ([A-Z][a-z]+)/gm, (match, first, last) => `${last}, ${first[0]}.`) // 'Smith, H. Lin, S.'
The replace
method also works with strings, replacing the first match of the string itself:
let replacement = 'Blackwell Toaster $29.95'.replace('$', 'USD')
// Replaces $
with USD
Note that the $
is not interpreted as an end anchor.
If you call the search
method with a string, it is converted to a regular expression:
let index = 'Blackwell Toaster $29.95'.search('$')
// Yields 24
, the end of the string, not the index of $
Use indexOf
to search for a plain string.
In the final section of this chapter, you will see several complex and uncommon regular expression features.
The +
and *
repetition operators are “greedy”—they match the longest possible strings. That’s generally desirable. You want /[0-9]+/
to match the longest possible string of digits, and not a single digit.
However, consider this example:
'"Hi" and "Bye"'.match(/".*"/g)
The result is
'"Hi" and "Bye"'
because .*
greedily matches everything until the final "
. That does not help us if we want to match quoted substrings.
One remedy is to require non-quotes in the repetition:
'"Hi" and "Bye"'.match(/"[^"]*"/g)
Alternatively, you can specify that the match should be reluctant, by using the *?
operator:
'"Hi" and "Bye"'.match(/".*?"/g)
Either way, now each quoted string is matched separately, and the result is
['"Hi"', '"Bye"']
There is also a reluctant version +?
that requires at least one repetition.
The lookahead operator p(?=q)
matches p provided it is followed by q, but does not include q in the match. For example, here we find the hours that precede a colon.
let hours = '10:30 - 12:00'.match(/[0-9]+(?=:)/g) // ['10, 12']
The inverted lookahead operator p(?!q)
matches p provided it is not followed by q.
let minutes = '10:30 - 12:00'.match(/[0-9][0-9](?!:)/g) // ['10, 12']
There is also a lookbehind (?<=p)q
that matches q as long as it is preceded by p.
minutes = '10:30 - 12:00'.match(/(?<=[0-9]+:)[0-9]+/g) // ['30', '00']
Note that the argument inside (?<=[0-9]+:)
is itself a regular expression.
Finally, there is an inverted lookbehind (?<!p)q
, matching q as long as it is not preceded by p.
hours = '10:30 - 12:00'.match(/(?<![0-9:])[0-9]+/g)
Regular expressions such as this one may have motivated Jamie Zawinski’s timeless quote, “Some people, when confronted with a problem, think: ‘I know, I’ll use regular expressions.’ Now they have two problems.”
Write a function that, given a string, produces an escaped string delimited by '
characters. Turn all non-ASCII Unicode into u{. . .}
. Produce escapes ,
f
,
,
,
, v
, '
, \
.
Write a function that fits a string into a given number of Unicode characters. If it is too long, trim it and append an ellipsis … (u{2026}
). Be sure to correctly handle characters that are encoded with two UTF-16 code units.
The substring
and slice
methods are very tolerant of bad arguments. Can you get them to yield an error with any arguments? Try strings, objects, array, no arguments.
Write a function that accepts a string and returns an array of all substrings. Be careful about characters that are encoded with two UTF-16 code units.
In a more perfect world, all string methods would take offsets that count Unicode characters, not UTF-16 code units. Which String
methods would be affected? Provide replacement functions for them, such as indexOf(str, sub)
and slice(str, start, end)
.
Implement a printf
tagged template function that formats integers, floating-point numbers, and strings with the classic printf
formatting instructions, placed after embedded expressions:
const formatted = printf`${item}%-40s | ${quantity}%6d | ${price}%10.2f`
Write a tagged template function spy
that displays both the raw and “cooked” string fragments and the embedded expression values. In the raw string fragments, remove the backslashes that were needed for escaping backticks, dollar signs, and backslashes.
List as many different ways as you can to produce a regular expression that matches only the empty string.
Is the m
/multiline
flag actually useful? Couldn’t you just match
? Produce a regular expression that can find all lines containing just digits without the multiline flag. What about the last line?
Produce regular expressions for email addresses and URLs.
Produce regular expressions for US and international telephone numbers.
Use regular expression replacement to clean up phone numbers and credit card numbers.
Produce a regular expression for quoted text, where the delimiters could be matching single or double quotes, or curly quotes “”.
Produce a regular expression for image URLs in an HTML document.
Using a regular expression, extract all decimal integers (including negative ones) from a string into an array.
Suppose you have a regular expression and you want to use it for a complete match, not just a match of a substring. You just want to surround it with ^
and $
. But that’s not so easy. The regular expression needs to be properly escaped before adding those anchors. Write a function that accepts a regular expression and yields a regular expression with the anchors added.
Use the replace
method of the String
class with a function argument to replace all °F measurements in a string with their °C equivalents.
Enhance the greek
function of Section 6.5, “Raw Template Literals” (page 122), so that it handles escaped backslashes and $
symbols. Also check whether a symbol starting with a backslash has a substitution. If not, include it verbatim.
Generalize the greek
function of the preceding exercise to a general purpose substitution function that can be called as subst(dictionary)`templateString`
.
3.146.255.127