Chapter 6. Strings and Regular Expressions

Images

In this chapter, you will learn about the methods that the standard library provides for string processing. We will then turn to regular expressions, which let you find strings that match patterns. After an introduction into the syntax of regular expressions and the JavaScript-specific idiosyncrasies, you will see how to use the API for finding and replacing matches.

6.1 Converting between Strings and Code Point Sequences

A string is a sequence of Unicode code points. Each code point is an integer between zero and 0x10FFFF. The fromCodePoint function of the String class assembles a string from code point arguments:

let str = String.fromCodePoint(0x48, 0x69, 0x20, 0x1F310, 0x21) // 'Hi Images!'

If the code points are in an array, use the spread operator:

let codePoints = [0x48, 0x69, 0x20, 0x1F310, 0x21]
str = String.fromCodePoint(...codePoints)

Conversely, you can turn a string into an array of code points:

let characters = [...str] // [ 'H', 'i', ' ', 'Images', '!' ]

The result is an array of strings, each containing a single code point. You can obtain the code points as integers:

codePoints = [...str].map(c => c.codePointAt(0))

Images Caution

JavaScript stores strings as sequences of UTF-16 code units. The offset in a call such as 'Hi Images'.codePointAt(i) refers to the UTF-16 encoding. In this example, valid offsets are 0, 1, 2, 3, and 5. If the offset falls in the middle of a pair of code units that make up a single code point, then an invalid code point is returned.

If you want to traverse the code points of a string without putting them in an array, use this loop:

for (let i = 0; i < str.length; i++) {
  let cp = str.codePointAt(i)
  if (cp > 0xFFFF) i++
  . . . // Process the code point cp
}

6.2 Substrings

The indexOf method yields the index of the first occurrence of a substring:

let index = 'Hello yellow'.indexOf('el') // 1

The lastIndexOf method yields the index of the last occurrence:

index = 'Hello yellow'.lastIndexOf('el') // 7

As with all offsets into JavaScript strings, these values are offsets into the UTF-16 encoding:

index = 'IImagesyellow'.indexOf('el') // 4

The offset is 4 because the “yellow heart” emoji Images is encoded with two UTF-16 code units.

If the substring is not present, these methods return -1.

The methods startsWith, endsWith, and includes return a Boolean result:

let isHttps = url.startsWith('https://')
let isGif = url.endsWith('.gif')
let isQuery = url.includes('?')

The substring method extracts a substring, given two offsets in UTF-16 code units. The substring contains all characters from the first offset up to, but not including, the second offset.

let substring = 'IImagesyellow'.substring(3, 7) // 'yell'

If you omit the second offset, all characters until the end of the string are included:

substring = 'IImagesyellow'.substring(3) // 'yellow'

The slice method is similar to substring, except that negative offsets are counted from the end of the string. -1 is the offset of the last code unit, -2 the offset of its predecessor, and so on. This is achieved by adding the string length to a negative offset.

'IImagesyellow'.slice(-6, -2) // 'yell', same as slice(3, 7)

The length of 'IImagesyellow' is 9—recall that the Images takes two code units. The offsets -6 and -2 are adjusted to 3 and 7.

With both the substring and slice methods, offsets larger than the string length are truncated to the length. Negative and NaN offsets are truncated to 0. (In the slice method, this happens after adding the string length to negative offsets.)

Images Caution

If the first argument to substring is larger than the second, the arguments are switched!

substring = 'IImagesyellow'.substring(7, 3) // 'yell', same as substring(3, 7)

In contrast, str.slice(start, end) yields the empty string if startend.

I prefer the slice method over substring. It is more versatile, has a saner behavior, and the method name is shorter.

Another way of taking a string apart is the split method. That method splits a string into an array of substrings, removing the provided separator.

let parts = 'Mary had a little lamb'.split(' ')
  // ['Mary', 'had', 'a', 'little', 'lamb']

You can supply a limit for the number of parts:

parts = 'Mary had a little lamb'.split(' ', 4)
  // ['Mary', 'had', 'a', 'little']

The separator can be a regular expression—see Section 6.12, “String Methods with Regular Expressions” (page 133).

Images Caution

Calling str.split('') with an empty separator splits the string into strings that each hold a 16-bit code unit, which is not useful if str contains characters above u{FFFF}. Use [...str] instead.

6.3 Other String Methods

In this section, you will find miscellaneous methods of the String class. Since strings are immutable in JavaScript, none of the string methods change the contents of a given string. They all return a new string with the result.

The repeat method yields a string repeated a given number of times:

const repeated = 'ho '.repeat(3) // 'ho ho ho '

The trim, trimStart, and trimEnd methods yield strings that remove leading and trailing white space, or just leading or trailing white space. White space characters include the space character, the nonbreaking space u{00A0}, newline, tab, and 21 other characters with the Unicode character property White_Space.

The padStart and padEnd methods do the opposite—they add space characters until the string has a minimum length:

let padded = 'Hello'.padStart(10) // '     Hello', five spaces are added

You can also supply your own padding string:

padded = 'Hello'.padStart(10, '=-') // =-=-=Hello

Images Caution

The first parameter is the length of the padded string in bytes. If your padding string contains characters that require two bytes, you may get a malformed string:

padded = 'Hello'.padStart(10, 'Images')
  // Padded with two hearts and an unmatched code unit

The toUpperCase and toLowerCase methods yield a string with all characters converted to upper- or lowercase.

let uppercased = 'Straße'.toUpperCase() // 'STRASSE'

As you can see, the toUpperCase method is aware of the fact that the uppercase of the German character 'ß' is the string 'SS'.

Note that toLowerCase does not recover the original string:

let lowercased = uppercased.toLowerCase() // 'strasse'

Images Note

String operations such as conversion to upper- and lowercase can depend on the user’s language preferences. See Chapter 8 for methods toLocaleUpperCase, toLocaleLowerCase, localeCompare, and normalize that are useful when you localize your applications.

Images Note

See Section 6.12, “String Methods with Regular Expressions” (page 133), for string methods match, matchAll, search, and replace that work with regular expressions.

The concat method concatenates a string with any number of arguments that are converted to strings.

const n = 7
let concatenated = 'agent'.concat(' ', n) // 'agent 7'

You can achieve the same effect with template strings or the join method of the Array class:

concatenated = `agent ${n}`
concatenated = ['agent', ' ', n].join('')

Table 6-1 shows the most useful features of the String class.

Table 6-1    Useful Functions and Methods of the String class

Name

Description

Functions

fromCodePoint(codePoints...)

Yields a string consisting of the given code points

Methods

startsWith(s), endsWith(s), includes(s)

true if a string starts or ends with s, or has s as a substring

indexOf(s, start), lastIndexOf(s, start)

The index of the first or last occurrence of s beginning with index start (which defaults to 0)

slice(start, end)

The substring of code units with index between start inclusive and end exclusive. Negative index values are counted from the end of the string. end defaults to the length of the string. Prefer this method over substring.

repeat(n)

This string, repeated n times

trimStart(), trimEnd(), trim()

This string with leading, trailing, or leading and trailing white space removed

padStart(minLength, padString), padEnd(minLength, padString)

This string, padded at the start or end until its length reaches minLength. The default padString is ' '.

toLowerCase(), toUpperCase()

This string with all letters converted to lower or upper case

split(separator, maxParts)

An array of parts obtained by removing all copies of the separator (which can be a regular expression). If maxParts is omitted, all parts are returned.

search(target)

The index of the first match of target (which can be a regular expression)

replace(target, replacement)

This string, with the first match of target replaced. If target is a global regular expression, all matches are replaced. See Section 6.13 about replacement patterns and functions.

match(regex)

An array of matches if regex is global, null if there is no match, and the match result otherwise. The match result is an array of all group matches, with properties index (the index of the match) and groups (an object mapping group names to matches).

matchAll(regex)

An iterable of the match results

Finally, there are global functions for encoding URL components and entire URLs—or, more generally, URIs using schemes such as mailto or tel—into their “URL encoded” form. That form uses only characters that were considered “safe” when the Internet was first created. Suppose you need to produce a query for translating a phrase from one language into another. You might construct a URL like this:

const phrase = 'à coté de'
const prefix = 'https://www.linguee.fr/anglais-francais/traduction'
const suffix = '.html'
const url = prefix + encodeURIComponent(phrase) + suffix

The phrase is encoded into '%C3%A0%20cot%C3%A9%20de', the result of encoding characters into UTF-8 and encoding each byte into a code %hh with two hexadecimal digits. The only characters that are left alone are the “safe” characters

A-Z a-z 0-9 ! ' ( ) * . _ ~ -

In the less common case, if you need to encode an entire URI, use the encodeURI function. It also leaves the characters

# $ & + , / : ; = ? @

unchanged since they can have special meanings in URIs.

6.4 Tagged Template Literals

Images

In Chapter 1, you saw template literals—strings with embedded expressions:

const person = { name: 'Harry', age: 42 }
message = `Next year, ${person.name} will be ${person.age + 1}.`

Template literals insert the values of the embedded expressions into the template string. In this example, the embedded expressions person.name and person.age + 1 are evaluated, converted to strings, and spliced with the surrounding string fragments. The result is the string

'Next year, Harry will be 43.'

You can customize the behavior of template literals with a tag function. As an example, we will be writing a tag function strong that produces an HTML string, highlighting the embedded values. The call

strong`Next year, ${person.name} will be ${person.age + 1}.`

will yield an HTML string

'Next year, <strong>Harry</strong> will be <strong>43</strong>.'

The tag function is called with the fragments of the literal string around the embedded expressions, followed by the expression values. In our example, the fragments are 'Next year, ', ' will be ', and '.', and the values are 'Harry' and 43. The tag function combines these pieces. The returned value is turned into a string if it is not already one.

Here is an implementation of the strong tag function:

const strong = (fragments, ...values) => {
  let result = fragments[0]
   for (let i = 0; i < values.length; i++)
    result += `<strong>${values[i]}</strong>${fragments[i + 1]}`
  return result
}

When processing the template string

strong`Next year, ${person.name} will be ${person.age + 1}.`

the strong function is called like this:

strong(['Next year, ', ' will be ', '.'], 'Harry', 43)

Note that all string fragments are put into an array, whereas the expression values are passed as separate arguments. The strong function uses the spread operator to gather them all in a second array.

Also note that there is always one more fragment than there are expression values.

This mechanism is infinitely flexible. You can use it for HTML templating, number formatting, internationalization, and so on.

6.5 Raw Template Literals

Images

If you prefix a template literal with String.raw, then backslashes are not escape characters:

path = String.raw`c:users
ate`

Here, u does not denote a Unicode escape, and is not turned into a newline character.

Images Caution

Even in raw mode, you cannot enclose arbitrary strings in backticks. You still need to escape all ` characters, $ before {, and before ` and {.

That doesn’t quite explain how String.raw works, though. Tag functions have access to a “raw” form of the template string fragments, in which backslash combinations such as u and lose their special meanings.

Suppose we want to handle strings with Greek letters. We follow the convention of the LATEX markup language for mathematical formulas. In that language, symbols start with backslashes. Therefore, raw strings are attractive—users want to write u and upsilon, not \nu and \upsilon. Here is an example of a string that we want to be able to process:

greek`
u=${factor}upsilon`

As with any tagged template string, we need to define a function:

const greek = (fragments, ...values) => {
  const substitutions = { alpha: 'α', . . ., nu: 'ν',  . . . }
  const substitute = str => str.replace(/\[a-z]+/g,
    match => substitutions[match.slice(1)])

  let result = substitute(fragments.raw[0])
  for (let i = 0; i < values.length; i++)
    result += values[i] + substitute(fragments.raw[i + 1])
  return result
}

You access the raw string fragments with the raw property of the first parameter of the tag function. The value of fragments.raw is an array of string fragments with unprocessed backslashes.

In the preceding tagged template literal, fragments.raw is an array of two strings. The first string is u=, and the second string is upsilon.

${
uupsilon{

including three backslashes. The second string has two characters:

}}

Note the following:

  • The in u is not turned into a newline.

  • The u in upsilon is not interpreted as a Unicode escape. In fact, it would not be syntactically correct. For that reason, fragments[1] cannot be parsed and is set to undefined.

  • ${factor} is an embedded expression. Its value is passed to the tag function.

The greek function uses regular expression replacement, which is explained in detail in Section 6.13, “More about Regex Replace” (page 135). Identifiers starting with a backslash are replaced with their substitutions, such as ν for u.

6.6 Regular Expressions

Images

Regular expressions specify string patterns. Use them whenever you need to locate strings that match a particular pattern. For example, suppose you want to find hyperlinks in an HTML file. You need to look for strings of the form <a href=". . .">. But wait—there may be extra spaces, or the URL may be enclosed in single quotes. Regular expressions give you a precise syntax for specifying what sequences of characters are legal matches.

In a regular expression, a character denotes itself unless it is one of the reserved characters

. * + ? { | ( ) [  ^ $

For example, the regular expression href only matches the string href.

The symbol . matches any single character. For example, .r.f matches href and prof.

The * symbol indicates that the preceding construct may be repeated 0 or more times; with the + symbol, the repetition is 1 or more times. A suffix of ? indicates that a construct is optional (0 or 1 times). For example, be+s? matches be, bee, and bees. You can specify other multiplicities with { }—see Table 6-2.

A | denotes an alternative: .(oo+|ee+)f matches beef or woof. Note the parentheses—without them, .oo+|ee+f would be the alternative between .oo+ and ee+f. Parentheses are also used for grouping—see Section 6.11, “Groups” (page 131).

A character class is a set of character alternatives enclosed in brackets, such as [Jj], [0-9], [A-Za-z], or [^0-9]. Inside a character class, the - denotes a range (all characters whose Unicode values fall within the two bounds). However, a - that is the first or last character in a character class denotes itself. A ^ as the first character in a character class denotes the complement—all characters except those specified. For example, [^0-9] denotes any character that is not a decimal digit.

There are six predefined character classes: d (digits), s (white space), w (word characters), and their complements D (non-digits), S (nonwhite space), and W (nonword characters).

The characters ^ and $ match the beginning and end of input. For example, ^[0-9]+$ matches a string entirely consisting of digits.

Be careful about the position of the ^ character. If it is the first character inside brackets, it denotes the complement: [^0-9]+$ matches a string of non-digits at the end of input.

Images Note

I have a hard time remembering that ^ matches the start and $ the end. I keep thinking that $ should denote start, and on the US keyboard, $ is to the left of ^. But it’s exactly the other way around, probably since the archaic text editor QED used $ to denote the last line.

Table 6-2 summarizes the JavaScript regular expression syntax.

If you need to have a literal . * + ? { | ( ) [ ^ $, precede it by a backslash. Inside a character class, you only need to escape [ and , provided you are careful about the positions of ] - ^. For example, []^-] is a class containing all three of them.

Table 6-2    Regular Expression Syntax

Expression

Description

Example

Characters

A character other than . * + ? { | ( ) [ ^ $

Matches only the given character

J

.

Matches any character except , or any character if the dotAll flag is set

 

u{hhhh}, u{hhhhh}

The Unicode code point with the given hex value (requires unicode flag)

u{1F310}

uhhhh, xhh

The UTF-16 code unit with the given hex value

xA0

f, , , , v

Form feed (x0C), newline (x0A), carriage return (x0D), tab (x09), vertical tab (x0B)

cL, where L is in [A-Za-z]

The control character corresponding to the character L

cH is Ctrl-H or backspace (x08)

c, where c is not in [0-9BDPSWbcdfknprstv]

The character c

\

Character Classes

[C1C2. . .], where Ci are characters, ranges c-d, or character classes

Any of the characters represented by C1, C2, . . .

[0-9+-]

[^. . .]

Complement of a character class

[^ds]

p{BooleanProperty}
p{Property=Value}
P{. . .}

A Unicode property (see Section 6.9); its complement (requires the unicode flag)

p{L} are Unicode letters

d, D

A digit [0-9]; the complement

d+ is a sequence of digits

w, W

A word character [a-zA-Z0-9_]; the complement

s, S

A space from [ vf xA0] or 18 additional Unicode space characters; same as p{White_Space}

s*,s* is a comma surrounded by optional white space

Sequences and Alternatives

XY

Any string from X, followed by any string from Y

[1-9][0-9]* is a positive number without leading zero

X|Y

Any string from X or Y

http|ftp

Grouping

(X)

Captures the match of X into a group—see Section 6.11

'([^']*)' captures the quoted text

n

Matches the nth group

(['"]).*1 matches 'Fred' or "Fred" but not "Fred'

(?<name>X)

Captures the match of X with the given name

'(?<qty>[0-9]+)' captures the match with name qty

k<name>

The group with the given name

k<qty> matches the group with name qty

(?:X)

Use parentheses without capturing X

In (?:http|ftp)://(.*), the match after :// is 1

Other (?. . .)

See Section 6.14

 

Quantifiers

X?

Optional X

+? is an optional + sign

X*, X+

0 or more X, 1 or more X

[1-9][0-9]+ is an integer ≥ 10

X{n}, X{n,}, X{m,n}

n times X, at least n times X, between m and n times X

[0-9]{4,6} are four to six digits

X*? or X+?

Reluctant quantifier, attempting the shortest match before trying longer matches

.*(<.+?>).* captures the shortest sequence enclosed in angle brackets

Boundary Matches

^ $

Beginning, end of input (or beginning, end of line if the multiline flag is set)

^JavaScript$ matches the input or line JavaScript

, B

Word boundary, nonword boundary

JavaB matches JavaScript but not Java code

6.7 Regular Expression Literals

Images

A regular expression literal is delimited by slashes:

const timeRegex = /^([1-9]|1[0-2]):[0-9]{2} [ap]m$/

Regular expression literals are instances of the RegExp class.

The typeof operator, when applied to a regular expression, yields 'object'.

Inside the regular expression literal, use backslashes to escape characters that have special meanings in regular expressions, such as the . and + characters:

const fractionalNumberRegex = /[0-9]+.[0-9]*/

Here, the escaped . means a literal period.

In a regular expression literal, you also need to escape a forward slash so that it is not interpreted as the end of the literal.

To convert a string holding a regular expression into a RegExp object, use the RegExp function, with or without new:

const fractionalNumberRegex = new RegExp('[0-9]+\.[0-9]*')

Note that the backslash in the string must be escaped.

6.8 Flags

Images

A flag modifies a regular expression’s behavior. One example is the i or ignoreCase flag. The regular expression

/[A-Z]+.com/i

matches Horstmann.COM.

You can also set the flag in the constructor:

const regex = new RegExp(/[A-Z]+.com/, 'i')

To find the flag values of a given RegExp object, you can use the flags property which yields a string of all flags. There is also a Boolean property for each flag:

regex.flags // 'i'
regex.ignoreCase // true

JavaScript supports six flags, shown in Table 6-3.

Table 6-3    Regular Expression Flags

Single Letter

Property Name

Description

i

ignoreCase

Case-insensitive match

m

multiline

^, $ match start, end of line

s

dotAll

. matches newline

u

unicode

Match Unicode characters, not code units—see Section 6.9

g

global

Find all matches—see Section 6.10

y

sticky

Match must start at regex.lastIndex—see Section 6.10

The m or multiline flag changes the behavior of the start and end anchors ^ and $. By default, they match the beginning and end of the entire string. In multiline mode, they match the beginning and end of a line. For example,

/^[0-9]+/m

matches digits at the beginning of a line.

With the s or dotAll flag, the . pattern matches newlines. Without it, . matches any non-newline character.

The other three flags are explained in later sections.

You can use more than one flag. The following regular expression matches upper- or lowercase letters at the start of each line:

/^[A-Z]/im

6.9 Regular Expressions and Unicode

Images

For historical reasons, regular expressions work with UTF-16 code units, not Unicode characters. For example, the . pattern matches a single UTF-16 code unit. For example, the string

'Hello Images'

does not match the regular expression

/Hello .$/

The Images character is encoded with two code units. The remedy is to use the u or unicode flag:

/Hello .$/u

With the u flag, the . pattern matches a single Unicode character, no matter how it is encoded in UTF-16.

If you need to keep your source files in ASCII, you can embed Unicode code points into regular expressions, using the u{ } syntax:

/[A-Za-z]+ u{1F310}/u

Images Caution

Without the u flag, /u{1F310}/ matches the string 'u{1F310}'.

When working with international text, you should avoid patterns such as [A-Za-z] for denoting letters. These patterns won’t match letters in other languages. Instead, use p{Property}, where Property is the name of a Boolean Unicode property. For example, p{L} denotes a Unicode letter. The regular expression

/Hello, p{L}+!/u

matches

'Hello, värld!'

and

'Hello, Images!'

Table 6-4 shows the names of other common Boolean properties.

For Unicode properties whose values are not Boolean, use the syntax p{Property=Value}. For example, the regular expression

/p{Script=Han}+/u

matches any sequence of Chinese characters.

Using an uppercase P yields the complement: P{L} matches any character that is not a letter.

Table 6-4    Common Boolean Unicode Properties

Name

Description

L

Letter

Lu

Uppercase letter

Ll

Lowercase letter

Nd

Decimal number

P

Punctuation

S

Symbol

White_Space

White space, same as s

Emoji

Emoji characters, modifiers, or components

6.10 The Methods of the RegExp Class

Images

The test method yields true if a string contains a match for the given regular expression:

/[0-9]+/.test('agent 007') // true

To test whether the entire string matches, your regular expression must use start and end anchors:

/^[0-9]+$/.test('agent 007') // false

The exec method yields an array holding the first matched subexpression, or null if there was no match.

For example,

/[0-9]+/.exec('agents 007 and 008')

returns an array containing the string '007'. (As you will see in the following section, the array can also contain group matches.)

In addition, the array that exec returns has two properties:

  • index is the index of the subexpression

  • input is the argument that was passed to exec

In other words, the array returned by the preceding call to exec is actually

['007', index: 7, input: 'agents 007 and 008']

To match multiple subexpressions, use the g or global flag:

let digits = /[0-9]+/g

Now each call to exec returns a new match:

result = digits.exec('agents 007 and 008') // ['007', index: 7, . . .]
result = digits.exec('agents 007 and 008') // ['008', index: 15, . . .]
result = digits.exec('agents 007 and 008') // null

To make this work, the RegExp object has a property lastIndex that is set to the first index after the match in each successful call to exec. The next call to exec starts the match at lastIndex. The lastIndex property is set to zero when a regular expression is constructed or a match failed.

You can also set the lastIndex property to skip a part of the string.

With the y or sticky flag, the match must start exactly at lastIndex:

digits = /[0-9]+/y
digits.lastIndex = 5
result = digits.exec('agents 007 and 008') // null
digits.lastIndex = 8
result = digits.exec('agents 007 and 008') // ['07', index: 8, . . .]

Images Note

If you simply want an array of all matched substrings, use the match method of the String class instead of repeated calls to exec—see Section 6.12, “String Methods with Regular Expressions” (page 133).

let results = 'agents 007 and 008'.match(/[0-9]+/g) // ['007', '008']

6.11 Groups

Images

Groups are used for extracting components of a match. For example, here is a regular expression for parsing times with groups for each component:

let time = /([1-9]|1[0-2]):([0-5][0-9])([ap]m)/

The group matches are placed in the array returned by exec:

let result = time.exec('Lunch at 12:15pm')
  // ['12:15pm', '12', '15', 'pm', index: 9, . . .]

As in the preceding section, result[0] is the entire matched string. For i > 0, result[i] is the match for the ith group.

Groups are numbered by their opening parentheses. This matters if you have nested parentheses. Consider this example. We want to analyze line items of invoices that have the form

Blackwell Toaster    USD29.95

Here is a regular expression with groups for each component:

/(p{L}+(s+p{L}+)*)s+([A-Z]{3})([0-9.]*)/u

In this situation, group 1 is 'Blackwell Toaster', the substring matched by the expression (p{L}+(s+p{L}+)*), from the first opening parenthesis to its matching closing parenthesis.

Group 2 is ' Toaster', the substring matched by (s+p{L}+).

Groups 3 and 4 are 'USD' and '29.95'.

We aren’t interested in group 2; it only arose from the parentheses that were required for the repetition. For greater clarity, you can use a noncapturing group, by adding ?: after the opening parenthesis:

/(p{L}+(?:s+p{L}+)*)s+([A-Z]{3})([0-9.]*)/u

Now 'USD' and '29.95' are captured as groups 2 and 3.

Images Note

When you have a group inside a repetition, such as (s+p{L}+)* in the example above, the corresponding group only holds the last match, not all matches.

If the repetition happened zero times, then the group match is set to undefined.

You can match against the contents of a captured group. For example, consider the regular expression:

/(['"]).*1/

The group (['"]) captures either a single or double quote. The pattern 1 matches the captured string, so that "Fred" and 'Fred' match the regular expression but "Fred' does not.

Images Caution

Even though they are supposed be outlawed in strict mode, several JavaScript engines still support octal character escapes in regular expressions. For example, 11 denotes , the character at code point 9.

However, if the regular expression has 11 or more capturing groups, then 11 denotes a match of the 11th group.

Numbered groups are rather fragile. It is much better to capture by name:

let lineItem = /(?<item>p{L}+(s+p{L}+)*)s+(?<currency>[A-Z]{3})(?<price>[0-9.]*)/u

When a regular expression has one or more named groups, the array returned by exec has a property groups whose value is an object holding group names and matches:

let result = lineItem.exec('Blackwell Toaster    USD29.95')
let groupMatches = result.groups
  // { item: 'Blackwell Toaster', currency: 'USD', price: '29.95' }

The expression k<name> matches against a group that was captured by name:

/(?<quote>['"]).*k<quote>/

Here, the group with the name “quote” matches a single or double quote at the beginning of the string. The string must end with the same character. For example, "Fred" and 'Fred' are matches but "Fred' is not.

The features of the RegExp are summarized in Table 6-5.

Table 6-5    Features of the RegExp Class

Name

Description

Constructors

new RegExp(regex, flags)

Constructs a regular expression from the given regex (a string, regular expression literal, or RegExp object) and the given flags

Properties

flags

A string of all flags

ignoreCase, multiline, dotAll, unicode, global, sticky

Boolean properties for all flag types

Methods

test(str)

true if str contains a match for this regular expression

exec(str)

Match results for the current match of this regular expression inside str. See Section 6.10 for details. The match and matchAll methods of the String class are simpler to use than this method.

6.12 String Methods with Regular Expressions

Images

As you saw in Section 6.10, “The Methods of the RegExp Class” (page 130), the workhorse method for getting match information is the exec method of the RegExp class. But its API is far from elegant. The String class has several methods that work with regular expressions and produce commonly used results more easily.

For a regular expression without the global flag set, the call str.match(regex) returns the same match results as regex.exec(str):

'agents 007 and 008'.match(/[0-9]+/) // ['007', index: 7, . . .]

With the global flag set, match simply returns an array of matches, which is often just what you want:

'agents 007 and 008'.match(/[0-9]+/g) // ['007', '008']

If there is no match, the String.match method returns null.

Images Note

RegExp.exec and String.match are the only methods in the ECMAScript standard library that yield null to indicate the absence of a result.

If you have a global search and want all match results without calling exec repeatedly, you will like the matchAll method of the String class that is currently a stage 3 proposal. It returns an iterable of the match results. Let’s say you want to look for all matches of the regular expression

let time = /([1-9]|1[0-2]):([0-5][0-9])([ap]m)/g

The loop

for (const [, hours, minutes, period] of input.matchAll(time)) {
  . . .
}

iterates over all match results, using destructuring to set hours, minutes, and period to the group matches. The initial comma ignores the entire matched expression.

The matchAll method yields the matches lazily. It is efficient if there are many matches but only a few are examined.

The search method returns the index of the first match or -1 if no match is found:

let index = 'agents 007 and 008'.search(/[0-9]+/) // Yields index 7

The replace method replaces the first match of a regular expression with a replacement string. To replace all matches, set the global flag:

let replacement = 'agents 007 and 008'.replace(/[0-9]/g, '?')
  // 'agents ??? and ???'

Images Note

The split method can have a regular expression as argument. For example,

str.split(/s*,s*/)

splits str along commas that are optionally surrounded by white space.

6.13 More about Regex Replace

Images

In this section, we have a closer look at the replace method of the String class.

The replacement string parameter can contain patterns starting with a $ that are processed as shown in Table 6-6.

Table 6-6    Replacement String Patterns

Pattern

Description

$`, $'

The portion before or after the matched string

$&

Matched string

$n

The nth group

$<name>

The group with the given name

$$

Dollar sign

For example, the following replacement repeats each vowel three times:

'hello'.replace(/[aeiou]/g, '$&$&$&') // 'heeellooo'

The most useful pattern is the group pattern. Here, we use groups to match the first and last name of a person in each line and flip them:

let names = 'Harry Smith
Sally Lin'
let flipped = names.replace(
  /^([A-Z][a-z]+) ([A-Z][a-z]+)/gm, "$2, $1")
  // 'Smith, Harry
Lin, Sally'

If the number after the $ sign is larger than the number of groups in the regular expression, the pattern is inserted verbatim:

let replacement = 'Blackwell Toaster $29.95'.replace('$29', '$19')
  // 'Blackwell Toaster $19.95'—there is no group 19

You can also use named groups:

flipped = names.replace(/^(?<first>[A-Z][a-z]+) (?<last>[A-Z][a-z]+)$/gm,
  "$<last>, $<first>")

For more complex replacements, you can provide a function instead of a replacement string. The function receives the following arguments:

  • The string that was matched by the regular expression

  • The matches of all groups

  • The offset of the match

  • The entire string

In this example, we just process the group matches:

flipped = names.replace(/^([A-Z][a-z]+) ([A-Z][a-z]+)/gm,
  (match, first, last)  => `${last}, ${first[0]}.`)
  // 'Smith, H.
Lin, S.'

Images Note

The replace method also works with strings, replacing the first match of the string itself:

let replacement = 'Blackwell Toaster $29.95'.replace('$', 'USD')
  // Replaces $ with USD

Note that the $ is not interpreted as an end anchor.

Images Caution

If you call the search method with a string, it is converted to a regular expression:

let index = 'Blackwell Toaster $29.95'.search('$')
  // Yields 24, the end of the string, not the index of $

Use indexOf to search for a plain string.

6.14 Exotic Features

Images

In the final section of this chapter, you will see several complex and uncommon regular expression features.

The + and * repetition operators are “greedy”—they match the longest possible strings. That’s generally desirable. You want /[0-9]+/ to match the longest possible string of digits, and not a single digit.

However, consider this example:

'"Hi" and "Bye"'.match(/".*"/g)

The result is

'"Hi" and "Bye"'

because .* greedily matches everything until the final ". That does not help us if we want to match quoted substrings.

One remedy is to require non-quotes in the repetition:

'"Hi" and "Bye"'.match(/"[^"]*"/g)

Alternatively, you can specify that the match should be reluctant, by using the *? operator:

'"Hi" and "Bye"'.match(/".*?"/g)

Either way, now each quoted string is matched separately, and the result is

['"Hi"', '"Bye"']

There is also a reluctant version +? that requires at least one repetition.

The lookahead operator p(?=q) matches p provided it is followed by q, but does not include q in the match. For example, here we find the hours that precede a colon.

let hours = '10:30 - 12:00'.match(/[0-9]+(?=:)/g) // ['10, 12']

The inverted lookahead operator p(?!q) matches p provided it is not followed by q.

let minutes = '10:30 - 12:00'.match(/[0-9][0-9](?!:)/g) // ['10, 12']

There is also a lookbehind (?<=p)q that matches q as long as it is preceded by p.

minutes = '10:30 - 12:00'.match(/(?<=[0-9]+:)[0-9]+/g) // ['30', '00']

Note that the argument inside (?<=[0-9]+:) is itself a regular expression.

Finally, there is an inverted lookbehind (?<!p)q, matching q as long as it is not preceded by p.

hours = '10:30 - 12:00'.match(/(?<![0-9:])[0-9]+/g)

Regular expressions such as this one may have motivated Jamie Zawinski’s timeless quote, “Some people, when confronted with a problem, think: ‘I know, I’ll use regular expressions.’ Now they have two problems.”

Exercises

  1. Write a function that, given a string, produces an escaped string delimited by ' characters. Turn all non-ASCII Unicode into u{. . .}. Produce escapes , f, , , , v, ', \.

  2. Write a function that fits a string into a given number of Unicode characters. If it is too long, trim it and append an ellipsis … (u{2026}). Be sure to correctly handle characters that are encoded with two UTF-16 code units.

  3. The substring and slice methods are very tolerant of bad arguments. Can you get them to yield an error with any arguments? Try strings, objects, array, no arguments.

  4. Write a function that accepts a string and returns an array of all substrings. Be careful about characters that are encoded with two UTF-16 code units.

  5. In a more perfect world, all string methods would take offsets that count Unicode characters, not UTF-16 code units. Which String methods would be affected? Provide replacement functions for them, such as indexOf(str, sub) and slice(str, start, end).

  6. Implement a printf tagged template function that formats integers, floating-point numbers, and strings with the classic printf formatting instructions, placed after embedded expressions:

    const formatted = printf`${item}%-40s | ${quantity}%6d | ${price}%10.2f`
  7. Write a tagged template function spy that displays both the raw and “cooked” string fragments and the embedded expression values. In the raw string fragments, remove the backslashes that were needed for escaping backticks, dollar signs, and backslashes.

  8. List as many different ways as you can to produce a regular expression that matches only the empty string.

  9. Is the m/multiline flag actually useful? Couldn’t you just match ? Produce a regular expression that can find all lines containing just digits without the multiline flag. What about the last line?

  10. Produce regular expressions for email addresses and URLs.

  11. Produce regular expressions for US and international telephone numbers.

  12. Use regular expression replacement to clean up phone numbers and credit card numbers.

  13. Produce a regular expression for quoted text, where the delimiters could be matching single or double quotes, or curly quotes “”.

  14. Produce a regular expression for image URLs in an HTML document.

  15. Using a regular expression, extract all decimal integers (including negative ones) from a string into an array.

  16. Suppose you have a regular expression and you want to use it for a complete match, not just a match of a substring. You just want to surround it with ^ and $. But that’s not so easy. The regular expression needs to be properly escaped before adding those anchors. Write a function that accepts a regular expression and yields a regular expression with the anchors added.

  17. Use the replace method of the String class with a function argument to replace all °F measurements in a string with their °C equivalents.

  18. Enhance the greek function of Section 6.5, “Raw Template Literals” (page 122), so that it handles escaped backslashes and $ symbols. Also check whether a symbol starting with a backslash has a substitution. If not, include it verbatim.

  19. Generalize the greek function of the preceding exercise to a general purpose substitution function that can be called as subst(dictionary)`templateString`.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.255.127