Chapter 18. Regular Expressions

Regular expressions are patterns that describe strings. They can be used as arguments to three XQuery built-in functions to determine whether a string value matches a particular pattern (matches), to replace parts of string that match a pattern (replace), and to tokenize strings based on a delimiter pattern (tokenize). This chapter explains the regular expression syntax used by XQuery.

The Structure of a Regular Expression

The regular expression syntax of XQuery is based on that of XML Schema, with some additions. Regular expressions, also known as regexes, can be composed of a number of different parts: atoms, quantifiers, and branches.

Atoms

An atom is the most basic unit of a regular expression. It might describe a single character, such as d, or an escape sequence that represents one or more characters, like s or p{Lu}. It could also be a character class expression that represents a range or choice of several characters, such as [a-z]. These kinds of atoms are described later in this chapter.

Quantifiers

Atoms may indicate required, optional, or repeating strings. The number of times a matching string may appear is indicated by a quantifier, which appears directly after an atom. For example, to indicate that the letter d must appear one or more times, you can use the expression d+, where the + means "one or more." The different quantifiers are listed in Table 18-1.

Table 18-1. Kinds of quantifiers

Quantifier

Number of occurrences

none

1

?

0 or 1

*

0, 1, or many

+

1 or many

{n}

n

{n,}

n to many

{n,m}

n to m

Examples of the use of these quantifiers are shown in Table 18-2. Note that in these cases, the quantifier applies only to the letter o, not to the preceding f.

Table 18-2. Quantifier examples

Regular expression

Strings that match

Strings that do not match

fo

fo

f, foo

fo?

f, fo

foo

fo*

f, fo, foo, fooo, ...

fx

fo+

fo, foo, fooo, ...

f

fo{2}

foo

fo, fooo

fo{2,}

foo, fooo, foooo, ...

f, fo

fo{2,3}

foo, fooo

f, fo, foooo

Parenthesized Sub-Expressions and Branches

A parenthesized sub-expression can be used as an atom in a larger regular expression. Parentheses are useful for repeating certain sequences of characters. For example, suppose you want to indicate a repetition of the string fo. The expression fo* matches fooo, but not fofo, because the quantifier applies to the final atom, not the entire string. To allow fofo, you can parenthesize fo, resulting in the regular expression (fo)*.

Parenthesized sub-expressions are also useful for specifying a choice between several different patterns. For example, to allow either the string fo or the string xy to come before z, you can use the expression (fo|xy)z. The two expressions on either side of the vertical bar character (|), in this case fo and xy, are known as branches.

The | character does not act on the atom immediately preceding it, but on the entire expression that precedes it (back to the previous | or corresponding opening parenthesis). For example, the regular expression (yes|no) indicates a choice between yes and no, not "ye, followed by s or n, followed by o." Branches at the top level can also be used without parentheses, as in yes|no.

Placing parentheses around a sub-expression also allows it to be referenced, which is useful for two purposes: back-references, and variable references when using the replace function. These features are covered in "Back-References" and "Using Sub-Expressions with Replacement Variables," respectively.

Table 18-3 shows some examples that exhibit the interaction between branches, atoms, and parentheses.

Table 18-3. Examples of parentheses in regular expressions

Regular expression

Strings that match

Strings that do not match

(fo)+z

foz, fofoz

z, fz, fooz, ffooz

(fo|xy)z

foz, xyz

z

(fo|xy)+z

fofoz, foxyz, xyfoz

z

(f+o)+z

foz, ffoz, foffoz

z, fz, fooz

yes|no

yes, no

 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.164.75