Regular expressions are patterns that describe strings. They can be used as arguments to three XQuery built-in functions to determine whether a string value matches a particular pattern (matches
), to replace parts of string that match a pattern (replace
), and to tokenize strings based on a delimiter pattern (tokenize
). This chapter explains the regular expression syntax used by XQuery.
The regular expression syntax of XQuery is based on that of XML Schema, with some additions. Regular expressions, also known as regexes, can be composed of a number of different parts: atoms, quantifiers, and branches.
An atom is the most basic unit of a regular expression. It might describe a single character, such as d
, or an escape sequence that represents one or more characters, like s
or p{Lu}
. It could also be a character class expression that represents a range or choice of several characters, such as [a-z]
. These kinds of atoms are described later in this chapter.
Atoms may indicate required, optional, or repeating strings. The number of times a matching string may appear is indicated by a quantifier, which appears directly after an atom. For example, to indicate that the letter d
must appear one or more times, you can use the expression d+
, where the +
means "one or more." The different quantifiers are listed in Table 18-1.
Table 18-1. Kinds of quantifiers
Quantifier |
Number of occurrences |
---|---|
|
1 |
|
0 or 1 |
|
0, 1, or many |
|
1 or many |
|
n |
|
n to many |
|
n to m |
Examples of the use of these quantifiers are shown in Table 18-2. Note that in these cases, the quantifier applies only to the letter o
, not to the preceding f
.
A parenthesized sub-expression can be used as an atom in a larger regular expression. Parentheses are useful for repeating certain sequences of characters. For example, suppose you want to indicate a repetition of the string fo
. The expression fo*
matches fooo
, but not fofo
, because the quantifier applies to the final atom, not the entire string. To allow fofo
, you can parenthesize fo
, resulting in the regular expression (fo)*
.
Parenthesized sub-expressions are also useful for specifying a choice between several different patterns. For example, to allow either the string fo
or the string xy
to come before z
, you can use the expression (fo|xy)z
. The two expressions on either side of the vertical bar character (|
), in this case fo
and xy
, are known as branches.
The |
character does not act on the atom immediately preceding it, but on the entire expression that precedes it (back to the previous |
or corresponding opening parenthesis). For example, the regular expression (yes|no)
indicates a choice between yes
and no
, not "ye
, followed by s
or n
, followed by o
." Branches at the top level can also be used without parentheses, as in yes|no
.
Placing parentheses around a sub-expression also allows it to be referenced, which is useful for two purposes: back-references, and variable references when using the replace
function. These features are covered in "Back-References" and "Using Sub-Expressions with Replacement Variables," respectively.
Table 18-3 shows some examples that exhibit the interaction between branches, atoms, and parentheses.
3.133.158.36