This appendix provides an overview of the XQuery regular expression syntax, which is based on the syntax used by XML Schema. In theory, this is the syntax supported by all XQuery implementations, although in practice you may find that your implementation varies slightly from the description here. When in doubt, consult the documentation accompanying your implementation.
This appendix begins with an overview of the regular expression syntax used by the built-in matches()
and replace()
functions (see Appendix C). It then provides a grammar for this regular expression language (Listing D.6) and a table of the Unicode properties (Table D.3).
Regular expressions, also known as regexps or regexes, provide a simple but powerful language for performing sophisticated string matching and replacement. Regexps are often more efficient and easier to maintain than hand-written programs that accomplish the same tasks, but sometimes are overlooked because they are perceived as complex and difficult to use.
XQuery uses a regular expression syntax that is derived from the ones used by XML Schema 1.0 and Perl. In XQuery, regular expressions match according to Unicode code point values, so collation isn't used. Only two functions (matches()
and replace()
) use regular expressions, but they are powerful tools for text manipulation.
Each regular expression consists of one or more branches separated by vertical bars (|
). Each branch corresponds to a choice of expressions to match; the regular expression matches a string if any of its branches match.
Each branch consists of zero or more atoms, each of which may have an optional modifier. Each atom matches a character; atoms can be many different expressions, but most commonly are ordinary characters or the wildcard character (.
). (Don't confuse regular expression atoms with atomic values, or the regular expression wildcard character with the wildcard node test used in paths.)
Ordinary characters match only themselves; the wildcard character matches (almost) any character. We explore the other kinds of atoms later in this section. Note that whitespace is significant in regular expressions, so don't add extra space characters unless you mean to do so.
Example D.1. Basic regular expressions
replace("xyz", "x", "a") => "ayz" replace("xyz", ".", "a") => "aaa" replace("xyz", "x|z", "a") => "aya" replace("xyz", "x |z", "a") => "xya" replace("x y z", "x |z", "a") => "ay a"
The modifier immediately follows an atom and determines the number of times that atom must appear. It consists of a quantifier and an optional reluctant quantifier. The quantifiers and their meanings are listed in Table D.1. Listing D.2 demonstrates their use.
An error is raised if in the expression {n,m}
, n
is greater than m
. If the modifier indicates the atom must appear exactly zero times, then it's equivalent to the empty string (in other words, as if the atom hadn't been listed in the regular expression at all).
Table D.1. Regular expression quantifiers
Modifier | Meaning |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Example D.2. Using regular expression modifiers
matches("abc", "d?") => true replace("abc", "ad?", "x") => "xbc" matches("abc", "d*") => true matches("abc", "d+") => false matches("xxx", "x{2}") => false matches("xxx", "x{3}") => true matches("xxx", "x{2,}") => true matches("xxx", "x{2,4}") => true matches("xxx", "x{4,5}") => false
Because certain characters, such as parentheses, have special meaning in regular expressions, they cannot be expressed directly. These meta-characters require a backslash escape in front of them. Most backslash-escaped characters result in the character being escaped, for example, .
matches the period (.
). The exceptions are the three escapes
,
, and
, which match the tab, new line, and carriage return characters, respectively (U+0009
, U+000A
, and U+000D
). Listing D.3 shows how to escape a meta-character.
Example D.3. Meta-characters require escaping in regular expressions
replace("x(z", "(", "y") => error (: invalid regexp :) replace("x(z", "(", "y") => "xyz" replace("x.y.z", ".", "-") => "-----" replace("x.y.z", ".", "-") => "x-y-z"
At this point, you already know enough about XQuery regular expressions to be productive with them. However, there are some additional features that may be worth learning.
The caret (^
) and dollar sign ($
) meta-characters may be used to represent the beginning and end of the string, respectively. These are especially useful for ensuring that the entire string matches the pattern.
The functions that use regular expressions take an optional additional parameter that can specify two flags, i
and m
. The i
flag indicates that the regular expression matches should be carried out case-insensitively. The m
flag indicates that the regular expression should match in “multi-line” mode. In multi-line mode, the wildcard character doesn't match the new line character, and the ^
and $
meta-characters match the beginning or end of lines, in addition to the beginning or end of the string.
Table D.2. Additional escapes
Escape sequence | Meaning | Equivalent to |
---|---|---|
| XML name characters. |
|
| XML digit characters. |
|
| XML initial name characters. |
|
| New line character |
|
| Match characters within the named Unicode block, such as |
|
| Match characters having the named Unicode property, such as |
|
| Carriage return character |
|
| Any XML whitespace character. |
|
| Tab character |
|
| Word characters—all characters except punctuation, separator, and “other” characters. |
|
In addition to the character escapes described previously, XQuery (and XML Schema) support escapes that match entire classes of Unicode characters. These escapes can match characters according to certain properties (or the absence of those properties), or match characters that belong to certain predefined categories (or don't belong). Given some escape x
, the capitalized version X
has the negated meaning. For example, s
matches any whitespace character, while S
matches any non-whitespace character. Table D.2 summarizes these special escape sequences and Listing D.4 demonstrates their use.
Note that implementations are free to implement new blocks or properties as they become part of the Unicode Database. In my experience, many implementations don't support these Unicode property escapes (p
). Also, some implementations don't support subtracted subgroups like [A-Z]-[AEIOU]
.
Parenthesized subexpressions are treated as groups. The matching part of the input string is called a captured substring, and can be referenced in the replacement using backslash followed by the number of the group (the first group is number 1). XQuery requires implementations to support references in replacements only; some implementations may also allow back references in the match pattern. Listing D.5 demonstrates the use of parenthesized groups.
Example D.5. Using parentheses to capture substrings
replace("xylophone", "xylo(.*)", "tele$1") => "telephone" replace("<x/><y/>", "<(.*)/>", "$1") => "x/><y" replace("<x/><y/>", "<(w*)/>", "$1") => "xy" matches("abcabc", "(abc)$1") => false (: some implementations return true :)
Normally, regular expressions match greedily (matching the longest substring possible). Greedy matching can sometimes produce unexpected results; for example, in the second example in Listing D.5, the pattern <(.*)/>
matches everything between the first <
and the last />
as a single pattern, instead of matching each element separately. The third example works around this by using the more specific pattern <(w*)/>
, but another way is to use a reluctant qualifier, as shown in Listing D.6. The optional reluctant quantifier (?
) indicates that instead the regular expression should match the shortest substring possible for the regular expression to still succeed. This “reluctance” makes a difference only when using the replace()
function.
XQuery doesn't actually define a grammar for its regular expression syntax, but instead refers to the XML Schema 1.0 Recommendation. Surprisingly, that document also neglects to provide a formal definition, relying instead on prose description. I used that description together with the additional features introduced by XQuery to produce the grammar in Listing D.7.
In this grammar, Char
is any XQuery character that isn't a meta-character (MetaChar
) for the regular expression language, and XmlChar
is any character that isn't a square bracket ([
or ]
) or hyphen (-
). The final production Property
corresponds to the character properties listed in Table D.3, and the Block
production corresponds to the names of character ranges in the Unicode Database.
Example D.7. The regular expression language of XQuery
Regexp := Branch ( '|' Branch)* Branch := (Atom Modifier?)* Modifier := Quantifier ('?')? Quantifier := '?' | '*' | '+' | '{' Quantity '}' Quantity := QRange | QExact QRange := QExact ',' QExact? QExact := [0-9]+ Atom := Char | Escape | Group | '(' Regexp ')' MetaChar := '^' | '$' | '.' | '' | '?' | '*' | '+' | '(' | ')' | '[' | ']' | '{' | '}' Group := '[' (PosGroup | NegGroup | SubGroup) ']' PosGroup := (Range | Escape)+ NegGroup := '^' PosGroup SubGroup := (PosGroup | NegGroup) '-' Group Range := XmlChar | CharOrEsc '-' CharOrEsc CharOrEsc := XmlChar | SingleEsc Escape := SingleEsc | MultiEsc | CatEsc | ComplEsc | WildEsc WildEsc := '.' SingleEsc := '' (MetaChar | 'n' | 'r' | 't' | '-') MultiEsc := '' EscSeq EscSeq := 's' | 'i' | 'c' | 'd' | 'w' | 'S' | 'I' | 'C' | 'D' | 'W' CatEsc := 'p{' CharProp '}' ComplEsc := 'P{' CharProp '}' CharProp := Property | Block
The table is reproduced from the XML Schema 1.0 Recommendation. The Unicode blocks accepted by the p{}
expression (such as Katakana
and BasicLatin
) are listed in the Unicode Database, and are omitted here for brevity.
Table D.3. Character properties for use with p
Category | Property | Meaning |
---|---|---|
Copyright © 1998–2004 World Wide Web Consortium, (Massachusetts Institute of Technology, Institut National de Recherche en Informatique et en Automatique, Keio University). All Rights Reserved. | ||
Letters |
|
|
Marks |
|
|
Numbers |
|
|
Punctuation |
|
|
Separators |
|
|
Symbols |
|
|
Other |
|
|
18.226.180.68