Appendix D. Regular Expressions

Introduction

This appendix provides an overview of the XQuery regular expression syntax, which is based on the syntax used by XML Schema. In theory, this is the syntax supported by all XQuery implementations, although in practice you may find that your implementation varies slightly from the description here. When in doubt, consult the documentation accompanying your implementation.

This appendix begins with an overview of the regular expression syntax used by the built-in matches() and replace() functions (see Appendix C). It then provides a grammar for this regular expression language (Listing D.6) and a table of the Unicode properties (Table D.3).

Overview

Regular expressions, also known as regexps or regexes, provide a simple but powerful language for performing sophisticated string matching and replacement. Regexps are often more efficient and easier to maintain than hand-written programs that accomplish the same tasks, but sometimes are overlooked because they are perceived as complex and difficult to use.

XQuery uses a regular expression syntax that is derived from the ones used by XML Schema 1.0 and Perl. In XQuery, regular expressions match according to Unicode code point values, so collation isn't used. Only two functions (matches() and replace()) use regular expressions, but they are powerful tools for text manipulation.

Each regular expression consists of one or more branches separated by vertical bars (|). Each branch corresponds to a choice of expressions to match; the regular expression matches a string if any of its branches match.

Each branch consists of zero or more atoms, each of which may have an optional modifier. Each atom matches a character; atoms can be many different expressions, but most commonly are ordinary characters or the wildcard character (.). (Don't confuse regular expression atoms with atomic values, or the regular expression wildcard character with the wildcard node test used in paths.)

Ordinary characters match only themselves; the wildcard character matches (almost) any character. We explore the other kinds of atoms later in this section. Note that whitespace is significant in regular expressions, so don't add extra space characters unless you mean to do so.

Example D.1. Basic regular expressions

replace("xyz", "x", "a")       => "ayz"
replace("xyz", ".", "a")       => "aaa"
replace("xyz", "x|z", "a")     => "aya"
replace("xyz", "x |z", "a")    => "xya"
replace("x y z", "x |z", "a")  => "ay a"

The modifier immediately follows an atom and determines the number of times that atom must appear. It consists of a quantifier and an optional reluctant quantifier. The quantifiers and their meanings are listed in Table D.1. Listing D.2 demonstrates their use.

An error is raised if in the expression {n,m}, n is greater than m. If the modifier indicates the atom must appear exactly zero times, then it's equivalent to the empty string (in other words, as if the atom hadn't been listed in the regular expression at all).

Table D.1. Regular expression quantifiers

Modifier

Meaning

none

Atom must appear exactly once

?

Atom must appear zero or one times

*

Atom must appear zero or more times

+

Atom must appear one or more times

{n}

Atom must appear exactly n times

{n,}

Atom must appear at least n times

{n,m}

Atom must appear at least n and at most m times

Example D.2. Using regular expression modifiers

matches("abc", "d?")       => true
replace("abc", "ad?", "x") => "xbc"
matches("abc", "d*")       => true
matches("abc", "d+")       => false
matches("xxx", "x{2}")     => false
matches("xxx", "x{3}")     => true
matches("xxx", "x{2,}")    => true
matches("xxx", "x{2,4}")   => true
matches("xxx", "x{4,5}")   => false

Because certain characters, such as parentheses, have special meaning in regular expressions, they cannot be expressed directly. These meta-characters require a backslash escape in front of them. Most backslash-escaped characters result in the character being escaped, for example, . matches the period (.). The exceptions are the three escapes , , and , which match the tab, new line, and carriage return characters, respectively (U+0009, U+000A, and U+000D). Listing D.3 shows how to escape a meta-character.

Example D.3. Meta-characters require escaping in regular expressions

replace("x(z", "(", "y")    => error (: invalid regexp :)
replace("x(z", "(", "y")   => "xyz"
replace("x.y.z", ".", "-")  => "-----"
replace("x.y.z", ".", "-") => "x-y-z"

At this point, you already know enough about XQuery regular expressions to be productive with them. However, there are some additional features that may be worth learning.

The caret (^) and dollar sign ($) meta-characters may be used to represent the beginning and end of the string, respectively. These are especially useful for ensuring that the entire string matches the pattern.

The functions that use regular expressions take an optional additional parameter that can specify two flags, i and m. The i flag indicates that the regular expression matches should be carried out case-insensitively. The m flag indicates that the regular expression should match in “multi-line” mode. In multi-line mode, the wildcard character doesn't match the new line character, and the ^ and $ meta-characters match the beginning or end of lines, in addition to the beginning or end of the string.

Table D.2. Additional escapes

Escape sequence

Meaning

Equivalent to

c

XML name characters.

n/a

d

XML digit characters.

p{Nd}

i

XML initial name characters.

[_p{L}]

New line character (U+000A)

[#x000A]

p{IsBlock}

Match characters within the named Unicode block, such as IsBasicLatin or IsKatakana. See the Unicode Database for a list of all block names.

n/a

p{Property}

Match characters having the named Unicode property, such as Lu (upper-case) or Mn (non-spacing).

n/a

Carriage return character (U+000D)

[#x000D]

s

Any XML whitespace character.

[ ]

Tab character (U+0009)

[#x0009]

w

Word characters—all characters except punctuation, separator, and “other” characters.

[#x0000-#x10FFFF]-[p{P}p{Z}p{C}]

In addition to the character escapes described previously, XQuery (and XML Schema) support escapes that match entire classes of Unicode characters. These escapes can match characters according to certain properties (or the absence of those properties), or match characters that belong to certain predefined categories (or don't belong). Given some escape x, the capitalized version X has the negated meaning. For example, s matches any whitespace character, while S matches any non-whitespace character. Table D.2 summarizes these special escape sequences and Listing D.4 demonstrates their use.

Note that implementations are free to implement new blocks or properties as they become part of the Unicode Database. In my experience, many implementations don't support these Unicode property escapes (p). Also, some implementations don't support subtracted subgroups like [A-Z]-[AEIOU].

Example D.4. Character escapes in regular expressions

replace("Hello, world", "s", "X") => "Hello,Xworld"
replace("Hello, world", "S", "X") => "XXXXX, XXXXX"
replace("Hello, world", "p{Lu}", "")  => "ello, world"
replace("Hello, world", "p{P}s", "") => "Helloworld"

Advanced Regexps

Parenthesized subexpressions are treated as groups. The matching part of the input string is called a captured substring, and can be referenced in the replacement using backslash followed by the number of the group (the first group is number 1). XQuery requires implementations to support references in replacements only; some implementations may also allow back references in the match pattern. Listing D.5 demonstrates the use of parenthesized groups.

Example D.5. Using parentheses to capture substrings

replace("xylophone", "xylo(.*)", "tele$1") => "telephone"
replace("<x/><y/>", "<(.*)/>", "$1")       => "x/><y"
replace("<x/><y/>", "<(w*)/>", "$1")      => "xy"
matches("abcabc", "(abc)$1")               => false
                          (: some implementations return true :)

Normally, regular expressions match greedily (matching the longest substring possible). Greedy matching can sometimes produce unexpected results; for example, in the second example in Listing D.5, the pattern <(.*)/> matches everything between the first < and the last /> as a single pattern, instead of matching each element separately. The third example works around this by using the more specific pattern <(w*)/>, but another way is to use a reluctant qualifier, as shown in Listing D.6. The optional reluctant quantifier (?) indicates that instead the regular expression should match the shortest substring possible for the regular expression to still succeed. This “reluctance” makes a difference only when using the replace() function.

Example D.6. Ordinary and reluctant quantifiers

replace("<x/><y/>", "<(.*)/>", "$1")  => "x/><y"
replace("<x/><y/>", "<(.*?)/>", "$1") => "xy"
replace("xyzxyzxyz", "x(.*)z", "$1")  => "yzxyzxy"
replace("xyzxyzxyz", "x(.*?)z", "$1") => "yyy"

Regexp Language

XQuery doesn't actually define a grammar for its regular expression syntax, but instead refers to the XML Schema 1.0 Recommendation. Surprisingly, that document also neglects to provide a formal definition, relying instead on prose description. I used that description together with the additional features introduced by XQuery to produce the grammar in Listing D.7.

In this grammar, Char is any XQuery character that isn't a meta-character (MetaChar) for the regular expression language, and XmlChar is any character that isn't a square bracket ([ or ]) or hyphen (-). The final production Property corresponds to the character properties listed in Table D.3, and the Block production corresponds to the names of character ranges in the Unicode Database.

Example D.7. The regular expression language of XQuery

Regexp      := Branch ( '|' Branch)*
Branch      := (Atom Modifier?)*
Modifier   := Quantifier ('?')?
Quantifier := '?' | '*' | '+' | '{' Quantity '}'
Quantity    := QRange | QExact
QRange      := QExact ',' QExact?
QExact      := [0-9]+
Atom        := Char | Escape | Group | '(' Regexp ')'
MetaChar    := '^' | '$' | '.' | '' | '?' | '*' | '+'
             | '(' | ')' | '[' | ']' | '{' | '}'
Group       := '[' (PosGroup | NegGroup | SubGroup) ']'
PosGroup    := (Range | Escape)+
NegGroup    := '^' PosGroup
SubGroup    := (PosGroup | NegGroup) '-' Group
Range       := XmlChar | CharOrEsc '-' CharOrEsc
CharOrEsc   := XmlChar | SingleEsc
Escape      := SingleEsc | MultiEsc | CatEsc | ComplEsc | WildEsc
WildEsc     := '.'
SingleEsc   := '' (MetaChar | 'n' | 'r' | 't' | '-')
MultiEsc    := '' EscSeq
EscSeq      := 's' | 'i' | 'c' | 'd' | 'w'
             | 'S' | 'I' | 'C' | 'D' | 'W'
CatEsc      := 'p{' CharProp '}'
ComplEsc    := 'P{' CharProp '}'
CharProp    := Property | Block

Character Properties

The table is reproduced from the XML Schema 1.0 Recommendation. The Unicode blocks accepted by the p{} expression (such as Katakana and BasicLatin) are listed in the Unicode Database, and are omitted here for brevity.

Table D.3. Character properties for use with p

Category

Property

Meaning

Copyright © 1998–2004 World Wide Web Consortium, (Massachusetts Institute of Technology, Institut National de Recherche en Informatique et en Automatique, Keio University). All Rights Reserved.

Letters

L

Lu

Ll

Lt

Lm

Lo

All letters

Uppercase letters

Lowercase letters

Titlecase letters

Modifier

Other

Marks

M

Mn

Mc

Me

All marks

Non-spacing

Spacing combining

Enclosing

Numbers

N

Nd

Nl

No

All numbers

Decimal digit

Letter

Other

Punctuation

P

Pc

Pd

Ps

Pe

Pi

Pf

Po

All punctuation

Connector

Dash

Open

Close

Initial quote

Final quote

Other

Separators

Z

Zs

Zl

Zp

All separators

Space

Line

Paragraph

Symbols

S

Sm

Sc

Sk

So

All symbols

Math

Currency

Modifier

Other

Other

C

Cc

Cf

Co

Cn

All others

Control

Format

Private use

Not assigned

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.199.210