Chapter 19. Regular Expressions

Regular expressions are patterns that describe strings. They can be used as arguments to four XQuery built-in functions to determine whether a string value matches a particular pattern (matches), to replace parts of string that match a pattern (replace), to tokenize strings based on a delimiter pattern (tokenize), and to split a string into matching and non-matching parts (analyze-string). This chapter explains the regular expression syntax used by XQuery.

The Structure of a Regular Expression

The regular expression syntax of XQuery is based on that of XML Schema, with some additions. Regular expressions, also known as regexes, can be composed of a number of different parts: atoms, quantifiers, and branches.

Atoms

An atom is the most basic unit of a regular expression. It might describe a single character, such as d, or an escape sequence that represents one or more characters, like s or p{Lu}. It could also be a character class expression that represents a range or choice of several characters, such as [a-z]. These kinds of atoms are described later in this chapter.

Quantifiers

Atoms may indicate required, optional, or repeating strings. The number of times a matching string may appear is indicated by a quantifier, which appears directly after an atom. For example, to indicate that the letter d must appear one or more times, you can use the expression d+, where the + means “one or more.” The different quantifiers are listed in Table 19-1.

Table 19-1. Kinds of quantifiers
QuantifierNumber of occurrences
none 1
? 0 or 1
* 0, 1, or many
+ 1 or many
{n} n
{n,} n to many
{n,m} n to m

Examples of the use of these quantifiers are shown in Table 19-2. Note that in these cases, the quantifier applies only to the letter o, not to the preceding f.

In this table and in all similar tables in this chapter, the “Strings that do not match” in the third column are ones that do not match (in their entirety) the regular expression shown. However, it is important to note that the matches function will return true if any part of a string matches the regular expression. For example, in the first row of the table, it says that foo does not match the regular expression fo. This is true, but the expression matches("foo", "fo") will return true because part of the string (the first two characters fo) matches the pattern.

Table 19-2. Quantifier examples
Regular expressionStrings that matchStrings that do not match
fo fo f, foo
fo? f, fo foo
fo* f, fo, foo, fooo, ... fx
fo+ fo, foo, fooo, ... f
fo{2} foo fo, fooo
fo{2,} foo, fooo, foooo, ...f, fo
fo{2,3} foo, fooof, fo, foooo

Parenthesized Sub-Expressions and Branches

A parenthesized sub-expression can be used as an atom in a larger regular expression. Parentheses are useful for repeating certain sequences of characters. For example, suppose you want to indicate a repetition of the string fo. The expression fo* matches fooo, but not fofo, because the quantifier applies to the final atom, not the entire string. To allow fofo, you can parenthesize fo, resulting in the regular expression (fo)*.

Parenthesized sub-expressions are also useful for specifying a choice between several different patterns. For example, to allow either the string fo or the string xy to come before z, you can use the expression (fo|xy)z. The two expressions on either side of the vertical bar character (|), in this case fo and xy, are known as branches.

The | character does not act on the atom immediately preceding it, but on the entire expression that precedes it (back to the previous | or corresponding opening parenthesis). For example, the regular expression (yes|no) indicates a choice between yes and no, not "ye, followed by s or n, followed by o.” Branches at the top level can also be used without parentheses, as in yes|no.

Placing parentheses around a sub-expression also allows it to be referenced, which is useful for two purposes: back-references, and variable references when using the replace function. These features are covered in “Back-References” and “Using Sub-Expressions with Replacement Variables”, respectively.

Table 19-3 shows some examples that exhibit the interaction among branches, atoms, and parentheses.

Table 19-3. Examples of parentheses in regular expressions
Regular expressionStrings that matchStrings that do not match
(fo)+z foz, fofozz, fz, fooz, ffooz
(fo|xy)z foz, xyz z
(fo|xy)+z fofoz, foxyz, xyfoz z
(f+o)+z foz, ffoz, foffozz, fz, fooz
(yes|no) yes, noyeno

Representing Individual Characters

A single character can be used to represent itself in a regular expression. In this case, it is known as a normal character. For example, the regular expression d matches the letter d, and def matches the string def, as you might expect. Each of the three single characters (d, e, and f) is its own atom, and can have a quantifier associated with it. For example, the regular expression d+ef matches the strings def, ddef, dddef, etc.

Certain characters, in order to be taken literally, must be escaped because they have another meaning in a regular expression. For example, the asterisk (*) will be treated like a quantifier unless it is escaped. These characters, called metacharacters, must be escaped (except when they are within square brackets): ., , ?, *, +, |, ^, $, {, }, (, ), [, and ].

These characters are escaped by preceding them with a backslash. This is referred to as a single-character escape because there is only one matching character. For convenience, there are three additional single-character escapes for the whitespace characters tab, line feed, and carriage return. Table 19-4 lists the single-character escapes.

Table 19-4. Single-character escapes
Escape sequenceCharacter
\
| |
. .
- -
^ ^
$ $
? ?
* *
+ +
{ {
} }
( (
) )
[ [
] ]
Line feed (#xA)
Carriage return (#xD)
Tab (#x9)

You can also use the standard XML syntax for character references and predefined entity references in regular expressions, as long as they are in quoted strings. For example, a space can be represented as &#x20;, and a less-than symbol (<) can be represented as &lt;. This can be useful for special characters. It is described further in “XML Entity and Character References”.

Table 19-5 shows some examples of representing individual characters in regular expressions.

Table 19-5. Representing individual characters
Regular expressionStrings that matchStrings that do not match
d d g
d+efg+ defg, ddefggdefgefg, deffgg
defg defg d, efg
d|e|f d, e, f g
f*o fo, ffo, fffo f*o
f*o f*o fo, ffo, fffo
d&#233;f déf def, df

Representing Any Character

The period (.) has special significance in regular expressions: it matches any character except a line feed (#xA) or carriage return (#xD). The period character represents only one matching character, but a quantifier (such as *) can be applied to it to represent multiple characters.

Table 19-6 shows some examples of the wildcard escape character in use. For the third example in the table, assume a line feed character between f and o in the third column. This string does not match unless you are in dot-all mode.

Table 19-6. The wildcard escape character
Regular expressionStrings that matchStrings that do not match
f.o fao, fbo, f2ofo, fbbo
f..o faao, fbco, f12ofo, fao
f.*o fo, fao, fbcde23o

f

o

f.o f.o fao

It is important to note that the period loses its wildcard power when placed in a character class expression (within square brackets).

XQuery functions that use regular expressions allow you to indicate that the processor should operate in dot-all mode. This is specified using the letter s in the $flags argument. In dot-all mode, the period matches any character whatsoever, including the line feed (#xA) and carriage return (#xD). See “Using Flags”.

Representing Groups of Characters

Sometimes characters fall into convenient groups, such as decimal digits or punctuation characters. Three different kinds of escapes can be used to represent a group of characters: multi-character escapes, category escapes, and block escapes. Like single-character escapes, they all start with a backslash.

Multi-Character Escapes

Multi-character escapes, listed in Table 19-7, represent groups of related characters. They are called multi-character escapes because there are several characters that they can match. However, each escape represents only one character in a matching string. To allow several replacement characters, you should use a quantifier such as +.

Table 19-7. Multi-character escapes
EscapeMeaning
s A whitespace character, as defined by XML (space, tab, carriage return, or line feed)
S A character that is not a whitespace character
d A decimal digit (0 to 9), or a digit in another style, for example, an Indic Arabic digit
D A character that is not a decimal digit
w A “word” character, that is, any character not in one of the Unicode categories of Punctuation, Separators, and Other
W A non-word character, that is, any character in one of the Unicode categories of Punctuation, Separators, and Other
i A character that is allowed as the first character of an XML name, i.e., a letter, an underscore (_), or a colon (:); the i stands for “initial”
I A character that cannot be the first character of an XML name
c A character that can be part of an XML name, i.e., a letter, a digit, an underscore (_), a hyphen (-), a colon (:), or a period (.)
C A character that cannot be part of an XML name

Category Escapes

The Unicode standard defines categories of characters based on their purpose. For example, there are categories for punctuation, uppercase letters, and currency symbols. These categories, listed in Table 19-8, can be referenced in regular expressions by using category escapes.

Table 19-8. Unicode categories
CategoryPropertyMeaningPropertyMeaning
Letters L All letters Lt Titlecase
Lu Uppercase Lm Modifier
Ll Lowercase Lo Other
Marks M All marks Mc Spacing combining
Mn Non-spacing Me Enclosing
Numbers N All numbers Nl Letter
Nd Decimal digit No Other
Punctuation P All punctuation Pe Close
Pc Connector Pi Initial quote
Pd Dash Pf Final quote
Ps Open Po Other
Separators Z All separators Zl Line
Zs Space Zp Paragraph
Symbols S All symbols Sk Modifier
Sm Math So Other
Sc Currency
Other C All others Co Private use
Cc Control Cn Not assigned
Cf Format

Category escapes take the form p{XX}, with XX representing the property listed in Table 19-8. For example, p{Lu} matches any uppercase letter. Category escapes that use an uppercase P, as in P{XX}, match all characters that are not in the category. For example, P{Lu} matches any character that is not an uppercase letter.

Note that the category escapes include all alphabets. If you intend for an expression to match only the capital letters A through Z, it is better to use [A-Z] than p{Lu}, because p{Lu} allows uppercase letters of all character sets. Likewise, if your intention is to allow only the decimal digits 0 through 9, use [0-9] rather than p{Nd} or d, because there are decimal digits other than 0 through 9 in other character sets.

Block Escapes

Unicode defines a numeric codepoint for each character. Each range of characters is represented by a block name, also defined by Unicode. For example, characters 0000 through 007F are known as Basic Latin. Table 19-9 lists the first five block escape ranges as an example. For a complete, updated list, see the blocks file of the Unicode standard at http://www.unicode.org/Public/UNIDATA/Blocks.txt.

Table 19-9. Partial list of Unicode block names
Start codeEnd codeBlock name (with spaces removed)
#x0000 #x007F BasicLatin
#x0080 #x00FF Latin-1Supplement
#x0100 #x017F LatinExtended-A
#x0180 #x024F LatinExtended-B
.........

Block escapes can be used to refer to these character ranges in regular expressions. They take the form p{IsXX}, with XX representing the Unicode block name with all spaces removed. For example, p{IsLatin-1Supplement} matches any one character in the range #x0080 to #x00FF. As with category escapes, you can use an uppercase P to match characters not in the block. For example, P{IsLatin-1Supplement} matches any character outside of that range.

Table 19-10 provides examples of representing groups of characters in regular expressions.

Table 19-10. Representing groups of characters
Regular expressionStrings that matchStrings that do not matchComment
fd f0, f1f, f01Multi-character escape
fd* f, f0, f012 ff Multi-character escape
fs*o fo, f o foo Multi-character escape
p{Ll} a, bA, B, 1, 2Category escape
P{Ll} A, B, 1, 2a, bCategory escape
p{L} a, b, A, B1, 2Category escape
P{L} 1, 2a, b, A, BCategory escape
p{IsBasicLatin} a, b&#226;, &#223;Block escape
P{IsBasicLatin} &#226;, &#223;a, bBlock escape

Character Class Expressions

Character class expressions, which are enclosed in square brackets, indicate a choice among several characters. These characters can be listed singly, expressed as a range of characters, or expressed as a combination of the two.

Single Characters and Ranges

To specify a choice of several characters, you can simply list them inside square brackets. For example, [def] matches d or e or f. To match multiple occurrences of these letters, you can use a quantifier with a character class expression, as in [def]*, which will match not only defdef, but eddfefd as well. The characters listed can also be any of the escapes described earlier in this chapter. The expression [p{Ll}d] matches either a lowercase letter or a digit.

It is also possible to specify a range of characters by separating the starting and ending characters with a hyphen. For example, [a-z] matches any letter from a to z. The endpoints of the range must be single characters or single character escapes (not multi-character escapes such as d).

You can specify more than one range in the same character class expression, which means that it matches a character in any of the ranges. The expression [a-zA-Z0-9] matches one character that is either between a and z, or between A and Z, or a digit from 0 to 9. Unicode codepoints are used to determine whether a character is in the range.

Ranges and single characters can be combined in any order. For example, [abc0-9] matches either a letter a, b, or c, or a digit from 0 to 9. This regular expression could also be expressed as [0-9abc] or [a0-9bc].

Subtraction from a Range

Subtraction allows you to express that you want to match a range of characters but leave a few out. For example, [a-z-[jkl]] matches any character from a to z except j, k, or l. A hyphen (-) precedes the character group to be subtracted, which is itself enclosed in square brackets. Like any character class expression, the subtracted group can be a list of single characters or ranges, or both. The expression [a-z-[j-l]] has the same meaning as the previous example. You can also subtract from a multi-character escape, for example, [p{Lu}-[ABC]].

Negative Character Class Expressions

It is also possible to specify a negative character class expression, meaning that a string should not match any of the characters specified. This is accomplished using the ^ character after the left square bracket. For example, [^a-z] matches any character that is not a letter from a to z. Any character class expression can be negated, including those that specify single characters, ranges, or a combination of the two. The negation applies to the entire character class expression, so [^a-z0-9] will match anything that is not a letter from a to z and also not a digit from 0 to 9.

Some examples of character class expressions are shown in Table 19-11.

Table 19-11. Character class expression examples
Regular expressionStrings that matchStrings that do not matchComment
[def] d, e, f def Single characters
[def]* d, eee, dfeda, bSingle characters, repeating
[p{Ll}d] a, b, 1A, BSingle characters with escapes
[d-f] d, e, fa, DRange of characters
[0-9d-fD-F] 3, d, Fa, 3dFMultiple ranges
[0-9stu] 4, 9, ta, 4tRange plus single characters
[s-ud] 4, 9, ta, t4Range plus single-character escape
[a-x-[f]] a, d, xf, 2Subtracting from a range
[a-x-[fg]] a, d, xf, g, 2Subtracting from a range
[a-x-[e-g]] a, d, xe, g, 2Subtracting from a range with a range
[^def] a, g, 2d, e, fNegating single characters
[^[] a, b, c [ Negating a single-character escape
[^d] d, E1, 2, 3Negating a multi-character escape
[^a-cj-l] d, 4b, j, lNegating a range

Escaping Rules for Character Class Expressions

Special escaping rules apply to character class expressions. They are:

  • The characters [, ], , and - should be escaped when included as single characters.

  • The character must be escaped if it is the lower bound of the range.

  • The characters [ and must be escaped if one of them is the upper bound of the range.

  • The character ^ must be escaped only if it appears first in the character class expression, directly after the opening bracket ([).

The other metacharacters do not need to be escaped when used in a character class expression, because they have no special meaning in that context. This includes the period character, which does not serve as a wildcard escape character when it appears inside a character class expression. However, it is never an error to escape any of the metacharacters, and getting into the habit of always escaping them eliminates the need to remember these rules.

Reluctant Quantifiers

XQuery supports reluctant quantifiers, which allow part of a regular expression to match the shortest possible string. Reluctant quantifiers are indicated by adding a question mark (?) to the end of any of the kinds of quantifiers identified in Table 19-1.

For example, given the string reluctant and the regular expression r.*t, the regular expression could match reluct or reluctant. Since a standard quantifier (*) is used, the match is on the longest possible string, reluctant. If the regular expression were r.*?t instead, which uses a reluctant quantifier, it would match reluct, the shorter of the two strings.

Reluctant quantifiers come into play when replacing matching values in a string. Table 19-12 shows some examples of calls to the replace function that use reluctant and non-reluctant quantifiers.

Table 19-12. Reluctant versus non-reluctant quantifiers
ExampleReturn value
replace("reluctant", "r.*t", "X") X
replace("reluctant", "r.*?t", "X") Xant
replace("aaah", "a{2,3}", "X") Xh
replace("aaah", "a{2,3}?", "X") Xah
replace("aaaah", "a{2,3}", "X") Xah
replace("aaaah", "a{2,3}?", "X") XXh

Reluctant quantifiers have no effect on simply determining whether a string matches a regular expression, which explains why they are not supported in XML Schema. It may seem that the regular expression r.*?tly would not match the string reluctantly because r.*?t would match the shorter string reluct, leaving an extra antly which does not match the pattern ly. However, this is not the way it works. Reluctant quantifiers do not indicate that only the shorter string matches, just that the processor uses the shorter of the two matches if called on to perform a replacement or some other operation. Any of the quantifiers in the examples in Table 19-2 could be replaced by reluctant quantifiers, and the list of matching and non-matching strings would be the same.

Anchors

XQuery adds the concept of anchors to XML Schema regular expressions. In XML Schema validation, the regular expression is expected to match the entire string, not a part of it. For example, the regular expression str matches only the string str and not other strings that contain str, like 5str5. The XQuery matches function, on the other hand, will return true if any part of the string matches the pattern. For example, matches("5str5", "str") returns true because a portion of the string matches the regular expression str.

Because of this looser interpretation, it is sometimes useful to explicitly say that the expression should match the beginning or end of the string (or both). Anchors can be used for this purpose. The ^ character is used to match the beginning of the string, and the $ character is used to match the end of the string. For example, the regular expression ^str specifies that a matching string must start with str. Table 19-13 shows some examples that use anchors with the matches and replace functions.

Table 19-13. Using anchors
ExampleReturn value
matches("string", "^str") true
matches("string", "^ing") false
matches("string", "str$") false
matches("string", "ing$") true
matches("string", "^s.*g$") true
matches("string", "rin") true
replace("aaaha", "^a", "X") Xaaha
replace("aaaha", "^a+", "X") Xha
replace("aaaha", "a", "X") XXXhX
replace("aaaha", "a$", "X") aaahX

XQuery functions that use regular expressions allow you to indicate that the processor should operate in multi-line mode. In multi-line mode, anchors match not just the beginning and end of the entire string, but also the beginning and end of any line within the string, as indicated by a line-feed character (#xA). This is specified using the letter m in the $flags argument, as described in “Using Flags”.

Back-References

XQuery supports the use of back-references. Back-references allow you to ensure that certain characters in a string match each other. For example, suppose you want to ensure that a string is a product number delimited by either single or double quotes. The product number must be three digits, followed by a hyphen, followed by two uppercase letters. You could write the expression:

('|")d{3}-[A-Z]{2}('|")

However, this would allow a string that starts with a single quote and ends with a double quote. You want to be sure the quotes match. You could write the expression:

'd{3}-[A-Z]{2}'|"d{3}-[A-Z]{2}"

but this requires repeating the entire pattern for the product number. Instead, you can parenthesize the expression representing the quotes and refer back to it by using an escaped digit. For example, the expression:

('|")d{3}-[A-Z]{2}1

is equivalent to the prior example, but it is shorter and simpler. The atom 1 indicates that you want to repeat the first parenthesized expression, namely ('|"). The characters that match the first parenthesized expression must be the same characters that match the back-reference. This means that the regular expression does not match a string that starts with a single quote and ends with a double quote.

The parenthesized sub-expressions are numbered in order from left to right based on the position of the opening parenthesis, starting with 1 (not 0). You can reference any of them by number. You can use as many digits as you want, provided that the number does not exceed the number of sub-expressions preceding it.

Using Flags

Four XQuery functions use regular expressions: matches, replace, tokenize, and analyze-string. Each of these functions accepts a $flags argument that allows for additional options in the interpretation of the regular expression, such as multi-line processing and case insensitivity. Options are indicated by single letters; the $flags argument is a string that can contain any of the valid letters in any order, and duplicates are allowed.

The $flags argument allows five options:

s

The letter s indicates dot-all mode, which affects the period wildcard (.). (This is known as single-line mode in Perl.) This means that the period wildcard matches any character whatsoever, including line-feed and carriage return characters. If the letter s is not specified, the period wildcard matches any character except the line-feed (#xA) or carriage return (#xD) character.

m

The letter m indicates multi-line mode, which affects anchors. In multi-line mode, the ^ and $ characters match the beginning and end of a line, as well as the beginning and end of the whole string. For the purposes of the m flag, the line-feed (#xA) character delimits lines.

i

The letter i indicates case-insensitive mode. This means that matching does not distinguish between normal characters that are case variants of each other, as defined by Unicode. For example, in case-insensitive mode, [a-z] matches the lowercase letters a through z, uppercase letters A through Z, and a few other characters such as a Kelvin sign. The meaning of category escapes such as p{Lu} is not affected.

x

The letter x indicates that whitespace characters within regular expressions should be ignored. This is useful for making long regexes readable by splitting over many lines. If x is not specified, whitespace characters are considered to be significant and must match those in the string. If you want to represent significant whitespace when using the x flag, you can use the multi-character escape s.

q

The letter q means that all metacharacters (such as [ or *, that would normally need to be escaped to be interpreted literally) are taken literally even without being escaped. For example, a.b would normally mean “a followed by any character followed by b,” but if used with the q flag, it will only match “a followed by a period followed by b.” This can be useful if you want to make simple text replacements without worrying about regular expression characters. The q flag can be used with the i flag, but if it is used with any of the other flags, it renders them ineffective.

If no flag options are desired, you should either pass a zero-length string, or omit the $flags argument entirely. Table 19-14 shows some examples of how the $flags argument, which is the third argument of the matches function, affects interpretation of regular expressions. They assume the following variable declaration (where the line break is significant):

declare variable $address := "123 Main Street
Traverse City, MI 49684";
Table 19-14. Examples of the $flags argument
ExampleReturn value
matches($address, "Street.*City") false
matches($address, "Street.*City", "s") true
matches($address, "Street$") false
matches($address, "Street$", "m") true
matches($address, "street") false
matches($address, "street", "i") true
matches($address, "Main Street") true
matches($address, "Main Street", "x") false
matches($address, "Main s Street", "x") true
matches($address, "[0-9]+") true
matches($address, "[0-9]+", "q") false
matches($address, "street$", "im") true

Using Sub-Expressions with Replacement Variables

The replace function allows parenthesized sub-expressions (also known as groups) to be referenced by number in the replacement string. In the $replacement string, you can use the variables $1, $2, $3, etc., to represent (in order) the parenthesized expressions in $pattern. This is very useful when replacing strings on the condition that they come directly before or after another string—for example, if you want to change instances of the word Chap to the word Sec, but only those that are followed by a space and a digit. This technique can also be used to reformat data for presentation. Table 19-15 shows some examples.

Table 19-15. Examples of using replacement variables
ExampleReturn value
replace("Chap 2...Chap 3...Chap 4...", "Chap (d)", "Sec $1.0") Sec 2.0...Sec 3.0...Sec 4.0...
replace("abc123", "([a-z])", "$1x") axbxcx123
replace("2315551212", "(d{3})(d{3})(d{4})", "($1) $2-$3") (231) 555-1212
replace("2015-10-18", "d{2}(d{2})-(d{2})-(d{2})", "$2/$3/$1") 10/18/15
replace("25", "(d+)", "$$1.00") $25.00

The variables are bound in order from left to right based on the position of the opening parenthesis. The variable $0 can be used to represent the string matched by the entire regular expression. If the variable number exceeds the number of parenthesized sub-expressions in the regular expression, it is replaced with a zero-length string.

If you wish to include the character $ in your replacement string, you must escape it with a backslash (i.e., $), as shown in the fifth example. Backslashes must also be escaped in the $replacement string, as in \.

Starting in version 3.0, if you have a set of parentheses that you don’t want to count for the purpose of sub-expressions, you can make it a non-capturing group by putting a ?: directly after the left parenthesis. For example, without non-capturing groups, the following query only refers to the second, third, and fourth groups.

replace("2015-10-18", "(d{2})(d{2})-(d{2})-(d{2})", "$3/$4/$2")

If the first group is changed to be a non-capturing group, the variable numbers need to change, as in:

replace("2015-10-18", "(?:d{2})(d{2})-(d{2})-(d{2})", "$2/$3/$1")

With this change, the first group is no longer counted as the first one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.177.14