Character Class Expressions

Character class expressions, which are enclosed in square brackets, indicate a choice among several characters. These characters can be listed singly, expressed as a range of characters, or expressed as a combination of the two.

Single Characters and Ranges

To specify a choice of several characters, you can simply list them inside square brackets. For example, [def] matches d or e or f. To match multiple occurrences of these letters, you can use a quantifier with a character class expression, as in [def]*, which will match not only defdef, but eddfefd as well. The characters listed can also be any of the escapes described earlier in this chapter. The expression [p{Ll}d] matches either a lowercase letter or a digit.

It is also possible to specify a range of characters, by separating the starting and ending characters with a hyphen. For example, [a-z] matches any letter from a to z. The endpoints of the range must be single characters or single character escapes (not a multi-character escapes such as d).

You can specify more than one range in the same character class expression, which means that it matches a character in any of the ranges. The expression [a-zA-Z0-9] matches one character that is either between a and z, or between A and Z, or a digit from 0 to 9. Unicode code points are used to determine whether a character is in the range.

Ranges and single characters can be combined in any order. For example, [abc0-9] matches either a letter a, b, or c or a digit from 0 to 9. This regular expression could also be expressed as [0-9abc] or [a0-9bc].

Subtraction from a Range

Subtraction allows you to express that you want to match a range of characters but leave a few out. For example, [a-z-[jkl]] matches any character from a to z except j, k or l. A hyphen (-) precedes the character group to be subtracted, which is itself enclosed in square brackets. Like any character class expression, the subtracted group can be a list of single characters or ranges, or both. The expression [a-z-[j-l]] has the same meaning as the previous example. You can also subtract from a multi-character escape, for example [p{Lu}-[ABC]].

Negative Character Class Expressions

It is also possible to specify a negative character class expression, meaning that a string should not match any of the characters specified. This is accomplished using the ^ character after the left square bracket. For example, [^a-z] matches any character that is not a letter from a to z. Any character class expression can be negated, including those that specify single characters, ranges, or a combination of the two. The negation applies to the entire character class expression, so [^a-z0-9] will match anything that is not a letter from a to z and also not a digit from 0 to 9.

Some examples of character class expressions are shown in Table 18-11.

Table 18-11. Character class expression examples

Regular expression

Strings that match

Strings that do not match

Comment

[def]

d, e, f

def

Single characters

[def]*

d, eee, dfed

a, b

Single characters, repeating

[p{Ll}d]

a, b, 1

A, B

Single characters with escapes

[d-f]

d, e, f

a, D

Range of characters

[0-9d-fD-F]

3, d, F

a, 3dF

Multiple ranges

[0-9stu]

4, 9, t

a, 4t

Range plus single characters

[s-ud]

4, 9, t

a, t4

Range plus single-character escape

[a-x-[f]]

a, d, x

f, 2

Subtracting from a range

[a-x-[fg]]

a, d, x

f, g, 2

Subtracting from a range

[a-x-[e-g]]

a, d, x

e, g, 2

Subtracting from a range with a range

[^def]

a, g, 2

d, e, f

Negating single characters

[^[]

a, b, c

[

Negating a single-character escape

[^d]

d, E

1, 2, 3

Negating a multi-character escape

[^a-cj-l]

d, 4

b, j, l

Negating a range

Escaping Rules for Character Class Expressions

Special escaping rules apply to character class expressions. They are:

  • The characters [, ], , and - must be escaped when included as single characters.[*]

  • The character must be escaped if it is the lower bound of the range.

  • The characters [ and must be escaped if one of them is the upper bound of the range.

  • The character ^ must be escaped only if it appears first in the character class expression, directly after the opening bracket ([).

The other metacharacters do not need to be escaped when used in a character class expression, because they have no special meaning in that context. This includes the period character, which does not serve as a wildcard escape character when it appears inside a character class expression. However, it is never an error to escape any of the metacharacters, and getting into the habit of always escaping them eliminates the need to remember these rules.



[*] There has been a lot of confusion about the rules for escaping "-" in successive corrections to the XML Schema recommendation, so there are variations between products, but it's always safe to escape it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.30.236