Sometimes characters fall into convenient groups, such as decimal digits or punctuation characters. Three different kinds of escapes can be used to represent a group of characters: multi-character escapes, category escapes, and block escapes. Like single-character escapes, they all start with a backslash.
Multi-character escapes, listed in Table 18-7, represent groups of related characters. They are called multi-character escapes because they allow a choice of multiple characters. However, each escape represents only one character in a matching string. To allow several replacement characters, you should use a quantifier such as +
.
Table 18-7. Multi-character escapes
Escape |
Meaning |
---|---|
|
A whitespace character, as defined by XML (space, tab, carriage return, or line feed) |
|
A character that is not a whitespace character |
|
A decimal digit (0 to 9), or a digit in another style, for example, an Indic Arabic digit |
|
A character that is not a decimal digit |
|
A "word" character, that is, any character not in one of the Unicode categories Punctuation, Separators, and Other |
|
A nonword character, that is, any character in one of the Unicode categories Punctuation, Separators, and Other |
|
A character that is allowed as the first character of an XML name, i.e., a letter, an underscore (_), or a colon (:); the "i" stands for "initial" |
|
A character that cannot be the first character of an XML name |
|
A character that can be part of an XML name, i.e., a letter, a digit, an underscore (_), a hyphen (-), a colon (:), or a period (.) |
|
A character that cannot be part of an XML name |
The Unicode standard defines categories of characters based on their purpose. For example, there are categories for punctuation, uppercase letters, and currency symbols. These categories, listed in Table 18-8, can be referenced in regular expressions using category escapes.
Table 18-8. Unicode categories
Category |
Property |
Meaning |
---|---|---|
Letters |
|
All letters |
|
Uppercase | |
|
Lowercase | |
|
Titlecase | |
|
Modifier | |
|
Other | |
Marks |
|
All marks |
|
Nonspacing | |
|
Spacing combining | |
|
Enclosing | |
Numbers |
|
All numbers |
|
Decimal digit | |
|
Letter | |
|
Other | |
Punctuation |
|
All punctuation |
|
Connector | |
|
Dash | |
|
Open | |
|
Close | |
|
Initial quote | |
|
Final quote | |
|
Other | |
Separators |
|
All separators |
|
Space | |
|
Line | |
|
Paragraph | |
Symbols |
|
All symbols |
|
Math | |
|
Currency | |
|
Modifier | |
|
Other | |
Other |
|
All others |
|
Control | |
|
Format | |
|
Private use | |
|
Not assigned |
Category escapes take the form p{XX}
, with XX
representing the property listed in Table 18-8. For example, p{Lu}
matches any uppercase letter. Category escapes that use an uppercase P, as in P{XX}
, match all characters that are not in the category. For example, P{Lu}
matches any character that is not an uppercase letter.
Note that the category escapes include all alphabets. If you intend for an expression to match only the capital letters A through Z, it is better to use [A-Z]
than p{Lu}
, because p{Lu}
allows uppercase letters of all character sets. Likewise, if your intention is to allow only the decimal digits 0 through 9, use [0-9] rather than p{Nd}
or d
, because there are decimal digits other than 0 through 9 in other character sets.
Unicode defines a numeric code point for each character. Each range of characters is represented by a block name, also defined by Unicode. For example, characters 0000 through 007F are known as basic latin. Table 18-9 lists the first five block escape ranges as an example. For a complete, updated list, see the blocks file of the Unicode standard at http://www.unicode.org/Public/UNIDATA/Blocks.txt.
Block escapes can be used to refer to these character ranges in regular expressions. They take the form p{IsXX}
, with XX
representing the Unicode block name with all spaces removed. For example, p{IsLatin-1Supplement}
matches any one character in the range #x0080
to #x00FF
. As with category escapes, you can use an uppercase P
to match characters not in the block. For example, P{IsLatin-1Supplement}
matches any character outside of that range.
Table 18-10 provides examples of representing groups of characters in regular expressions.
Table 18-10. Representing groups of characters
Regular expression |
Strings that match |
Strings that do not match |
Comment |
---|---|---|---|
|
|
|
multi-character escape |
|
|
|
multi-character escape |
|
|
|
multi-character escape |
|
|
|
category escape |
|
|
|
category escape |
|
|
|
category escape |
|
|
|
category escape |
|
|
|
block escape |
|
|
|
block escape |
3.133.156.251