Representing Groups of Characters

Sometimes characters fall into convenient groups, such as decimal digits or punctuation characters. Three different kinds of escapes can be used to represent a group of characters: multi-character escapes, category escapes, and block escapes. Like single-character escapes, they all start with a backslash.

Multi-Character Escapes

Multi-character escapes, listed in Table 18-7, represent groups of related characters. They are called multi-character escapes because they allow a choice of multiple characters. However, each escape represents only one character in a matching string. To allow several replacement characters, you should use a quantifier such as +.

Table 18-7. Multi-character escapes

Escape

Meaning

s

A whitespace character, as defined by XML (space, tab, carriage return, or line feed)

S

A character that is not a whitespace character

d

A decimal digit (0 to 9), or a digit in another style, for example, an Indic Arabic digit

D

A character that is not a decimal digit

w

A "word" character, that is, any character not in one of the Unicode categories Punctuation, Separators, and Other

W

A nonword character, that is, any character in one of the Unicode categories Punctuation, Separators, and Other

i

A character that is allowed as the first character of an XML name, i.e., a letter, an underscore (_), or a colon (:); the "i" stands for "initial"

I

A character that cannot be the first character of an XML name

c

A character that can be part of an XML name, i.e., a letter, a digit, an underscore (_), a hyphen (-), a colon (:), or a period (.)

C

A character that cannot be part of an XML name

Category Escapes

The Unicode standard defines categories of characters based on their purpose. For example, there are categories for punctuation, uppercase letters, and currency symbols. These categories, listed in Table 18-8, can be referenced in regular expressions using category escapes.

Table 18-8. Unicode categories

Category

Property

Meaning

Letters

L

All letters

 

Lu

Uppercase

 

Ll

Lowercase

 

Lt

Titlecase

 

Lm

Modifier

 

Lo

Other

Marks

M

All marks

 

Mn

Nonspacing

 

Mc

Spacing combining

 

Me

Enclosing

Numbers

N

All numbers

 

Nd

Decimal digit

 

Nl

Letter

 

No

Other

Punctuation

P

All punctuation

 

Pc

Connector

 

Pd

Dash

 

Ps

Open

 

Pe

Close

 

Pi

Initial quote

 

Pf

Final quote

 

Po

Other

Separators

Z

All separators

 

Zs

Space

 

Zl

Line

 

Zp

Paragraph

Symbols

S

All symbols

 

Sm

Math

 

Sc

Currency

 

Sk

Modifier

 

So

Other

Other

C

All others

 

Cc

Control

 

Cf

Format

 

Co

Private use

 

Cn

Not assigned

Category escapes take the form p{XX}, with XX representing the property listed in Table 18-8. For example, p{Lu} matches any uppercase letter. Category escapes that use an uppercase P, as in P{XX}, match all characters that are not in the category. For example, P{Lu} matches any character that is not an uppercase letter.

Note that the category escapes include all alphabets. If you intend for an expression to match only the capital letters A through Z, it is better to use [A-Z] than p{Lu}, because p{Lu} allows uppercase letters of all character sets. Likewise, if your intention is to allow only the decimal digits 0 through 9, use [0-9] rather than p{Nd} or d, because there are decimal digits other than 0 through 9 in other character sets.

Block Escapes

Unicode defines a numeric code point for each character. Each range of characters is represented by a block name, also defined by Unicode. For example, characters 0000 through 007F are known as basic latin. Table 18-9 lists the first five block escape ranges as an example. For a complete, updated list, see the blocks file of the Unicode standard at http://www.unicode.org/Public/UNIDATA/Blocks.txt.

Table 18-9. Partial list of Unicode block names

Start code

End code

Block name (with spaces removed)

#x0000

#x007F

BasicLatin

#x0080

#x00FF

Latin-1Supplement

#x0100

#x017F

LatinExtended-A

#x0180

#x024F

LatinExtended-B

#x0250

#x02AF

IPAExtensions

...

...

...

Block escapes can be used to refer to these character ranges in regular expressions. They take the form p{IsXX}, with XX representing the Unicode block name with all spaces removed. For example, p{IsLatin-1Supplement} matches any one character in the range #x0080 to #x00FF. As with category escapes, you can use an uppercase P to match characters not in the block. For example, P{IsLatin-1Supplement} matches any character outside of that range.

Table 18-10 provides examples of representing groups of characters in regular expressions.

Table 18-10. Representing groups of characters

Regular expression

Strings that match

Strings that do not match

Comment

fd

f0, f1

f, f01

multi-character escape

fd*

f, f0, f012

ff

multi-character escape

fs*o

fo, fo

foo

multi-character escape

p{Ll}

a, b

A, B, 1, 2

category escape

P{Ll}

A, B, 1, 2

a, b

category escape

p{L}

a, b, A, B

1, 2

category escape

P{L}

1, 2

a, b, A, B

category escape

p{IsBasicLatin}

a, b

â, ß

block escape

P{IsBasicLatin}

â, ß

a, b

block escape

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.173.199