Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Representing Groups of Characters

Sometimes characters fall into convenient groups, such as decimal digits or punctuation characters. Three different kinds of escapes can be used to represent a group of characters: multi-character escapes, category escapes, and block escapes. Like single-character escapes, they all start with a backslash.

Multi-Character Escapes

Multi-character escapes, listed in Table 18-7, represent groups of related characters. They are called multi-character escapes because they allow a choice of multiple characters. However, each escape represents only one character in a matching string. To allow several replacement characters, you should use a quantifier such as +.

Table 18-7. Multi-character escapes

Escape	Meaning
`s`	A whitespace character, as defined by XML (space, tab, carriage return, or line feed)
`S`	A character that is not a whitespace character
`d`	A decimal digit (0 to 9), or a digit in another style, for example, an Indic Arabic digit
`D`	A character that is not a decimal digit
`w`	A "word" character, that is, any character not in one of the Unicode categories Punctuation, Separators, and Other
`W`	A nonword character, that is, any character in one of the Unicode categories Punctuation, Separators, and Other
`i`	A character that is allowed as the first character of an XML name, i.e., a letter, an underscore (_), or a colon (:); the "i" stands for "initial"
`I`	A character that cannot be the first character of an XML name
`c`	A character that can be part of an XML name, i.e., a letter, a digit, an underscore (_), a hyphen (-), a colon (:), or a period (.)
`C`	A character that cannot be part of an XML name

Category Escapes

The Unicode standard defines categories of characters based on their purpose. For example, there are categories for punctuation, uppercase letters, and currency symbols. These categories, listed in Table 18-8, can be referenced in regular expressions using category escapes.

Table 18-8. Unicode categories

Category	Property	Meaning
Letters	`L`	All letters
	`Lu`	Uppercase
	`Ll`	Lowercase
	`Lt`	Titlecase
	`Lm`	Modifier
	`Lo`	Other
Marks	`M`	All marks
	`Mn`	Nonspacing
	`Mc`	Spacing combining
	`Me`	Enclosing
Numbers	`N`	All numbers
	`Nd`	Decimal digit
	`Nl`	Letter
	`No`	Other
Punctuation	`P`	All punctuation
	`Pc`	Connector
	`Pd`	Dash
	`Ps`	Open
	`Pe`	Close
	`Pi`	Initial quote
	`Pf`	Final quote
	`Po`	Other
Separators	`Z`	All separators
	`Zs`	Space
	`Zl`	Line
	`Zp`	Paragraph
Symbols	`S`	All symbols
	`Sm`	Math
	`Sc`	Currency
	`Sk`	Modifier
	`So`	Other
Other	`C`	All others
	`Cc`	Control
	`Cf`	Format
	`Co`	Private use
	`Cn`	Not assigned

Category escapes take the form p{XX}, with XX representing the property listed in Table 18-8. For example, p{Lu} matches any uppercase letter. Category escapes that use an uppercase P, as in P{XX}, match all characters that are not in the category. For example, P{Lu} matches any character that is not an uppercase letter.

Note that the category escapes include all alphabets. If you intend for an expression to match only the capital letters A through Z, it is better to use [A-Z] than p{Lu}, because p{Lu} allows uppercase letters of all character sets. Likewise, if your intention is to allow only the decimal digits 0 through 9, use [0-9] rather than p{Nd} or d, because there are decimal digits other than 0 through 9 in other character sets.

Block Escapes

Unicode defines a numeric code point for each character. Each range of characters is represented by a block name, also defined by Unicode. For example, characters 0000 through 007F are known as basic latin. Table 18-9 lists the first five block escape ranges as an example. For a complete, updated list, see the blocks file of the Unicode standard at http://www.unicode.org/Public/UNIDATA/Blocks.txt.

Table 18-9. Partial list of Unicode block names

Start code	End code	Block name (with spaces removed)
`#x0000`	`#x007F`	`BasicLatin`
`#x0080`	`#x00FF`	`Latin-1Supplement`
`#x0100`	`#x017F`	`LatinExtended-A`
`#x0180`	`#x024F`	`LatinExtended-B`
`#x0250`	`#x02AF`	`IPAExtensions`
`...`	`...`	`...`

Block escapes can be used to refer to these character ranges in regular expressions. They take the form p{IsXX}, with XX representing the Unicode block name with all spaces removed. For example, p{IsLatin-1Supplement} matches any one character in the range #x0080 to #x00FF. As with category escapes, you can use an uppercase P to match characters not in the block. For example, P{IsLatin-1Supplement} matches any character outside of that range.

Table 18-10 provides examples of representing groups of characters in regular expressions.

Table 18-10. Representing groups of characters

Regular expression	Strings that match	Strings that do not match	Comment
`fd`	`f0, f1`	`f, f01`	multi-character escape
`fd*`	`f, f0, f012`	`ff`	multi-character escape
`fs*o`	`fo, fo`	`foo`	multi-character escape
`p{Ll}`	`a, b`	`A, B, 1, 2`	category escape
`P{Ll}`	`A, B, 1, 2`	`a, b`	category escape
`p{L}`	`a, b, A, B`	`1, 2`	category escape
`P{L}`	`1, 2`	`a, b, A, B`	category escape
`p{IsBasicLatin}`	`a, b`	`â, ß`	block escape
`P{IsBasicLatin}`	`â, ß`	`a, b`	block escape

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18.4. Representing Groups of Characters

Create new playlist

Sign In

Sign Up

Representing Groups of Characters

Multi-Character Escapes

Category Escapes

Block Escapes

Table of Contents for
18.4. Representing Groups of Characters