More Atoms

Atoms that exactly match a character are the simplest atoms that can be used in a pattern facet. The other atoms that can be used in pattern facets are special characters, a wildcard that matches any character, or predefined and user-defined character classes.

Special Characters

Table 9-1 shows the list of atoms that match a single character, exactly like the characters you’ve already seen, but they also correspond to characters that must be escaped or (for the first three characters on the list) that are just provided for convenience.

Table 9-1. Special characters

Character

Description


                           

Newline (can also be written as 
 — because it’s an XML document).


                           

Carriage return (can also be written as 
).

	

Tabulation (can also be written as 	)

\

Character

|

Character |

.

Character .

-

Character -

^

Character ^

?

Character ?

*

Character *

+

Character +

{

Character {

}

Character }

(

Character (

)

Character )

[

Character [

]

Character ]

Wildcard

The dot character (.) has a special meaning; it’s a wildcard atom that matches any valid XML characters except newlines and carriage returns. As with any atom, a dot may be followed by an optional quantifier; .* (dot, asterisk) is a common construct to match zero or more occurrences of any character. To illustrate the usage of .* (and the fact that the pattern facet is a Swiss Army knife), a pattern facet can define the integers that are multiples of 10:

<define name="multipleOfTen">
  <data type="integer">
    <param name="pattern">.*0</param>
  </data>
</define>

or:

multipleOfTen = xsd:integer {pattern = ".*0"}

Character Classes

W3C XML Schema has adopted the classical Perl and Unicode character classes (but not the POSIX-style character classes also available in Perl), and user-defined classes are also available.

Classical Perl character classes

W3C XML Schema supports the classical Perl character classes plus a couple of additions to match XML-specific productions. Each class is designated by a single letter; the classes designated by the upper- and lowercase versions of the same letter are complementary:

s

Spaces. Matches XML whitespace (space #x20, tab #x09, linefeed #x0A, and carriage return #x0D).

S

Characters that aren’t spaces.

d

Digits (0 to 9, but also digits in other alphabets).

D

Characters that aren’t digits.

w

Extended “word” characters (any Unicode character not defined as punctuation, separator, or other). This conforms to the Perl definition, assuming UTF-8 support has been switched on.

W

Nonword characters.

i

XML 1.0 initial name characters (i.e., all the letters plus _). This is a W3C XML Schema extension of Perl regular expressions.

I

Characters that may not be used as a XML initial name character.

c

XML 1.0 name characters (. : - plus initial name characters, digits, and the characters defined by Unicode as “combining” or “extender”). This is a W3C XML Schema extension of Perl regular expressions.

C

Characters that can’t be used in a XML 1.0 name.

These character classes can be used with an optional quantifier like any other atom. The last pattern facet that you saw:

multipleOfTen = xsd:integer {pattern = ".*0"}

constrains the lexical space to be a string of characters ending with a zero. Knowing that the base type is an xsd:integer, is good enough for our purposes, but if the base type had been an xsd:decimal (or xsd:string), you can be more restrictive and write:

multipleOfTen = xsd:integer {pattern = "-?d*0"}

This syntax checks that the characters before the trailing zero are digits with an optional leading - (you’ll see later how to specify an optional leading - or +).

Unicode character classes

Patterns support character classes matching both Unicode categories and blocks. Categories and blocks are two complementary classification systems: categories classify the characters by their usage independently of their localization (letters, uppercase, digit, punctuation, etc.); blocks classify characters by their localization independently of their usage (Latin, Arabic, Hebrew, Tibetan, and even Gothic or musical symbols).

The syntax p{Name} is similar for blocks and categories; the prefix Is is added to the name of categories to make the distinction. The syntax P{Name} is also available to select the characters that don’t match a block or category. A list of Unicode blocks and categories is given in the specification. Table 9-2 shows the Unicode character classes, and Table 9-3 shows the Unicode character blocks.

Table 9-2. Unicode character classes

Unicode character class

Includes

C

Other characters (nonletters, nonsymbols, nonnumbers, nonseparators)

Cc

Control characters

Cf

Format characters

Cn

Unassigned code points

Co

Private-use characters

L

Letters

Ll

Lowercase letters

Lm

Modifier letters

Lo

Other letters

Lt

Titlecase letters

Lu

Uppercase letters

M

All marks

Mc

Spacing combining marks

Me

Enclosing marks

Mn

Nonspacing marks

N

Numbers

Nd

Decimal digits

Nl

Number letters

No

Other numbers

P

Punctuation

Pc

Connector punctuation

Pd

Dashes

Pe

Closing punctuation

Pf

Final quotes (may behave like Ps or Pe)

Pi

Initial quotes (may behave like Ps or Pe)

Po

Other forms of punctuation

Ps

Opening punctuation

S

Symbols

Sc

Currency symbols

Sk

Modifier symbols

Sm

Mathematical symbols

So

Other symbols

Z

Separators

Zl

Line breaks

Zp

Paragraph breaks

Zs

Spaces

Table 9-3. Unicode character blocks

AlphabeticPresentationForms

EnclosedAlphanumerics

Malayalam

Arabic

EnclosedCJKLettersandMonths

MathematicalAlphanumericSymbols

ArabicPresentationForms-A

Ethiopic

MathematicalOperators

ArabicPresentationForms-B

GeneralPunctuation

MiscellaneousSymbols

Armenian

GeometricShapes

MiscellaneousTechnical

Arrows

Georgian

Mongolian

BasicLatin

Gothic

MusicalSymbols

Bengali

Greek

Myanmar

BlockElements

GreekExtended

NumberForms

Bopomofo

Gujarati

Ogham

BopomofoExtended

Gurmukhi

OldItalic

BoxDrawing

HalfwidthandFullwidthForms

OpticalCharacterRecognition

BraillePatterns

HangulCompatibilityJamo

Oriya

ByzantineMusicalSymbols

HangulJamo

PrivateUse

Cherokee

HangulSyllables

PrivateUse

CJKCompatibility

Hebrew

PrivateUse

CJKCompatibilityForms

HighPrivateUseSurrogates

Runic

CJKCompatibilityIdeographs

HighSurrogates

Sinhala

CJKCompatibilityIdeographsSupplement

Hiragana

SmallFormVariants

CJKRadicalsSupplement

IdeographicDescriptionCharacters

SpacingModifierLetters

CJKSymbolsandPunctuation

IPAExtensions

Specials

CJKUnifiedIdeographs

Kanbun

Specials

CJKUnifiedIdeographsExtensionA

KangxiRadicals

SuperscriptsandSubscripts

CJKUnifiedIdeographsExtensionB

Kannada

Syriac

CombiningDiacriticalMarks

Katakana

Tags

CombiningHalfMarks

Khmer

Tamil

CombiningMarksforSymbols

Lao

Telugu

ControlPictures

Latin-1Supplement

Thaana

CurrencySymbols

LatinExtended-A

Thai

Cyrillic

LatinExtendedAdditional

Tibetan

Deseret

LatinExtended-B

UnifiedCanadianAboriginalSyllabics

Devanagari

LetterlikeSymbols

YiRadicals

Dingbats

LowSurrogates

YiSyllables

You’ll see in the next section that W3C XML Schema has introduced an extension to regular expressions to specify intersections, This extension can define the intersection between a block and a category in a single pattern facet.

Note

Although Unicode blocks seem to be a great way to restrict text to a set of characters you can print, display, read, or store in a database, they aren’t designed for this purpose, and you must be careful when using them so. John Cowan, who has taught courses on Unicode and enjoys obscure alphabets, wrote about this topic:

It’s important to note that Unicode blocks are a very crude mechanism for discrimination: not everything needed to write Greek is in the Greek block, and there are no less than five Latin blocks, one of which (Basic Latin; i.e., ASCII) contains many script-independent symbols. The blocks were originally created solely for internal Unicode organizational purposes, and have spread to the outside world somewhat randomly.

The five Latin blocks mentioned by John are BasicLatin, Latin1Supplement, LatinExtended-A, LatinExtendedAdditional, and LatinExtended-B.

User-defined character classes

These classes are lists of characters between square brackets that accept - signs to define ranges and a leading ^ to negate the whole list: for instance, the following defines the list of letters on the first row of a French keyboard:

[azertyuiop]

This expression specifies all characters between a and z:

[a-z]

This expression specifies all characters that aren’t between a and z:

[^a-z]

This expression defines characters - ^ and :

[-^\]

This expression specifies a decimal sign:

[-+]

These examples demonstrate that the contents of these square brackets follows a specific syntax and semantic. Like the regular expression’s main syntax, there’s a list of atoms, but instead of matching each atom against a character of the instance string, you define a logical space. Brackets operate in a space between the atoms and more formal character classes.

The caret (^) is a special character that has a different meaning depending on its location. A negator when it appears at the beginning of a class, the caret loses this special meaning and acts as a normal character when it appears later in the class definition.

Tip

The support of the escape format #x XX (such as in #x2D) is a frequent source of confusion. Because this format is used in the W3C XML Schema recommendation to describe characters by their Unicode value, some people have thought to use it in regular expressions, but it is not meant to be used that way. If you want to define characters by their Unicode values, you should use numeric entities instead (such as &#x2D; if you are using the XML syntax or the syntax for escaping characters in the compact syntax x{2D}). Note that in both cases, the reference is replaced by the corresponding character at parse time and that the regular expression engine will see the actual character instead of the escape sequence.

Also, some characters may or must be escaped: \ matches the character . In fact, in a class definition, all the escape sequences you saw as atoms can be used. Even though some special characters lose their special meaning inside square brackets, they can always be escaped. So, the following:

[-^\]

can also be written as:

[-^\]

or as:

[^\-]

because when they are escaped, the location of the characters doesn’t matter.

Within square brackets, the character also keeps its meaning of a reference to a Perl or Unicode class. The following:

[dp{Lu}]

is a set of decimal digits (Perl class d) and uppercase letters (Unicode category “Lu”).

Mathematicians have found that three basic operations are needed to manipulate sets and that these operations can be chosen from a larger set of operations. In square brackets, you’ve already seen two of these operations: union (the square bracket is an implicit union of its atoms) and complement (a leading ^ realizes the complement of the set defined in the square bracket). W3C XML Schema extends the syntax of Perl regular expressions to introduce a third operation: the difference between sets . The syntax follows:

[set1-[set2]]

Its meaning is that all the characters in set1 that don’t belong to set2, where set1 and set2 can use all the syntactic tricks you saw earlier.

This operator can perform intersections of character classes (the intersection between two sets A and B is the difference between A and the complement of B), and you can now define a class for the BasicLatinLetters as:

[p{IsBasicLatin}-[^p{L}]]

Using the P construct, which is also a complement, you can define the class as:

[p{IsBasicLatin}-[P{L}]]

The corresponding definition is:

<define name="BasicLatinLetters">
  <data type="token">
    <param name="pattern">[p{IsBasicLatin}-[P{L}]]*</param>
  </data>
</define>

or:

BasicLatinLetters = xsd:token {pattern = "[p{IsBasicLatin}-[P{L}]]*"}

Or-ing and Grouping

I used an or in the first example pattern facet when I wrote “1|5|15” to allow either 1, 5, or 15.

Ors are especially interesting when used with groups. Groups are complete regular expressions, which are themselves considered atoms and can be used with an optional quantifier to form more complete (and complex) regular expressions. Groups are enclosed by parentheses. To define a comma-separated list of 1, 5, or 15, ignoring whitespace between values and commas, the following pattern facet can be used:

<define name="myListOfBytes">
  <data type="token">
    <param name="pattern">(1|5|15)( *, *(1|5|15))*</param>
  </data>
</define>

or:

myListOfBytes = xsd:token {pattern = "(1|5|15)( *, *(1|5|15))*"}

Note the reliance on the whitespace processing of the base datatype (xsd:token collapses the whitespace). You don’t need to worry about leading and trailing whitespace that’s trimmed; single occurrences of spaces were tested with the * atom before and after the comma.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.147.235