Patterns

The 'pattern' facet clearly requires more explanation than the brief description above. This XML Schema feature is based on the regular expression capabilities available in the Perl programming language, and is therefore very powerful, but also consequently quite complex.

Simple templates

The Pattern element holds a pattern in its Value attribute. The simplest possible form of pattern involves a series of characters that must be present, in the order specified, in each element or attribute definition that uses it. The following pattern restriction specifies that the value must be 'abc':

<pattern value="abc" />

In this simple form, a pattern is similar to an enumeration. Just as a restriction can contain multiple Enumeration elements, it can also contain multiple Pattern elements. The element or attribute instance is valid if it matches any of the patterns:

<restriction base="string">
  <pattern value="abc"/>
  <pattern value="xyz"/>
</restriction>


   <code>abc</code>
   <code>xyz</code>
   <code>acb</code> <!-- ERROR -->
   <code>xzy</code> <!-- ERROR -->
   <code>abcc</code> <!-- ERROR -->

Alternatively, a single pattern may contain multiple 'branches'. Each branch is essentially a separate expression, separated from surrounding branches using '|' symbols. Again, the pattern test succeeds if any one of the branches matches the target value, so the '|' symbol is performing a similar function to its use in DTD content models. The following example is therefore equivalent to the multi-pattern example above:

<restriction base="string">
  <pattern value="abc|xyz"/>
</restriction>

Note that, although branches are an optional technique at this level, they are the only way to achieve this effect in the later discussion on sub-expressions.

Atoms

Each branch of an expression (or the whole expression, if it is not divided into branches) consists of a number of 'atoms'. In the examples above, the single letter 'a' is an atom, and 'b' is another. Apart from individual characters, an atom can also be a 'character class' (an escape sequence, or a selection from a pre-defined or user-defined group of characters), or a complete sub-expression (as explained further below).

Each atom validates one portion of the value the pattern is being compared to, in sequential order from left to right. This is why the first atom, 'a', is expected to match the first character of the value. If the value does not begin with an 'a' character, the value has already failed to match the pattern. If the value does begin with 'a', then the next atom, 'b', is compared with the next character in the value.

Quantifiers

A 'quantifier' can be added to an atom to specify how frequently the atom may occur. The examples above have no quantifier. In these cases, an implied quantifier specifies that the atom must occur exactly once. The expression 'abc' therefore specifies that the content of the element or attribute must be exactly 'abc'. There must be one 'a' character, followed by one 'b' character, followed by one 'c' character. The values 'ab', 'bc' and 'abcc' would not be valid.

Explicit quantifiers include the symbols '?', '+' and '*' (which have meanings that will be unsurprising to those familiar with DTD content models), and 'quantities' that allow any number of occurrences to be precisely specified.

The explicit quantifier '?' indicates that the atom is optional. The expression 'ab?c' makes the 'b' atom optional, so legal values include 'abc' and 'ac'.

The explicit quantifier '+' indicates that the atom is repeatable. The expression 'ab+c' makes the 'b' atom repeatable, so legal values include 'abc' and 'abbbc'.

The explicit quantifier '*' indicates that the atom is both optional and repeatable. The expression 'ab*c' makes the 'b' atom optional and repeatable, so legal values include 'ac', 'abc' and 'abbbc'.

The following example includes all three quantifiers, and all of the following target Code elements are valid according to this pattern:

<pattern value="a+b?c*" />


   <code>a</code>
   <code>ab</code>
   <code>abc</code>
   <code>aaa</code>
   <code>aaab</code>
   <code>aaabc</code>
   <code>aaabccc</code>

Quantities

A 'quantity' is a more finely-tuned instrument for specifying occurrence options than the qualifiers described above. Instead of a single symbol, such as '+', a quantity involves one or two values, enclosed by curly brackets, '{' and '}'.

The simplest form of quantity involves a single value, such as '{3}' or '{123}', which specifies how many times the atom must occur. For example, 'ab{3}c' specifies that the value must be 'abbbc'.

A quantity range can have two values, separated by a comma, such as '{3,7}'. A value can include any number of occurrences of the atom between, and including, these two extremes. For example, 'ab{3,4}c' specifies that the value must be either 'abbbc' or 'abbbbc'.

It is also possible to specify just a minimum number of occurrences. If the second value is absent, then only a minimum is being specified, so '{3,}' means 'at least 3'. But the comma must still be present. For example, 'ab{2,}c' allows for everything from 'abbc' to 'abbbbbbbbbc' and beyond.

Escape characters

A number of significant symbols have now been introduced, such as '*' and '{'. These symbols cannot be used within an expression as atoms, because they would be misinterpreted as significant pattern markup. It is therefore necessary to escape them, in the same way that '&' and '<' must be escaped in XML documents. Instead of the '&' character, however, the '' symbol is used in a pattern. Again, just as an '&amp;' is needed in XML documents to use the escape character itself as a data character, the '' symbol must also be escaped in patterns, giving '' (this should be familiar to C and Java software developers). The characters that need to be escaped are '.' '', '?', '*', '+', '{', '}', '(', ')', '|', '[' and ']'. For example, '?' is escaped as '?':

\ (the escape character)
| (branch separator)
. (not-a-line-end character)
- (range separator) (character class subtraction)
^ (used at start of negative character group)
? (optional indicator)
* (optional and repeatable indicator)
+ (required and repeatable)
{ (quantity start)
} (quantity end)
( (sub-group start)
) (sub-group end)
[ (range group start)
] (range group end)

In some circumstances, the '-' and '^' characters must also be escaped:

- (range separator)
^ (negative group indicator)

In addition, escape characters are used to include whitespace characters that would otherwise be difficult or impossible to add, such as the tab character:

 (newline)

 (return)
	 (tab)

Character groups

Atoms (quantified or otherwise) do not have to be single characters. They can also be escape sequences, so an escaped character can be quantified, such as '++', which states that the '+' character is required and repeatable. In addition, they can also be 'character groups', a feature that allows a particular character in the target value to be one of a number of pre-defined options.

It could be imagined that a product code needs to start with exactly three letters, and end with between two and four digits. While 'abc123' and 'wxy9876' would both be valid, 'ab123' and 'wxy98765' would not (the first has too few letters, and the second has too many digits). This requirement could be achieved using a very large number of branches, such as 'aaaa00 | aaaa000 | aaaa0000 | ... |' (and so on), but this is clearly impractical. Instead, a 'character class' expression is enclosed by square brackets, '[' and ']'. For example, '[abc]' means that, at this position, the letters 'a', 'b' or 'c' may appear.

When the first character is '^', the group becomes a 'negative character group', reversing the meaning, so that any character except those in the group can be matched. For example, '[^abc]' specifies that any character except 'a', 'b' or 'c' can be included. The '^' symbol can be used later in the group without having this significance, so '[a^b]' simply means that the character must be 'a' or '^' or 'b'.

Quantifiers can be used on groups as well as individual characters. For example, '[abc]+' specifies that at least one of the letters 'a', 'b' and 'c' must appear, but then additional characters from this set may also appear, so 'abcbca' would be a valid match.

Character ranges

It is not always efficient to have to specify every possible character in a group. For example, '[abcdefghijklmnopqrstuvwxyz]' is a verbose way to specify that any lower-case letter can occur. When a large set of options have sequential ASCII values, as in this example, a 'range' can be specified instead, using a '-' separator between the first character in the range and the last character in the range. The more succinct equivalent of the example above is therefore '[a-z]'. The expression '[a-zA-Z0-9]' allows all normal letters and digits to occur.

If the '-' character is to be matched, within a group, it must be escaped using '-', but it is not necessary to do this outside of a group. For example, 'a-b[x-y]+' matches 'a-bxxx---yyy'.

An XML character reference can be included in a range, such as '&#123;' or '&#xAA;'. This is particularly useful for representing characters that are difficult, or even impossible, to enter directly from a keyboard.

This approach can still be used when some of the characters in the range are not wanted. Individual items can be selectively removed from the range, using a 'sub group', with a '-' prefix, as in '[a-z-[m]', which removes 'm' as a valid option.

Sub-groups

An entire expression can be created within another expression, with the embedded expression enclosed by brackets, '(' and ')'. This is useful when several options are required at a particular location in the pattern, because a complete expression can contain branches, such as '1|2|3'. Therefore, 'abc(1|2|3)d' matches 'abc1d', 'abc2d' and 'abc3d'.

The simple example above, with only a single character in each option, is just another interpretation of the more succinct 'abc[123]*d' expression. However, the simpler technique cannot work for multi-character alternatives, such as '[abc(XX|YY|ZZ)d]'.

A quantifier can be assigned to a sub-group. In the following example, there can be any number of the given strings embedded within the value:

<pattern value="abc(XX|YY|ZZ)*d" />


   <code>abcd</code>
   <code>abcYYd</code>
   <code>acbZZYYXXXXd</code>

Character class escapes

There are various categories of 'character class escape':

  • single character escapes (discussed above)

  • multi-character escapes (such as 's' (non-whitespace) and '.' (non-line-ending))

  • general category escapes (such as 'p{L}' and 'p{Lu}') and complementary general category escapes (such as 'P{L}' and 'P{Lu}')

  • block category escapes (such as 'p{IsBasicLatin}' and 'p{IsTibetan}') and complementary block category escapes (such as 'P{IsBasicLatin}' and 'P{IsTibetan}').

A 'single category escape' is an escape sequence for a single character, such as the '{' character, which has a significant role in expressions (they are listed and discussed in more detail above).

Multi-character escapes

For convenience, a number of single character escape codes are provided to represent very common sets of characters, including:

  • non-line-ending characters

  • whitespace characters and non-whitespace characters

  • initial XML name characters (and all characters except these characters)

  • subsequent XML name characters (and all characters except these characters)

  • decimal digits (and all characters except these digits).

The '.' character represents every character except a new-line or carriage-return character. The expression '…..' therefore represents a string of five characters that are not broken over lines.

The remaining multi-character escape characters are all escaped in the normal way, using a '' symbol. They are all defined in pairs, with a lower-case letter representing a particular common requirement, and the equivalent upper-case letter representing the opposite effect.

The escape sequence 's' represents any whitespace character, including the space, tab, new-line and carriage return characters. The 'S' sequence therefore represents any non-whitespace character.

The escape sequence 'i' represents any initial name character ('_', ':' or a letter). The 'I' sequence therefore represents any non-initial character. Similarly, the escape sequence 'c' represents any XML name character, and 'C' represents any non-XML name character.

The escape sequence 'd' represents any decimal digit. It is equivalent to the expression 'p{Nd}' (see below). The 'D' sequence therefore represents any non-decimal digit character.

The escape sequence 'w' represents all characters except for punctuation, separator and 'other' characters (using techniques described below, this is equivalent to '[&#x0000;-&#x10FFFF;-[p{P}p{Z}p{C}]]'), whereas the 'W' sequence represents only these characters.

Category escapes

The escape sequence 'p' or 'P' (reverse meaning) introduces a 'category escape' set. A category token is enclosed within following curly brackets, '{' and '}'. These tokens represent pre-defined sets of characters, such as all upper-case letters or the Tibetan character set.

General category escapes

A 'general category escape' is a reference to a pre-defined set of characters, such as all of the upper-case letters, or all punctuation characters. These sets of characters have special names, such as 'Lu' for upper-case letters, and 'P' for all punctuation. For example, 'p{Lu}' represents all upper-case letters, and 'P{Lu}' represents all characters except upper-case letters.

Single letter codes are used for major groupings, such as 'L' for all letters (of which upper-case letters are just a subset). The full set of options are:

L  All letters
 Lu uppercase
 Ll lowercase
 Lt titlecase
 Lm modifier
 Lo other
M  All Marks
 Mn nonspacing
 Mc spacing combination
 Me enclosing
N  All Numbers
 Nd decimal digit
 Nl letter
 No other
P  All Punctuation
 Pc connector
 Pd dash
 Ps open
 Pe close
 Pi initial quote
 Pf final quote
 Po other
Z  All Separators
 Zs space
 Zl line
 Zp paragraph
S  All Symbols
 Sm math
 Sc currency
 Sk modifier
 So other
C  All Others
 Cc control
 Cf format
 Co private use

These concepts are defined at http://www.unicode.org/Public/3.1-Update/UnicodeCharacterDatabase-3.1.0.html.

Block category escapes

The Unicode character set is divided into many significant groupings, such as musical symbols, Braille characters and Tibetan characters. A keyword is assigned to each group, such as 'MusicalSymbols', 'BraillePatterns' and 'Tibetan'.

In alphabetical order, the full set of keywords is:

AlphabeticPresentationForms Dingbats LetterlikeSymbols
Arabic EnclosedAlphanumerics LowSurrogates
ArabicPresentationForms-A EnclosedCJKLettersandMonths Malayalam
ArabicPresentationForms-B Ethiopic MathematicalAlphanumericSymbols
Armenian GeneralPunctuation MathematicalOperators
Arrows GeometricShapes MiscellaneousSymbols
BasicLatin Georgian MiscellaneousTechnical
Bengali Gothic Mongolian
BlockElements Greek MusicalSymbols
Bopomofo GreekExtended Myanmar
BopomofoExtended Gujarati NumberForms
BoxDrawing Gurmukhi Ogham
BraillePatterns HalfwidthandFullwidthForms OldItalic
ByzantineMusicalSymbols HangulCompatibilityJamo OpticalCharacterRecognition
Cherokee HangulJamo Oriya
CJKCompatibility HangulSyllables PrivateUse (3 separate sets)
CJKCompatibilityForms Hebrew Runic
CJKCompatibilityIdeographs HighPrivateUseSurrogates Sinhala
CJKCompatibilityIdeographsSupplement HighSurrogates SmallFormVariants
CJKRadicalsSupplement Hiragana SpacingModifierLetters
CJKSymbolsandPunctuation IdeographicDescriptionCharacters Specials (two seperate sets)
CJKUnifiedIdeographs IPAExtensions SuperscriptsandSubscripts
CJKUnifiedIdeographsExtensionA Kanbun Syriac
CJKUnifiedIdeographsExtensionB KangxiRadicals Tags
CombiningDiacriticalMarks Kannada Tamil
CombiningHalfMarks Katakana Telugu
CombiningMarksforSymbols Khmer Thaana
ControlPictures Lao Thai
CurrencySymbols Latin-1Supplement Tibetan
Cyrillic LatinExtended-A UnifiedCanadianAboriginalSyllabics
Deseret LatinExtended-B YiRadicals
Devanagari LatinExtendedAdditional YiSyllables

A reference to one of these categories involves a keyword that begins with 'Is…', followed by a name from the list above, such as 'Tibetan'. For example, 'p{IsTibetan}'.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.53.209