Patterns

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Patterns

The 'pattern' facet clearly requires more explanation than the brief description above. This XML Schema feature is based on the regular expression capabilities available in the Perl programming language, and is therefore very powerful, but also consequently quite complex.

Simple templates

The Pattern element holds a pattern in its Value attribute. The simplest possible form of pattern involves a series of characters that must be present, in the order specified, in each element or attribute definition that uses it. The following pattern restriction specifies that the value must be 'abc':

<pattern value="abc" />

In this simple form, a pattern is similar to an enumeration. Just as a restriction can contain multiple Enumeration elements, it can also contain multiple Pattern elements. The element or attribute instance is valid if it matches any of the patterns:

<restriction base="string">
  <pattern value="abc"/>
  <pattern value="xyz"/>
</restriction>


   <code>abc</code>
   <code>xyz</code>
   <code>acb</code> <!-- ERROR -->
   <code>xzy</code> <!-- ERROR -->
   <code>abcc</code> <!-- ERROR -->

Alternatively, a single pattern may contain multiple 'branches'. Each branch is essentially a separate expression, separated from surrounding branches using '|' symbols. Again, the pattern test succeeds if any one of the branches matches the target value, so the '|' symbol is performing a similar function to its use in DTD content models. The following example is therefore equivalent to the multi-pattern example above:

<restriction base="string">
  <pattern value="abc|xyz"/>
</restriction>

Note that, although branches are an optional technique at this level, they are the only way to achieve this effect in the later discussion on sub-expressions.

Atoms

Each branch of an expression (or the whole expression, if it is not divided into branches) consists of a number of 'atoms'. In the examples above, the single letter 'a' is an atom, and 'b' is another. Apart from individual characters, an atom can also be a 'character class' (an escape sequence, or a selection from a pre-defined or user-defined group of characters), or a complete sub-expression (as explained further below).

Each atom validates one portion of the value the pattern is being compared to, in sequential order from left to right. This is why the first atom, 'a', is expected to match the first character of the value. If the value does not begin with an 'a' character, the value has already failed to match the pattern. If the value does begin with 'a', then the next atom, 'b', is compared with the next character in the value.

Quantifiers

A 'quantifier' can be added to an atom to specify how frequently the atom may occur. The examples above have no quantifier. In these cases, an implied quantifier specifies that the atom must occur exactly once. The expression 'abc' therefore specifies that the content of the element or attribute must be exactly 'abc'. There must be one 'a' character, followed by one 'b' character, followed by one 'c' character. The values 'ab', 'bc' and 'abcc' would not be valid.

Explicit quantifiers include the symbols '?', '+' and '*' (which have meanings that will be unsurprising to those familiar with DTD content models), and 'quantities' that allow any number of occurrences to be precisely specified.

The explicit quantifier '?' indicates that the atom is optional. The expression 'ab?c' makes the 'b' atom optional, so legal values include 'abc' and 'ac'.

The explicit quantifier '+' indicates that the atom is repeatable. The expression 'ab+c' makes the 'b' atom repeatable, so legal values include 'abc' and 'abbbc'.

The explicit quantifier '*' indicates that the atom is both optional and repeatable. The expression 'ab*c' makes the 'b' atom optional and repeatable, so legal values include 'ac', 'abc' and 'abbbc'.

The following example includes all three quantifiers, and all of the following target Code elements are valid according to this pattern:

<pattern value="a+b?c*" />


   <code>a</code>
   <code>ab</code>
   <code>abc</code>
   <code>aaa</code>
   <code>aaab</code>
   <code>aaabc</code>
   <code>aaabccc</code>

Quantities

A 'quantity' is a more finely-tuned instrument for specifying occurrence options than the qualifiers described above. Instead of a single symbol, such as '+', a quantity involves one or two values, enclosed by curly brackets, '{' and '}'.

The simplest form of quantity involves a single value, such as '{3}' or '{123}', which specifies how many times the atom must occur. For example, 'ab{3}c' specifies that the value must be 'abbbc'.

A quantity range can have two values, separated by a comma, such as '{3,7}'. A value can include any number of occurrences of the atom between, and including, these two extremes. For example, 'ab{3,4}c' specifies that the value must be either 'abbbc' or 'abbbbc'.

It is also possible to specify just a minimum number of occurrences. If the second value is absent, then only a minimum is being specified, so '{3,}' means 'at least 3'. But the comma must still be present. For example, 'ab{2,}c' allows for everything from 'abbc' to 'abbbbbbbbbc' and beyond.

Escape characters

A number of significant symbols have now been introduced, such as '*' and '{'. These symbols cannot be used within an expression as atoms, because they would be misinterpreted as significant pattern markup. It is therefore necessary to escape them, in the same way that '&' and '<' must be escaped in XML documents. Instead of the '&' character, however, the '' symbol is used in a pattern. Again, just as an '&' is needed in XML documents to use the escape character itself as a data character, the '' symbol must also be escaped in patterns, giving '' (this should be familiar to C and Java software developers). The characters that need to be escaped are '.' '', '?', '*', '+', '{', '}', '(', ')', '|', '[' and ']'. For example, '?' is escaped as '?':

\ (the escape character)
| (branch separator)
. (not-a-line-end character)
- (range separator) (character class subtraction)
^ (used at start of negative character group)
? (optional indicator)
* (optional and repeatable indicator)
+ (required and repeatable)
{ (quantity start)
} (quantity end)
( (sub-group start)
) (sub-group end)
[ (range group start)
] (range group end)

In some circumstances, the '-' and '^' characters must also be escaped:

- (range separator)
^ (negative group indicator)

In addition, escape characters are used to include whitespace characters that would otherwise be difficult or impossible to add, such as the tab character:

 (newline)

 (return)
	 (tab)

Character groups

Atoms (quantified or otherwise) do not have to be single characters. They can also be escape sequences, so an escaped character can be quantified, such as '++', which states that the '+' character is required and repeatable. In addition, they can also be 'character groups', a feature that allows a particular character in the target value to be one of a number of pre-defined options.

It could be imagined that a product code needs to start with exactly three letters, and end with between two and four digits. While 'abc123' and 'wxy9876' would both be valid, 'ab123' and 'wxy98765' would not (the first has too few letters, and the second has too many digits). This requirement could be achieved using a very large number of branches, such as 'aaaa00 | aaaa000 | aaaa0000 | ... |' (and so on), but this is clearly impractical. Instead, a 'character class' expression is enclosed by square brackets, '[' and ']'. For example, '[abc]' means that, at this position, the letters 'a', 'b' or 'c' may appear.

When the first character is '^', the group becomes a 'negative character group', reversing the meaning, so that any character except those in the group can be matched. For example, '[^abc]' specifies that any character except 'a', 'b' or 'c' can be included. The '^' symbol can be used later in the group without having this significance, so '[a^b]' simply means that the character must be 'a' or '^' or 'b'.

Quantifiers can be used on groups as well as individual characters. For example, '[abc]+' specifies that at least one of the letters 'a', 'b' and 'c' must appear, but then additional characters from this set may also appear, so 'abcbca' would be a valid match.

Character ranges

It is not always efficient to have to specify every possible character in a group. For example, '[abcdefghijklmnopqrstuvwxyz]' is a verbose way to specify that any lower-case letter can occur. When a large set of options have sequential ASCII values, as in this example, a 'range' can be specified instead, using a '-' separator between the first character in the range and the last character in the range. The more succinct equivalent of the example above is therefore '[a-z]'. The expression '[a-zA-Z0-9]' allows all normal letters and digits to occur.

If the '-' character is to be matched, within a group, it must be escaped using '-', but it is not necessary to do this outside of a group. For example, 'a-b[x-y]+' matches 'a-bxxx---yyy'.

An XML character reference can be included in a range, such as '{' or 'ª'. This is particularly useful for representing characters that are difficult, or even impossible, to enter directly from a keyboard.

This approach can still be used when some of the characters in the range are not wanted. Individual items can be selectively removed from the range, using a 'sub group', with a '-' prefix, as in '[a-z-[m]', which removes 'm' as a valid option.

Sub-groups

An entire expression can be created within another expression, with the embedded expression enclosed by brackets, '(' and ')'. This is useful when several options are required at a particular location in the pattern, because a complete expression can contain branches, such as '1|2|3'. Therefore, 'abc(1|2|3)d' matches 'abc1d', 'abc2d' and 'abc3d'.

The simple example above, with only a single character in each option, is just another interpretation of the more succinct 'abc[123]*d' expression. However, the simpler technique cannot work for multi-character alternatives, such as '[abc(XX|YY|ZZ)d]'.

A quantifier can be assigned to a sub-group. In the following example, there can be any number of the given strings embedded within the value:

<pattern value="abc(XX|YY|ZZ)*d" />


   <code>abcd</code>
   <code>abcYYd</code>
   <code>acbZZYYXXXXd</code>

Character class escapes

There are various categories of 'character class escape':

single character escapes (discussed above)
multi-character escapes (such as 's' (non-whitespace) and '.' (non-line-ending))
general category escapes (such as 'p{L}' and 'p{Lu}') and complementary general category escapes (such as 'P{L}' and 'P{Lu}')
block category escapes (such as 'p{IsBasicLatin}' and 'p{IsTibetan}') and complementary block category escapes (such as 'P{IsBasicLatin}' and 'P{IsTibetan}').

A 'single category escape' is an escape sequence for a single character, such as the '{' character, which has a significant role in expressions (they are listed and discussed in more detail above).

Multi-character escapes

For convenience, a number of single character escape codes are provided to represent very common sets of characters, including:

non-line-ending characters
whitespace characters and non-whitespace characters
initial XML name characters (and all characters except these characters)
subsequent XML name characters (and all characters except these characters)
decimal digits (and all characters except these digits).

The '.' character represents every character except a new-line or carriage-return character. The expression '…..' therefore represents a string of five characters that are not broken over lines.

The remaining multi-character escape characters are all escaped in the normal way, using a '' symbol. They are all defined in pairs, with a lower-case letter representing a particular common requirement, and the equivalent upper-case letter representing the opposite effect.

The escape sequence 's' represents any whitespace character, including the space, tab, new-line and carriage return characters. The 'S' sequence therefore represents any non-whitespace character.

The escape sequence 'i' represents any initial name character ('_', ':' or a letter). The 'I' sequence therefore represents any non-initial character. Similarly, the escape sequence 'c' represents any XML name character, and 'C' represents any non-XML name character.

The escape sequence 'd' represents any decimal digit. It is equivalent to the expression 'p{Nd}' (see below). The 'D' sequence therefore represents any non-decimal digit character.

The escape sequence 'w' represents all characters except for punctuation, separator and 'other' characters (using techniques described below, this is equivalent to '[-􏿿-[p{P}p{Z}p{C}]]'), whereas the 'W' sequence represents only these characters.

Category escapes

The escape sequence 'p' or 'P' (reverse meaning) introduces a 'category escape' set. A category token is enclosed within following curly brackets, '{' and '}'. These tokens represent pre-defined sets of characters, such as all upper-case letters or the Tibetan character set.

General category escapes

A 'general category escape' is a reference to a pre-defined set of characters, such as all of the upper-case letters, or all punctuation characters. These sets of characters have special names, such as 'Lu' for upper-case letters, and 'P' for all punctuation. For example, 'p{Lu}' represents all upper-case letters, and 'P{Lu}' represents all characters except upper-case letters.

Single letter codes are used for major groupings, such as 'L' for all letters (of which upper-case letters are just a subset). The full set of options are:

L		All letters
	Lu	uppercase
	Ll	lowercase
	Lt	titlecase
	Lm	modifier
	Lo	other
M		All Marks
	Mn	nonspacing
	Mc	spacing combination
	Me	enclosing
N		All Numbers
	Nd	decimal digit
	Nl	letter
	No	other
P		All Punctuation
	Pc	connector
	Pd	dash
	Ps	open
	Pe	close
	Pi	initial quote
	Pf	final quote
	Po	other
Z		All Separators
	Zs	space
	Zl	line
	Zp	paragraph
S		All Symbols
	Sm	math
	Sc	currency
	Sk	modifier
	So	other
C		All Others
	Cc	control
	Cf	format
	Co	private use

These concepts are defined at http://www.unicode.org/Public/3.1-Update/UnicodeCharacterDatabase-3.1.0.html.

Block category escapes

The Unicode character set is divided into many significant groupings, such as musical symbols, Braille characters and Tibetan characters. A keyword is assigned to each group, such as 'MusicalSymbols', 'BraillePatterns' and 'Tibetan'.

In alphabetical order, the full set of keywords is:

AlphabeticPresentationForms	Dingbats	LetterlikeSymbols
Arabic	EnclosedAlphanumerics	LowSurrogates
ArabicPresentationForms-A	EnclosedCJKLettersandMonths	Malayalam
ArabicPresentationForms-B	Ethiopic	MathematicalAlphanumericSymbols
Armenian	GeneralPunctuation	MathematicalOperators
Arrows	GeometricShapes	MiscellaneousSymbols
BasicLatin	Georgian	MiscellaneousTechnical
Bengali	Gothic	Mongolian
BlockElements	Greek	MusicalSymbols
Bopomofo	GreekExtended	Myanmar
BopomofoExtended	Gujarati	NumberForms
BoxDrawing	Gurmukhi	Ogham
BraillePatterns	HalfwidthandFullwidthForms	OldItalic
ByzantineMusicalSymbols	HangulCompatibilityJamo	OpticalCharacterRecognition
Cherokee	HangulJamo	Oriya
CJKCompatibility	HangulSyllables	PrivateUse (3 separate sets)
CJKCompatibilityForms	Hebrew	Runic
CJKCompatibilityIdeographs	HighPrivateUseSurrogates	Sinhala
CJKCompatibilityIdeographsSupplement	HighSurrogates	SmallFormVariants
CJKRadicalsSupplement	Hiragana	SpacingModifierLetters
CJKSymbolsandPunctuation	IdeographicDescriptionCharacters	Specials (two seperate sets)
CJKUnifiedIdeographs	IPAExtensions	SuperscriptsandSubscripts
CJKUnifiedIdeographsExtensionA	Kanbun	Syriac
CJKUnifiedIdeographsExtensionB	KangxiRadicals	Tags
CombiningDiacriticalMarks	Kannada	Tamil
CombiningHalfMarks	Katakana	Telugu
CombiningMarksforSymbols	Khmer	Thaana
ControlPictures	Lao	Thai
CurrencySymbols	Latin-1Supplement	Tibetan
Cyrillic	LatinExtended-A	UnifiedCanadianAboriginalSyllabics
Deseret	LatinExtended-B	YiRadicals
Devanagari	LatinExtendedAdditional	YiSyllables

A reference to one of these categories involves a keyword that begins with 'Is…', followed by a name from the list above, such as 'Tibetan'. For example, 'p{IsTibetan}'.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Patterns

Create new playlist

Sign In

Sign Up

Patterns

Simple templates

Atoms

Quantifiers

Quantities

Escape characters

Character groups

Character ranges

Sub-groups

Character class escapes

Multi-character escapes

Category escapes

General category escapes

Block category escapes

Table of Contents for
Patterns