6. Using Regular Expressions to Specify Simple Datatypes

The Swiss Army Knife

Patterns (and regular expressions in general) are like a Swiss army knife when constraining simple datatypes. They are highly flexible, can compensate for many of the limitations of the other facets, and are often used to define user datatypes on various formats such as ISBN numbers, telephone numbers, or custom date formats. However, like a Swiss army knife, patterns have their own limitations.

Multirange datatypes (such as integers between -1 and 5 or 10 and 15) can be defined as a union of datatypes meeting the different ranges (in this case, we could perform a union between a datatype accepting integers between -1 and 5 and a second datatype accepting integers between 10 and 15); however, after the union, the resulting datatype loses its semantic of integer and cannot be constrained using integer facets any longer. Using patterns to define multirange datatypes is therefore an option: although less readable than using an union, it preserves the semantic of the base type.

Cutting a tree with a Swiss army knife is long, tiring, and dangerous. Writing regular expressions may also become long, tiring, and dangerous when the number of combinations grows. One should try to keep them as simple as possible.

A Swiss army knife cannot change lead into gold, and no facet can change the primary type of a simple datatype. A string datatype restricted to match a custom date format will still retain the properties of a string and will never acquire the facets of a datetime datatype. This means that there is no effective way to express localized date formats.

The Simplest Possible Patterns

In their simplest form, patterns may be used as enumerations applied to the lexical space rather than on the value space.

If, for instance, we have a byte value that can only take the values “1,” “5,” or “15,” the classical way to define such a datatype is to use the xs:enumeration facet:

<xs:simpleType name="myByte">
  <xs:restriction base="xs:byte">
    <xs:enumeration value="1"/>
    <xs:enumeration value="5"/>
    <xs:enumeration value="15"/>
  </xs:restriction>
</xs:simpleType>

This is the “normal” way of defining this datatype if it matches the lexical space and the value space of an xs:byte . It gives the flexibility to accept the instance documents with values such as “1,” “5,” and “15,” but also “01” or “0000005.” One of the particularities of xs:pattern is it must be the only facet constraining the lexical space. If we have an application that is disturbed by leading zeros, we can use patterns instead of enumerations to define our datatype:

<xs:simpleType name="myByte">
  <xs:restriction base="xs:byte">
    <xs:pattern value="1"/>
    <xs:pattern value="5"/>
    <xs:pattern value="15"/>
  </xs:restriction>
</xs:simpleType>

This new datatype is still derived from xs:byte and has the semantic of a byte, but its lexical space is now constrained to accept only “1,” “5,” and “15,” leaving out any variation that has the same value but a different lexical representation.

Tip

This is an important difference from Perl regular expressions, on which W3C XML Schema patterns are built. A Perl expression such as /15/ matches any string containing “15,” while the W3C XML Schema pattern matches only the string equal to “15.” The Perl expression equivalent to this pattern is thus /^15$/.

This example has been carefully chosen to avoid using any of the meta characters used within patterns, which are: “.”, “”, “?”, “*”, “+”, “{”, “}”, “(”, “)”, “[”, and “]”. We will see the meaning of these characters later in this chapter; for the moment, we just need to know that each of these characters needs to be “escaped” by a leading “” to be used as a literal. For instance, to define a similar datatype for a decimal when lexical space is limited to “1” and “1.5,” we write:

<xs:simpleType name="myDecimal">
  <xs:restriction base="xs:decimal">
    <xs:pattern value="1"/>
    <xs:pattern value="1.5"/>
  </xs:restriction>
</xs:simpleType>

A common source of errors is that “normal” characters should not be escaped: we will see later that a leading “” changes their meaning (for instance, “s” matches all the XML whitespaces and not the character “s”).

Quantifying

Despite an apparent similarity, the xs:pattern facet interprets its value attribute in a very different way than xs:enumeration does. xs:enumeration reads the value as a lexical representation, and converts it to the corresponding value for its base datatype, while xs:pattern reads the value as a set of conditions to apply on lexical values. When we write:

<xs:pattern value="15"/>

we specify three conditions (first character equals “1,” second character equals “5,” and the string must finish after this). Each of the matching conditions (such as first character equals “1” and second character equals “5”) is called a piece. This is just the simplest form to specify a piece.

Each piece in a pattern is composed of an atom identifying a character, or a set of characters, and an optional quantifier. Characters (except special characters that must be escaped) are the simplest form of atoms. In our example, we have omitted the quantifiers. Quantifiers may be defined using two different syntaxes: either a special character (* for 0 or more, + for one or more, and ? for 0 or 1) or a numeric range within curly braces ({n} for exactly n times, {n,m} for between n and m times, or {n,} for n or more times).

Using these quantifiers, we can merge our three patterns into one:

<xs:simpleType name="myByte">
  <xs:restriction base="xs:byte">
    <xs:pattern value="1?5?"/>
  </xs:restriction>
</xs:simpleType>

This new pattern means there must be zero or one character (“1”) followed by zero or one character (“5”). This is not exactly the same meaning as our three previous patterns since the empty string “” is now accepted by the pattern. However, since the empty string doesn’t belong to the lexical space of our base type ( xs:byte ), the new datatype has the same lexical space as the previous one.

We could also use quantifiers to limit the number of leading zeros—for instance, the following pattern limits the number of leading zeros to up to 2:

<xs:simpleType name="myByte">
  <xs:restriction base="xs:byte">
    <xs:pattern value="0{0,2}1?5?"/>
  </xs:restriction>
</xs:simpleType>

More Atoms

By this point, we have seen the simplest atoms that can be used in a pattern: “1,” “5,” and “.” are atoms that exactly match a character. The other atoms that can be used in patterns are special characters, a wildcard that matches any character, or predefined and user-defined character classes.

Special Characters

Table 6-1 shows the list of atoms that match a single character, exactly like the characters we have already seen, but also correspond to characters that must be escaped or (for the first three characters on the list) that are just provided for convenience.

Table 6-1. Special characters

	New line (can also be written as “ — since we are in a XML document).
	Carriage return (can also be written as “ -- ).
	Tabulation (can also be written as “ -- )
\	Character “”
\|	Character “\|”
.	Character “.”
-	Character “-”
^	Character “^”
?	Character “?”
*	Character “*”
+	Character “+”
{	Character “{”
}	Character “}”
(	Character “(”
)	Character “)”
[	Character “[”
]	Character “]”

Wildcard

The character “.” has a special meaning: it’s a wildcard atom that matches any XML valid character except newlines and carriage returns. As with any atom, “.” may be followed by an optional quantifier and “.*” is a common construct to match zero or more occurrences of any character. To illustrate the usage of “.*” (and the fact that xs:pattern is a Swiss army knife), a pattern may be used to define the integers that are multiples of 10:

<xs:simpleType name="multipleOfTen">
  <xs:restriction base="xs:integer">
    <xs:pattern value=".*0"/>
  </xs:restriction>
</xs:simpleType>

Character Classes

W3C XML Schema has adopted the “classical” Perl and Unicode character classes (but not the POSIX-style character classes also available in Perl).

Classical Perl character classes

W3C XML Schema supports the classical Perl character classes plus a couple of additions to match XML-specific productions. Each of these classes are designated by a single letter; the classes designated by the upper- and lowercase versions of the same letter are complementary:

s: Spaces. Matches the XML whitespaces (space #x20, tabulation #x09, line feed #x0A, and carriage return #x0D).
S: Characters that are not spaces.
d: Digits (“0” to “9” but also digits in other alphabets).
D: Characters that are not digits.
w: Extended “word” characters (any Unicode character not defined as “punctuation”, “separator,” and “other”). This conforms to the Perl definition, assuming UTF8 support has been switched on.
W: Nonword characters.
i: XML 1.0 initial name characters (i.e., all the “letters” plus “-”). This is a W3C XML Schema extension over Perl regular expressions.
I: Characters that may not be used as a XML initial name character.
c: XML 1.0 name characters (initial name characters, digits, “.”, “:”, “-”, and the characters defined by Unicode as “combining” or “extender”). This is a W3C XML Schema extension to Perl regular expressions.
C: Characters that may not be used in a XML 1.0 name.

These character classes may be used with an optional quantifier like any other atom. The last pattern that we saw:

<xs:pattern value=".*0"/>

constrains the lexical space to be a string of characters ending with a zero. Knowing that the base type is a xs:integer, this is good enough for our purposes, but if the base type had been a xs:decimal (or xs:string ), we could be more restrictive and write:

<xs:pattern value="-?d*0"/>

This checks that the characters before the trailing zero are digits with an optional leading - (we will see later on in Section 6.5.2.2 how to specify an optional leading - or +).

Unicode character classes

Patterns support character classes matching both Unicode categories and blocks. Categories and blocks are two complementary classification systems: categories classify the characters by their usage independently to their localization (letters, uppercase, digit, punctuation, etc.), while blocks classify characters by their localization independently of their usage (Latin, Arabic, Hebrew, Tibetan, and even Gothic or musical symbols).

The syntax p{Name} is similar for blocks and categories; the prefix Is is added to the name of categories to make the distinction. The syntax P{Name} is also available to select the characters that do not match a block or category. A list of Unicode blocks and categories is given in the specification. Table 6-2 shows the Unicode character classes and Table 6-3 shows the Unicode character blocks.

Table 6-2. Unicode character classes

Unicode Character Class	Includes
C	Other characters (non-letters, non symbols, non-numbers, non-separators)
Cc	Control characters
Cf	Format characters
Cn	Unassigned code points
Co	Private use characters
L	Letters
Ll	Lowercase letters
Lm	Modifier letters
Lo	Other letters
Lt	Titlecase letters
Lu	Uppercase letters
M	All Marks
Mc	Spacing combining marks
Me	Enclosing marks
Mn	Non-spacing marks
N	Numbers
Nd	Decimal digits
Nl	Number letters
No	Other numbers
P	Punctuation
Pc	Connector punctuation
Pd	Dashes
Pe	Closing punctuation
Pf	Final quotes (may behave like Ps or Pe)
Pi	Initial quotes (may behave like Ps or Pe)
Po	Other forms of punctuation
Ps	Opening punctuation
S	Symbols
Sc	Currency symbols
Sk	Modifier symbols
Sm	Mathematical symbols
So	Other symbols
Z	Separators
Zl	Line breaks
Zp	Paragraph breaks
Zs	Spaces

Table 6-3. Unicode character blocks

AlphabeticPresentationForms	Arabic	ArabicPresentationForms-A
ArabicPresentationForms-B	Armenian	Arrows
BasicLatin	Bengali	BlockElements
Bopomofo	BopomofoExtended	BoxDrawing
BraillePatterns	ByzantineMusicalSymbols	Cherokee
CJKCompatibility	CJKCompatibilityForms	CJKCompatibilityIdeographs
CJKCompatibilityIdeographsSupplement	CJKRadicalsSupplement	CJKSymbolsandPunctuation
CJKUnifiedIdeographs	CJKUnifiedIdeographsExtensionA	CJKUnifiedIdeographsExtensionB
CombiningDiacriticalMarks	CombiningHalfMarks	CombiningMarksforSymbols
ControlPictures	CurrencySymbols	Cyrillic
Deseret	Devanagari	Dingbats
EnclosedAlphanumerics	EnclosedCJKLettersandMonths	Ethiopic
GeneralPunctuation	GeometricShapes	Georgian
Gothic	Greek	GreekExtended
Gujarati	Gurmukhi	HalfwidthandFullwidthForms
HangulCompatibilityJamo	HangulJamo	HangulSyllables
Hebrew	HighPrivateUseSurrogates	HighSurrogates
Hiragana	IdeographicDescriptionCharacters	IPAExtensions
Kanbun	KangxiRadicals	Kannada
Katakana	Khmer	Lao
Latin-1Supplement	LatinExtended-A	LatinExtendedAdditional
LatinExtended-B	LetterlikeSymbols	LowSurrogates
Malayalam	MathematicalAlphanumericSymbols	MathematicalOperators
MiscellaneousSymbols	MiscellaneousTechnical	Mongolian
MusicalSymbols	Myanmar	NumberForms
Ogham	OldItalic	OpticalCharacterRecognition
Oriya	PrivateUse	PrivateUse
PrivateUse	Runic	Sinhala
SmallFormVariants	SpacingModifierLetters	Specials
Specials	SuperscriptsandSubscripts	Syriac
Tags	Tamil	Telugu
Thaana	Thai	Tibetan
UnifiedCanadianAboriginalSyllabics	YiRadicals	YiSyllables

We don’t yet know how to specify intersections between a block and a category in a single pattern, or how to specify that a datatype must be composed of only basic Latin letters. So, to “cross” these classifications and define the intersection of the block L (all the letters) and the category BasicLatin (ASCII characters below #x7F), we can perform two successive restrictions:

<xs:simpleType name="BasicLatinLetters">
  <xs:restriction>
    <xs:simpleType>
      <xs:restriction base="xs:token">
        <xs:pattern value="p{IsBasicLatin}*"/>
      </xs:restriction>
    </xs:simpleType>
    <xs:pattern value="p{L}*"/>
  </xs:restriction>
</xs:simpleType>

User-defined character classes

These classes are lists of characters between square brackets that accept - signs to define ranges and a leading ^ to negate the whole list—for instance:

[azertyuiop]

to define the list of letters on the first row of a French keyboard,

[a-z]

to specify all the characters between “a” and “z”,

[^a-z]

for all the characters that are not between “a” and “z,” but also

[-^\]

to define the characters “-,” “^,” and “,” or

[-+]

to specify a decimal sign.

These examples are enough to see that what’s between these square brackets follows a specific syntax and semantic. Like the regular expression’s main syntax, we have a list of atoms, but instead of matching each atom against a character of the instance string, we define a logical space. Between the atoms and the character class is the set of characters matching any of the atoms found between the brackets.

We see also two special characters that have a different meaning depending on their location! The character -, which is a range delimiter when it is between a and z, is a normal character when it is just after the opening bracket or just before the closing bracket ([+-] and [-+] are, therefore, both legal). On the contrary, ^, which is a negator when it appears at the beginning of a class, loses this special meaning to become a normal character later in the class definition.

We also notice that characters may or must be escaped: “\” is used to match the character “”. In fact, in a class definition, all the escape sequences that we have seen as atoms can be used. Even though some of the special characters lose their special meaning inside square brackets, they can always be escaped. So, the following:

[-^\]

can also be written as:

[-^\]

or as:

[^\-]

since the location of the characters doesn’t matter any longer when they are escaped.

Within square brackets, the character “” also keeps its meaning of a reference to a Perl or Unicode class. The following:

[dp{Lu}]

is a set of decimal digits (Perl class d) and uppercase letters (Unicode category “Lu”).

Mathematicians have found that three basic operations are needed to manipulate sets and that these operations can be chosen from a larger set of operations. In our square brackets, we already saw two of these operations: union (the square bracket is an implicit union of its atoms) and complement (a leading ^ realizes the complement of the set defined in the square bracket). W3C XML Schema extended the syntax of the Perl regular expressions to introduce a third operation: the difference between sets. The syntax follows:

[set1-[set2]]

Its meaning is all the characters in set1 that do not belong to set2, where set1 and set2 can use all the syntactic tricks that we have seen up to now.

This operator can be used to perform intersections of character classes (the intersection between two sets A and B is the difference between A and the complement of B), and we can now define a class for the BasicLatin Letters as:

[p{IsBasicLatin}-[^p{L}]]

Or, using the P construct, which is also a complement, we can define the class as:

[p{IsBasicLatin}-[P{L}]]

The corresponding datatype definition would be:

<xs:simpleType name="BasicLatinLetters">
  <xs:restriction base="xs:token">
    <xs:pattern value="[p{IsBasicLatin}-[P{L}]]*"/>
  </xs:restriction>
</xs:simpleType>

Oring and Grouping

In our first example pattern, we used three separate patterns to express three possible values. We can condense this definition using the “|” character, which is the “or” operator when used outside square brackets. The simple type definition is then:

<xs:simpleType name="myByte">
  <xs:restriction base="xs:byte">
    <xs:pattern value="1|5|15"/>
  </xs:restriction>
</xs:simpleType>

This syntax is more concise, but whether or not it’s more readable is subject to discussion. Also, these “ors” would not be very interesting if it were not possible to use them in conjunction with groups. Groups are complete regular expressions, which are, themselves, considered atoms and can be used with an optional quantifier to form more complete (and complex) regular expressions. Groups are enclosed by brackets (“(” and “)”). To define a comma-separated list of “1,” “5,” or “15,” ignoring whitespaces between values and commas, the following pattern could be used:

<xs:simpleType name="myListOfBytes">
  <xs:restriction base="xs:token">
    <xs:pattern value="(1|5|15)( *, *(1|5|15))*"/>
  </xs:restriction>
</xs:simpleType>

Note how we have relied on the whitespace processing of the base datatype ( xs:token collapses the whitespaces). We have not tested leading and trailing whitespaces that are trimmed and we have only tested single occurrences of spaces with the following atom:

run back " * " run back

before and after the comma.

Common Patterns

After this overview of the syntax used by patterns, let’s see some common patterns that you may have to use (or adapt) in your schemas or just consider as examples.

String Datatypes

Regular expressions treat information in its textual form. This makes them an excellent mechanism for constraining strings.

Unicode blocks

Unicode is a great asset of XML; however, there are few applications able to process and display all the characters of the Unicode set correctly and still fewer users able to read them! If you need to check that your string datatypes belong to one (or more) Unicode blocks, you can derive them from basic types such as:

<xs:simpleType name="BasicLatinToken">
  <xs:restriction base="xs:token">
    <xs:pattern value="p{IsBasicLatin}*"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="Latin-1Token">
  <xs:restriction base="xs:token">
    <xs:pattern value="[p{IsBasicLatin}p{IsLatin-1Supplement}]*"/>
  </xs:restriction>
</xs:simpleType>

Note that such patterns do not impose a character encoding on the document itself and that, for instance, the Latin-1Token datatype could validate instance documents using UTF-8, UTF-16, ISO-8869-1 or other encoding. (This assumes the characters used in this string belong to the two Unicode blocks BasicLatin and Latin-1Supplement .) In other words, working on the lexical space, i.e., after the transformations have been done by the parser, these patterns do not control the physical format of the instance documents.

Counting words

We have already seen a trick to count the words using a dummy derivation by list; however, this derivation counts only whitespace-separated “words,” ignoring the punctuation that was treated like normal characters. We can limit the number of words using a couple of patterns. To do so, we can define an atom, which is a sequence of one or more “word” characters (w+) followed by one or more nonword characters (W+), and control its number of occurrences. If we are not very strict on the punctuation, we also need to allow an arbitrary number of nonword characters at the beginning of our value and to deal with the possibility of a value ending with a word (without further separation). One of the ways to avoid any ambiguity at the end of the string is to dissociate the last occurrence of a word to make the trailing separator optional:

<xs:simpleType name="story100-200words">
  <xs:restriction base="xs:token">
    <xs:pattern value="W*(w+W+){99,199}w+W*"/>
  </xs:restriction>
</xs:simpleType>

URIs

We have seen that xs:anyURI doesn’t care about “absolutizing” relative URIs and it may be wise to impose the usage of absolute URIs, which are easier to process. Furthermore, it can also be interesting for some applications to limit the accepted URI schemes. This can easily be done by a set of patterns such as:

<xs:simpleType name="httpURI">
  <xs:restriction base="xs:anyURI">
    <xs:pattern value="http://.*"/>
  </xs:restriction>
</xs:simpleType>

Numeric and Float Types

While numeric types aren’t strictly text, patterns can still be used appropriately to constrain their lexical form.

Leading zeros

Getting rid of leading zeros is quite simple but requires some precautions if we want to keep the optional sign and the number “0” itself. This can be done using patterns such as:

<xs:simpleType name="noLeadingZeros">
  <xs:restriction base="xs:integer">
    <xs:pattern value="[+-]?([1-9][0-9]*|0)"/>
  </xs:restriction>
</xs:simpleType>

Note that in this pattern, we chose to redefine all the lexical rules that apply to an integer. This pattern would give the same lexical space applied to a xs:token datatype as on a xs:integer . We could also have relied on the knowledge of the base datatype and written:

  <xs:simpleType name="noLeadingZeros">
    <xs:restriction base="xs:integer">
      <xs:pattern value="[+-]?([^0].*|0)"/>
    </xs:restriction>
  </xs:simpleType>

Relying on the base datatype in this manner can produce simpler patterns, but can also be more difficult to interpret since we would have to combine the lexical rules of the base datatype to the rules expressed by the pattern to understand the result.

Fixed format

The maximum number of digits can be fixed using xs:totalDigits and xs:fractionDigits . However, these facets are only maximum numbers and work on the value space. If we want to fix the format of the lexical space to be, for instance, “DDDD.DD”, we can write a pattern such as:

<xs:simpleType name="fixedDigits">
  <xs:restriction base="xs:decimal">
    <xs:pattern value="[+-]?.{4}..{2}"/>
  </xs:restriction>
</xs:simpleType>

Datetimes

Dates and time have complex lexical representations. Patterns can give developers extra control over how they are used.

Time zones

The time zone support of W3C XML Schema is quite controversial and needs some additional constraints to avoid comparison problems. These patterns can be kept relatively simple since the syntax of the datetime is already checked by the schema validator and only simple additional checks need to be added. Applications which require that their datetimes specify a time zone may use the following template, which checks that the time part ends with a “Z” or contains a sign:

<xs:simpleType name="dateTimeWithTimezone">
  <xs:restriction base="xs:dateTime">
    <xs:pattern value=".+T.+(Z|[+-].+)"/>
  </xs:restriction>
</xs:simpleType>

Still simpler, applications that want to make sure that none of their datetimes specify a time zone may just check that the time part doesn’t contain the characters “+”, “-”, or “Z”:

<xs:simpleType name="dateTimeWithoutTimezone">
  <xs:restriction base="xs:dateTime">
    <xs:pattern value=".+T[^Z+-]+"/>
  </xs:restriction>
</xs:simpleType>

In these two datatypes, we used the separator “T”. This is convenient, since no occurrences of the signs can occur after this delimiter except in the time zone definition. This delimiter would be missing if we wanted to constrain dates instead of datetimes, but, in this case, we can detect the time zones on their “:” instead:

<xs:simpleType name="dateWithTimezone">
  <xs:restriction base="xs:date">
    <xs:pattern value=".+[:Z].*"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="dateWithoutTimezone">
  <xs:restriction base="xs:date">
    <xs:pattern value="[^:Z]*"/>
  </xs:restriction>
</xs:simpleType>

Applications may also simply impose a set of time zones to use:

<xs:simpleType name="dateTimeInMyTimezones">
  <xs:restriction base="xs:dateTime">
    <xs:pattern value=".++02:00"/>
    <xs:pattern value=".++01:00"/>
    <xs:pattern value=".++00:00"/>
    <xs:pattern value=".+Z"/>
    <xs:pattern value=".+-04:00"/>
  </xs:restriction>
</xs:simpleType>

We promised earlier to look at xs:duration and see how we can define two datatypes that have a complete sort order. The first datatype will consist of durations expressed only in months and years, and the second will consist of durations expressed only in days, hours, minutes, and seconds. The criteria used for the test can be the presence of a “D” (for day) or a “T” (the time delimiter). If neither of those characters are detected, then the datatype uses only year and month parts. The test for the other type cannot be based on the absence of “Y” and “M”, since there is also an “M” in the time part. We can test that, after an optional sign, the first field is either the day part or the “T” delimiter:

<xs:simpleType name="YMduration">
  <xs:restriction base="xs:duration">
    <xs:pattern value="[^TD]+"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="DHMSduration">
  <xs:restriction base="xs:duration">
    <xs:pattern value="-?P((d+D)|T).*"/>
  </xs:restriction>
</xs:simpleType>

Back to Our Library

Let’s see where we can use our Swiss army knife in our library. The first datatype, which we promised to improve at the end of the last chapter, is the ISBN number. Without fiddling the details of the constitution of an ISBN number (which can’t be fully checked with W3C XML Schema), we can check that the total number of characters actually used is 10 and limit its contents to digits and the letter “X.”:

<xs:simpleType name="isbn">
  <xs:restriction base="xs:NMTOKEN">
    <xs:length value="10"/>
    <xs:pattern value="[0-9]{9}[0-9X]"/>
  </xs:restriction>
</xs:simpleType>

Tip

You may wonder why we kept the xs:length , since as far as validation is concerned, it is less constraining than the xs:pattern that we added. This is a question worth asking, but it doesn’t have a complete answer yet. However, applications which use the PSVI as a source of meta information may or may not be able to deduce from a pattern that the length of a string has been fixed. It might be good practice to keep redundant facets to provide extra information to these future applications.

W3C XML Schema doesn’t allow expression of the fact that the book ID is the same value as the ISBN number with a “b” used as a prefix, but we can still define that it is a “b” with 9 digits and a trailing digit or “X”:

<xs:simpleType name="bookID">
  <xs:restriction base="xs:ID">
    <xs:pattern value="b[0-9]{9}[0-9X]"/>
  </xs:restriction>
</xs:simpleType>

To use this new datatype, we must be aware that we are using a global attribute that was referenced in the element book, but that was also referenced in the elements character and author, which do not have the same format. This is the main limitation in using global elements and attributes: they can be referenced only if they have the same types at all the locations in which they appear. We can work around this problem by creating a local attribute definition for the id attribute of book with the new datatype.

The last things we may want to constrain are the dates for which no time zones are needed and which, in fact, could just be a potential source of issues if we need to compare them:

<xs:simpleType name="date">
  <xs:restriction base="xs:date">
    <xs:pattern value="[^:Z]*"/>
  </xs:restriction>
</xs:simpleType>

Our new schema is then:

<?xml version="1.0"?> 
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:simpleType name="string255">
    <xs:restriction base="xs:token">
      <xs:maxLength value="255"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="string32">
    <xs:restriction base="xs:token">
      <xs:maxLength value="32"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="isbn">
    <xs:restriction base="xs:NMTOKEN">
      <xs:length value="10"/>
      <xs:pattern value="[0-9]{9}[0-9X]"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="bookID">
    <xs:restriction base="xs:ID">
      <xs:pattern value="b[0-9]{9}[0-9X]"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="supportedLanguages">
    <xs:restriction base="xs:language">
      <xs:enumeration value="en"/>
      <xs:enumeration value="es"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="date">
    <xs:restriction base="xs:date">
      <xs:pattern value="[^:Z]*"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:element name="name" type="string32"/>
  <xs:element name="qualification" type="string255"/>
  <xs:element name="born" type="date"/>
  <xs:element name="dead" type="date"/>
  <xs:element name="isbn" type="isbn"/>
  <xs:attribute name="id" type="xs:ID"/>
  <xs:attribute name="available" type="xs:boolean"/>
  <xs:attribute name="lang" type="supportedLanguages"/>
  <xs:element name="title">
    <xs:complexType>
      <xs:simpleContent>
        <xs:extension base="string255">
          <xs:attribute ref="lang"/>
        </xs:extension>
      </xs:simpleContent>
    </xs:complexType>
  </xs:element>
  <xs:element name="library">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="book" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="author">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="name"/>
        <xs:element ref="born"/>
        <xs:element ref="dead" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="book">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="isbn"/>
        <xs:element ref="title"/>
        <xs:element ref="author" minOccurs="0" maxOccurs="unbounded"/> 
        <xs:element ref="character" minOccurs="0"
          maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="id" type="bookID"/>
      <xs:attribute ref="available"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="character">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="name"/>
        <xs:element ref="born"/>
        <xs:element ref="qualification"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

Table of Contents for
6. Using Regular Expressions to Specify Simple Datatypes

Chapter 6. Using Regular Expressions to Specify Simple Datatypes

The Swiss Army Knife

The Simplest Possible Patterns

Tip

Quantifying

More Atoms

Special Characters

Wildcard

Character Classes

Classical Perl character classes

Unicode character classes

User-defined character classes

Oring and Grouping

Common Patterns

String Datatypes

Unicode blocks

Counting words

URIs

Numeric and Float Types

Leading zeros

Fixed format

Datetimes

Time zones

Back to Our Library

Tip

Table of Contents for 6. Using Regular Expressions to Specify Simple Datatypes

Create new playlist

Sign In

Sign Up

Chapter 6. Using Regular Expressions to Specify Simple Datatypes

The Swiss Army Knife

The Simplest Possible Patterns

Tip

Quantifying

More Atoms

Special Characters

Wildcard

Character Classes

Classical Perl character classes

Unicode character classes

User-defined character classes

Oring and Grouping

Common Patterns

String Datatypes

Unicode blocks

Counting words

URIs

Numeric and Float Types

Leading zeros

Fixed format

Datetimes

Time zones

Back to Our Library

Tip

Table of Contents for
6. Using Regular Expressions to Specify Simple Datatypes