Among the different facets available to restrict the lexical space of simple datatypes, the most flexible (and also the one that we will often use as a last resort when all the other facets are unable to express the restriction on a user-defined datatype) is based on regular expressions.
Patterns (and regular expressions in general) are like a Swiss army knife when constraining simple datatypes. They are highly flexible, can compensate for many of the limitations of the other facets, and are often used to define user datatypes on various formats such as ISBN numbers, telephone numbers, or custom date formats. However, like a Swiss army knife, patterns have their own limitations.
Multirange datatypes (such as integers between -1 and 5 or 10 and 15) can be defined as a union of datatypes meeting the different ranges (in this case, we could perform a union between a datatype accepting integers between -1 and 5 and a second datatype accepting integers between 10 and 15); however, after the union, the resulting datatype loses its semantic of integer and cannot be constrained using integer facets any longer. Using patterns to define multirange datatypes is therefore an option: although less readable than using an union, it preserves the semantic of the base type.
Cutting a tree with a Swiss army knife is long, tiring, and dangerous. Writing regular expressions may also become long, tiring, and dangerous when the number of combinations grows. One should try to keep them as simple as possible.
A Swiss army knife cannot change lead into gold, and no facet can change the primary type of a simple datatype. A string datatype restricted to match a custom date format will still retain the properties of a string and will never acquire the facets of a datetime datatype. This means that there is no effective way to express localized date formats.
In their simplest form, patterns may be used as enumerations applied to the lexical space rather than on the value space.
If, for instance, we have a byte value that can only take the values
“1,”
“5,” or
“15,” the classical way to define
such a datatype is to use the
xs:enumeration
facet:
<xs:simpleType name="myByte"> <xs:restriction base="xs:byte"> <xs:enumeration value="1"/> <xs:enumeration value="5"/> <xs:enumeration value="15"/> </xs:restriction> </xs:simpleType>
This is the “normal” way of
defining this datatype if it matches the lexical space and the value
space of an
xs:byte
. It gives
the flexibility to accept the instance documents with values such as
“1,”
“5,” and
“15,” but also
“01” or
“0000005.” One of the
particularities of
xs:pattern
is it must be the only facet
constraining the lexical space. If we have an application that is
disturbed by leading zeros, we can use patterns instead of
enumerations to define our datatype:
<xs:simpleType name="myByte"> <xs:restriction base="xs:byte"> <xs:pattern value="1"/> <xs:pattern value="5"/> <xs:pattern value="15"/> </xs:restriction> </xs:simpleType>
This new datatype is still derived from
xs:byte
and has the semantic of a byte, but
its lexical space is now constrained to accept only
“1,”
“5,” and
“15,” leaving out any variation
that has the same value but a different lexical representation.
This is an important difference from Perl regular expressions, on
which W3C XML Schema patterns are built.
A Perl expression such as /15/
matches any string
containing “15,” while the W3C XML
Schema pattern matches only the string equal to
“15.” The Perl expression
equivalent to this pattern is thus /^15$/
.
This example has been carefully chosen to avoid using any of the meta characters used within patterns, which are: “.”, “”, “?”, “*”, “+”, “{”, “}”, “(”, “)”, “[”, and “]”. We will see the meaning of these characters later in this chapter; for the moment, we just need to know that each of these characters needs to be “escaped” by a leading “” to be used as a literal. For instance, to define a similar datatype for a decimal when lexical space is limited to “1” and “1.5,” we write:
<xs:simpleType name="myDecimal"> <xs:restriction base="xs:decimal"> <xs:pattern value="1"/> <xs:pattern value="1.5"/> </xs:restriction> </xs:simpleType>
A common source of errors is that “normal” characters should not be escaped: we will see later that a leading “” changes their meaning (for instance, “s” matches all the XML whitespaces and not the character “s”).
Despite an
apparent similarity, the
xs:pattern
facet interprets its value attribute
in a very different way than
xs:enumeration
does.
xs:enumeration
reads the value as a lexical representation,
and converts it to the corresponding value for its base datatype,
while
xs:pattern
reads the value
as a set of conditions to apply on lexical values. When we write:
<xs:pattern value="15"/>
we specify three conditions (first character equals “1,” second character equals “5,” and the string must finish after this). Each of the matching conditions (such as first character equals “1” and second character equals “5”) is called a piece. This is just the simplest form to specify a piece.
Each piece in a pattern is composed of an atom
identifying a
character, or a set of characters, and an optional quantifier.
Characters (except special characters that must be escaped) are the
simplest form of atoms. In our example, we have omitted the
quantifiers.
Quantifiers may be defined using two
different syntaxes: either a special character (*
for 0 or more, +
for one or more, and
?
for 0 or 1) or a numeric range within curly
braces ({n}
for exactly n times,
{n,m}
for between n and m times, or
{n,}
for n or more times).
Using these quantifiers, we can merge our three patterns into one:
<xs:simpleType name="myByte"> <xs:restriction base="xs:byte"> <xs:pattern value="1?5?"/> </xs:restriction> </xs:simpleType>
This new pattern means there must be zero or one character
(“1”) followed by zero or one
character (“5”). This is not
exactly the same meaning as our three previous patterns since the
empty string “” is now accepted by
the pattern. However, since the empty string doesn’t
belong to the lexical space of our base type (
xs:byte
), the new datatype has the same
lexical space as the previous one.
We could also use quantifiers to limit the number of leading zeros—for instance, the following pattern limits the number of leading zeros to up to 2:
<xs:simpleType name="myByte"> <xs:restriction base="xs:byte"> <xs:pattern value="0{0,2}1?5?"/> </xs:restriction> </xs:simpleType>
By this point, we have seen the simplest atoms that can be used in a pattern: “1,” “5,” and “.” are atoms that exactly match a character. The other atoms that can be used in patterns are special characters, a wildcard that matches any character, or predefined and user-defined character classes.
Table 6-1 shows the list of atoms that match a single character, exactly like the characters we have already seen, but also correspond to characters that must be escaped or (for the first three characters on the list) that are just provided for convenience.
The
character
“.” has a special meaning:
it’s a wildcard atom that matches any XML valid
character except newlines and carriage returns. As with any atom,
“.” may be followed by an optional
quantifier and “.*” is a common
construct to match zero or more occurrences of any character. To
illustrate the usage of “.*” (and
the fact that
xs:pattern
is a Swiss army knife), a
pattern may be used to define the integers that are multiples of 10:
<xs:simpleType name="multipleOfTen"> <xs:restriction base="xs:integer"> <xs:pattern value=".*0"/> </xs:restriction> </xs:simpleType>
W3C XML Schema has adopted the “classical” Perl and Unicode character classes (but not the POSIX-style character classes also available in Perl).
W3C XML Schema supports the classical Perl character classes plus a couple of additions to match XML-specific productions. Each of these classes are designated by a single letter; the classes designated by the upper- and lowercase versions of the same letter are complementary:
s
Spaces. Matches the XML whitespaces (space #x20, tabulation #x09, line feed #x0A, and carriage return #x0D).
S
d
D
w
Extended “word” characters (any Unicode character not defined as “punctuation”, “separator,” and “other”). This conforms to the Perl definition, assuming UTF8 support has been switched on.
W
i
XML 1.0 initial name characters (i.e., all the “letters” plus “-”). This is a W3C XML Schema extension over Perl regular expressions.
I
Characters that may not be used as a XML initial name character.
c
XML 1.0 name characters (initial name characters, digits, “.”, “:”, “-”, and the characters defined by Unicode as “combining” or “extender”). This is a W3C XML Schema extension to Perl regular expressions.
C
These character classes may be used with an optional quantifier like any other atom. The last pattern that we saw:
<xs:pattern value=".*0"/>
constrains the lexical space to be a string of characters ending with
a zero. Knowing that the base type is a
xs:integer
, this is good enough for our purposes,
but if the base type had been a
xs:decimal
(or
xs:string
), we could be more restrictive and
write:
<xs:pattern value="-?d*0"/>
This checks that the characters before the trailing zero are digits
with an optional leading -
(we will see later on
in Section 6.5.2.2 how to specify an optional
leading -
or +
).
Patterns support character classes matching both Unicode categories and blocks. Categories and blocks are two complementary classification systems: categories classify the characters by their usage independently to their localization (letters, uppercase, digit, punctuation, etc.), while blocks classify characters by their localization independently of their usage (Latin, Arabic, Hebrew, Tibetan, and even Gothic or musical symbols).
The syntax p{Name}
is similar for blocks and
categories; the prefix Is
is added to the name of
categories to make the distinction. The syntax
P{Name}
is also available to select the
characters that do not match a block or category. A list of Unicode
blocks and categories is given in the specification. Table 6-2 shows the Unicode character classes and Table 6-3 shows the Unicode character blocks.
Unicode Character Class |
Includes |
C |
Other characters (non-letters, non symbols, non-numbers, non-separators) |
Cc |
Control characters |
Cf |
Format characters |
Cn |
Unassigned code points |
Co |
Private use characters |
L |
Letters |
Ll |
Lowercase letters |
Lm |
Modifier letters |
Lo |
Other letters |
Lt |
Titlecase letters |
Lu |
Uppercase letters |
M |
All Marks |
Mc |
Spacing combining marks |
Me |
Enclosing marks |
Mn |
Non-spacing marks |
N |
Numbers |
Nd |
Decimal digits |
Nl |
Number letters |
No |
Other numbers |
P |
Punctuation |
Pc |
Connector punctuation |
Pd |
Dashes |
Pe |
Closing punctuation |
Pf |
Final quotes (may behave like Ps or Pe) |
Pi |
Initial quotes (may behave like Ps or Pe) |
Po |
Other forms of punctuation |
Ps |
Opening punctuation |
S |
Symbols |
Sc |
Currency symbols |
Sk |
Modifier symbols |
Sm |
Mathematical symbols |
So |
Other symbols |
Z |
Separators |
Zl |
Line breaks |
Zp |
Paragraph breaks |
Zs |
Spaces |
AlphabeticPresentationForms |
Arabic |
ArabicPresentationForms-A |
ArabicPresentationForms-B |
Armenian |
Arrows |
BasicLatin |
Bengali |
BlockElements |
Bopomofo |
BopomofoExtended |
BoxDrawing |
BraillePatterns |
ByzantineMusicalSymbols |
Cherokee |
CJKCompatibility |
CJKCompatibilityForms |
CJKCompatibilityIdeographs |
CJKCompatibilityIdeographsSupplement |
CJKRadicalsSupplement |
CJKSymbolsandPunctuation |
CJKUnifiedIdeographs |
CJKUnifiedIdeographsExtensionA |
CJKUnifiedIdeographsExtensionB |
CombiningDiacriticalMarks |
CombiningHalfMarks |
CombiningMarksforSymbols |
ControlPictures |
CurrencySymbols |
Cyrillic |
Deseret |
Devanagari |
Dingbats |
EnclosedAlphanumerics |
EnclosedCJKLettersandMonths |
Ethiopic |
GeneralPunctuation |
GeometricShapes |
Georgian |
Gothic |
Greek |
GreekExtended |
Gujarati |
Gurmukhi |
HalfwidthandFullwidthForms |
HangulCompatibilityJamo |
HangulJamo |
HangulSyllables |
Hebrew |
HighPrivateUseSurrogates |
HighSurrogates |
Hiragana |
IdeographicDescriptionCharacters |
IPAExtensions |
Kanbun |
KangxiRadicals |
Kannada |
Katakana |
Khmer |
Lao |
Latin-1Supplement |
LatinExtended-A |
LatinExtendedAdditional |
LatinExtended-B |
LetterlikeSymbols |
LowSurrogates |
Malayalam |
MathematicalAlphanumericSymbols |
MathematicalOperators |
MiscellaneousSymbols |
MiscellaneousTechnical |
Mongolian |
MusicalSymbols |
Myanmar |
NumberForms |
Ogham |
OldItalic |
OpticalCharacterRecognition |
Oriya |
PrivateUse |
PrivateUse |
PrivateUse |
Runic |
Sinhala |
SmallFormVariants |
SpacingModifierLetters |
Specials |
Specials |
SuperscriptsandSubscripts |
Syriac |
Tags |
Tamil |
Telugu |
Thaana |
Thai |
Tibetan |
UnifiedCanadianAboriginalSyllabics |
YiRadicals |
YiSyllables |
We don’t yet know how to specify intersections
between a block and a category in a single pattern, or how to specify
that a datatype must be composed of only basic Latin letters. So, to
“cross” these classifications and
define the intersection of the block L
(all the
letters) and the category BasicLatin
(ASCII
characters below #x7F), we can perform two successive restrictions:
<xs:simpleType name="BasicLatinLetters"> <xs:restriction> <xs:simpleType> <xs:restriction base="xs:token"> <xs:pattern value="p{IsBasicLatin}*"/> </xs:restriction> </xs:simpleType> <xs:pattern value="p{L}*"/> </xs:restriction> </xs:simpleType>
These
classes are lists of characters between square
brackets that
accept -
signs to
define ranges and a
leading ^
to negate
the whole list—for instance:
[azertyuiop]
to define the list of letters on the first row of a French keyboard,
[a-z]
to specify all the characters between “a” and “z”,
[^a-z]
for all the characters that are not between “a” and “z,” but also
[-^\]
to define the characters “-,” “^,” and “,” or
[-+]
to specify a decimal sign.
These examples are enough to see that what’s between these square brackets follows a specific syntax and semantic. Like the regular expression’s main syntax, we have a list of atoms, but instead of matching each atom against a character of the instance string, we define a logical space. Between the atoms and the character class is the set of characters matching any of the atoms found between the brackets.
We see also two special characters that have a different meaning
depending on their location! The character -
,
which is a range delimiter when it is between a
and z
, is a normal character when it is just after
the opening bracket or just before the closing bracket
([+-]
and [-+]
are, therefore,
both legal). On the contrary, ^
, which is a
negator when it appears at the beginning of a class, loses this
special meaning to become a normal character later in the class
definition.
We also notice that characters may or must be escaped: “\” is used to match the character “”. In fact, in a class definition, all the escape sequences that we have seen as atoms can be used. Even though some of the special characters lose their special meaning inside square brackets, they can always be escaped. So, the following:
[-^\]
can also be written as:
[-^\]
or as:
[^\-]
since the location of the characters doesn’t matter any longer when they are escaped.
Within square brackets, the character “” also keeps its meaning of a reference to a Perl or Unicode class. The following:
[dp{Lu}]
is a set of decimal digits (Perl class d
) and
uppercase letters (Unicode category
“Lu”).
Mathematicians have found that three basic
operations are needed to manipulate sets and that these operations
can be chosen from a larger set of operations. In our square
brackets, we already saw two of these operations:
union (the
square bracket is an implicit union of its atoms) and
complement (a leading
^
realizes the complement of the set defined in
the square bracket). W3C XML Schema extended the syntax of the Perl
regular expressions to introduce a third operation: the difference
between sets. The syntax follows:
[set1-[set2]]
Its meaning is all the characters in set1
that do
not belong to set2
, where set1
and set2
can use all the syntactic tricks that we
have seen up to now.
This operator can be used to perform
intersections of character classes (the
intersection between two sets A and B is the difference between A and
the complement of B), and we can now define a class for the
BasicLatin Letters
as:
[p{IsBasicLatin}-[^p{L}]]
Or, using the P
construct, which is also a
complement, we can define the class as:
[p{IsBasicLatin}-[P{L}]]
The corresponding datatype definition would be:
<xs:simpleType name="BasicLatinLetters"> <xs:restriction base="xs:token"> <xs:pattern value="[p{IsBasicLatin}-[P{L}]]*"/> </xs:restriction> </xs:simpleType>
In our first example pattern, we used three separate patterns to express three possible values. We can condense this definition using the “|” character, which is the “or” operator when used outside square brackets. The simple type definition is then:
<xs:simpleType name="myByte"> <xs:restriction base="xs:byte"> <xs:pattern value="1|5|15"/> </xs:restriction> </xs:simpleType>
This syntax is more concise, but whether or not it’s more readable is subject to discussion. Also, these “ors” would not be very interesting if it were not possible to use them in conjunction with groups. Groups are complete regular expressions, which are, themselves, considered atoms and can be used with an optional quantifier to form more complete (and complex) regular expressions. Groups are enclosed by brackets (“(” and “)”). To define a comma-separated list of “1,” “5,” or “15,” ignoring whitespaces between values and commas, the following pattern could be used:
<xs:simpleType name="myListOfBytes"> <xs:restriction base="xs:token"> <xs:pattern value="(1|5|15)( *, *(1|5|15))*"/> </xs:restriction> </xs:simpleType>
Note how we have relied on the whitespace processing of the base
datatype (
xs:token
collapses the whitespaces). We
have not tested leading and trailing whitespaces that are trimmed and
we have only tested single occurrences of spaces with the following
atom:
run back " * " run back
After this overview of the syntax used by patterns, let’s see some common patterns that you may have to use (or adapt) in your schemas or just consider as examples.
Regular expressions treat information in its textual form. This makes them an excellent mechanism for constraining strings.
Unicode is a great asset of XML; however, there are few applications able to process and display all the characters of the Unicode set correctly and still fewer users able to read them! If you need to check that your string datatypes belong to one (or more) Unicode blocks, you can derive them from basic types such as:
<xs:simpleType name="BasicLatinToken"> <xs:restriction base="xs:token"> <xs:pattern value="p{IsBasicLatin}*"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="Latin-1Token"> <xs:restriction base="xs:token"> <xs:pattern value="[p{IsBasicLatin}p{IsLatin-1Supplement}]*"/> </xs:restriction> </xs:simpleType>
Note that such patterns do not impose a character encoding on the
document itself and that, for instance, the
Latin-1Token
datatype could validate instance
documents using UTF-8, UTF-16, ISO-8869-1 or other encoding. (This
assumes the characters used in this string belong to the two Unicode
blocks BasicLatin
and
Latin-1Supplement
.) In other words, working on the
lexical space, i.e., after the transformations have been done by the
parser, these patterns do not control the physical format of the
instance documents.
We
have already seen a trick to count the
words using a dummy derivation by list; however, this derivation
counts only whitespace-separated
“words,” ignoring the punctuation
that was treated like normal characters. We can limit the number of
words using a couple of patterns. To do so, we can define an atom,
which is a sequence of one or more
“word” characters
(w+
) followed by one or more nonword characters
(W+
), and control its number of occurrences. If
we are not very strict on the punctuation, we also need to allow an
arbitrary number of nonword characters at the beginning of our value
and to deal with the possibility of a value ending with a word
(without further separation). One of the ways to avoid any ambiguity
at the end of the string is to dissociate the last occurrence of a
word to make the trailing separator optional:
<xs:simpleType name="story100-200words"> <xs:restriction base="xs:token"> <xs:pattern value="W*(w+W+){99,199}w+W*"/> </xs:restriction> </xs:simpleType>
We have seen that
xs:anyURI
doesn’t care about
“absolutizing” relative URIs and it
may be wise to impose the usage of absolute URIs, which are easier to
process. Furthermore, it can also be interesting for some
applications to limit the accepted URI schemes. This can easily be
done by a set of patterns such as:
<xs:simpleType name="httpURI"> <xs:restriction base="xs:anyURI"> <xs:pattern value="http://.*"/> </xs:restriction> </xs:simpleType>
While numeric types aren’t strictly text, patterns can still be used appropriately to constrain their lexical form.
Getting rid of leading zeros is quite simple but requires some precautions if we want to keep the optional sign and the number “0” itself. This can be done using patterns such as:
<xs:simpleType name="noLeadingZeros"> <xs:restriction base="xs:integer"> <xs:pattern value="[+-]?([1-9][0-9]*|0)"/> </xs:restriction> </xs:simpleType>
Note that in this pattern, we chose to redefine all the lexical rules
that apply to an integer. This pattern would give the same lexical
space applied to a
xs:token
datatype as on a
xs:integer
. We could also have
relied on the knowledge of the base datatype and written:
<xs:simpleType name="noLeadingZeros"> <xs:restriction base="xs:integer"> <xs:pattern value="[+-]?([^0].*|0)"/> </xs:restriction> </xs:simpleType>
Relying on the base datatype in this manner can produce simpler patterns, but can also be more difficult to interpret since we would have to combine the lexical rules of the base datatype to the rules expressed by the pattern to understand the result.
The maximum number of digits can be fixed using
xs:totalDigits
and
xs:fractionDigits
. However, these facets are only maximum
numbers and work on the value space. If we want to fix the format of
the lexical space to be, for instance,
“DDDD.DD”, we can write a pattern
such as:
<xs:simpleType name="fixedDigits"> <xs:restriction base="xs:decimal"> <xs:pattern value="[+-]?.{4}..{2}"/> </xs:restriction> </xs:simpleType>
Dates and time have complex lexical representations. Patterns can give developers extra control over how they are used.
The time zone support of W3C XML Schema is quite controversial and needs some additional constraints to avoid comparison problems. These patterns can be kept relatively simple since the syntax of the datetime is already checked by the schema validator and only simple additional checks need to be added. Applications which require that their datetimes specify a time zone may use the following template, which checks that the time part ends with a “Z” or contains a sign:
<xs:simpleType name="dateTimeWithTimezone"> <xs:restriction base="xs:dateTime"> <xs:pattern value=".+T.+(Z|[+-].+)"/> </xs:restriction> </xs:simpleType>
Still simpler, applications that want to make sure that none of their datetimes specify a time zone may just check that the time part doesn’t contain the characters “+”, “-”, or “Z”:
<xs:simpleType name="dateTimeWithoutTimezone"> <xs:restriction base="xs:dateTime"> <xs:pattern value=".+T[^Z+-]+"/> </xs:restriction> </xs:simpleType>
In these two datatypes, we used the separator “T”. This is convenient, since no occurrences of the signs can occur after this delimiter except in the time zone definition. This delimiter would be missing if we wanted to constrain dates instead of datetimes, but, in this case, we can detect the time zones on their “:” instead:
<xs:simpleType name="dateWithTimezone"> <xs:restriction base="xs:date"> <xs:pattern value=".+[:Z].*"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="dateWithoutTimezone"> <xs:restriction base="xs:date"> <xs:pattern value="[^:Z]*"/> </xs:restriction> </xs:simpleType>
Applications may also simply impose a set of time zones to use:
<xs:simpleType name="dateTimeInMyTimezones"> <xs:restriction base="xs:dateTime"> <xs:pattern value=".++02:00"/> <xs:pattern value=".++01:00"/> <xs:pattern value=".++00:00"/> <xs:pattern value=".+Z"/> <xs:pattern value=".+-04:00"/> </xs:restriction> </xs:simpleType>
We promised earlier to look at
xs:duration
and see how
we can define two datatypes that have a complete sort order. The
first datatype will consist of durations expressed only in months and
years, and the second will consist of durations expressed only in
days, hours, minutes, and seconds. The criteria used for the test can
be the presence of a “D” (for day)
or a “T” (the time delimiter). If
neither of those characters are detected, then the datatype uses only
year and month parts. The test for the other type cannot be based on
the absence of “Y” and
“M”, since there is also an
“M” in the time part. We can test
that, after an optional sign, the first field is either the day part
or the “T” delimiter:
<xs:simpleType name="YMduration"> <xs:restriction base="xs:duration"> <xs:pattern value="[^TD]+"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="DHMSduration"> <xs:restriction base="xs:duration"> <xs:pattern value="-?P((d+D)|T).*"/> </xs:restriction> </xs:simpleType>
Let’s see where we can use our Swiss army knife in our library. The first datatype, which we promised to improve at the end of the last chapter, is the ISBN number. Without fiddling the details of the constitution of an ISBN number (which can’t be fully checked with W3C XML Schema), we can check that the total number of characters actually used is 10 and limit its contents to digits and the letter “X.”:
<xs:simpleType name="isbn"> <xs:restriction base="xs:NMTOKEN"> <xs:length value="10"/> <xs:pattern value="[0-9]{9}[0-9X]"/> </xs:restriction> </xs:simpleType>
You may wonder why we kept the
xs:length
, since
as far as validation is concerned, it is less constraining than the
xs:pattern
that we added. This is a question worth
asking, but it doesn’t have a complete answer yet.
However, applications which use the PSVI as a source of meta
information may or may not be able to deduce from a pattern that the
length of a string has been fixed. It might be good practice to keep
redundant facets to provide extra information to these future
applications.
W3C XML Schema doesn’t allow expression of the fact that the book ID is the same value as the ISBN number with a “b” used as a prefix, but we can still define that it is a “b” with 9 digits and a trailing digit or “X”:
<xs:simpleType name="bookID"> <xs:restriction base="xs:ID"> <xs:pattern value="b[0-9]{9}[0-9X]"/> </xs:restriction> </xs:simpleType>
To use this new datatype, we must be aware that we are using a
global attribute that was referenced in the
element book
, but that was also referenced in the
elements character
and author
,
which do not have the same format. This is the main limitation in
using global elements and attributes: they can be referenced only if
they have the same types at all the locations in which they appear.
We can work around this problem by creating a local attribute
definition for the id
attribute of
book
with the new datatype.
The last things we may want to constrain are the dates for which no time zones are needed and which, in fact, could just be a potential source of issues if we need to compare them:
<xs:simpleType name="date"> <xs:restriction base="xs:date"> <xs:pattern value="[^:Z]*"/> </xs:restriction> </xs:simpleType>
Our new schema is then:
<?xml version="1.0"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:simpleType name="string255"> <xs:restriction base="xs:token"> <xs:maxLength value="255"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="string32"> <xs:restriction base="xs:token"> <xs:maxLength value="32"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="isbn"> <xs:restriction base="xs:NMTOKEN"> <xs:length value="10"/> <xs:pattern value="[0-9]{9}[0-9X]"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="bookID"> <xs:restriction base="xs:ID"> <xs:pattern value="b[0-9]{9}[0-9X]"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="supportedLanguages"> <xs:restriction base="xs:language"> <xs:enumeration value="en"/> <xs:enumeration value="es"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="date"> <xs:restriction base="xs:date"> <xs:pattern value="[^:Z]*"/> </xs:restriction> </xs:simpleType> <xs:element name="name" type="string32"/> <xs:element name="qualification" type="string255"/> <xs:element name="born" type="date"/> <xs:element name="dead" type="date"/> <xs:element name="isbn" type="isbn"/> <xs:attribute name="id" type="xs:ID"/> <xs:attribute name="available" type="xs:boolean"/> <xs:attribute name="lang" type="supportedLanguages"/> <xs:element name="title"> <xs:complexType> <xs:simpleContent> <xs:extension base="string255"> <xs:attribute ref="lang"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> <xs:element name="library"> <xs:complexType> <xs:sequence> <xs:element ref="book" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="author"> <xs:complexType> <xs:sequence> <xs:element ref="name"/> <xs:element ref="born"/> <xs:element ref="dead" minOccurs="0"/> </xs:sequence> <xs:attribute ref="id"/> </xs:complexType> </xs:element> <xs:element name="book"> <xs:complexType> <xs:sequence> <xs:element ref="isbn"/> <xs:element ref="title"/> <xs:element ref="author" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="character" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="id" type="bookID"/> <xs:attribute ref="available"/> </xs:complexType> </xs:element> <xs:element name="character"> <xs:complexType> <xs:sequence> <xs:element ref="name"/> <xs:element ref="born"/> <xs:element ref="qualification"/> </xs:sequence> <xs:attribute ref="id"/> </xs:complexType> </xs:element> </xs:schema>
3.12.36.30