RELAX NG includes a
native type system, but this type
library has been kept minimal by design because more complete type
libraries are available. It consists of just two datatypes
(token
and string
) that differ
only in the whitespace processing applied before validation. The
whole RELAX NG datatype system can be seen as a mechanism for adding
validating transformations to text nodes. These transformations
change text nodes into
canonical
formats (formats in which all the
different formats for a same value are converted into a single
normalized or “canonical” format).
The two native datatypes don’t detect format errors
(their formats are broad enough to allow any value) but still
transform text nodes in their canonical forms, which can make a
difference for enumerations. Other datatype libraries, covered in
Chapter 8, can detect format errors.
Enumerations are the first place you can see datatypes at work.
Applying datatypes to enumeration values is done by adding a
type
attribute in value
patterns. Up to now, we haven’t specified any
datatype when we’ve written value
elements. By default, they have the default type
token
from the built-in library. Text values of
this datatype receive full whitespace normalization similar to that
performed by the
XPath normalize-space( )
function: all sequences of one or more whitespace
characters—the characters #x20
(space),
#x9
(tab), #xA
(linefeed), and
#xD
(carriage return)—are replaced by a
single space, and the leading space and trailing space are then
trimmed.
Reconsidering previous examples, writing:
<attribute name="available"> <choice> <value>available</value> <value>checked out</value> <value>on hold</value> </choice> </attribute>
or:
attribute available {"available"|"checked out"|"on hold"}
has used the default type value (token
) and is
equivalent to the following:
<attribute name="available"> <choice> <value type="token">available</value> <value type="token">checked out</value> <value type="token">on hold</value> </choice> </attribute>
or:
attribute available {token "available"|token "checked out"|token "on hold"}
When the token
datatype is used,
whitespace normalization is
applied to the value defined in the schema and to the value found in
the instance document. The comparison is done using the result of the
normalization, which explains why "on hold
" was matching "
on hold
" with spaces or tabs
added before, between, and after the words.
The name of the token
datatype, borrowed from W3C XML
Schema, is highly confusing. In IT jargon, a
token is a piece of a string between two
delimiters, what is called a “word”
in plain English. The token
datatype
doesn’t denote a word. Otherwise,
“on” and
“hold” would be valid tokens;
“on hold”
wouldn’t. The token
datatype is
more a “token-ized” datatype, in
the sense that it’s a string that can be easily cut
into tokens when nonsignificant whitespace is removed.
This confusion is dangerous because it can cause you to use the
string
datatype when what you need is
token
. (You’ll see later in this
chapter that using the string
datatype should be
reserved for select cases).
To suppress this
normalization,
you can specify the second built-in datatype,
string
, which doesn’t perform any
transformation on the values before comparing them to the specified
value:
<attribute name="available"> <choice> <value type="string">available</value> <value type="string">checked out</value> <value type="string">on hold</value> </choice> </attribute>
or:
attribute available {string "available"|string "checked out"|string "on hold"}
Using the new definition, the value of our attribute must exactly
match the value specified in the schema:
available
, checked out
, and
on hold
. No extra whitespace is permitted.
The native token
and string
datatypes have the same basic definition as the W3C XML Schema
token
and string
datatypes. The
difference is that additional restrictions, which can be applied
using param
attributes to the W3C XML Schema
datatypes, aren’t available with RELAX
NG’s native datatypes. More details are provided in
Chapter 8.
18.222.121.231