Sometimes, building an extensible schema is a matter of capturing existing practice in RELAX NG, while other times, the schema development comes before practice, and the schema developer has the opportunity to make a lot of choices. You often have to do your best to write an extensible schema for an existing XML vocabulary and are constrained by the existing vocabulary. Other times you can design whatever vocabulary seems appropriate to the information being described.
In the case of a fixed result, the only way to manage extensibility relies on how named patterns are defined, much the same way that programmers’ decisions about how to define classes in object-oriented environments have a lot of impact on its extensibility. In this section, I will examine the major approaches to use when defining named patterns and start elements with extensibility in mind.
Let’s look back at our first schema, the Russian doll schema:
<?xml version="1.0" encoding="utf-8" ?> <element xmlns="http://relaxng.org/ns/structure/1.0" name="library"> <oneOrMore> <element name="book"> <attribute name="id"/> <attribute name="available"/> <element name="isbn"> <text/> </element> <element name="title"> <attribute name="xml:lang"/> <text/> </element> <zeroOrMore> <element name="author"> <attribute name="id"/> <element name="name"> <text/> </element> <element name="born"> <text/> </element> <optional> <element name="died"> <text/> </element> </optional> </element> </zeroOrMore> <zeroOrMore> <element name="character"> <attribute name="id"/> <element name="name"> <text/> </element> <element name="born"> <text/> </element> <element name="qualification"> <text/> </element> </element> </zeroOrMore> </element> </oneOrMore> </element>
or, in the compact syntax:
element library { element book { attribute id {text}, attribute available {text}, element isbn {text}, element title {attribute xml:lang {text}, text}, element author { attribute id {text}, element name {text}, element born {text}, element died {text}?}*, element character { attribute id {text}, element name {text}, element born {text}, element qualification {text}}* } + }
What if you want to derive a schema that has a new
id
attribute on the library
element? That’s simple: take our schema, copy it,
and edit it as a new one. There is no option for extensibility at all
because you can’t include a
schema that
doesn’t have a grammar
element as
a root.
The first thing to consider when you want a RELAX NG schema to be
extensible is that you always want the
root element to be a
grammar
element. In this case, the change,
producing
russian-doll.rng
, is minor:
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="library"> <oneOrMore> <element name="book"> <attribute name="id"/> <attribute name="available"/> <element name="isbn"> <text/> </element> <element name="title"> <attribute name="xml:lang"/> <text/> </element> <zeroOrMore> <element name="author"> <attribute name="id"/> <element name="name"> <text/> </element> <element name="born"> <text/> </element> <optional> <element name="died"> <text/> </element> </optional> </element> </zeroOrMore> <zeroOrMore> <element name="character"> <attribute name="id"/> <element name="name"> <text/> </element> <element name="born"> <text/> </element> <element name="qualification"> <text/> </element> </element> </zeroOrMore> </element> </oneOrMore> </element> </start> </grammar>
In the compact syntax, grammar
is implicit, but
you still need a start
pattern to be able to
redefine anything. The result of adding this pattern,
russian-doll.rnc
, looks like:
start = element library { element book { attribute id { text }, attribute available { text }, element isbn { text }, element title { attribute xml:lang { text }, text }, element author { attribute id { text }, element name { text }, element born { text }, element died { text }? }*, element character { attribute id { text }, element name { text }, element born { text }, element qualification { text } }* }+ }
Once these minor changes have been made, the schema can at least be included into another schema and modified there.
Although the previous schemas can be redefined, this redefinition is
ineffective because the granularity
is very coarse, and you can’t redefine just the
library
element. The best you can do is the
following, which isn’t much of an improvement:
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <include href="russian-doll.rng"> <start> <element name="library"> <attribute name="id"/> <oneOrMore> <element name="book"> <attribute name="id"/> <attribute name="available"/> <element name="isbn"> <text/> </element> <element name="title"> <attribute name="xml:lang"/> <text/> </element> <zeroOrMore> <element name="author"> <attribute name="id"/> <element name="name"> <text/> </element> <element name="born"> <text/> </element> <optional> <element name="died"> <text/> </element> </optional> </element> </zeroOrMore> <zeroOrMore> <element name="character"> <attribute name="id"/> <element name="name"> <text/> </element> <element name="born"> <text/> </element> <element name="qualification"> <text/> </element> </element> </zeroOrMore> </element> </oneOrMore> </element> </start> </include> </grammar>
or:
include "russian-doll.rnc" { start = element library { attribute id { text }, element book { attribute id { text }, attribute available { text }, element isbn { text }, element title { attribute xml:lang { text }, text }, element author { attribute id { text }, element name { text }, element born { text }, element died { text }? }*, element character { attribute id { text }, element name { text }, element born { text }, element qualification { text } }* }+ } }
In other words, we still need to redefine the whole schema. We’ve made no gains in modularity, because any changes in the original schema aren’t propagated into our resulting schema. To fix this, we need to create finer-grained definitions. A first approach to finer granularity involves defining a named pattern for each element (as with the schema style imposed by DTDs). That approach leads to a schema similar to the flat schema seen in Chapter 5 called flat.rng :
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <ref name="library-element"/> </start> <define name="library-element"> <element name="library"> <oneOrMore> <ref name="book-element"/> </oneOrMore> </element> </define> <define name="author-element"> <element name="author"> <attribute name="id"/> <ref name="name-element"/> <ref name="born-element"/> <optional> <ref name="died-element"/> </optional> </element> </define> <define name="book-element"> <element name="book"> <attribute name="id"/> <attribute name="available"/> <ref name="isbn-element"/> <ref name="title-element"/> <zeroOrMore> <ref name="author-element"/> </zeroOrMore> <zeroOrMore> <ref name="character-element"/> </zeroOrMore> </element> </define> <define name="born-element"> <element name="born"> <text/> </element> </define> <define name="character-element"> <element name="character"> <attribute name="id"/> <ref name="name-element"/> <ref name="born-element"/> <ref name="qualification-element"/> </element> </define> <define name="died-element"> <element name="died"> <text/> </element> </define> <define name="isbn-element"> <element name="isbn"> <text/> </element> </define> <define name="name-element"> <element name="name"> <text/> </element> </define> <define name="qualification-element"> <element name="qualification"> <text/> </element> </define> <define name="title-element"> <element name="title"> <attribute name="xml:lang"/> <text/> </element> </define> </grammar>
or, in the compact syntax, flat.rnc :
start = library-element library-element = element library { book-element+ } author-element = element author { attribute id { text }, name-element, born-element, died-element? } book-element = element book { attribute id { text }, attribute available { text }, isbn-element, title-element, author-element*, character-element* } born-element = element born { text } character-element = element character { attribute id { text }, name-element, born-element, qualification-element } died-element = element died { text } isbn-element = element isbn { text } name-element = element name { text } qualification-element = element qualification { text } title-element = element title { attribute xml:lang { text }, text }
These new schemas are more verbose, but they’re also
much more extensible. To add our id
attribute, we
need only to redefine the library
element:
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <include href="flat.rng"> <define name="library-element"> <element name="library"> <attribute name="id"/> <oneOrMore> <ref name="book-element"/> </oneOrMore> </element> </define> </include> </grammar>
or:
include "flat.rnc" { library-element = element library { attribute id { text }, book-element+ } }
All changes
made to the flat schemas—except to the
library
element—are now propagate through to
the derived schemas.
Although
the previous result is much more
extensible, we still have to redefine the complete content of the
library
element to add our id
attribute. We may have reduced the problem of redefinition our
Russian doll model had, but we haven’t eliminated
it. If we change our main vocabulary and add a new attribute or
element to the library
element in
flat.rng, the modification
isn’t automatically taken into account in our
schema. We’ll need to edit it.
The modification isn’t automatically transferred,
because the extensibility of a
named pattern doesn’t
cross element boundaries. Because we have the boundary of the
library
element included within our
library-element
named pattern, the content of this
element isn’t extensible, as shown in Figure 12-1.
To avoid this difficulty, we could have split our named patterns
according to the content of the elements rather than by the element
themselves. We would then have been able to add new content within
the library
element, as shown in Figure 12-2.
Generalizing this approach for all the definitions of all the elements leads to a schema that looks like flat-content.rng :
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="library"> <ref name="library-content"/> </element> </start> <define name="library-content"> <oneOrMore> <element name="book"> <ref name="book-content"/> </element> </oneOrMore> </define> <define name="book-content"> <attribute name="id"/> <attribute name="available"/> <element name="isbn"> <ref name="isbn-content"/> </element> <element name="title"> <ref name="title-content"/> </element> <zeroOrMore> <element name="author"> <ref name="author-content"/> </element> </zeroOrMore> <zeroOrMore> <element name="character"> <ref name="character-content"/> </element> </zeroOrMore> </define> <define name="author-content"> <attribute name="id"/> <element name="name"> <ref name="name-content"/> </element> <element name="born"> <ref name="born-content"/> </element> <optional> <element name="died"> <ref name="died-content"/> </element> </optional> </define> <define name="born-content"> <text/> </define> <define name="character-content"> <attribute name="id"/> <element name="name"> <ref name="name-content"/> </element> <element name="born"> <ref name="born-content"/> </element> <element name="qualification"> <ref name="qualification-content"/> </element> </define> <define name="died-content"> <text/> </define> <define name="isbn-content"> <text/> </define> <define name="name-content"> <text/> </define> <define name="qualification-content"> <text/> </define> <define name="title-content"> <attribute name="xml:lang"/> <text/> </define> </grammar>
or, in the compact syntax, flat-content.rnc :
start = element library { library-content } library-content = element book { book-content }+ book-content = attribute id { text }, attribute available { text }, element isbn { isbn-content }, element title { title-content }, element author { author-content }*, element character { character-content }* author-content = attribute id { text }, element name { name-content }, element born { born-content }, element died { died-content }? born-content = text character-content = attribute id { text }, element name { name-content }, element born { born-content }, element qualification { qualification-content } died-content = text isbn-content = text name-content = text qualification-content = text title-content = attribute xml:lang { text }, text
We can now take full advantage of the named pattern and, instead of
redefining it, we can combine it neatly with the
id
attribute:
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <include href="flat-content.rng"/> <define name="library-content" combine="interleave"> <attribute name="id"/> </define> </grammar>
or:
include "flat-content.rnc" library-content &= attribute id { text }
Because of the nature of the content, the extension can be done using
a combination by
interleave
. This method of combination is
frequently useful, when attributes or elements need to be added, but
it works only when the relative order isn’t
significant for the schema. Otherwise, you still need to redefine the
pattern or to combine it by choice.
When you are free to define the vocabulary, there are three
principal guidelines for designing extensible formats. The first one
is independent of any schema language. The second is specific to
RELAX NG and maximizes the usage of combination through
interleave
. The third is a way to minimize the
impact of interleave
on schemas that need to be
converted into W3C XML Schema or DTD schemas.
Attributes are generally difficult to extend. When choosing from among elements and attributes, people often base their choice on the relative ease of processing, styling, or transforming. Instead, you should probably focus on their extensibility.
Independent of any XML schema language, when you have an attribute in an instance document, you are pretty much stuck with it. Unless you replace it with an element, there is no way to extend it. You can’t add any child elements or attributes to it because it’s designed to be a leaf node and to remain a leaf node. Furthermore, you can’t extend the parent element to include a second instance of an attribute with the same name. (Attributes with duplicate names are forbidden by XML 1.0.) You are thus making an impact not only on the extensibility of the attribute but also on the extensibility of the parent element.
Because attributes can’t be annotated with new
attributes and because they can’t be duplicated,
they can’t be localized like elements through
duplication with different values of xml:lang
attributes. Because attributes are more difficult to localize, you
should avoid storing any text targeted at human consumers within
attributes. You never know whether your application will become
international. These attributes would make it more difficult to
localize.
To understand the reasons behind these limitations, it’s worth looking at the original use cases for attributes. Attributes were originally designed to hold metadata, information about the contents of the document. Elements themselves are a kind of metadata, labelling the content found in the document, and attributes are a mechanism for refining that metadata. (Data about metadata is still metadata.) Because of this, the editors of XML 1.0 decided that the lack of extensibility in XML attributes was not an issue.
Although most XML tools provide equal access to elements and attribute contents and don’t require attributes to contain exclusively metadata, the syntactic restrictions created by considering attributes to be metadata remain. Therefore, it’s wise to use attributes for what they’ve been designed for—metadata. My advice is to use attributes only when there is a good reason to do so: when the information is clearly metadata, and you have good reason to believe that it will not have to be extended.
In our example library,
identifiers are
good candidates for being attributes, but even
available
probably should have been specified
as an element. Although at first glance available
might be considered metadata (available
doesn’t directly affect the description of the
content of a book), other users looking at the
book
element may want this information item to be
able to store more details. They might even want to give it more
structure, to extend it to indicate whether the book is available as
a new or as a used item, for example.
There are times when these rules about metadata and attributes must be relaxed. You saw in Chapter 11 that it wasn’t a good idea to add foreign elements into a text-only element. Doing so transforms its content model from text to mixed content. It’s always risky to extend a text-only element by adding elements, while additional attributes usually pass unnoticed by existing applications. In this case, the lack of further extensibility may be compensated for by the short-term gain in backward compatibility between the vocabularies before and after the extension.
XML users often confuse the usage of elements and attributes. A common bad habit is the assumption that schemas should always enforce a fixed order among child elements. In other words, the relative order between subelements always matters.
Relative order is much less natural than you usually think, at least at the schema level. To draw a parallel with another technology: it’s considered poor practice to pay attention to the physical order of columns and rows in the table of a relational database. Furthermore, UML—the dominant modeling methodology—doesn’t attach any order to the attributes of classes, nor does it attach any order to relations between classes (unless specifically specified). UML attributes often represent not only XML attributes but also elements.
The main reasons people expect order to be required derive from limitations in DTDs and, more recently, in W3C XML Schema. Still, there are strong reasons to believe that when there is no special reason, relative order between subelements is something that should be left to the choice of those creating document instances, and you shouldn’t bother users and applications with enforcing an unnecessary constraint at the schema level.
In RELAX NG, defining content models in which the relative order of
child elements isn’t significant is almost as simple
as defining content models where it is significant.
It’s just a matter of adding
interleave
elements. When the relative order
isn’t significant, the definition is more extensible
because these content models can easily be extended through pattern
combinations using interleave
.
Using content models in which the relative order of child elements
isn’t significant makes it easier to add new
elements and attributes if necessary. I demonstrated this in the
example about the addition of the id
attribute in
the library
element in the first section of this
chapter.
Note that together with the “element or attribute” question, the issue of order significance is among the most controversial for XML experts. Technical constraints may, in some cases, justify enforcing element order in documents. These constraints come into play most notably during stream processing of huge documents; requiring information to appear in a specific order might permit the skipping of processing long content that otherwise needs to be buffered if this information came after the content. Other arguments for requiring that the order of elements is important—which I find to be far from obvious—include the assertion that there is “disorder” carried by documents in which element order isn’t enforced; that it’s much easier to read documents when you know where to find each element; and finally there is concern that if the order isn’t enforced, human users will be disoriented, confused, and find themselves in an insoluble quandary when it comes to choosing an order.
While the interleave
pattern works just fine most
of the time, you need to keep in mind the restriction about the
interleave
pattern mentioned in Chapter 6: there can be only
one text
pattern in each
interleave
pattern. This restriction affects
mixed-content models found mainly in document-oriented applications
and may sometimes require schemas to specify the order when mixing
textual content and elements.
Generalizing content models in which the relative order of child elements isn’t significant might lead you to difficulties when you need to work with other schema languages; notably, DTD and W3C XML Schema, such as if you are using RELAX NG as your main schema language and want to maintain the possibility of converting your RELAX NG schemas to DTDs or W3C XML schemas for the same vocabulary.
A way to avoid these potential issues surrounding the relative order of elements is to add elements that act as containers. These containers can make it easier to specify that elements include a text node, several elements that aren’t repeated, or repeated elements with the same name.
Among the elements of our library, the book
element is the only one that would be problematic for other schema
languages if I decided to switch its content model to
interleave
. The book-content
pattern then becomes:
<define name="book-content"> <interleave> <attribute name="id"/> <attribute name="available"/> <element name="isbn"> <ref name="isbn-content"/> </element> <element name="title"> <ref name="title-content"/> </element> <zeroOrMore> <element name="author"> <ref name="author-content"/> </element> </zeroOrMore> <zeroOrMore> <element name="character"> <ref name="character-content"/> </element> </zeroOrMore> </interleave> </define>
or, in the compact syntax:
book-content = attribute id { text } & attribute available { text } & element isbn { isbn-content } & element title { title-content } & element author { author-content }* & element character { character-content }*
This change allows instance documents in which
author
and character
elements
are mixed up with the other elements, such as that shown in Figure 12-3.
W3C XML Schema can’t
support this mixing. In order to define a schema that can more easily
be translated into a W3C XML Schema, I can add containers to isolate
the author
and character
elements from the elements that can’t be repeated.
The content of the book-content
pattern thus
becomes:
<define name="book-content"> <interleave> <attribute name="id"/> <attribute name="available"/> <element name="isbn"> <ref name="isbn-content"/> </element> <element name="title"> <ref name="title-content"/> </element> <element name="authors"> <zeroOrMore> <element name="author"> <ref name="author-content"/> </element> </zeroOrMore> </element> <element name="characters"> <zeroOrMore> <element name="character"> <ref name="character-content"/> </element> </zeroOrMore> </element> </interleave> </define>
or:
book-content = attribute id { text } & attribute available { text } & element isbn { isbn-content } & element title { title-content } & element authors { element author { author-content }* } & element characters { element character { character-content }* }
and validates elements such as those shown in Figure 12-4.
The relative order between the isbn
,
title
, authors
, and
characters
elements is still not significant, but
the author
and character
elements are now grouped together under containers and
can’t interleave between the other elements.
That’s enough to make this schema much friendlier to
schema languages with less expressive power than RELAX NG.
Note that even if these containers aren’t necessary
for RELAX NG, they are considered good practice by many XML experts.
The containers facilitate access to author
and
character
elements. The downside is that
additional hierarchies are added, and XPath expressions that identify
the contained elements become more verbose: instead of writing
/library/book/character
to access to the
character
elements, you have to write
/library/book/characters/character
. This style can
get tedious.
The previous sections focused on making schemas easy to extend through combination of named patterns and limiting the use of redefinition, because it leads to schemas with redundant pieces that are more difficult to maintain. However, extension is just one way to modify a schema to adapt it to other applications. There are also times when it is necessary to restrict schemas, adding new constraints or removing elements and attributes.
With RELAX NG, designing
schemas that can
be restricted without complete redefinition is more difficult than
designing schemas that are easy to extend. This is because the only
restriction that can be applied through combination is the
combination of notAllowed
patterns through
interleave. As shown in Chapter 10, if the
definition of the died
element has been included
in the named pattern element-died
, you can use
this feature to remove the element from the schema:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <include href="library.rng"/> <define name="element-died" combine="interleave"> <notAllowed/> </define> </grammar>
or:
include "library.rnc" element-died &= notAllowed
The rule of thumb for writing schemas that are easy to restrict is thus to increase the granularity of named patterns, exactly as is done when writing extensible schemas.
Note that the distinction between defining named patterns for
content, rather than for elements as was important for writing
extensible schemas, becomes meaningless for defining easily
restricted schemas. This is because interleaving a
notAllowed
pattern with an element or with its
content leads in both cases to a pattern that can’t
be matched in any instance structure.
The issue of restricting schemas is difficult enough that motivated people have propose specific solutions. Looking beyond the scope RELAX NG’s built-in features, Bob DuCharme has proposed a generic mechanism that relies on annotations that are preprocessed to generate subsets of schemas: see Chapter 13.
18.190.176.78