Extensible Schemas

Sometimes, building an extensible schema is a matter of capturing existing practice in RELAX NG, while other times, the schema development comes before practice, and the schema developer has the opportunity to make a lot of choices. You often have to do your best to write an extensible schema for an existing XML vocabulary and are constrained by the existing vocabulary. Other times you can design whatever vocabulary seems appropriate to the information being described.

Working from a Fixed Result

In the case of a fixed result, the only way to manage extensibility relies on how named patterns are defined, much the same way that programmers’ decisions about how to define classes in object-oriented environments have a lot of impact on its extensibility. In this section, I will examine the major approaches to use when defining named patterns and start elements with extensibility in mind.

Providing a grammar and a start element

Let’s look back at our first schema, the Russian doll schema:

 <?xml version="1.0" encoding="utf-8"  ?>
 <element xmlns="http://relaxng.org/ns/structure/1.0" name="library">
  <oneOrMore>
   <element name="book">
    <attribute name="id"/>
    <attribute name="available"/>
    <element name="isbn">
     <text/>
    </element>
    <element name="title">
     <attribute name="xml:lang"/>
     <text/>
    </element>
    <zeroOrMore>
     <element name="author">
      <attribute name="id"/>
      <element name="name">
       <text/>
      </element>
      <element name="born">
       <text/>
      </element>
      <optional>
       <element name="died">
        <text/>
       </element>
      </optional>
     </element>
    </zeroOrMore>
    <zeroOrMore>
     <element name="character">
      <attribute name="id"/>
      <element name="name">
       <text/>
      </element>
      <element name="born">
       <text/>
      </element>
      <element name="qualification">
       <text/>
      </element>
     </element>
    </zeroOrMore>
   </element>
  </oneOrMore>
 </element>

or, in the compact syntax:

 element library {
  element book {
   attribute id {text},
   attribute available {text},
   element isbn {text},
   element title {attribute xml:lang {text}, text},
   element author {
    attribute id {text},
    element name {text},
    element born {text},
    element died {text}?}*,
   element character {
    attribute id {text},
    element name {text},
    element born {text},
    element qualification {text}}*
  } +
 }

What if you want to derive a schema that has a new id attribute on the library element? That’s simple: take our schema, copy it, and edit it as a new one. There is no option for extensibility at all because you can’t include a schema that doesn’t have a grammar element as a root.

The first thing to consider when you want a RELAX NG schema to be extensible is that you always want the root element to be a grammar element. In this case, the change, producing russian-doll.rng , is minor:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <start>
     <element name="library">
       <oneOrMore>
         <element name="book">
           <attribute name="id"/>
           <attribute name="available"/>
           <element name="isbn">
             <text/>
           </element>
           <element name="title">
             <attribute name="xml:lang"/>
             <text/>
           </element>
           <zeroOrMore>
             <element name="author">
               <attribute name="id"/>
               <element name="name">
                 <text/>
               </element>
               <element name="born">
                 <text/>
               </element>
               <optional>
                 <element name="died">
                   <text/>
                 </element>
               </optional>
             </element>
           </zeroOrMore>
           <zeroOrMore>
             <element name="character">
               <attribute name="id"/>
               <element name="name">
                 <text/>
               </element>
               <element name="born">
                 <text/>
               </element>
               <element name="qualification">
                 <text/>
               </element>
             </element>
           </zeroOrMore>
         </element>
       </oneOrMore>
     </element>
   </start>
 </grammar>

In the compact syntax, grammar is implicit, but you still need a start pattern to be able to redefine anything. The result of adding this pattern, russian-doll.rnc , looks like:

 start =
    element library
    {
       element book
       {
          attribute id { text },
          attribute available { text },
          element isbn { text },
          element title { attribute xml:lang { text }, text },
          element author
          {
             attribute id { text },
             element name { text },
             element born { text },
             element died { text }?
          }*,
          element character
          {
             attribute id { text },
             element name { text },
             element born { text },
             element qualification { text }
          }*
       }+
    }

Once these minor changes have been made, the schema can at least be included into another schema and modified there.

Maximize granularity

Although the previous schemas can be redefined, this redefinition is ineffective because the granularity is very coarse, and you can’t redefine just the library element. The best you can do is the following, which isn’t much of an improvement:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="russian-doll.rng">
     <start>
       <element name="library">
         <attribute name="id"/>
         <oneOrMore>
           <element name="book">
             <attribute name="id"/>
             <attribute name="available"/>
             <element name="isbn">
               <text/>
             </element>
             <element name="title">
               <attribute name="xml:lang"/>
               <text/>
             </element>
             <zeroOrMore>
               <element name="author">
                 <attribute name="id"/>
                 <element name="name">
                   <text/>
                 </element>
                 <element name="born">
                   <text/>
                 </element>
                 <optional>
                   <element name="died">
                     <text/>
                   </element>
                 </optional>
               </element>
             </zeroOrMore>
             <zeroOrMore>
               <element name="character">
                 <attribute name="id"/>
                 <element name="name">
                   <text/>
                 </element>
                 <element name="born">
                   <text/>
                 </element>
                 <element name="qualification">
                   <text/>
                 </element>
               </element>
             </zeroOrMore>
           </element>
         </oneOrMore>
       </element>
     </start>
   </include>
 </grammar>

or:

 include "russian-doll.rnc"
 {
 start =
       element library
       {
          attribute id { text },
          element book
          {
             attribute id { text },
             attribute available { text },
             element isbn { text },
             element title { attribute xml:lang { text }, text },
             element author
             {
                attribute id { text },
                element name { text },
                element born { text },
                element died { text }?
             }*,
             element character
             {
                attribute id { text },
                element name { text },
                element born { text },
                element qualification { text }
             }*
          }+
       }
 }

In other words, we still need to redefine the whole schema. We’ve made no gains in modularity, because any changes in the original schema aren’t propagated into our resulting schema. To fix this, we need to create finer-grained definitions. A first approach to finer granularity involves defining a named pattern for each element (as with the schema style imposed by DTDs). That approach leads to a schema similar to the flat schema seen in Chapter 5 called flat.rng :

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <start>
     <ref name="library-element"/>
   </start>
   <define name="library-element">
     <element name="library">
       <oneOrMore>
         <ref name="book-element"/>
       </oneOrMore>
     </element>
   </define>
   <define name="author-element">
     <element name="author">
       <attribute name="id"/>
       <ref name="name-element"/>
       <ref name="born-element"/>
       <optional>
         <ref name="died-element"/>
       </optional>
     </element>
   </define>
   <define name="book-element">
     <element name="book">
       <attribute name="id"/>
       <attribute name="available"/>
       <ref name="isbn-element"/>
       <ref name="title-element"/>
       <zeroOrMore>
         <ref name="author-element"/>
       </zeroOrMore>
       <zeroOrMore>
         <ref name="character-element"/>
       </zeroOrMore>
     </element>
   </define>
   <define name="born-element">
     <element name="born">
       <text/>
     </element>
   </define>
   <define name="character-element">
     <element name="character">
       <attribute name="id"/>
       <ref name="name-element"/>
       <ref name="born-element"/>
       <ref name="qualification-element"/>
     </element>
   </define>
   <define name="died-element">
     <element name="died">
       <text/>
     </element>
   </define>
   <define name="isbn-element">
     <element name="isbn">
       <text/>
     </element>
   </define>
   <define name="name-element">
     <element name="name">
       <text/>
     </element>
   </define>
   <define name="qualification-element">
     <element name="qualification">
       <text/>
     </element>
   </define>
   <define name="title-element">
     <element name="title">
       <attribute name="xml:lang"/>
       <text/>
     </element>
   </define>
 </grammar>

or, in the compact syntax, flat.rnc :

 start = library-element
        
 library-element = element library { book-element+ }
 author-element =
    element author
    {
       attribute id { text },
       name-element,
       born-element,
       died-element?
    }
        
 book-element =
    element book
    {
       attribute id { text },
       attribute available { text },
       isbn-element,
       title-element,
       author-element*,
       character-element*
    }
        
 born-element = element born { text }

 character-element =
    element character
    {
       attribute id { text },
       name-element,
       born-element,
       qualification-element
    }
        
 died-element = element died { text }
        
 isbn-element = element isbn { text }
        
 name-element = element name { text }
        
 qualification-element = element qualification { text }
        
 title-element = element title { attribute xml:lang { text }, text }

These new schemas are more verbose, but they’re also much more extensible. To add our id attribute, we need only to redefine the library element:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="flat.rng">
     <define name="library-element">
       <element name="library">
         <attribute name="id"/>
         <oneOrMore>
           <ref name="book-element"/>
         </oneOrMore>
       </element>
     </define>
   </include>
 </grammar>

or:

 include "flat.rnc"
 {
    library-element = element library { attribute id { text }, book-element+ }
 }

All changes made to the flat schemas—except to the library element—are now propagate through to the derived schemas.

Defining named patterns for content rather than for elements

Although the previous result is much more extensible, we still have to redefine the complete content of the library element to add our id attribute. We may have reduced the problem of redefinition our Russian doll model had, but we haven’t eliminated it. If we change our main vocabulary and add a new attribute or element to the library element in flat.rng, the modification isn’t automatically taken into account in our schema. We’ll need to edit it.

The modification isn’t automatically transferred, because the extensibility of a named pattern doesn’t cross element boundaries. Because we have the boundary of the library element included within our library-element named pattern, the content of this element isn’t extensible, as shown in Figure 12-1.

A flat schema, which is difficult to extend
Figure 12-1. A flat schema, which is difficult to extend

To avoid this difficulty, we could have split our named patterns according to the content of the elements rather than by the element themselves. We would then have been able to add new content within the library element, as shown in Figure 12-2.

A split schema, which is easier to extend
Figure 12-2. A split schema, which is easier to extend

Generalizing this approach for all the definitions of all the elements leads to a schema that looks like flat-content.rng :

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <start>
     <element name="library">
       <ref name="library-content"/>
     </element>
   </start>
   <define name="library-content">
     <oneOrMore>
       <element name="book">
         <ref name="book-content"/>
       </element>
     </oneOrMore>
   </define>
   <define name="book-content">
     <attribute name="id"/>
     <attribute name="available"/>
     <element name="isbn">
       <ref name="isbn-content"/>
     </element>
     <element name="title">
       <ref name="title-content"/>
     </element>
     <zeroOrMore>
       <element name="author">
         <ref name="author-content"/>
       </element>
     </zeroOrMore>
     <zeroOrMore>
       <element name="character">
         <ref name="character-content"/>
       </element>
     </zeroOrMore>
   </define>
   <define name="author-content">
     <attribute name="id"/>
     <element name="name">
       <ref name="name-content"/>
     </element>
     <element name="born">
       <ref name="born-content"/>
     </element>
     <optional>
       <element name="died">
         <ref name="died-content"/>
       </element>
     </optional>
   </define>
   <define name="born-content">
       <text/>
   </define>
   <define name="character-content">
     <attribute name="id"/>
     <element name="name">
       <ref name="name-content"/>
     </element>
     <element name="born">
       <ref name="born-content"/>
     </element>
     <element name="qualification">
       <ref name="qualification-content"/>
     </element>
   </define>
   <define name="died-content">
     <text/>
   </define>
   <define name="isbn-content">
     <text/>
   </define>
   <define name="name-content">
     <text/>
   </define>
   <define name="qualification-content">
     <text/>
   </define>
   <define name="title-content">
     <attribute name="xml:lang"/>
     <text/>
   </define>
 </grammar>

or, in the compact syntax, flat-content.rnc :

 start = element library { library-content }

 library-content = element book { book-content }+

 book-content =
    attribute id { text },
    attribute available { text },
    element isbn { isbn-content },
    element title { title-content },
    element author { author-content }*,
    element character { character-content }*

 author-content =
    attribute id { text },
    element name { name-content },
    element born { born-content },
    element died { died-content }?

 born-content = text

 character-content =
    attribute id { text },
    element name { name-content },
    element born { born-content },
    element qualification { qualification-content }

 died-content = text

 isbn-content = text

 name-content = text

 qualification-content = text

 title-content = attribute xml:lang { text }, text

We can now take full advantage of the named pattern and, instead of redefining it, we can combine it neatly with the id attribute:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="flat-content.rng"/>
   <define name="library-content" combine="interleave">
     <attribute name="id"/>
   </define>
 </grammar>

or:

 include "flat-content.rnc"

 library-content &= attribute id { text }

Because of the nature of the content, the extension can be done using a combination by interleave . This method of combination is frequently useful, when attributes or elements need to be added, but it works only when the relative order isn’t significant for the schema. Otherwise, you still need to redefine the pattern or to combine it by choice.

Free Formats

When you are free to define the vocabulary, there are three principal guidelines for designing extensible formats. The first one is independent of any schema language. The second is specific to RELAX NG and maximizes the usage of combination through interleave. The third is a way to minimize the impact of interleave on schemas that need to be converted into W3C XML Schema or DTD schemas.

Be cautious with attributes

Attributes are generally difficult to extend. When choosing from among elements and attributes, people often base their choice on the relative ease of processing, styling, or transforming. Instead, you should probably focus on their extensibility.

Independent of any XML schema language, when you have an attribute in an instance document, you are pretty much stuck with it. Unless you replace it with an element, there is no way to extend it. You can’t add any child elements or attributes to it because it’s designed to be a leaf node and to remain a leaf node. Furthermore, you can’t extend the parent element to include a second instance of an attribute with the same name. (Attributes with duplicate names are forbidden by XML 1.0.) You are thus making an impact not only on the extensibility of the attribute but also on the extensibility of the parent element.

Because attributes can’t be annotated with new attributes and because they can’t be duplicated, they can’t be localized like elements through duplication with different values of xml:lang attributes. Because attributes are more difficult to localize, you should avoid storing any text targeted at human consumers within attributes. You never know whether your application will become international. These attributes would make it more difficult to localize.

To understand the reasons behind these limitations, it’s worth looking at the original use cases for attributes. Attributes were originally designed to hold metadata, information about the contents of the document. Elements themselves are a kind of metadata, labelling the content found in the document, and attributes are a mechanism for refining that metadata. (Data about metadata is still metadata.) Because of this, the editors of XML 1.0 decided that the lack of extensibility in XML attributes was not an issue.

Although most XML tools provide equal access to elements and attribute contents and don’t require attributes to contain exclusively metadata, the syntactic restrictions created by considering attributes to be metadata remain. Therefore, it’s wise to use attributes for what they’ve been designed for—metadata. My advice is to use attributes only when there is a good reason to do so: when the information is clearly metadata, and you have good reason to believe that it will not have to be extended.

In our example library, identifiers are good candidates for being attributes, but even available probably should have been specified as an element. Although at first glance available might be considered metadata (available doesn’t directly affect the description of the content of a book), other users looking at the book element may want this information item to be able to store more details. They might even want to give it more structure, to extend it to indicate whether the book is available as a new or as a used item, for example.

There are times when these rules about metadata and attributes must be relaxed. You saw in Chapter 11 that it wasn’t a good idea to add foreign elements into a text-only element. Doing so transforms its content model from text to mixed content. It’s always risky to extend a text-only element by adding elements, while additional attributes usually pass unnoticed by existing applications. In this case, the lack of further extensibility may be compensated for by the short-term gain in backward compatibility between the vocabularies before and after the extension.

Use order sparingly

XML users often confuse the usage of elements and attributes. A common bad habit is the assumption that schemas should always enforce a fixed order among child elements. In other words, the relative order between subelements always matters.

Relative order is much less natural than you usually think, at least at the schema level. To draw a parallel with another technology: it’s considered poor practice to pay attention to the physical order of columns and rows in the table of a relational database. Furthermore, UML—the dominant modeling methodology—doesn’t attach any order to the attributes of classes, nor does it attach any order to relations between classes (unless specifically specified). UML attributes often represent not only XML attributes but also elements.

The main reasons people expect order to be required derive from limitations in DTDs and, more recently, in W3C XML Schema. Still, there are strong reasons to believe that when there is no special reason, relative order between subelements is something that should be left to the choice of those creating document instances, and you shouldn’t bother users and applications with enforcing an unnecessary constraint at the schema level.

In RELAX NG, defining content models in which the relative order of child elements isn’t significant is almost as simple as defining content models where it is significant. It’s just a matter of adding interleave elements. When the relative order isn’t significant, the definition is more extensible because these content models can easily be extended through pattern combinations using interleave.

Using content models in which the relative order of child elements isn’t significant makes it easier to add new elements and attributes if necessary. I demonstrated this in the example about the addition of the id attribute in the library element in the first section of this chapter.

Note that together with the “element or attribute” question, the issue of order significance is among the most controversial for XML experts. Technical constraints may, in some cases, justify enforcing element order in documents. These constraints come into play most notably during stream processing of huge documents; requiring information to appear in a specific order might permit the skipping of processing long content that otherwise needs to be buffered if this information came after the content. Other arguments for requiring that the order of elements is important—which I find to be far from obvious—include the assertion that there is “disorder” carried by documents in which element order isn’t enforced; that it’s much easier to read documents when you know where to find each element; and finally there is concern that if the order isn’t enforced, human users will be disoriented, confused, and find themselves in an insoluble quandary when it comes to choosing an order.

While the interleave pattern works just fine most of the time, you need to keep in mind the restriction about the interleave pattern mentioned in Chapter 6: there can be only one text pattern in each interleave pattern. This restriction affects mixed-content models found mainly in document-oriented applications and may sometimes require schemas to specify the order when mixing textual content and elements.

Use containers

Generalizing content models in which the relative order of child elements isn’t significant might lead you to difficulties when you need to work with other schema languages; notably, DTD and W3C XML Schema, such as if you are using RELAX NG as your main schema language and want to maintain the possibility of converting your RELAX NG schemas to DTDs or W3C XML schemas for the same vocabulary.

A way to avoid these potential issues surrounding the relative order of elements is to add elements that act as containers. These containers can make it easier to specify that elements include a text node, several elements that aren’t repeated, or repeated elements with the same name.

Among the elements of our library, the book element is the only one that would be problematic for other schema languages if I decided to switch its content model to interleave. The book-content pattern then becomes:

  <define name="book-content">
    <interleave>
      <attribute name="id"/>
      <attribute name="available"/>
      <element name="isbn">
        <ref name="isbn-content"/>
      </element>
      <element name="title">
        <ref name="title-content"/>
      </element>
      <zeroOrMore>
        <element name="author">
          <ref name="author-content"/>
        </element>
      </zeroOrMore>
      <zeroOrMore>
        <element name="character">
          <ref name="character-content"/>
        </element>
      </zeroOrMore>
    </interleave>
  </define>

or, in the compact syntax:

 book-content =
    attribute id { text }
  & attribute available { text }
  & element isbn { isbn-content }
  & element title { title-content }
  & element author { author-content }*
  & element character { character-content }*

This change allows instance documents in which author and character elements are mixed up with the other elements, such as that shown in Figure 12-3.

An instance document with interleaved content
Figure 12-3. An instance document with interleaved content

W3C XML Schema can’t support this mixing. In order to define a schema that can more easily be translated into a W3C XML Schema, I can add containers to isolate the author and character elements from the elements that can’t be repeated. The content of the book-content pattern thus becomes:

  <define name="book-content">
    <interleave>
      <attribute name="id"/>
      <attribute name="available"/>
      <element name="isbn">
        <ref name="isbn-content"/>
      </element>
      <element name="title">
        <ref name="title-content"/>
      </element>
      <element name="authors">
        <zeroOrMore>
          <element name="author">
            <ref name="author-content"/>
          </element>
        </zeroOrMore>
      </element>
      <element name="characters">
        <zeroOrMore>
          <element name="character">
            <ref name="character-content"/>
          </element>
        </zeroOrMore>
      </element>
    </interleave>
  </define>

or:

book-content =
  attribute id { text }
 & attribute available { text }
 & element isbn { isbn-content }
 & element title { title-content }
 & element authors { element author { author-content }* }
 & element characters { element character { character-content }* }

and validates elements such as those shown in Figure 12-4.

A document with interleaved container
Figure 12-4. A document with interleaved container

The relative order between the isbn, title, authors, and characters elements is still not significant, but the author and character elements are now grouped together under containers and can’t interleave between the other elements. That’s enough to make this schema much friendlier to schema languages with less expressive power than RELAX NG.

Note that even if these containers aren’t necessary for RELAX NG, they are considered good practice by many XML experts. The containers facilitate access to author and character elements. The downside is that additional hierarchies are added, and XPath expressions that identify the contained elements become more verbose: instead of writing /library/book/character to access to the character elements, you have to write /library/book/characters/character. This style can get tedious.

Restricting Existing Schemas

The previous sections focused on making schemas easy to extend through combination of named patterns and limiting the use of redefinition, because it leads to schemas with redundant pieces that are more difficult to maintain. However, extension is just one way to modify a schema to adapt it to other applications. There are also times when it is necessary to restrict schemas, adding new constraints or removing elements and attributes.

With RELAX NG, designing schemas that can be restricted without complete redefinition is more difficult than designing schemas that are easy to extend. This is because the only restriction that can be applied through combination is the combination of notAllowed patterns through interleave. As shown in Chapter 10, if the definition of the died element has been included in the named pattern element-died, you can use this feature to remove the element from the schema:

 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="library.rng"/>
   <define name="element-died" combine="interleave">
     <notAllowed/>
   </define>
 </grammar>

or:

 include "library.rnc"
 element-died &= notAllowed

The rule of thumb for writing schemas that are easy to restrict is thus to increase the granularity of named patterns, exactly as is done when writing extensible schemas.

Note that the distinction between defining named patterns for content, rather than for elements as was important for writing extensible schemas, becomes meaningless for defining easily restricted schemas. This is because interleaving a notAllowed pattern with an element or with its content leads in both cases to a pattern that can’t be matched in any instance structure.

The issue of restricting schemas is difficult enough that motivated people have propose specific solutions. Looking beyond the scope RELAX NG’s built-in features, Bob DuCharme has proposed a generic mechanism that relies on annotations that are preprocessed to generate subsets of schemas: see Chapter 13.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.176.78