Document Type Definitions

As discussed earlier, Document Type Definitions, or DTDs, are the form of document types specified by the XML 1.0 recommendation. Though there are alternatives, DTDs remain one of the most common ways of specifying a document type. In this section, we discuss the syntax of the various declarations that can occur in the Document Type Declaration; these can all appear in both the internal and external subsets.

Entity Declarations

Entities are sources of data that are used to compose a larger construct. Most, called general entities, are used to construct documents, but some, known as parameter entities, are used to construct the document type itself. Both are defined using an entity declaration in the Document Type Definition. Each kind of entity is defined in a separate namespace; there can be a general entity named myEntity and a parameter entity of the same name, and the names do not clash.

Entities can be declared more than once — the first definition for a name takes precedence. This allows the internal subset to override a definition provided in the external subset; when used with parameter entities, this mechanism can be used to extend DTDs. Document type extension generally works best when the DTD being extended has been carefully designed with this in mind. The DocBook DTD for technical documentation is an excellent example of this.

General entities can take a variety of forms: they may be parsed entities, consisting of XML text, or unparsed, such as an image stored as a Portable Network Graphics (PNG) file. The text of a parsed entity may be included in the entity declaration, or it may reside in an external source. The body of an unparsed entity is always stored externally. Most entities used with XML are parsed entities; unparsed constructs, such as images, are typically referenced using an absolute or relative URL rather than by a named entity.

Parsed general entities are used to define substitution text for a (typically) shorter name. Recall that in XML, text includes not only character data, but markup as well, so the substitution can actually insert additional structure into the document as long as all structures are complete within the substitution. At production time, a parser resolves the entity into its substitution text, and evaluates the document based on how it looks after the entities have been resolved. A simple internal entity is as easy to create as a symbol and its replacement text:

<!ENTITY sandwich "Crabby Patty">

In your document, any reference to &sandwich; yields the replacement text of “Crabby Patty” into the document. For example:

I am hungry for a &sandwich;.

This sentence renders as:

I am hungry for a Crabby Patty.

External entities are defined using an entity declaration that gives a URL to an external resource containing the replacement text:

<!ENTITY legal SYSTEM "http://www.example.com/legal.xml">

Any reference to &legal; within a document yields:

<legal>Copyright 2001, Example Corporation</legal>.

Like internal entities, external entities replace symbols with the appropriate text. Sometimes this must be done when the text uses characters that would otherwise be considered markup (such as the use of special characters like <, >, and & in your XML). Other times, entities are used to keep boilerplate information that is normally maintained somewhere else available to the document.

Parameter entities are different in both usage and applicability. They can only be used to create the Document Type Definition, and not to directly compose the document. The syntax of an XML document does not allow parameter entities to be referred to from within the document content, but only allows their use with the internal and external DTD subsets. There are no unparsed parameter entities, though a nonvalidating parser may ignore them. Validating parsers are required to parse all referenced parameter entities.

The declaration for a parameter entity looks much like the declaration for a general entity, with just a couple of additional characters added:

<!ENTITY % node-decls SYSTEM "node-decls.dtd">

What this declaration has that the general entity declaration doesn’t is a percent sign (% ) between the keyword ENTITY and the name of the entity, with whitespace on both sides to set it off (the whitespace is required). This parameter entity would be used like this:

%node-decls;

Note that the reference to the parameter entity uses the percent sign instead of the ampersand to mark the beginning of the name; this is necessary since the two sets of names may overlap.

The effect of entity replacement is much like the use of general entities. The replacement text effectively replaces the entity reference, and interpretation of the document type continues using the modified text.

The usefulness of parameter entities is highest when working with modularized document types, which can provide carefully designed extension mechanisms using parameter entities. A large DTD, such as the industry-standard DocBook DTD for software documentation, can be customized by creating a new document type that simply defines several parameter entities and then incorporates the standard DocBook definition. Since the entity declarations in the customization layer override the definitions provided by DocBook, this mechanism can be used to either extend or restrict the specific document type in ways that are suitable for a specific project.

Element Type Declarations

Element type declarations are used to constrain an element’s content. They indicate what element types can be used as children of the element, and show how the children may be arranged. Element type declarations may look like this:

<!ELEMENT br      EMPTY>
<!ELEMENT generic ANY>
<!ELEMENT name    (address+)>
<!ELEMENT para    (#PCDATA | list | picture)*>

We can break up the declaration into particular systactic components, each with a specific purpose:

<!ELEMENT name content-model>

The text <!ELEMENT tells the parser that this is an element type declaration. name gives a name to the element type; this allows it to be referenced from elsewhere in the Document Type Definition. The content-model is used to specify what can appear as content of the element, whether it can contain character data, other elements, or both. No element type may be declared more than once.

It is interesting to note that there is not a place for attributes to be declared. While attributes are associated with element types, they are defined using attribute declarations, described later in this chapter, in Section 2.6.3.

Content models

A content model describes what elements are allowed as children of the declared element type, in what order and combination they are allowed, and whether arbitrary character data is allowed.

The content models of all elements can be broken into two categories:

Element Content

This describes content made up only of elements. That is, you define an address element that requires no character data, but instead requires child elements. The specification defines content particles that “consist of names, choice lists of content particles, or sequence lists of content particles.”

Mixed Content

This content may contain character data. This is the most common arrangement in text documents:

<news title="XML from Outer Space">
    This article describes XML transmissions from outer space.
    <h1>Not a Meteor</h1>
    <para>Contrary to earlier reports, the XML that has landed from
          outer space is not a meteor.</para>
</news>

In this example, elements and character data are mixed beneath the news element. Elements that have a mixed content model are not required to allow other elements as content. In fact, an element type with only character data in the content model may be completely empty; there is no way to specify that there must be characters in the character data.

Let’s take another look at our example element declarations:

<!ELEMENT br      EMPTY>

These element type declarations are simple. The content model of the first, EMPTY, can be used to describe an empty br element as found in XHTML. It can contain no child elements and no character data. It can still contain noncontent constructs, such as comments or processing instructions. An element type declared as EMPTY is considered a degenerate special case of element content.

<!ELEMENT generic ANY>

Next, we have an element named generic that can contain any kind of element defined in the document type (this does not allow undefined element types!). In addition to other elements, character data is allowed as well, so a content model of ANY is mixed content.

<!ELEMENT name    (address+)>

The third example is simple, but very different from the others. Instead of a simple name such as ANY or EMPTY, the model is described by something that closely resembles a regular expression. In this particular example, we have a name element that requires one or more address elements to be included. This form of content model is perhaps the most commonly used and allows for fine control. Content models can take on varying levels of complexity, but the goal is always the same: to define the content that is allowed or expected within the element.

The content model is specified with parentheses, as well as with commas indicating a sequence. Vertical bar characters (|) indicate a choice. For example:

<!ELEMENT name (first, last)>

This element type requires a first child element followed by a last child element, and nothing else. If you want to offer a choice between first or last, but not allow both, use a vertical bar:

<!ELEMENT name (first | last)>

These expressions can be nested within each other as well:

<!ELEMENT order (sku, quantity, (account | name), price)>

The above order element requires a child sku element, followed by a quantity element, then followed by either an account or a name element, and finally followed by a price element.

Additionally, the operators +, *, and ? can be tacked onto the end of content expressions to indicate the number of times an element or sequence must occur, or whether it is repeatable or even required. Without a modifier, the element must appear exactly once in that location. They are explained in the following list:

+
  1. Content must appear one or more times.

*

Content may appear zero or more times.

?

Content may appear zero times or one time.

For example, to require an order element to have only one account, followed by at least one or more skus, contain one or more price elements, and optionally provide a shipping address (ship) once only, you could use an Element type such as the following:

<!ELEMENT order (account, sku+, price+, ship?)>

To mix a combination of character data or elements, you can use the or operator to specify your mixed content, as shown here:

<!ELEMENT paragraph (#PCDATA | list | picture)*>

This paragraph element type allows for repeatable sequences of character data (denoted by the asterisk), list elements, or picture elements within paragraph elements. #PCDATA can only be combined with elements using the or operator in a group that has a * modifier, and it can only occur in the outermost parenthesized group of a content model.

Attribute Declarations

As discussed earlier, attributes are used to provide name/value combinations as properties of elements. Attributes can appear only in start tags and empty element tags. An attribute-list declaration would be a part of a DTD, used to validate the XML document. An example follows:

<!ATTLIST news 
    title  CDATA #REQUIRED
    author CDATA #IMPLIED>

This is an attribute-list declaration that indicates that any news element is required to have a title attribute consisting of character data, and may optionally have an author attribute, also consisting of character data.

Attribute data types

The specification states that attribute types are of three kinds: string, tokenized, and enumerated. In the earlier attribute list example, you saw that a news element required a title attribute with the string type CDATA.

There are several tokenized attribute types:

ID

A unique identifier for this element. The identifier must be a name unique in the current document instance.

IDREF

Must match an ID somewhere in the XML document.

IDREFS

A list of one or more names, separated by spaces. Each must match an ID in the document.

ENTITY

Matches the name of an unparsed entity declared in the document.

ENTITIES

A space-separated list containing one or more entity names.

NMTOKEN

The most seldom used, this matches an NMTOKEN production as defined in the XML recommendation; refer to the recommendation for more information.

NMTOKENS

A list of one or more space-separated NMTOKEN values; this is the least used attribute type.

The remaining attribute types, the enumerated types, are defined in the attribute list itself. An enumerated type is a type that takes a name from a defined list of names, in which the list is given in an attribute declaration. Each distinct set of names forms a separate type, but these types do not have names of their own. An example should help clarify this:

<!ATTLIST ship
          type  (sloop | frigate | dinghy)  #IMPLIED>

This declaration defines an attribute type that may have a value of dinghy, frigate, or sloop, but no other value. The element <ship type="yacht"/> would trigger a validation failure.

Attribute values and constraints

An attribute declaration allows the document type to specify a default value for an attribute if the attribute is missing. It can also indicate whether the attribute may be omitted from the document. Let’s look at a more interesting example of an attribute declaration:

<!ATTLIST chapter
    synopsis  CDATA                       #IMPLIED
    author    CDATA                       #REQUIRED
    email     CDATA                       "[email protected]"
    version   CDATA #FIXED                "1.0"
    type      (normal|reference|appendix) "normal">

The synopsis attribute is required to be a string (CDATA) if it is given at all, but it is not required, and does not have a default value because it is marked as #IMPLIED. (Most of the attributes in HTML are declared this way.) The #REQUIRED constraint means just what it says; the author attribute must be specified in the document. Because it is a string, it may be empty. If a string value is specified instead of #IMPLIED or #REQUIRED, as with the email attribute in our example attribute list, it becomes the default value that is used if no value is given in the document.

The #FIXED constraint can only be used in conjunction with a default value, which we see for the version attribute. When this constraint is used, the document is allowed to include the attribute, but the value must match that given by the default exactly, though it may be encoded using a different mixture of characters, entity references, and character references. If the value differs, an error is reported by the parser.

The type attribute is an example of an enumerated type, similar to what we looked at earlier. Default values and constraints are specified for enumerated types in the same way as for other types, with the additional constraint that if a value is specified, it must be one of the names included in the enumeration.

ID attributes offer some unique behavior. Let’s create an attribute for the news element we defined previously:

<!ATTLIST news
          newsID ID #REQUIRED>

With this attribute list, news elements are required to have a newsID attribute. The allowed values are governed by the rules of the ID tokenized type. Specifically, the ID value is a name (as defined in this chapter in Section 2.5.2.1) and must not appear more than once in an XML document as the value of any attribute of type ID. In other words, ID values must uniquely represent an element within the document. Consider a legal example:

<news newsID="id39">Text</news>
<news newsID="id40">Text</news>

Since the values of ID attributes are required to be unique within a document, the following is illegal:

<news newsID="id39">Text</news>
<news newsID="id39">Text</news>

Additionally, no element may have more than one ID attribute specified. An element type may define more than one attribute of the ID type, but at most, one ID value may be specified for any element. As a result, some of the programming APIs can use the values of ID attributes to retrieve specific elements from a document.

What is most interesting about ID attributes, however, is not the attributes themselves, but the IDREF attribute type. While a particular value may only appear once in a document as an ID type, it may appear any number of times as the value of an IDREF or IDREFS attribute. In particular, attributes of those types may only take values that also appear as the value of an ID attribute somewhere in the same document. (An IDREFS attribute can take a value that is a space-separated list of ID values, each of which must exist in the document.) These values can be used to forge internal links between elements that a validating parser must check. This can be very convenient when a basic tree structure is not sufficient to model your data; the ID, IDREF, and IDREFS attributes can be used to extend the data model to include connected, directed graphs with typed arcs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.153