The Document Type Definition (DTD) is native to XML 1.0. You’ll learn how to use DTDs in this hack.
XML inherited the Document Type Definition (DTD) from SGML. It is the native language for validating XML—though it is not itself in XML syntax—and is interwoven into the XML 1.0 specification (http://www.w3.org/TR/2004/REC-xml-20040204/). Using non-XML syntax, a DTD defines the structure or content model of a valid XML instance. A DTD can define elements, attributes, entities, and notations, and can contain comments (just like XML comments), conditional sections, and a structure unique to DTDs called parameter entities. DTDs can be internal or external to an XML document, or both. This hack shows you how to implement all the basic structures of a DTD.
Example 5-1 shows external.xml , and Example 5-2 shows a DTD against which external.xml is valid. The external DTD is called order.dtd . This is also known as an external subset . This DTD is a local file in this example, but it could also exist across a network.
Example 5-1. external.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE order SYSTEM "order.dtd"> <order id="TDI-983857"> <store>Prineville</store> <product>feed-grade whole oats</product> <package>sack</package> <weight std="lbs.">50</weight> <quantity>23</quantity> <price cur="USD"> <high>5.99</high> <regular>4.99</regular> <discount>3.99</discount> </price> <ship>the back of Tom's pickup</ship> </order>
The
XML declaration on line 1 declares that this document does not stand
alone. That’s because on line 2,
external.xml references the DTD
order.dtd. The file
order.dtd is considered an external entity and
is called an external subset. The SYSTEM
keyword
on line 2 indicates that the DTD will be identified by a system
identifier, which for all practical purposes is a URL for a local or
remote file.
In this DTD, all the valid structures found in
external.xml are declared. The document element
is order
(line 4), which has child elements that
describe the pieces of a purchase order, including information on the
store, product, product packaging, product weight, quantity, price,
and shipping method. Validate external.xml
against its associated DTD order.dtd by using
RXP, xmlvalid, or xmllint on the command line
[Hack #9]
, or use RXP online, or the
Brown University STG online validator
[Hack #9]
.
Example 5-2. order.dtd
<?xml encoding="UTF-8"?> <!-- Order DTD --> <!ELEMENT order (store+,product,package?,weight?,quantity,price,ship*)> <!-- id = part number --> <!ATTLIST order id ID #REQUIRED xmlns CDATA #FIXED "http://www.wyeast.net/order" date CDATA #IMPLIED> <!ELEMENT store (#PCDATA)> <!ELEMENT product (#PCDATA)> <!ELEMENT package (#PCDATA)> <!ELEMENT weight (#PCDATA)> <!ATTLIST weight std NMTOKEN #REQUIRED> <!ELEMENT quantity (#PCDATA)> <!ELEMENT price (high?,regular,discount?,total?)> <!ATTLIST price cur (USD|CAD|AUD|EUR) "USD"> <!ELEMENT high (#PCDATA)> <!ELEMENT regular (#PCDATA)> <!ELEMENT discount (#PCDATA)> <!ELEMENT ship (#PCDATA)>
A
text declaration (http://www.w3.org/TR/2004/REC-xml-20040204/#sec-TextDecl)
is similar to an XML declaration (see “The XML
Declaration” in Chapter 1),
except that version information (e.g.,
version="1.0
“) is optional; encoding declarations,
such as encoding="UTF-8
“, are required; and there
are no standalone declarations (e.g.,
standalone="no
“).
Most of the lines in this DTD contain
element type declarations
(http://www.w3.org/TR/2004/REC-xml-20040204/#elemdecls).
This is one of several kinds of markup declarations (http://www.w3.org/TR/2004/REC-xml-20040204/#dt-markupdecl)
that may appear in a DTD. The simplest, on lines 8 through 11 and
lines 16 through 19, have content models for parsed character data
(#PCDATA
), which means that these elements must
contain only text—no element children. The elements declared on
lines 3 and 14 (order
and
price
) have
content models that include only child
elements. The +
, ?
, and
*
symbols denote
occurrence constraints, meaning
that the child elements may occur only a given number of times:
+
means that the element may occur one or more
times; ?
means the element may occur zero or one
time (that is, it’s optional); and
*
means the element may occur zero or more times.
When an element name in a content model is followed by a comma
(,), that means that exactly one of those elements
may occur.
The DTD order.dtd has three
attribute-list declarations on
lines 5, 12, and 15. You can declare one or more attributes at a
time, hence the phrase attribute list. The first
declares three attributes, id
,
xmlns
, and date
. XML attributes
declared in DTDs must have one of 10 possible types:
CDATA
, ID
,
IDREF
, IDREFS
,
ENTITY
, ENTITIES
,
NMTOKEN
, NMTOKENS
,
NOTATION
, and enumeration
(see
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-attribute-types for an
explanation of all the types).
The attribute id
on line 5 is of type ID, which
must be an XML name (http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Name)
and must be unique (http://www.w3.org/TR/2004/REC-xml-20040204/#id).
It is also required (#REQUIRED
); that is, it must
appear in any valid instance of the DTD.
On line 7, the attribute date
is declared. The
#IMPLIED
keyword means that the attribute may or
may not appear in a legal instance. CDATA
means
that the value of date
will be a string.
The std
attribute for the
weight
element is declared on line 12. It is
required (#REQUIRED
) and is of type
NMTOKEN
. A name token is a single, atomic
unit—a string with no whitespace. The attribute-list
declaration on line 15 declares the cur
(currency)
attribute for the price
element. The default value
in quotes is USD
(United States dollar), with
possible values USD
, CAD
(Canadian dollar), AUD
(Australian dollar), and
EUR
(Euro).
You can also have a DTD that is internal to an XML document. This is
called the internal subset.
internal.xml is an example of an XML document
that contains an internal subset (Example 5-3). The
DTD is stored in the DOCTYPE
declaration, which
encloses markup declarations in square brackets ([ ]
); see lines 2 and 21.
Example 5-3. internal.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <!DOCTYPE order [ <!-- Order DTD --> <!ELEMENT order (store+,product,package?,weight?,quantity,price,ship*)> <!-- id = part number --> <!ATTLIST order id ID #REQUIRED xmlns CDATA #FIXED "http://www.wyeast.net/order" date CDATA #IMPLIED> <!ELEMENT store (#PCDATA)> <!ELEMENT product (#PCDATA)> <!ELEMENT package (#PCDATA)> <!ELEMENT weight (#PCDATA)> <!ATTLIST weight std NMTOKEN #REQUIRED> <!ELEMENT quantity (#PCDATA)> <!ELEMENT price (high?,regular,discount?,total?)> <!ATTLIST price cur (USD|CAD|AUD|EUR) "USD"> <!ELEMENT high (#PCDATA)> <!ELEMENT regular (#PCDATA)> <!ELEMENT discount (#PCDATA)> <!ELEMENT ship (#PCDATA)> ]> <order id="TDI-983857"> <store>Prineville</store> <product>feed-grade whole oats</product> <package>sack</package> <weight std="lbs.">50</weight> <quantity>23</quantity> <price cur="USD"> <high>5.99</high> <regular>4.99</regular> <discount>3.99</discount> </price> <ship>the back of Tom's pickup</ship> </order>
One line 1, the document internal.xml is
declared to be standalone; i.e., it does not depend on markup
declarations in an external entity. Notice that there is no
SYSTEM
keyword or system identifier (URL). This is
because the markup declarations are enclosed in the document type
declaration, rather than in an external entity. The document type
declaration (lines 2 through 21) contains the same declarations as
order.dtd, and the document itself (lines 23
through 35) is the same as external.xml, except
for the DOCTYPE
.
The document both.xml, shown in Example 5-4, uses both an internal subset and an external
subset (both.dtd in Example 5-5). Notice how the document type declaration
uses both the SYSTEM
keyword, a system identifier
(both.dtd), and also encloses markup
declarations in square brackets ([ ]
). The
advantage of this syntax is that DTDs can be developed and used in a
modular fashion, and documents can be validated with these modules
even if they exist locally or in disparate locations (across the
Internet).
Example 5-4. both.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE order SYSTEM "both.dtd" [ <!-- Order DTD --> <!ELEMENT order (store+,product,package?,weight?,quantity,price,ship*)> <!-- id = part number --> <!ATTLIST order id ID #REQUIRED xmlns CDATA #FIXED "http://www.wyeast.net/order" date CDATA #IMPLIED> <!ELEMENT store (#PCDATA)> <!ELEMENT product (#PCDATA)> <!ELEMENT package (#PCDATA)> <!ELEMENT weight (#PCDATA)> <!ATTLIST weight std NMTOKEN #REQUIRED> <!ELEMENT quantity (#PCDATA)> <!ELEMENT ship (#PCDATA)> ]> <order id="TDI-983857"> <store>Prineville</store> <product>feed-grade whole oats</product> <package>sack</package> <weight std="lbs.">50</weight> <quantity>23</quantity> <price cur="USD"> <high>5.99</high> <regular>4.99</regular> <discount>3.99</discount> </price> <ship>the back of Tom's pickup</ship> </order>
A parameter entity (PE) is a special entity that can be used only in a DTD. They are not allowed in XML documents. A PE provides a way to store information and then reuse that information elsewhere, multiple times. A good example of this can be found in the way the XHTML 1.0 strict DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd) defines a set of core attributes. Here is a fragment from the DTD:
<!-- core attributes common to most elements id document-wide unique id class space separated list of classes style associated style info title advisory title/amplification --> <!ENTITY % coreattrs "id ID #IMPLIED class CDATA #IMPLIED style %StyleSheet; #IMPLIED title %Text; #IMPLIED" >
Lines 1 through 6 of this fragment contain a comment explaining the
purpose of four attributes, id
,
class
, style
, and
title
. Starting on line 7, an entity is declared.
The percent sign (%
) is a flag to the XML
processor saying that this is a parameter entity. The information in
double quotes makes up part of an attribute-list declaration that is
reused three times in the DTD.
Where normal entity references
[Hack #4]
begin with an ampersand
(&
), parameter entity references begin with a
percent sign (%
). Lines 10 and 11 show the
parameter entity references %Stylesheet;
and
%Text;
, which are defined elsewhere in the DTD as:
<!ENTITY % StyleSheet "CDATA"> <!-- style sheet data --> <!ENTITY % Text "CDATA"> <!-- used for titles etc. -->
%Stylesheet;
and %Text;
expand
to CDATA
. As you can see, a parameter entity can
contain a reference to another parameter entity. In fact, the
attrs
parameter entity in
xhtml1-strict.dtd references
coreattrs
and two other parameter entities:
<!ENTITY % attrs "%coreattrs; %i18n; %events;">
attrs
, in turn, is used over 60 times in the DTD,
so you can see that parameter entities are a handy way to reuse
information in a DTD.
This section briefly covers several other things you can include in DTDs: comments, conditional sections, unparsed entities, and notations.
DTDs can contain XML-style comments [Hack #1] . For example, the pair of comments used on lines 2 and 4 in Example 5-2 are formed just as they would be in an XML document.
Conditional sections allow you to include or exclude declarations in a DTD conditionally. This feature can help you develop a DTD while you are still trying out different content models. Look at this fragment from conditional.dtd:
<![INCLUDE[ <!ATTLIST price cur (USD|CAD|AUD|EUR) "USD"> ]]> <![IGNORE[ <!ATTLIST price cur (USD|EUR) "USD"> ]]>
The structure that starts with the word INCLUDE
indicates that the following declaration (which must be complete) is
to be included in the DTD at validation time. The section marked
IGNORE
, however, is ignored. The following
fragment, also in conditional.xml, shows how you
can turn these sections on or off with parameter entities.
<!ENTITY % on 'INCLUDE' > <!ENTITY % off 'IGNORE' > ... <![%on;[ <!ELEMENT price (high?,regular,discount?,total?)> ]]> <![%off;[ <!ELEMENT price (regular,discount,total)> ]]>
Conditional sections are an interesting hack in themselves, but they are frequently considered more complicating than helpful.
An unparsed entity is a resource upon which XML places no constraints. It can consist of a chunk of XML, non-XML text, a graphical file, a binary file, or any other electronic resource. An unparsed entity has a name that is associated with a system identifier or a public identifier.
For example, in DocBook
[Hack #62]
, a module of the DTD
(dbnotnx.mod, under the subdirectory
docbook-4.3CR in this book’s
file archive) is dedicated to notations. Here is a notation from that
module that associates the name GIF89a
with a
public identifier -//CompuServe//NOTATION Graphics
Interchange Format 89a//EN
:
<!NOTATION GIF89a PUBLIC "-//CompuServe//NOTATION Graphics Interchange Format 89a//EN">
Here is another example from the same module that uses a system
identifier for the name PNG
:
<!NOTATION PNG SYSTEM "http://www.w3.org/TR/REC-png">
Elsewhere in a DTD that includes this module, you could declare several entities like this:
<!ENTITY dbnotnx SYSTEM "dbnotnx.mod"> &dbnotnx; ... <!ENTITY g001 SYSTEM "g001.gif" NDATA GIF89a> <!ENTITY g002 SYSTEM "g002.png" NDATA PNG> ... <!ELEMENT graphic EMPTY> <!ATTLIST graphic img ENTITY #REQUIRED>
The entity declarations associate names with files with the names of
notations. The presence of the NDATA
keyword
indicates an unparsed entity. Then, in an instance, you could refer
to the entity in an attribute, like this:
<graphic img="g001"/> ... <graphic img="g002"/>
The syntax for unparsed entities is the most awkward and forbidding
of any syntax in XML. The use of unparsed entities is rare, and the
applications that support them are even rarer. If people want to
display graphics, they usually transform their XML into HTML or XHTML
and use the ubiquitously supported img
tag.
3.14.252.56