As you have seen, XML validators recognize well-formed XML, and this is very useful for picking up syntax errors in your document. Unfortunately, a well-formed, syntactically-correct XML document may still have semantic errors in it. For example, a job in Listing 16.4 with no location or skills does not make sense, but without these elements, the XML document is still well-formed, but not valid.
What is required is a set of rules or constraints that define a valid structure for an XML document. There are two common methods for specifying XML rules—the Document Type Definition (DTD) and XML Schemas.
A DTD provides a template that defines the occurrence, and arrangement of elements and attributes in an XML document. Using a DTD, you can define
Element ordering and hierarchy
Which attributes are associated with an element
Default values and enumeration values for attributes
Any entity references used in the document (internal constants, external files, and parameters)
NOTE
Entity references are covered in Appendix A, “An Overview of XML.”
DTDs originated with SGML and have some disadvantages when compared with XML Schemas, which were developed explicitly for XML. One of these disadvantages is that a DTD is not written in XML, which means you have to learn another syntax to define a DTD. Another disadvantage is that DTD's are not as comprehensive as XML Schemas and cannot therefore constrain an XML document as tightly as an XML Schema.
DTD rules can be included in the XML document as document type declarations, or they can be stored in an external document. The syntax is the same in both cases.
If a DTD is being used, the XML document must include a DOCTYPE declaration, which is followed by the name of the root element for the XML document. If an external DTD is being used, the declaration also includes the word SYSTEM followed by a system identifier (the URI that identifies the location of the DTD file). For example
<!DOCTYPE jobSummary SYSTEM "jobSummary.dtd">
specifies that the root element for this XML document is jobSummary and the remainder of the DTD rules are in the file called jobSummary.dtd in the same directory.
An external identifier can also include a public identifier. The public identifier precedes the system identifier and is denoted by the word PUBLIC. An XML processor can use the public identifier to try to generate an alternative URI. If the document is unavailable by this method, the system identifier will be used.
<!DOCTYPE web-app PUBLIC '-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN' 'http://java.sun.com/dtd/web-app_2_3.dtd'>
The DTD defines every element in the XML document with element type declarations. Each element type declaration takes the following form:
<!ELEMENT name ( content ) >
For example, for the jobSummary XML document in Listing 16.4, the jobSummary root element is defined as
<!ELEMENT jobSummary ( job* )>
The * sign indicates that the jobSummary element may consist of zero or more job elements. There are other symbols used to designate rules for combining elements and these are listed in Table 16.3.
Character | Meaning |
---|---|
* | Zero or more (not required) |
+ | One or more (at least one required) |
? | Element is optional (if present can only appear once) |
| | Alternate elements |
() | Group of elements |
The following defines an XML job element that must include one location, an optional description, and at least one skill:
<!ELEMENT job (location, description?, skill+)>
Elements can contain other elements, or content, or have elements and content. The jobSummary element, in Listing 16.4, contains other elements but no text body; whereas the location element has a text body but does not contain any elements.
To define an element that has a text body, use the reference #PCDATA (Parsed Character DATA). For example, the location element in Listing 16.4 is defined by
<!ELEMENT location (#PCDATA)>
An element can also have no content (the <br> tag in HTML is such an example). This tag would be defined with the EMPTY keyword as
<!ELEMENT br EMPTY>
You will also see elements defined with contents of ANY. The ANY keyword denotes that the element can contain all possible elements, as well as PCDATA. The use of ANY should be avoided. If your data is so unstructured that it cannot be defined explicitly, there probably is no point in creating a DTD in the first place.
In Listing 16.4, the job element has two attributes—customer and reference. Attributes are defined in an ATTLIST that has the following form:
<!ATTLIST element attribute type default-value>
The element is the name of the element and attribute is the name of the attribute. The type defines the kind of attribute that is expected. A type is either one of the defined constants described in Table 16.4, or it is an enumerated type where the permitted values are given in a bracketed list.
The ATTLIST default-value component defines a value that will be used if one is not supplied. For example
<!ATTLIST button visible (true | false) "true").
defines that the element button has an attribute called visible that can be either true or false. If the attribute is not supplied, because a default value is supplied, it will be set to be true.
The default-value item can also be used to specify that the attribute is #REQUIRED, #FIXED, or #IMPLIED. The meaning of these values is given in Table 16.5.
Listing 16.7 is the DTD for the jobSummary XML document. Create the DTD in a file called jobSummary.dtd in the same directory as your jobSummary XML document.
Don't forget to add the following line to the jobSummary XML at line 2 (following the PI):
<!DOCTYPE jobSummary SYSTEM "jobSummary.dtd">
View the jobSummary.xml document in your XML browser or other XML validator.
If the browser cannot find the DTD, it will generate an error. Edit jobSummary.xml, remove the customer attribute, and check that your XML validator generates an appropriate error (such as “Required attribute 'customer' is missing”).
As has been already stated, DTDs have some limitations:
A DTD cannot define type information other than characters.
DTDs were not designed to support namespaces and, although it is possible to add namespaces to a DTD, how to do so is beyond the scope of this book.
DTDs are not easily extended.
You can only have one DTD per document, so you cannot have different definitions of an element in a single document and have them validated with a DTD.
The syntax for DTDs is not XML. Tools and developers must understand the DTD syntax as well as XML.
To address these issues, the XML Schema structure definition mechanism was developed by the W3C to fulfill the role of DTDs while addressing the previously listed limitations. XML Schemas are XML documents.
The XML Schema standard is split into two parts:
Specifying the structure and constraints on an XML document
A way of defining data types, including a set of pre-defined types
Because it is a more powerful and flexible mechanism than DTDs, the syntax for defining an XML schema is slightly more involved. An example of an XML schema for the jobSummary XML shown in Listing 16.4 can be seen in Listing 16.8.
TIP
The World Wide Web Consortium Web site provides access to a number of XML schema tools, including XML schema browsers and validators. These tools can be found at http://www.w3.org/XML/Schema.
The first thing to notice is that this schema exists within a namespace as defined on the second line. The string xsd is used by convention for a schema namespace, but any prefix can be used.
Elements that have sub-elements and/or attributes are defined as complex types. In addition to complex types, there are a number of built-in simple types. Examples of a few simple types are
A complex type element (one with attributes or sub-elements) has to be defined in the schema and will typically contain a set of element declarations, element references, and attribute declarations. Listing 16.8 contains the definition for the job tag complex type, which contains three elements (location, description, and skill) and two attributes (customer and reference).
In a schema, like a DTD, elements can be made optional or required. The job element in Listing 16.8 is optional because the value of the minOccurs attribute is 0. In general, an element is required to appear when the value of minOccurs is 1 or more. Similarly, the maximum number of times an element can appear is determined by the value of maxOccurs. This value can be a positive integer or the term unbounded to indicate there is no maximum number of occurrences. The default value for both the minOccurs and the maxOccurs attributes is 1. If you do not specify the number of occurrences, the element must be present and must occur only once.
Element attributes can be declared with a use attribute to indicate whether the element attribute is required, optional, or even prohibited.
There are more aspects to schemas than it is possible to cover in this book. Visit the WC3 Web site (www.w3.org) for more information on XML schemas and all other aspects of XML.
52.15.78.83