Enforcing XML Document Structure

If XML is used to transfer information between applications, there needs to be a mechanism for ensuring that the XML is not only syntactically correct but also is structurally correct. In fact, there are two common mechanisms for this:

  • Document Type Definitions

  • XML Schemas

Document Type Definition (DTD)

A Document Type Definition (DTD) is a way of defining the structure of an XML document. DTD elements can be included in the XML document itself or in a separate external document. The syntax used to define a DTD is different from XML itself.

The following is an example DTD that describes the jobSummary XML:

<!DOCTYPE jobSummary>
<!ELEMENT jobSummary (job*)>
<!ELEMENT job (location, description?, skill*)>
<!ATTLIST job customer CDATA #REQUIRED>
<!ATTLIST job reference CDATA #REQUIRED>
<!ELEMENT location (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT skill (#PCDATA)>

The !DOCTYPE element must include the name of the root element. If the remainder of the document type definitions are stored in an external file, it will have the following form:

<!DOCTYPE root_element SYSTEM "external_filename">>

If the definitions are included in the XML document itself, the !DOCTYPE element must appear in the document prolog before the actual document data begins. In this case, the !DOCTYPE element must include all the DTD elements with the following syntax:

<!DOCTYPE jobSummary [
<!ELEMENT jobSummary (job*)>
<!ELEMENT job (location, description?, skill*)>
<!ATTLIST job customer CDATA #REQUIRED>
<!ATTLIST job reference CDATA #REQUIRED>
<!ELEMENT location (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT skill (#PCDATA)>
]>
						

The other elements (!ELEMENT and !ATTLIST) are described in this section.

Elements

Element declarations take the following form:

<!ELEMENT element_name (content)>

where element_name is the XML tag and content is one or more of the values shown in Table C.2.

Table C.2. DTD Content Specifications for Elements
Content Type Syntax Element contains
Element <!ELEMENT tag (sub1)> Sub-element only
#PCDATA <!ELEMENT tag (#PCDATA)> Text only
EMPTY <!ELEMENT tag (EMPTY)> Nothing
ANY <!ELEMENT tag (ANY)> anything (text or elements)

Note

#PCDATA limits the content of the element to character data only; nested elements are not allowed. Do no confuse with CDATA sections in XML that are used to present large areas of un-interpreted text.


The characters in Table C.3 can be used to combine multiple element content types to define more complex elements.

Table C.3. Content Characters Used in DTD Definitions
Character Meaning
, Sequence operator, separates a list of required elements
* Zero or more (not required)
+ One or more (at least one required)
? Element is optional
| Alternate elements
() Group of elements

The following is a declaration for the job element:

<!ELEMENT job (location, description?, skill*)>

The job element consists of, in order, one location, an optional description, and an optional list of skill elements.

Attributes

Attribute declarations take the following form:

<!ATTLIST element_name
								attribute_1_name (type) default-value
								attribute_2_name (type) default-value>

An attribute type can be any one of the types shown in Table C.4, though CDATA (text) is the most common.

Table C.4. DTD Attribute Types
Type Attribute is a…
CDATA Character string.
NMTOKEN Valid XML name.
NMTOKENS Multiple XML names.
ID Unique identifier.
IDREF An element found elsewhere in the document. The value for IDREF must match the ID of another element.
ENTITY External binary data file (such as a GIF image).
ENTITIES Multiple external binary files.
NOTATION Helper program.

The default-value item can also be used to specify that the attribute is #REQUIRED, #FIXED, or #IMPLIED. The meanings of these values are presented in Table C.5.

Table C.5. DTD Attribute Default Values
Default Value Meaning
#REQUIRED Attribute must be provided.
#FIXED Effectively a constant declaration. The attribute must be set to the given value or the XML is not valid.
#IMPLIED The attribute is optional and the processing application is allowed to use any appropriate value if required.

Entity References

Another DTD element not mentioned so far is an entity reference. An entity reference has more than one form. The first, called a general entity reference, provides shorthand for often-used text. An entity reference has the following format:

<!ENTITY name "replacement text">

Note

This is, in fact, how the special characters are handled. The character entity &amp; is defined as <!ENTITY &amp; "&">.


The entity reference called name can be referred to in the XML document using &name;, as shown in the following:

<!DOCTYPE book [
...
<ENTITY copyright "Copyright 2002 by Sams Publishing>
]>

<book title="J2EE in 21 Days">A very useful book &copyright;</book>

The second form, called an external entity reference, provides a mechanism to include data from external sources into the document's contents. This has the following format:

<!ENTITY name SYSTEM "URI">

For example, if the file Copy.xml that can be retrieved from the Sams Web site contains the following XML fragment

<copyright>
  <date>2002</date>
  <publisher>Sams Publishing</publisher>
</copyright>

this can be referenced in any XML document as follows:

<!DOCTYPE [
...
<ENITITY copyright http://www.samspublishing.com/xml/Copy.xml>
]>
<book>
  <title>J2EE in 21 Days>
..&copyright;
  <synopsis>All you need to know about J2EE</synopsis>
</book>
							

XML Schema

Like DTDs, an XML Schema can be used to specify the structure of an XML document. In addition, it has many advantages over DTDs:

  • Schemas have a way of defining data types, including a set of pre-defined types.

  • A schema is namespace aware.

  • It is possible to precisely specify the number of occurrences of an element (as opposed to a DTD's imprecise use of ?, *, and +) with the minOccurs and maxOccurs attributes.

  • The ability to restrict the values that can be assigned to predefined types.

  • A schema is written in XML.

The following is a schema to define the jobSummary XML:

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">

  <xsd:element name="jobSummary">
    <xsd:complexType>
      <xsd:sequence>
<xsd:element name="job" type="jobType" minOccurs="0" maxOccurs="unbounded"/>
      </xsd:sequence>
      </xsd:complexType>
   </xsd:element>

  <xsd:complexType name="jobType">
    <xsd:sequence>
      <xsd:element name="location" type="xsd:string"/>
       <xsd:element name="description" type="xsd:string"/>
<xsd:element name="skill" type="xsd:string" minOccurs="1" maxOccurs="unbounded"/>
    </xsd:sequence>
      <xsd:attribute name="customer" type="xsd:string" use="required"/>
      <xsd:attribute name="reference" type="xsd:string" use="required"/>
  </xsd:complexType>
</xsd:schema>

In schemas, elements can have a type attribute that can be one of the following:

  • string Any combination of characters

  • integer An integral number

  • float A floating-point number

  • boolean true/false or 1/0

  • date yyyy-mm-dd

There are considerably more predefined simple data types. A full list can be obtained from the W3C Web site.

Or an element can be a complex type, which is a combination of elements or elements and text.

The number of times an element can appear is controlled by two attributes:

  • minOccurs

  • maxOccurs

For example, the following skill element must appear at least once and can occur any number of times.

<xsd:element name="skill" type="xsd:string" minOccurs="1" maxOccurs="unbounded"/>

Elements can be made optional by setting the value of the minOccurs attribute to 0.

Element attributes can be declared with a use attribute to indicate whether the element attribute is required, optional, or even prohibited.

A declaration of a complex type generally includes one of the following that specifies how the elements appear in the document:

  • all—All the named elements must appear, however they may be in any order.

  • choice—One, and only one, of the elements listed must appear.

  • sequence—All the named elements must appear in the sequence listed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.86.211