Structure and Syntax of XML

In this section, you will explore the syntax of XML and understand what is meant by a well-formed document.

Note

You will often encounter the terms “well formed” and “valid” applied to XML documents. These are not the same. A well-formed document is structurally and syntactically correct, whereas a valid document is also semantically correct. A document can be well-formed but not valid.


The best way to become familiar with the syntax of XML is to write an XML document. To check your XML, you will need access to an XML-aware browser or another XML validator.

An XML browser includes an XML parser. To get the browser to check the syntax and structure of your XML document, simply use the browser to open the XML file. Valid XML will be displayed in a structured way (with indentation). If the XML is not well-formed, an appropriate error message will be given.

Tip

An easy way to validate XML is to use a browser. A validating XML parser is available for Microsoft Internet Explorer versions 5 and later. To obtain this, access the Microsoft Developers Network Web site (msdn.microsoft.com/downloads/samples/internet/xml/xml_validator/) and follow the download instructions.

Other XML validators are available, such as the Sun Microsystems Multi-Schema XML Validator. This is a Java tool to validate XML documents and can be obtained from www.sun.com/software/xml/developers/multischema/.


HTML and XML

At first glance, XML looks very similar to HTML. An XML document consists of elements that have a start and end tag enclosed, just like HTML. In fact, Listing 16.1 is both well-formed HTML and XML.

Listing 16.1. Example XML and HTML
1: <html>
2:   <head><title>Web Page</title></head>
3:   <body>
4:     <H1>Teach Yourself J2EE in 21 Days</H1>
5:     <P>Now you have seen the web page – buy the book</P>
6:   </body>
7: </html>

An XML document is only well formed if there are no syntax errors. If you are familiar with HTML, you will be aware that many browsers are lenient with poorly formed HTML documents. Missing end tags and even missing sections will often be ignored and therefore unnoticed until the page is displayed in a more rigorous browser and fails to display correctly.

XML differs from HTML in that a missing end tag will always cause an error.

We will now look at XML syntax so you can understand what is going on.

Structure of an XML Document

The outermost element in an XML document is called the root element. Each XML document must have one and only one root element, often called the top level element. If there is more than one root element, an error will be generated.

The root element can be preceded by a prolog that contains XML declarations. Comments can be inserted at any point in an XML document. The prolog is optional, but it is good practice to include a prolog with all XML documents giving the XML version being used (all full XML listings in this chapter will include a prolog). A minimal XML document must contain at least one element to be well-formed.

Declarations

There are two types of XML declaration. XML documents may, and should, begin with an XML declaration, which specifies the version of XML being used. The following is an example of an XML declaration:

<?xml version ="1.0"?>

which tells the parser that this document conforms to the XML version 1.0 (WC3 recommendation 10-February-1998). As with all declarations, the XML declaration, if present, should always be placed in the prolog.

The other type of declaration is called an XML document type declaration and is used to validate the XML. This will be discussed in more detail in the section titled “Creating Valid XML” later in this chapter.

Elements

An element must have a start tag and an end tag enclosed in < and > characters. The end tag is the same as the start tag except that it is preceded with a / character. The tags are case sensitive, and the names used for the start and end tags must be exactly the same, for example The tags <Start>...</start> do not make up an element, whereas <Start>...</Start> do (both tags are letter case consistent).

An element name can only contain letters, digits, underscores _, colons :, periods ., and hyphens -. An element name may not begin with a digit, period, or hyphen.

The element may also optionally have attributes and a body. All the elements in Listing 16.2 are well-formed XML elements.

Listing 16.2. Valid XML Elements
1: <start>this is the beginning</start>
2: <date day="16th" Month="February">My Birthday</date>
3: <today yesterday="15th" Month="February"></today>
4: <box color="red"/>
5: <head></head>
6: <end/>

Table 16.1 describes each of these elements.

Table 16.1. XML Elements
Line in Listing 16.2 Element Type An Element With
1 <tag>text</tag> A start tag, body, and end tag
2 <tag attribute="text"> Attribute and a text</tag>body
3 <tag attribute="text"> </tag> Attribute but no body
4 <tag attribute="text"/> Short form of attribute but no body
5 <tag></tag> A start tag and end tag but no body
6 <tag/> Shorthand for the previous tag

Although the body of an element may contain nearly all the printable Unicode characters, certain characters are not allowed in certain places. To avoid confusion (to human readers as well as parsers) the characters in Table 16.2 should not be used in tag or attribute names. If these characters are required in the body of an element, the appropriate symbolic string in Table 16.2 can be used to represent them.

Table 16.2. Special XML Characters
Character Name Symbolic Form
& Ampersand &amp;
< Open angle bracket &lt;
> Close angle bracket &gt;
' Single quotes &apos;
" Double quotes &quot;

The elements in an XML document have a tree-like hierarchy, with elements containing other elements and data. Elements must nest—that is, an end tag must close the textually preceding start tag. This means that

<B><I>bold and italic</I></B>

is correct, while

<B><I>bold and italic</B></I>

is not.

Well-Formed XML Documents

An XML document is said to be well-formed if there is exactly one root element, and it and every sub-element has delimiting start and end tags that are properly nested within each other.

The following is a simple XML document with an XML declaration followed by a number of elements. The structure represents a list of jobs that could be used in the Agency case study example. In Listing 16.3, the <jobSummary> tag is the root tag followed by a number of jobs.

Listing 16.3. Example jobSummary XML
<?xml version ="1.0"?>
<jobSummary>
  <job>
    <customer>winston</customer>
    <reference>Cigar Trimmer</reference>
    <location>London</location>
    <description>Must like to talk and smoke</description>
    <skill>Cigar maker</skill>
    <skill>Critic</skill>
  </job>
  <job>
    <customer>george</customer>
    <reference>Tree pruner</reference>
    <location>Washington</location>
    <description>Must be honest</description>
    <skill>Tree surgeon</skill>
  </job>
</jobSummary>
						

Attributes

Attributes are name/value pairs that are associated with elements. There can be any number of attributes, and an element's attributes all appear inside the start tag. The names of attributes are case sensitive and are limited to certain characters in the same way as those of elements. Attributes can only contain letters, underscores _, colons :, periods ., and hyphens -. An attribute name cannot begin with a digit, period, or hyphen.

The value of an attribute is a text string delimited by quotes. Unlike HTML, all attribute values in an XML document must be enclosed in quotes; single or double quotes can be used. Listing 16.4 shows the jobSummary XML document re-written to use attributes to hold some of the data.

Listing 16.4. JobSummary.xml XML with Attributes
 1: <?xml version ="1.0"?>
 2: <jobSummary>
 3:   <job customer="winston" reference="Cigar Trimmer">
 4:     <location>London</location>
 5:     <description>Must like to talk and smoke</description>
 6:     <skill>Cigar maker</skill>
 7:     <skill>Critic</skill>
 8:   </job>
 9:   <job customer="george" reference="Tree pruner">
10:     <location>Washington</location>
11:     <description>Must be honest</description>
12:     <skill>Tree surgeon</skill>
13:   </job>
14: </jobSummary>
						

This version is preferable to the previous one for two reasons. First, it is easier to check by eye to make sure that every job also has a customer and reference. Second, in programming terms, the reference and the customer are items that need to be integrity checked (the reference has to be unique and the customer must already exist). You generally make an item an attribute when its value is limited in some way. You will see later, in the “Document Type Definitions” section, how to use DTDs and schemas to check the validity of both elements and attributes.

Comments

XML comments have the same syntax as a type of HTML comment. They are introduced by <!-- and ended with -->, for example

<!-- this is a comment -->

Comments can appear anywhere in a document except within the tags, for example,

<item quantity="1 lb">Cream cheese <!-- this is a comment --></item>

is acceptable, whereas the following is not

<item <!-- this is a comment --> quantity="1 lb">Cream cheese </item>

Note

As with commenting code, the comments you add to your XML should be factually correct, useful, and to the point. They should be used to make the XML document easier to read and comprehend.


Any character is allowed in a comment, including those that cannot be used in elements and tags, but to maintain compatibility with SGML, the combination of two hyphens together (--) cannot be used within the text of a comment.

Comments should be used to annotate the XML, but you should be aware that the parser might remove the comments, so they may not be always accessible to a receiving application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.60.62