Structure and Syntax of XML

In this section, you will explore the syntax of XML and understand what is meant by a well-formed document.

NOTE

You will often encounter the terms “well-formed” and “valid” applied to XML documents. These are not the same. A well-formed document is structurally and syntactically correct (the XML conforms to the XML language definition, that is all tags have a correctly nested corresponding end tag, all attributes are quoted, only valid characters have been used, and so on), whereas a valid document is also semantically correct (the XML conforms to some external definition stored in an XML Schema or Document Type Definition). A document can be well-formed but may not be valid.


The best way to become familiar with the syntax of XML is to write an XML document. To check your XML, you will need access to an XML-aware browser or another XML validator. The XML-aware browser or XML validator will allow you to ensure that the XML is well-formed. If the XML references an XML Schema or Document Type Definition (more on these later) the validator can also check that the XML is valid.

An XML browser includes an XML parser. To get the browser to check the syntax and structure of your XML document, simply use the browser to open the XML file. Well-formed XML will be displayed in a structured way (with indentation). If the XML is not well-formed, an appropriate error message will be given.

TIP

An easy way to validate XML is to use an XML aware browser. The latest versions of most popular browsers are now XML aware. You can download validating XML parsers from Sun Microsystems at www.sun.com/software/xml/developers/multischema/and the Microsoft Developers Network at msdn.microsoft.com/downloads/samples/internet/xml/xml_validator/. There are numerous other XML validators and XML editors vailable from the Internet.


HTML and XML

At first glance, XML looks very similar to HTML. An XML document consists of elements that have a start and end tag, just like HTML. In fact, Listing 16.1 is both well-formed HTML and XML.

Listing 16.1. Example XML and HTML
<html>
 <head><title>Web Page</title></head>
 <body>
  <h1>Teach Yourself J2EE in 21 Days</h1>
  <p>Now you have seen the web page – buy the book</p>
 </body>
</html>

An XML document is only well-formed if there are no syntax errors. If you are familiar with HTML, you will be aware that many browsers are lenient with poorly formed HTML documents. Missing end tags and even missing sections will often be ignored and therefore unnoticed until the page is displayed in a more rigorous browser, and fails to display correctly.

XML differs from HTML in that a missing end tag will always cause an error.

We will now look at XML syntax so you can understand what is going on.

Structure of an XML Document

The outermost element in an XML document is called the root element. Each XML document must have one and only one root element, often called the top level element. If there is more than one root element, an error will be generated.

The root element can be preceded by a prolog that contains XML declarations. Comments can be inserted at any point in an XML document. The prolog is optional, but it is good practice to include a prolog with all XML documents giving the XML version being used (all full XML listings in this chapter will include a prolog). A minimal XML document must contain at least one element.

Declarations

There are two types of XML declaration. XML documents may, and should, begin with an XML declaration, which specifies the version of XML being used. The following is an example of an XML declaration:

<?xml version ="1.0"?>

The XML version element tells the parser that this document conforms to the XML version 1.0 (W3C recommendation 10-February-1998). As with all declarations, the XML declaration, if present, should always be placed in the prolog.

The other type of declaration is called an XML document type declaration and is used to validate the XML. This will be discussed in more detail in the section titled “Creating Valid XML” later in this chapter.

Elements

An element must have a start tag and an end tag enclosed in < and > characters. The end tag is the same as the start tag except that it is preceded with a / character. The tags are case sensitive, and the names used for the start and end tags must be exactly the same, for example the tags <Start>...</start> do not make up an element, whereas <Start>...</Start> do (both tags are letter case consistent).

An element name can only contain letters, digits, underscores _, colons :, periods ., and hyphens -. An element name must begin with a letter or underscore.

An element may also optionally have attributes and a body. All the elements in Listing 16.2 are well-formed XML elements. All attributes must be quoted, both single and double quotes are permitted.

Listing 16.2. Valid XML Elements
<start>this is the beginning</start>
<date day="16th" Month="February">My Birthday</date>
<today yesterday="15th" Month="February"></today>
<box color="red"/>
<head></head>
<end/>

Table 16.1 describes each of these elements.

Table 16.1. XML Elements
Element TypeXML Element Includes
<tag>text</tag>A start tag, body, and end tag
<tag attribute="text"> text </tag>An attribute and a body
<tag attribute="text"> </tag>An attribute but no body
<tag attribute="text"/>Short form of attribute but no body
<tag></tag>A start tag and end tag but no body
<tag/>Shorthand for the previous tag

Although the body of an element may contain nearly all the printable Unicode characters, certain characters are not allowed in certain places. To avoid confusion (to human readers as well as parsers) the characters in Table 16.2 should not be used in tag or attribute values. If these characters are required in the body of an element, the appropriate symbolic string in Table 16.2 can be used to represent them.

Table 16.2. Special XML Characters
CharacterNameSymbolic Form
&Ampersand&amp;
<Open angle bracket&lt;
>Close angle bracket&gt;
'Single quotes&apos;
"Double quotes&quot;

The elements in an XML document have a tree-like hierarchy, with elements containing other elements and data. Elements must nest—that is, an end tag must close the textually preceding start tag. This means that

<b><i>bold and italic</i></b>

is correct, while

<b><i>bold and italic</b></i>

is not.

Well-formed XML Documents

An XML document is said to be well-formed if there is exactly one root element, it and every sub-element has delimiting start and end tags that are properly nested within each other and all attributes are quoted.

The following is a simple XML document with an XML declaration followed by a number of elements. The structure represents a list of jobs that could be used in the Agency case study example. In Listing 16.3, the <jobSummary> tag is the root tag followed by a number of jobs.

Listing 16.3. Example jobSummary XML
<?xml version ="1.0"?>
<jobSummary>
 <job>
  <customer>winston</customer>
  <reference>Cigar Trimmer</reference>
  <location>London</location>
  <description>Must like to talk and smoke</description>
  <skill>Cigar maker</skill>
  <skill>Critic</skill>
 </job>
 <job>
  <customer>george</customer>
  <reference>Tree pruner</reference>
  <location>Washington</location>
  <description>Must be honest</description>
  <skill>Tree surgeon</skill>
 </job>
</jobSummary>

Attributes

Attributes are name/value pairs that are associated with elements. There can be any number of attributes, and an element's attributes all appear inside the start tag. The names of attributes are case sensitive and are limited to the following characters: letters, digits, underscores _, periods ., and hyphens -. An attribute name must begin with a letter or underscore.

The value of an attribute is a text string delimited by quotes, either single or double quotes may be used. Unlike HTML, all attribute values in an XML document must be enclosed in quotes. Listing 16.4 shows the jobSummary XML document re-written to use attributes to hold some of the data.

Listing 16.4. JobSummary.xml XML with Attributes
<?xml version ="1.0"?>
<jobSummary>
 <job customer="winston" reference="Cigar Trimmer">
  <location>London</location>
  <description>Must like to talk and smoke</description>
  <skill>Cigar maker</skill>
  <skill>Critic</skill>
 </job>
 <job customer="george" reference="Tree pruner">
  <location>Washington</location>
  <description>Must be honest</description>
  <skill>Tree surgeon</skill>
 </job>
</jobSummary>

The choice of using nested elements or attributes is a contentious area. There are many schools of thought and it usually ends up being a matter of personal taste or corporate standards. Prior to the introduction of XML Schemas (see section “XML Schemas”) there were advantages to using attributes when the values were constrained in some way; such as values that are numbers or specific patterns. XML Schemas also allow element values to be constrained in the same way as attribute values.

Comments

XML comments are introduced by <!-- and ended with -->, for example

<!-- this is a comment -->

Comments can appear anywhere in a document except within the tags, for example,

<item quantity="1lb">Cream cheese <!-- this is a comment --></item>

is acceptable, whereas the following is not

<item <!-- this is a comment --> quantity="1lb">Cream cheese </item>

NOTE

As with commenting code, the comments you add to your XML should be factually correct, useful, and to the point. They should be used to make the XML document easier to read and comprehend.


Any character is allowed in a comment, including those that cannot be used in elements and tags, but to maintain compatibility with SGML, the combination of two hyphens together (--) cannot be used within the text of a comment.

Comments should be used to annotate the XML, but you should be aware that the parser might remove the comments, so they may not always be accessible to a receiving application.

Namespaces

When designers define an XML structure for some data, they are free to choose tag names that are appropriate for the data. Consequently, there is nothing to stop two individuals from using the same tag name for different purposes or in different ways. Consider the job agency that deals with two contract companies, each of which uses a different form of job description (such as those in Listings 16.3 and 16.4). How can an application differentiate between these different types of book descriptions?

The answer is to use namespaces. XML provides namespaces that can be used to impose a hierarchical structure on XML tag names in the same way that Java packages provide a naming hierarchy for Java methods. You can define a unique namespace with which you can qualify your tags to avoid them being confused with those from other XML authors.

An attribute called xmlns (XML Namespace) is added to an element tag in a document and is used to define the namespace. For example, the second line in Listing 16.5 indicates that the tags for the whole of this document are scoped within the agency namespace.

Listing 16.5. XML Document with Namespace
<?xml version ="1.0"?>
<jobSummary xmlns="agency">
 <job customer="winston" reference="Cigar Trimmer">
  <location>London</location>
  <description>Must like to talk and smoke</description>
  <skill>Cigar maker</skill>
  <skill>Critic</skill>
 </job>
 <job customer="george" reference="Tree pruner">
  <location>Washington</location>
  <description>Must be honest</description>
  <skill>Tree surgeon</skill>
 </job>
</jobSummary>

The xmlns attribute can be added to any element in the document to enable scoping of elements, and multiple namespaces can be defined in the same document using a prefix. For example, Listing 16.6 has two namespaces—ad and be. All the tags have been prefixed with the appropriate namespace, and now two different forms of the job tag (one with attributes and one without) can coexist in the same file.

Listing 16.6. XML Document with Namespaces
<?xml version ="1.0"?>
<jobSummary xmlns:ad="ADAgency" xmlns:be="BEAgency">
 <ad:job customer="winston" reference="Cigar Trimmer">
  <ad:location>London</ad:location>
  <ad:description>Must like to talk and smoke</ad:description>
  <ad:skill>Cigar maker</ad:skill>
  <ad:skill>Critic</ad:skill>
 </ad:job>
 <be:job>
  <be:customer>george</be:customer>
  <be:reference>Tree pruner</be:refenence>
  <be:location>Washington</be:location>
  <be:description>Must be honest</be:description>
  <be:skill>Tree surgeon</be:skill>
 </be:job>
</jobSummary>

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.209.201