Constructing XML Documents

Documents are the heart of XML. Any amount of usable XML is presented as a document, often stored in a file. One of the very first things you must understand in order to use XML is how to create a well-formed document. In this section, we examine the syntactic components of a document, starting with the individual characters and looking at how they are viewed when building larger syntactic constructs. Then we look at the constructs defined for all documents by the XML recommendation.

Characters in XML Documents

The XML Specification defines a character as “an atomic unit of text as specified by ISO/IEC 10646.” (Remember, ISO/IEC 10646 is more commonly referred to as Unicode.) Of course, this explanation is exactly what you should say at a party if someone asks. One of the goals of both standardization and XML is to make documents easily understandable by platforms around the globe. As such, simple things like ASCII characters can become quite complex.

Regardless, the specification states that legal characters are “tab, carriage return, line feed,” as well as belonging to the aforementioned Unicode specification. If you were to write an XML parser, the topic of characters and standardization would be of incredible importance to you. For the rest of us, it’s usually enough to choose an XML parser that gets it right.

You can declare the character encoding used in an XML document using the optional XML declaration:

<?xml version="1.0" encoding="UTF-8"?>

For an external entity that is not a document itself, a variation of the XML declaration, called an encoding declaration, is used:

<?xml encoding="UTF-8"?>

More information on the XML declaration is provided in “The Document Prolog” later in this chapter. For now, let’s look at some of the most widely used character sets and encodings. (A character set that can be mapped into Unicode can be considered an encoding of Unicode, even if it does not directly support everything defined in Unicode.)

The ASCII character set

The American Standard Code for Information Interchange (ASCII) is a 7-bit text format (meaning that it takes a sequence of seven 1’s and 0’s to form a character). ASCII is understood by virtually every computer in use. Unicode extends ASCII, so the first 128 characters of Unicode coincide with the first 128 characters of ASCII.

The ISO-8859-1 character set

The character set ISO-8859-1 is also known as Latin-1. The ISO-8859-1 set is very widely used as it contains support for most (but not all) Western European languages. The first 256 characters of Unicode are identical to ISO-8859-1 for compatibility reasons. The first 128 characters of ISO-8859-1 are identical to ASCII. The second 128 are a combination of control characters, special characters, and accented letters. ISO-8859-1 was inspired by DEC Multinational Character Set, but there are a few differences. There are also various ISO-8859-X sets with support for additional languages and characters.

UTF-8 Encoding

Universal Transformation Format, 8-bit (UTF-8), is documented in IETF RFC 2279 by F. Yergeau. UTF-8 is the most popular complete encoding of Unicode.

UTF-8 extends ASCII to some degree. The first 128 positions of UTF-8 are transparently encoded to their ASCII counterparts. Since Unicode can supposedly support over 2 billion characters (way beyond 128), getting it to fit in a stream of discrete 8-bit bytes requires some encoding. UTF-8 solves this problem by representing each Unicode character with a unique sequence of bytes. In a UTF-8 stream, ASCII characters occupy only one byte in the stream, whereas all other characters are represented by two or more bytes the stream. Your XML declaration using UTF-8 appears as follows:

<?xml version="1.0" encoding="UTF-8"?>

The most detailed information for dealing with UTF-8 encoding comes from the RFC.

Text, Character Data, and Markup

The specification states “text consists of intermingled character data and markup.” The main point here is that every character within an XML document is either character data (the actual information content we’re most interested in, such as an address or item quantity), or it is markup (containing all of the special characters needed to create start tags, end tags, entities, comments, CDATA delimiters, DTDs, processing instructions, and declarations). All the characters together constitute text.

Character data in the content of elements is “any string of characters that does not contain the start-delimiter of any markup.” Clearly, it is important to know the difference between the two, since it is markup that allows our programs to interpret the character data correctly.

All markup begins with one of two characters: the less-than sign (<) and the ampersand (&). All markup that begins with the less-than sign ends with the greater-than sign (>), and markup that begins with an ampersand ends with a semi-colon (;). These are the only special characters you need to be aware of most of the time. In some situations, the single-quote ( ' ) and double-quote (") characters need special attention. This does not mean that your documents and data cannot include these characters, only that they require some special encoding in the XML text. Any Unicode character can be part of the character data.

One result is you’re unable to use literal special characters such as ampersands (&), or angle brackets (<, >) within your text. For example, the following would confound an XML processor:

<question>Is 5 < 7 ≤ 9?</question>

The text of the question element contains characters not allowed by the specification. The < is expected to start a new markup component, so the following space is interpreted as a syntax error. The less-than sign is used to start a variety of markup constructs, the most common of which are the element start and end tags. The ampersand is used to mark entity references.

In order to use these special characters within your XML document, you’ll need to encode them using entity or character references. To turn the example into proper XML, we need to use this:

<question>Is 5 &lt; 7 &#x2263; 9?</question>

Entity references are discussed later in this chapter, although many of you who have worked with HTML will find them familiar as they include &apos; ('), &quot; ("), &lt; (<), and &gt; (>). XML allows you to define your own entities as well, and they can contain more than a single character, but those four are defined by the XML specification and do not need to be defined specially for your documents. Character references are slightly different in that they specify individual Unicode characters without attempting to use mnemonic identifiers for them. A character reference you might have seen used in HTML would be something like &#174; (®, the registered trademark symbol). In XML, the numeric portion of the reference may be given using hexadecimal digits as well if the letter x is inserted between the sharp sign and the first digit. The reference &#xAE; also refers to the registered trademark symbol. Character references cannot be defined by authors, and they always refer to Unicode characters by the ordinal value assigned to them in the Unicode specification.

Names

The XML specification defines several small lexical details, but perhaps one of the most important is the name. Names are tokens composed of some combination of legal characters including letters, digits, underscores, hyphens, or colons; the first character of a name cannot be a digit. Name tokens are used for naming anything that needs a name in XML, including element types, attributes, and entities. Some names cannot be used in day-to-day XML markup. First, names beginning with the string xml (in any mixture of upper- and lowercase) are “reserved for standardization in [the XML specification] or future versions of this specification.” Secondly, when naming your elements, you must avoid use of the colon (:), as it is the basis for XML namespaces (a method of prefixing element names with tokens to give them domain context). While the XML 1.0 specification allows colons in element and attribute names, the more recent Namespaces specification assigns a particular syntactic significance that constrains their use. In other words, if you’re defining a whole class of elements related specifically to books, such as bookTitle or bookAuthor, its better to use capitalization, hyphenation, or underscores to separate the words (such as book_title, book-title, or bookTitle) as opposed to using the colon, such as book:Title. Using an expression like book:Title leads XML processors to believe that you are referring to a Title element within the namespace URI attached to the local name book. Of course, it may be that Namespaces are appropriate for your application, in which case you should take the time to read the Namespaces specification very carefully and define any that are needed.

Whitespace in Character Data

When working with XML-based markup languages, it can be difficult to know how to treat whitespace. For many applications, whitespace can be handled as just more normal character data, while this is not sufficient for others. The problem most often manifests itself when presentation to the user is being controlled by the application. While the XML specification does not attempt to solve the problem, it does provide a way to include a hint for processing tools and applications that the whitespace in a particular element should be preserved as given, rather than treated as malleable space.

The easiest way to visualize the problem is to consider the way program source code is most commonly presented in HTML. Most HTML authors wrap source code in a pre element:

<pre>
def hello(  ):
    print "Hello, world!"
</pre>

This is certainly the easiest way to present source code in HTML. Now consider what happens if, instead of using a pre element, we use a paragraph, or p, element:

<p>
def hello(  ):
    print "Hello, world!"
</p>

This creates a very different effect in most web browsers, typically causing the entire program text to be shown on a single line with only a single space separating each word, even though the example includes multiple lines and multiple adjacent spaces.

The solution looks simple, at least for HTML. Simply use a pre element when we want to preserve whitespace. This obvious solution unfortunately has an equally obvious problem—it only works for HTML, not for arbitrary XML-based markup languages. A solution is needed that also works for a non-HTML document like this:

<Poem>
  Ode to a node,
  Nested beneath its tree,
  Snug as a bug in its XML rug
  Dreaming of the W3C.
</Poem>

How is an XML tool to know that the line breaks and other presentation for a poem are significant?

The XML specification defines an attribute called xml:space that you can attach to an element to communicate to the application that whitespace should be preserved. It is the responsibility of the client application to act on this information and indeed preserve whitespace when handling or formatting the data. A typical compliant XML parser passes the whitespace from the document through to the application regardless of whether the xml:space attribute has been seen (in either the document or the schema). An application can use the attribute to determine just what manipulations it can perform on the document content.

The value of the xml:space attribute can be either default or preserve. If the value is default, the application is allowed to treat the whitespace in whatever way it normally would; the XML specification imposes no limitations on how the whitespace is affected in this case. If, however, the value is preserve, the application is expected to avoid interfering with the whitespace in the element to which the attribute applies, as well as all child elements, until it encounters a child that specifies a value for xml:space. At that point, the child’s value for xml:space takes precedence for itself and it’s descendents.

The xml:space attribute can be used in a couple of different ways. The first is to simply include it in the document instance, which is sufficient for well-formed XML. The first line of our poem becomes:

<Poem xml:space="preserve">

While this seems reasonable for small quantities of XML text, it proves unworkable for large volumes of documents that are edited by humans. Think about what HTML would be like if we had to always include a special attribute to get the effect of the pre element! For this reason, the xml:space attribute is most often used by including it in the document schema. In a DTD, we would write something like this:

<!ATTLIST Poem xml:space (default|preserve) 'preserve'>

Attribute list declarations will be discussed in more detail in Section 2.6.3 later in this chapter.

From a practical point of view, most applications that parse XML look at the names of the elements to determine what to do with the character data contained therein. For example, while parsing the text of a book formatted in XML, you may come across a code element that tells you to preserve the whitespace within that section. If you look carefully, however, often the document type specifies that xml:space has a default value of preserve for those elements.

End-of-Line Handling

The specification is straightforward where end-of-line handling is concerned. An XML parser must pass characters to applications with normalized line endings. That is, any combination of the hexadecimal characters 0x0D and 0x0A, or a standalone 0x0D character not followed by 0x0A, is converted to a single 0x0A character. For the less hexadecimal among us, it means that typical formatting codes such as and are converted to . And for those of you who have never used those weird backslash characters, it means that text coming from platforms that commonly use carriage-returns plus linefeed characters to terminate lines (such as Windows) is converted to use only linefeed characters.

Language Identification

An attribute named xml:lang is provided by the specification and can be placed inside documents to indicate the language used in the content. Again, this attribute must be declared in valid documents, much like xml:space. The values that can be used within this attribute are defined in IETF RFC 1766, or in a later version. Most language character codes have two letters, such as en for English, but dialects may be specified using an underscore character and an additional two-letter code; United States English can be specified as en_US, while the Queen’s English can be specified as en_GB.

The Document Prolog

An XML document contains a prolog, which includes everything that precedes the single element that is the document content. The prolog consists of an optional declaration called the XML declaration, followed by an optional Document Type Declaration, followed by any number (including zero!) of comments and processing instructions. So the prolog may completely empty, but often contains the XML declaration as a matter of good form. The Document Type Declaration is required if the document is intended to conform to a DTD.

The XML declaration looks much like a processing instruction, but is slightly different because of a special purpose it serves. Since XML requires that all documents are Unicode — but does not constrain the encoding of the Unicode characters to bytes in the data stream that contains the document — there must be a way to determine that encoding. Some encodings can be recognized by the leading bytes of the data stream. A set of specific rules for determining the encoding from the leading bytes of the data stream is given as part of the XML recommendation. For many encodings however, that is not possible. The XML specification states that in those cases where the encoding is not known a priori (as when the encoding is returned in the headers of an HTTP response), the document must be encoded in UTF-8 or include an XML declaration that specifies the encoding. The declaration always includes the version of the XML specification with which the document conforms (only XML 1.0 has been defined at this time). A typical XML declaration would look like:

<?xml version="1.0" encoding="iso-8859-1"?>

This declares that the document is encoded in the character set ISO 8859-1, more commonly known as Latin-1. It’s entirely legal to omit the encoding from the declaration as well, so the minimal declaration looks like this:

<?xml version="1.0"?>

I’m sure this already appears on coffee mugs.

After the XML declaration, a Document Type Declaration may appear. Note that this is different from the Document Type Definition, although the first two words and obvious abbreviations are the same. To avoid confusion, the acronym “DTD” is never used to refer to this; it is usually called the “DOCTYPE declaration.” If given, this declaration specifies the name of the document element, and may specify both internal and external components of the DTD. Let’s look at the simplest form of this declaration:

<!DOCTYPE book>

This tells us that the document element is of the type named book, but nothing else; this is not very useful by itself. There are actually two additional components to this declaration, each of which is optional, but one or both must be provided for the declaration to be particularly useful. Let’s look at an example that contains both of these components:

<!DOCTYPE book SYSTEM "http://xml.example.com/dtds/book.dtd" [
  <!ENTITY myCompany "Super Mega Ultra Corporation">
 ]>

Here, we include a specification for an external subset of the DTD (the SYSTEM and the quoted string), and an internal subset enclosed in brackets.

If the Document Type Declaration is given, the name of the document type must match the name of the root element. If you declare your document type as <!DOCTYPE Tool [...]>, then your root element must be Tool. Furthermore, all the specific relationships in the DTD concerning nesting, character data, and attributes must be enforced against the document if it is to pass the test for validity.

If you decide to use both the internal and external subsets, the internal subset overrules the external. That is, the rules contained within the DTD inside your XML document prevails over rules for the same construct in an external DTD subset.

Start, End, and Empty Element Tags

An element’s name communicates its type. The attributes contained within a start tag are not recognized in any particular order. The specification sees no difference between <name first="Chris" last="Jones"> and <name last = "Jones" first = "Chris">.

There are several constraints to keep in mind when working with tags. First, there is a constraint on attributes: they must be unique. No attribute name can appear twice in the same start tag. Next, if the document is to be considered valid, the attributes must have been declared, and the values must be of the types specified. Additionally, attribute values cannot be, nor can they contain, external entity references. Finally, an attribute of a start tag, or its entity replacement text, must not contain the character <. As for end tags, the specification requires only that they exactly match the start tag’s name. Attributes are not allowed in end tags.

Elements can contain just about any type of character data, as long as it is not confused with surrounding XML markup itself. This has been addressed earlier in this chapter in Section 2.5.2.

Empty elements are elements without content. They may contain attributes as shown in this example:

<names>
  <name first="Chris" last="Jones"/>
  <name/>
</names>

This XML represents two well-formed name elements. Both are empty, but the first expresses two attributes as well.

Quotes around attribute values

The specification defines literals as “any quoted string not containing the quotation mark used as a delimiter for that string.” Functionally, literals are used to indicate the content for an internal entity and the values of attributes. Typically, attribute and value combinations look like this:

<account refnum="23908403"/>

In this example, refnum is an attribute of the account element and has a value of 23908403. Either single or double quotation marks may be used, with the restriction that whichever is used to quote the value may not be directly used in the value, though it may be included using entity references or numeric character references.

As an example of an attribute value that contains both types of quotation marks, let’s use this phrase:

The cat said “The dog yelled `Help!,’ then I pounced.”

Encoded as an attribute, we end up with this:

<talltale text=
  'The cat said "The dog yelled &apos;Help!,&apos; then I pounced."'
  />

Comments

Comments in XML are similar to comments in HTML. The specification states that comments can reside anywhere outside of other markup. A simple XML comment looks like this:

<!-- This is a comment. -->

Since comments are not allowed inside other markup, you can’t embed a comment inside an XML start tag:

<book name="Python and XML" <!--comment here-->>

This type of expression is not allowed by the XML specification. Interestingly enough, comments can appear inside a DTD. In addition, comments are not considered part of the document’s character data. A couple of other caveats are that the double-hyphen (--) cannot be used inside the text of a comment as the characters --> are used to indicate that the comment is being closed. Since one of the goals of XML is to avoid the syntactic difficulties of preceding markup languages, XML simply does not allow a double-hyphen within the body of comments. Entities and other markup are not handled within the text of a comment, so you can use the characters special to the rest of XML in your comments without worry that they’ll cause syntax errors in your data. The correct version of the earlier comment element is as follows:

<book name="Python and XML">
This book is about the Python programming language
and XML markup language.
<!--comment here-->
</book>

By placing the comment inside the element instead of in the start tag, we’ve made it follow the rules.

Processing Instructions

Processing Instructions (PI) allow an XML document to pass instructions to a handling application. The XML processor does not consider Processing Instructions to be part of the document’s character data. The point of PIs is to hand information to an application. For example, if you are communicating an urgent piece of news and want the receiving application to present some sort of alert to the user, you might place the following instruction within the XML, so that varying applications can act accordingly (i.e., a Palm VII could beep, an X Window application could raise an alert box, and so on):

<?newsAlert title="Martians Invade"?>

In this example, newsAlert is commonly referred to as the target; the rest of the text does not have a special name. The distinction between the two portions of the processing instruction is entirely a matter of convention; the specification mandates only the leading <?, trailing ?>, and the lack of the character pair ?> within the PI. (Note that most of the APIs used to work with PIs refer to the two parts as the target and the data.) There is no specific syntax associated with the content of processing instructions, though it is recommended practice to begin each with a target (usually the name of the tool expected to handle it). It is becoming common for applications to expect the content following the target to look much like a series of attributes with values, which are commonly referred to as pseudo-attributes. Clients of this XML document are able to handle or ignore the PI in whatever way is appropriate to them. Processing Instructions are useful because they provide an XML-oriented way of passing events between applications or adding annotations to the data that are specific to particular applications. Historically, PIs were used in the SGML community to encode instructions to formatting applications, with semantics such as “add a page break here.”

CDATA Sections

A CDATA section is used to escape special characters in character data in your document. For example:

<![CDATA[The <ool <utter Knife & Sharpening Set]]>

This is actually an encoding of the character data:

The <ool <utter Knife & Sharpening Set

Without using a CDATA section, this must be encoded using general entities or character references:

The &lt;ool &lt;utter Knife &amp; Sharpening Set

The CDATA section is a good way to escape longer stretches of text that contain many characters that would otherwise be treated as markup if included directly in the text. Note that a CDATA section starts with the markup '<![CDATA['; no whitespace is allowed around the word CDATA. Once inside a CDATA section, no XML syntax is recognized until the characters ']]>' are encountered. Entity and character references aren’t resolved or recognized, so the text &#173; does not resolve to the trademark registration symbol, though it would in normal character data or in a CDATA attribute value.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.209.201