5.1. XML Documents

XML documents as defined in the XML Recommendation as strings of characters, but the markup within them implies a rich structure. This structure is implied in that recommendation and defined more carefully in the Infoset Recommendation. The latter Recommendation was written after the Namespace Recommendation, so it is “namespace aware”: The data structures it defines are designed to make namespace information directly available as properties of objects.

Notation Note

The XML Recommendation uses the term “document” solely to mean the concrete document; The Infoset Recommendation uses the terms “infoset” and “document information item” to mean the abstract document. When dealing with schemas, it is often immaterial whether the documents you think of are concrete or abstract.


A concrete XML document is a string of characters that consists of a prolog, an element, and “misc.” (‘Misc’ really is an XML technical term, but you won’t find it much in this book outside of this paragraph.) The prolog is typically an XML Declaration and/or a Document Type Declaration with misc intervening, but may be just an empty string. The element is required and is generally the important part of the document—that is where the data characters reside. Misc consists of incidental character strings: whitespace, processing instructions, and comments. The element’s structure and content are of primary interest in this book.

An abstract XML document (an XML infoset) is made up of objects—instances of various classes—combined in various ways. These instances are all called “information items” in the Infoset Recommendation. To begin with, the topmost object is a single document information item.

Notation Note

Abstract document, infoset, and document information item are essentially the same thing. The Infoset Recommendation takes the point of view that the various information items in isolation, linked by properties, as opposed to being part of the document element, so it considers an infoset a collection of information items with links between them. This distinction is at most a matter of how you think of objects and the values of their properties—or it can be thought of as just an implementation detail.


Other information items occur as values of various properties of the document information item or (recursively) properties of those information items in turn.

Any (concrete) element must be in one of two forms. Either it consists simply of an empty-element tag, or it consists of three consecutive character strings concatenated: a start-tag, content, and an end-tag. Of these three, the simplest is the end-tag, which is a string consisting of ‘</’, a type name, and ‘>’. A start-tag is a string consisting of ‘<’, a type name, optional attribute value specifications, and a closing ‘>’, with intervening whitespace where needed or desired after the type name. The type name in the start- and end-tags of an element having start- and end-tags must be the same:

Type name: A name used to name an element type.

An empty-element tag is like a start-tag except that the terminating ‘>’ is instead a ‘/>’.

Note

A concrete element, thought of as an object, has no properties special to elements. Instead, it has object-valued propertyMethods, which return values, but these values must be defined in terms of the base string. The fact that everything devolves back to the base string is what makes it “concrete.”


The content of an element is slightly more complicated. It, too, is a character string (consisting of the concatenation of data characters, complete elements, and character and entity references), which occurs between the start- and end-tags. If the element consists of an empty-element tag, the content is an empty string.

This book uses the following terms to refer to elements and parts thereof (abstract and concrete) and to connect that terminology to definitions in the various Recommendations:

Document element: The outermost element in an XML document.

Element:

  1. A character string conforming to the requirements of the XML Recommendation.

  2. An element information item.

Attribute specification: A character string conforming to the requirements of the XML Recommendation.

Attribute:

  1. An attribute specification.

  2. An attribute information item.

Content of an element:

  1. The character string between the start- and end-tags of an element. (Special case: The content of an element consisting of an empty-element tag is a priori the empty string.)

  2. The terms in the value of an element information item’s children property (a sequence or list of various kinds of information items, especially element and character information items).

Children of an element:

  1. The immediate subelements of an element.

  2. The immediate subelements and data characters of an element.

  3. The immediate subelements, data characters, and attributes of an element.

Attributes of an element:

  1. The attribute specifications found in the start-tag or empty-element tag of the element.

  2. The members of the value of the element information item’s attributes property (a set of attribute information items).

Subelement of an element: An immediate subelement of the element or (recursively) a subelement of one of those immediate subelements.

Immediate subelement of an element: An element information item term in the element information item’s children property’s value (a sequence or list of information items) or a substring of the content of the element, which, when parsed, gives rise to such an element information item.

Data character:

  1. A character in an XML document that an XML parser recognizes as data (rather than markup).

  2. A character information item in the children of an element information item.

  3. Such a character information item or a character in the value of an attribute.

Metadata string:

  1. A character string recognized as markup but retained in the abstract data structure because it provides information about the abstract structure. (Example: An element’s type name or the name of an attribute.)

  2. A character string that is the value of an information item property other than the value property of an attribute.

Markup punctuation: A character string recognized as markup but which only serves to identify or delineate markup. (Examples: Whitespace and the strings ‘<’, ‘</’, ‘=’, ‘/>’, and ‘>’ found in various tags.) Markup punctuation is typically not retained in the abstract data structure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.171.121