Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

The XML Information Set (Infoset) and Beyond

6.1 Introduction

Look at any XML document and you will see a sequence of tags and values set out on a page or a computer screen. Zoom in (metaphorically) and it’s a sequence of characters. Zoom in again and it’s some ink on a page or pixels on a screen or bits in memory or on a disk. In whatever form the XML document is presented, that form represents some information – the cast of a movie, the line items in a purchase order, or the sections and chapters of this book. When a program performs operations on XML – query, update, extract – it does not need or want to deal with bits in memory or even with tags and values. The program wants to operate on the information itself.

To that end, the W3C has defined a more abstract representation of that information, the XML Information Set, or Infoset. In this chapter, we look at the Infoset in some detail and then describe some of the later developments. The Post-Schema-Validation Infoset (PSVI) was defined by the XML Schema Working Group to add type and validation information to the Infoset. The XPath 1.0 Data Model, though similar to the Infoset, added some important notions that influenced the data models that followed (particularly the XQuery Data Model). The Document Object Model (DOM), though strictly speaking an API, has an implicit data model closely related to the Infoset. We end the chapter with a brief introduction to the XQuery Data Model (described in more detail in Section 10.6, “The Data Model”), the most ambitious effort yet, which has both strong typing and an API.

The descriptions in this chapter (and indeed in this book) are necessarily incomplete. The goal is to give the reader a general understanding of the concepts rather than a reference manual from which to implement a query engine. That said, we go into a fair amount of detail on the Infoset, which lays the foundations for other data models. And we go into some detail on the XQuery Data Model and type system in the next chapter, since it is so central to the XQuery language.

6.2 What Is the Infoset?

The XML Information Set, or Infoset, is an abstract representation of the core information in an XML document. That is, the Infoset encapsulates the meaning of a document, so an XML processor need not be concerned about variations in syntax. Every well-formed XML document that conforms to the W3C XML Namespace recommendation¹ can be represented as an Infoset. An XML document does not have to be valid (conform to a DTD or Schema) to be represented as an Infoset.

The W3C XML Information Set Recommendation² (“Infoset”) defines the Infoset representation of a document as a set of information items. There are 11 information items, and each information item has a set of properties. The Infoset information items are summarized in the next section; for a complete description, see the Information Set Recommendation.

Note that not all the information contained in a document is represented in the Infoset (see Section 6.4). The goals of the Infoset Recommendation are to select the most generally useful information in a document and to define how to represent that information in a standard way using standard terminology. Interestingly, the recommendation itself says it exists only so that other specs have a standard way of talking about information in a document. Nonetheless, the Infoset has become the basis for several more sophisticated data models used by XML processors (more on data models later).

6.3 The Infoset Information Items and Their Properties

The W3C XML Information Set Recommendation defines 11 kinds of information items. Each information item (except the Namespace information item) is associated with a definition and/or some syntax given in the W3C XML recommendation.³ Each information item has a set of properties, and a property may itself contain one or more information items – for example, the [children] property of an element might include element information items.

The Infoset information items and their properties are summarized next. The top-level bullets represent information items, and their names are in bold. The second-level bullets describe properties of those information items. Property names are enclosed in square brackets [ ].

1. Document Information Item – The document information item is the starting point for all the information items in the Infoset. Think of an Infoset as a tree in which each tree node represents either some character data or an XML marked-up construct (e.g., an element, a comment, or a processing instruction) and each branch is a “parent/child” relationship. The document information item is the root node in that tree. It is a notional node; i.e., it is not represented in the character string or printed form of the XML document. It exists only so that the Infoset is truly a tree – so that an XML processor can start at the document information item and visit any part of the Infoset using common tree-walking algorithms. Take a look at Figure 6-4 (near the end of the chapter). An Infoset representing only the nodes that are part of the XML document, those below the dashed line, would not be a tree – we need to add a notional root node (the document information item) to make it a tree. Its properties include:

Figure 6-4 Tree Structure Corresponding to a Trivial XML Document.

a. Information from the XML declaration ([character encoding scheme], [standalone], [version]).

b. The [document element] property – contains the element information item for the document element. The document element is the single top-level element in the XML document (“movie” or “movies” in most of our examples). This top-level element is sometimes referred to as the “root element,” since it is the root of the tree of elements within the Infoset tree. We said the document information item (see earlier) is the root node – remember, not all nodes in the Infoset tree are elements. The root element may have sibling nodes that are not elements (the prolog, comments, processing instructions). In Figure 6-4, the element “A” is the root element, while “R” is the root node. Other XML abstractions have the same concepts but use different names. We will refer back to the “root element” and “root node” for consistency.

c. [children] – a list of information items representing the children of the document information item, in document order. This list contains exactly one element information item, which represents the “document element,” plus information items for processing instructions and comments that are children of the root node. If there is a DTD declaration, its information item appears here, too.

d. [all declarations processed] – is “not strictly speaking part of the Infoset of the document” (according to the Infoset spec). This property is metadata describing the state of the Infoset build. If true, it means that all declarations in the document have been read and processed, that is, everything that can be known about the document is known. If false, some properties may be “unknown” (e.g., the references property of the attribute information item).

2. Element Information Item – Each element information item represents an XML element. Its properties include:

a. [children] – a list of child information items, in document order. The list includes an element information item for each child element as well as information items for processing instructions and comments in the XML element, [children] also includes an information item for each data character and unexpanded entity reference in the XML element.

b. [parent] – the information item for the parent of this XML element. This is an element information item, except where the XML element is the root element, in which case the parent is a document information item. Notice that the treelike structure of an XML document is preserved by the [parent] and [children] properties.

c. [attributes] – an unordered set of attribute information items. Information items in this set may come directly from the text of the document, or they may be introduced by DTD defaults.

d. [local name] – the (local) name of this element, e.g., “movie” or “title.”

e. [namespace name] – the namespace URI reference (if any). The namespace name and the local name together uniquely name this element.⁴

f. [prefix] – the namespace prefix, if any. If the prefix is present, it must be associated with a namespace name.

3. Attribute Information Item – The attribute information item represents an attribute. Its properties are:

a. [owner element] – the element information item of the element in which this attribute appears. Note that, according the Infoset specification, the relationship between an attribute and its associated element is not a parent/child relationship; it’s an owner element/attribute relationship.⁵

b. [normalized value] – the value of the XML attribute, normalized as specified by the W3C XML Recommendation. Normalization resolves character references and entity references, replaces each white-space character (#x20,⁶ #xD, #xA, #x9) with a space character (#x20) and replaces all end-of-line characters with #xA. Unless the attribute type is CDATA, normalization also collapses sequences of spaces to a single space and removes leading and trailing spaces.

c. [specified] – a flag to show whether the attribute was specified as part of its owner element or produced by defaults in a DTD. This is one place where the Infoset preserves information that would be needed to reconstruct the XML document exactly. We will see other places where the Infoset discards such information.

d. [local name] – the name of this attribute.

e. [namespace name], [prefix] – the namespace name and namespace prefix, if any, of the name of this attribute (see also the earlier discussion of the element information item).

f. [attribute type] – the type, if any, of this attribute. Possible values are ID, IDREF, IDREFS, ENTITY, ENTITIES, NMTOKEN, NMTOKENS, NOTATION, CDATA, and ENUMERATION. The Infoset specification first became a recommendation in the same year as XML Schema (2001), and it deals with only DTD types, not the much richer set of types available in XML Schema.

g. [references] – if the attribute type is IDREF, IDREFS, ENTITY, ENTITIES, or NOTATION, then the [references] property is an ordered list of the element, unparsed entity, or notation information items referenced in the attribute value. Otherwise, this property has no value.⁷

4. Processing Instruction (PI) Information Item – The PI information item represents a processing instruction. Its properties include:

a. [target] – the target of the PI.

b. [content] – the content of the PI.

c. [parent] – the document, element, or document type declaration information item for the parent of this PI.

5. Unexpanded Entity Reference Information Item – The unexpanded entity reference information item provides a mechanism for a nonvalidating XML parser to indicate that an entity reference has been read but not expanded. The motivation for this information item is that some applications, such as browsers, may not want to immediately expand every entity reference. Unexpanded entity reference properties include:

a. [name] – the name of the entity.

b. [system identifier] – the system identifier of the entity, as it appears in the entity declaration.

c. [public identifier] – the normalized public identifier of the entity.

d. [parent] – the element information item that contains this information item in its [children] property.

6. Character Information Item – The Infoset contains a character information item for each data character in the XML document. Information about where this character came from – whether it appeared literally in the document, as a character reference or in a CDATA section – is discarded. Only the contents of elements (and not, for example, attribute values) are counted as “data characters.”⁸ Character information item properties are:

a. [character code] – the ISO 10646 (UCS) character code (equivalently, the Unicode code point).

b. [element content white space] – a flag to indicate whether this character is “white space in element content.” This property enables an XML processor to preserve white space in element content when it sees the xml:space [preserve] attribute.

c. [parent] – the element information item of the element containing this character data.

7. Comment Information Item – The comment information item represents a comment. Its properties are:

a. [content] – a string, the content of the comment.

b. [parent] – the element information item for this comment’s parent.

8. Document Type Declaration Information Item – The Infoset contains at most one Document Type Declaration information item, containing information about processing instructions from the DTD. Information about entities and notations from the DTD appears in the document information item, not here. PIs from the internal DTD subset appear before those in the external subset, but there is no way to distinguish between the two sources. Much of the content of the DTD, including the definition of element and attribute structures, is discarded. The Document Type Declaration information item properties are:

a. [system identifier] – the system identifier of the external DTD subset, as it appears in the DOCTYPE declaration.

b. [public identifier] – the normalized public identifier of the external DTD subset.

c. [children] – an ordered list of processing instruction information items, representing processing instructions appearing in the DTD.

d. [parent] – the document information item.

9. Unparsed Entity Information Item – There is an unparsed entity information item for each unparsed general entity declared in the DTD. An unparsed entity references non-XML data – data that the XML processor is not expected to parse – such as a gif image. Unparsed entity properties include:

a. [name] – the name of the entity.

b. [system identifier] – the system identifier of the unparsed entity, as it appears in the DOCTYPE declaration.

c. [public identifier] – the normalized⁹ public identifier of the unparsed entity.

d. [notation name] – the notation name associated with the unparsed entity.

e. [notation] – the information item for the notation named in [notation name].¹⁰

10. Notation Information Item – There is a notation information item for each notation declared in the DTD. Notation properties include:

a. [name] – the name of the notation.

b. [system identifier] – the system identifier of the external DTD subset, as it appears in the DOCTYPE declaration.

c. [public identifier] – the normalized public identifier of the notation.

11. Namespace Information Item – For every element, there is a namespace information item for each of its in-scope namespaces. Namespace properties are:

a. [prefix] – the namespace prefix.

b. [namespace name] – the namespace name (URI) to which the prefix is bound.

From this description of the information items that go to make up an Infoset, it is clear that the Infoset represents both the data and the structure of an XML document. The data is represented in the information items and their properties, and the treelike structure is preserved by the [parent] and [children] properties. The Infoset also preserves some, but not all, of the information needed to reconstruct the original XML document, so parts of the Infoset can be serialized – put back into an XML document – in only one way, while other parts could map to an XML document in several ways.

Consider a sample movie document, Example 6-1.

Example 6-1 A Sample movie Document

Figure 6-1 shows a tree representation of (part of) the Infoset for Example 6-1.

Figure 6-1 Infoset Tree for a Sample movie Document.

6.4 The Infoset vs. the Document

We started this chapter by saying that the Infoset is “an abstract representation of the core information in an XML document.” Before we go any further, let’s dissect this definition to clarify the relationship between Infoset and document.

The Infoset is not the document. The Infoset takes some of the information conveyed by the XML document and represents it in an abstract way. This abstract representation may in turn be represented in a number of ways – as a tree diagram, as a table, or even as another XML document. The most common representation of an Infoset is an in-memory structure as part of an application. Unfortunately, the Infoset Recommendation does not specify an API to such a structure. Both the representation of the Infoset and the provision of an API to get at information items are left up to the implementation.

As we just said, the Infoset does not represent all the information in an XML document. So what information is included and what is left out? Let’s take another look at the sample movie document in Example 6-1. Assume for now that when we say “the document,” we actually mean the ink on the page. (Of course, the ink on the page is itself an abstraction. You may even be reading a different abstraction – say, pixels on a screen. But for now we’ll assume that the ink on the page is the ultimate reality.) There is some information conveyed by the ink on the page that is obviously not relevant to an XML processor – the size of the font, the color of the ink, the kinds of quotes around attribute values. And most of the information in the Infoset clearly is relevant – such as the data itself and the parent-child structure. But some information is borderline – information that is in the document, but not in the Infoset, that might be considered relevant. For example:

• The source of characters – Character information is represented in the Infoset as character information items. The only properties of a character information item are [character code], [element content white space], and [parent]. In other words, the Infoset tells us what characters are in the data but not how they got there. CDATA sections, general parsed entities, and character references, if present in the XML document, cannot be reconstructed just by looking at the Infoset.

• Order of attributes – Attributes appear in a document in a particular order, but the [attributes] property of the element information item in the Infoset is an unordered set – i.e., the Infoset Recommendation says that attribute order is unimportant, and so it is not preserved. In addition, attribute values are white-space-normalized (e.g., multiple white-space characters are collapsed to a single white space, and leading and trailing white space is removed).

• Empty elements – An empty element may appear in a document either in the form “<movie/>” or in the form “<movie></movie>.” The Infoset does not distinguish between the two.

See Appendix D of the Infoset Recommendation for a nonexhaustive list of information not represented in the Infoset.

Interestingly, an early working draft of the Infoset¹¹ defined six more information items (for a total of 17). The extra information items – internal entity, external entity, entity start and end markers, and CDATA start and end markers – would have made it easier to reconstruct a document from its Infoset. The decision to drop these information items was a good one – this information is syntactic rather than semantic and does not belong in the Infoset.

Some of the Infoset information may come from a DTD. DTDs have a small amount of information about types – for example, an attribute may have a type, one of ID, IDREF, IDREFS, ENTITY, ENTITIES, NMTOKEN, NMTOKENS, NOTATION, CDATA, or ENUMERATION. But the Infoset does not include type information from an XML Schema. This is the biggest shortcoming of the Infoset, and it’s addressed by an extension to the Infoset known as the Post-Schema-Validation Infoset, or PSVI (see Section 6.6).

An Infoset may¹² be created from a document, usually via an XML parser. The resulting Infoset is an abstract representation of the essence of that document. If the Infoset is then serialized, the resulting document will contain the same information as the document we started with, but the two documents will probably not be identical.¹³

Now we have a good picture of what the Infoset is, what’s in it, and how it relates to a document. But what is the Infoset good for? The main benefit of the Infoset is that it offers an XML processor an abstraction of what’s important in the document. Operations on documents can be defined in terms of the Infoset, and the XML processor can ignore details like character entity evaluation.

6.5 The XPath 1.0 Data Model

The XPath 1.0 Data Model, though similar to the Infoset, added some important notions that influenced the data models that followed (particularly the XQuery Data Model). The XPath 1.0¹⁴ Data Model is a tree representation of an XML document. The tree is defined in terms of seven types of nodes – root, element, text, attribute, namespace, processing instruction, and comment nodes. Four of the Infoset information items are not represented in the XPath Data Model – unexpanded entity references, unparsed entities, DTD, and notation items. Six of the others map one-to-one to XPath data model nodes. And one – the Infoset’s character item – is represented as a collection of character items in the XPath Data Model’s text node. See the XPath 1.0 Recommendation for a mapping from the XPath Data Model to the Infoset.¹⁵

The XPath 1.0 Data Model introduces several important notions:

• The Infoset describes the information in an XML document as information items. Though these items are hierarchic in nature and have a single top-level item, the Infoset spec purposely avoids using the terms tree and nodes.¹⁶ The XPath 1.0 Data Model, on the other hand, talks about the data model as a tree, made up of nodes.

• The XPath 1.0 Data Model introduces the notion of a text node, made up of “a sequence of one or more consecutive character information items.”

• In the XPath 1.0 Data Model, every node has an associated string value. The string value may represent a single value (as in the string value of a text node or an attribute node), or it may be the concatenation of the string values of all the descendant text nodes.

• Since XPath 1.0’s purpose is to query (address is the XPath term) documents, it includes the notion of a node set, the precursor to XQuery’s sequences. Interestingly, the node set is not a part of the XPath 1.0 Data Model, which models only input to XPath expressions, not output.

Figure 6-2 shows an XPath 1.0 Data Model tree for the sample movie document, Example 6-1. The figure is smaller than Figure 6-1 because the individual character items are now collected together into text nodes. It is also simpler because a lot of the information in the Infoset (such as anything to do with entities or DTDs) is not represented.

Figure 6-2 XPath 1.0 Data Model Tree for a Sample movie Document.

6.6 The Post-Schema-Validation Infoset (PSVI)

The Infoset provides an abstraction of the data and structure in a document, so a processor can deal with information items and their properties. The only type information in the Infoset is the attribute type information available in a DTD. However, if the document can be associated with an XML Schema, then there is a lot of valuable information available – type information – that cannot be represented in the Infoset as it is defined by the Infoset Recommendation. To address this, the XML Schema Recommendation Part 1¹⁷ (“Schema 1”) defines extensions (“augmentations”) to the Infoset, to form a Post-Schema-Validation Infoset, or PSVI. The PSVI is an abstraction, just as the Infoset is – it’s an abstraction of the information represented in the document augmented by the information in the XML Schema.

6.6.1 Infoset + Additional Properties and Information Items

When you validate an XML document against an XML Schema, the Schema processor augments the Infoset of that document by adding properties to attribute and element information items. Validation also adds some new information items not defined in the Infoset.

“Schema 1” defines about two dozen additional properties of element information items. For example:

• [validity] – validity of the element: valid, invalid, or notKnown.

• [validation attempted] – what kind of validation was attempted: full, none, or partial.

• [validation context] – a reference to the nearest ancestor with a [schema information] property, i.e., a pointer to the schema against which the document was validated.

• [schema normalized value] – generally, the white-space-normalized content of a leaf node. Similar to the string value in the XPath Data Model, but here white-space-normalization rules are derived from the element’s schema definition.

• [type definition type] – simple type or complex type.

• [type definition anonymous] – true (anonymous type) or false (named type).

• [type definition name] – if not anonymous, the name of the type. If anonymous, may contain a processor-supplied unique name.

• [identity constraint table] – contains an identity-constraint binding information item for each unique or key constraint in the schema.

As well as these additional properties, the PSVI introduces several other information items, such as:

• Identity-constraint binding information item – contains information on unique and key constraints.

• Namespace schema information item – Properties include [schema documents], a set of schema document information items.

• Schema document information item – Properties are [document location], a URI, and [document], a document information item.

Many of these additional properties can be associated with attributes as well as elements.

6.6.2 Additional Information in the PSVI

So what information can we get from a PSVI that we cannot get from an Infoset? The PSVI gives us lots of information about the schema validity of the document as well as information about types.

Schema Validity

Schema validity, as “Schema 1” tells us, is “not a binary predicate”! First, you can choose to validate in a number of ways – strict (everything must be valid), lax (if it’s defined in the schema it must be valid, else ignore it), or skip (don’t try to validate anything against the schema). Second, you can mix and match these validation modes within a document – i.e., you can do strict validation on some parts of the document, skip on some others, and lax on the rest. The PSVI tracks which kind of validation was done where as well as the result (valid, invalid, notKnown) for each element and attribute.

Types

An XML Schema may contain a lot of information about types. In the Schema world, type information covers structure type information as well as data type information.

Complex types define the structure of an element – the valid attributes, children and content of an element.

Simple types define the data type of the (simple)¹⁸ content of an element or of the value of an attribute.

XML Schema data types are defined in XML Schema Part 2: Data Types¹⁹ [Schema 2]. XML Schema has the following built-in data types:

• Primitive types – familiar data types such as string, Boolean, decimal, float.

• Derived types – built-in types derived from the primitive types, such as normalizedString, integer, positiveinteger.

In addition, users can define:

• Complex types (named or anonymous) – Complex type, as opposed to a simple type, describes an element that has one or more attributes or child elements. Think of a complex type as describing a subtree rather than a leaf node.

• Derived types – defined by restricting or extending built-in types or user-defined types.

See Chapter 5, “Structural Metadata,” for a more detailed discussion of XML Schema types.

6.6.3 Limitations of the PSVI

We have seen that the PSVI adds structure type and data type information to the Infoset. This information is useful when querying XML. But the PSVI does not go far enough.

• The PSVI type system is not quite extensive enough for query purposes (we see in Chapter 10, “Introduction to XQuery 1.0,” that the XQuery Data Model adds some more types).

• There is no API for the PSVI – the DOM, probably the most widely used API, only knows about the Infoset (see Section 6.7).

• The PSVI only deals with documents – when querying XML, we need to consider arbitrary sequences of documents, nodes, and/or values. (Some would argue that “sequences” should also be on that list, but at the time of writing even the XQuery Data Model cannot model sequences of sequences.)

6.6.4 Visualizing the PSVI

There is an enormous amount of information in the PSVI for even a simple document – Figure 6-3 shows just a small part of the PSVI information for one element (the title) of the sample movie document, Example 6-1.

Figure 6-3 Part of the PSVI Tree for movie.xml.

6.7 The Document Object Model (DOM) – An API

The Document Object Model (DOM) is fundamentally different from the Infoset and the PSVI. While the Infoset and PSVI are data models – they define an abstract representation of the data in an XML document – the DOM is an API. It defines an interface to the data and structure of an XML (or HTML) document so that a program can navigate and manipulate them. The DOM is language- and platform-independent: The specification defines bindings for Java and ECMAScript (a scripting language very close to JavaScript). If you have written any dynamic web pages using JavaScript, you have probably used the DOM without realizing it.²⁰

The DOM is defined in a suite of W3C Recommendations.²¹ The DOM Level 1 Specification²² defines a set of objects – in the sense of “object-oriented programming” – that can represent any structured document, including an XML document. Later specs build on Level 1. DOM Level 2²³ adds a DOMTimeStamp data type, support for namespaces, plus several extra specifications, including views and events. DOM Level 3²⁴ adds load and save, and validation. There are also some notes associated with Level 3, including a note on DOM and XPath.²⁵

The DOM is a free-based (as opposed to event-based)²⁶ API. DOM Level 1 defines a hierarchy of node objects. The spec refers to this hierarchy as “The DOM Structure Model” – an appropriate name, since it looks a lot like a data model without the data type information. In DOM Level 1, all element and attribute content is treated as character data (as in the Infoset), and all values are returned as strings of type DOMString. Though DOM Level 2 did introduce one more data type – DOMTimeStamp – the DOM data model is still essentially untyped, except for some vendor extensions. Notably, Microsoft has introduced a number of proprietary extensions to the DOM, including the nodeTypedValue property of a node, nodeTypedValue returns the value of a node, with the type specified in an associated XML Schema, if present.

For an XML document, the hierarchy of node objects is a tree, with a single (notional) document node. Remember that the DOM provides an API to manipulate a document, not just to navigate around a static document. When editing a document, it is often useful to deal with a fragment – a part of the tree that may have more than one top node. To handle fragments, the DOM introduces the DocumentFragment node type, which adds a notional root element to a fragment.

There are 12 DOM node types, which are similar to the information items in the Infoset. Table 6-1 compares the DOM node types with the Infoset items.

Table 6-1

DOM Node Types and Infoset Items

DOM Structure Model Node Type	Corresponding Infoset Information Item	Differences
Document	Document	−
DocumentFragment	−	A part of a document, possibly with multiple top-nodes – not defined in the Infoset.
Element	Element	−
Attr	Attribute	−
DocumentType	Document type declaration	DOM DocumentType includes entities and notations. In the Infoset these are properties of Document.
Processinglnstruction	Processing Instruction	−
Comment	Comment	−
Text	Character	DOM groups character Infoset items together into text nodes, like XPath.
CDATASection	−	The Infoset does not model CDATA sections.
Entity	−	The Infoset does not model entities.
EntityReference	Unexpanded entity reference
Notation	Notation	−
	Namespace	Although DOM Level 2 supports namespaces via several of its interfaces, it does not represent namespaces in its structure model.

A DOM parser builds instances of these node types. The DOM also introduces some objects to represent results:

• NodeList – an ordered list (sequence) of Nodes.

• NamedNodeMap – an unordered list of nodes, e.g., all the attributes of an element.

NodeLists and NamedNodeMaps contain references to parts of the actual document, not copies, so DOM methods manipulate the “live” document.

The important part of the DOM spec is the interfaces and methods it defines on this underlying data model – the DOM is, after all, an API. We will not describe these interfaces and methods in detail. We will just observe that the DOM, by itself, is not very useful for querying.

• The DOM defines only two ways to access the values in elements and attributes. Neither allows for accurate, simple, efficient queries over XML.

– You can access values of elements and their attributes by name. This is useful only if you know the name of the element (or attribute) for which you are looking. The DOM method getElementsByTagName returns all elements with the given name that are descendants of the current node, so this access method does not take account of where the element occurs.

– You can access values of elements and their attributes by “walking the DOM tree” – i.e., get the top-level node and look at its children, then look at their children, and so on.

• The DOM is not type-aware (though there are proprietary extensions to the DOM that are type-aware) – all values are returned as strings. That means that, if you want to perform any operations that depend on type (equality, greater than, less than, etc.), you have to explicitly cast the returned value to some appropriate host-language type.

That said, the DOM is a very popular way to access and manipulate XML, and many query implementations use the DOM at some level.

6.8 Introducing the XQuery Data Model

For the rest of this book we focus on the XQuery 1.0 and XPath 2.0 Data Model and its relationship to the SQL data model.

We said early in this chapter that the Infoset is an abstract representation of the information in an XML document, invented so that XML processors could perform operations on XML without having to deal with the details of how that information is represented in the original source input. The XQuery Data Model could be described as “the (extended) Infoset for XQuery” – that is, it is an abstract representation of the information in an XML document, defined for the purpose of an XQuery engine.

The XQuery language is defined in terms of the XQuery Data Model – that is, it is assumed that every query takes an XQuery Data Model instance as input and returns an XQuery Data Model instance as output. How one or more input documents get converted into an XQuery Data Model instance and how the resulting XQuery Data Model instance is presented to the user are left up to the implementation.

Why doesn’t XQuery just use the Infoset? The Infoset is insufficient, for a couple of reasons. First, the Infoset has no data type information, and any reasonable query language needs to know about the types of the data values with which it’s dealing in order to do comparisons, ordering, and so on. So why not use the PSVI? After all, that is the Infoset extended with type information. The PSVI was defined as part of XML Schema, which is concerned about validating documents, not querying them. That said, the XQuery Data Model is based largely on the PSVI, with some additional types.

Second, the Infoset represents only well-formed XML documents. XQuery needs to be able to represent a result (and, by extension, an intermediate result or input) that is an XML document, a subtree, a value, or a sequence of (a mixture of) any of these. The XQuery Data Model introduces the notion of a sequence – in XQuery, everything is a sequence of 0,1, or more items, where an item is indistinguishable from a sequence of items of length 1. An item may be a value or a node. A node may be a document, element, attribute, text, namespace, processing instruction, or comment node.

We describe the XQuery Data Model and its relationship to the Infoset and XML Schema in more detail in Section 10.6, “The Data Model.”

6.9 A Note Regarding Data Model Terminology

More than one W3C specification defines terms related to a data model for XML. Unfortunately, there is no universal agreement on the concepts involved, much less the terminology used for those concepts. In particular, several of these specifications are, in our opinion, unnecessarily confusing in the terms they use to reference the topmost elements of XML documents.

We struggled more than once with the problems caused by this lack of uniformity of concept and terminology. To aid our readers, we offer the following information to better their understanding.

Consider the trivial XML document illustrated in Example 6-2. That document corresponds to the tree structure shown in Figure 6-4.

Example 6-2 Trivial XML Document

The mere fact that some specifications have multiple names for the same concept (see, for example, the XML column’s cell corresponding to tree node A in Table 6-2) is problem enough. But the fact that different specifications use certain words (roof is a good example) for different purposes – or not at all – just makes things difficult for no good reason.

Table 6-2

Tree-Related Terminology

6.10 Chapter Summary and Further Reading

We started this chapter by looking at the Infoset – an abstract representation of the information in an XML document. The Infoset is extended with type information in the Post-Schema-Validation Infoset, defined by XML Schema. XQuery defined its own data model – the XQuery Data Model – based on the Infoset, with additional type information and sequences. We also mentioned the DOM, an API for accessing and manipulating XML, which has its own underlying data model (the DOM Structure Model), which is similar to the Infoset.

For further reading, there are a number of mappings between data models – see especially the mapping from DOM to XPath 1.0 Data Model in the DOM Level 3 Note,²⁵ and the mapping from XPath 1.0 Data Model to Infoset that we saw earlier in this chapter.¹⁵ If you want to see the details of the PSVI, take a look at the XSV (XML Schema Validator)²⁷ tool. XSV takes as input an XML document and an XML Schema document and outputs its PSVI as an XML document according to the PSVI Schema.²⁸ There’s also a stylesheet²⁹ to display validity information from the PSVI as a color-coded HTML page.

Related readings include the W3C Recommendation on Canonical XML³⁰ (interestingly, this is defined on the XPath Data Model) and Erik Wilde’s proposal to make the Infoset extensible in a standard way.³¹

¹ Namespaces in XML (Cambridge, MA: World Wide Web Consortium, 1999). Available at: http://www.w3.org/TR/REC-xml-names/.

² XML Information Set (Second Edition) (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/TR/xml-infoset/.

³ Extensible Markup Language (XML) 1.0 (Third Edition) (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/TR/REC-xml/.

⁴ The XML 1.0 spec refers to the string between “<” and “>” in a start tag that names an element as the element’s type, or element-type. Oddly, it refers to the analogous string for an attribute as an attribute name. These strings may consist of a namespace prefix plus a local name, separated by a colon (making up a qualified name). The namespace prefix, if it exists, must be associated with a namespace URI reference (also known as a namespace name). If the Infoset is processed by a namespace-aware processor, the processor must use the namespace name, not the prefix – the prefix is just a placeholder for the namespace name.

⁵ The XPath 1.0 spec refers to an attribute’s owner element as its parent, but it explicitly says that an attribute is not a child of its owner (parent) element. The XQuery Data Model spec uses this same definition for an element/attribute relationship.

⁶ The convention here for character codepoints is used in many XML specifications. “#xN” denotes the codepoint with the hexadecimal value N.

⁷ Actually, this is a simplification. The Infoset spec describes three other cases that result in the [references] property of an attribute having no value. The attribute value might be syntactically invalid. The attribute type might denote that the attribute value can only legally reference a unique thing, whereas the attribute value actually references something that is not unique within the document (e.g., the attribute might be an IDREF that references an ID that occurs more than once in the document). Or the attribute type might denote that the attribute value references some (not necessarily unique) thing, whereas the attribute value actually references something that does not exist within the document (e.g., the attribute might be an IDREF that references an ID that does not occur in any ID attribute in the document). In this latter case, there is an exception when the [all declarations processed] property of the document information item is false. This means the thing we are trying to reference might exist somewhere, and we just haven’t read it yet, so the [references property] is “unknown.”
How did this description get so complicated? Most of the complexity arises when we need to account for the cases where the document is not valid (e.g., there are multiple attributes of type ID with the same value) or where the processor has not yet attempted to find out whether or not the document is valid (i.e., where not all declarations have been processed). If the tiny amount of type information taken into account when building the Infoset (the 10 attribute types available in the DTD) can introduce this much complexity, imagine how complicated it is to build the XQuery Data Model based on the broad range of data types, structure types, and validation/validity states allowed in the PSVI. Or just read on.

⁸ This is consistent with the XPath 1.0 Data Model notion of a “text node” as a collection of data characters that does not include attribute values and with the idea that attribute values are somehow not quite data.

⁹ To normalize an identifier, replace each string of white space with a single space character (#x20), and remove leading and trailing white space.

¹⁰ The [notation] property of an unparsed entity may have no value (if there are zero or many notations with the name in [notation name]), or it may be “unknown” (if there are no notations with that name and not all declarations have been processed). See also the footnote discussion of the [references] property of an attribute.

¹¹ XML Information Set, W3C Working Draft 2 (Cambridge, MA: World Wide Web Consortium, 2001). Available at: http://www.w3.org/TR/2001/WD-xml-infoset-20010202/.

¹² An application may create an Infoset that does not represent any document – e.g., an Infoset that represents an intermediate result of some processing.

¹³ For some tips on creating XML in a canonical form, see Canonical XML (Cambridge, MA: World Wide Web Consortium, 2001). Available at: http://www.w3.org/TR/xml-cl4n.

¹⁴ XML Path Language (XPath) Version 1.0 (Cambridge, MA: World Wide Web Consortium, 1999). Available at: http://www.w3.org/TR/1999/REC-xpath-19991116.

¹⁵ XML Path Language (XPath) Version 1.0, Appendix ? (Cambridge, MA: World Wide Web Consortium, 1999). Available at: http://www.w3.org/TR/1999/REC-xpath-19991116#infoset.

¹⁶ The Infoset spec says: The terms information set and information item are similar in meaning to the generic terms tree and node, as they are used in computing. However, the former terms are used in this specification to reduce possible confusion with other specific data models. Information items do not map one-to-one with the nodes of the DOM or the “tree” and “nodes” of the XPath data model.

¹⁷ XML Schema Part 1: Structures Second Edition (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/TR/xmlschema-l/.

¹⁸ Simple content is the content of an attribute or of an element that does not have any child elements.

¹⁹ XML Schema Part 2: Datatypes Second Edition (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/TR/xmlschema-2/.

²⁰ For a simple description of how the DOM plays in DHTML (Dynamic HTML), see Fabian Guisset, The DOM and JavaScript. Available at: http://www.mozilla.org/docs/dom/reference/javascript.html.

²¹ Document Object Model Activity Statement (Cambridge, MA: World Wide Web Consortium, 2005). Available at: http://www.w3.org/DOM/Activity.

²² Document Object Model Level 1 (Second Edition) (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/DOM/DOMTR#doml.

²³ Document Object Model Level 2 (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/DOM/DOMTR#dom2.

²⁴ Document Object Model Level 3 (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/DOM/DOMTR#dom3.

²⁵ Document Object Model (DOM) Level 3 XPath Specification (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/TR/2004/NOTE-DOM-Level-3-XPath-20040226/.

²⁶ For an example of event-based parsing, see Java API for XML Parsing (JAXP) at http://jcp.org/en/jsr/detail?id=5, or the SAX (Simple API for XML) home page at http://www.saxproject.org.

²⁷ Henry S. Thompson and Richard Tobin, Current Status of XSV: Coverage, Known Bugs, etc. (Edinburgh, England: University of Edinburgh, 2005). Available at: http://www.ltg.ed.ac.uk/~ht/xsv-status.html.

²⁸ Richard Tobin and Henry Thompson, A Schema for Serialized Infosets (Edinburgh, England: University of Edinburgh, 2005). Available at: http://www.w3.org/2001/05/serialized-infoset-schema.html.

²⁹ C. M. Sperberg-McQueen, Document List (Cambridge, MA: World Wide Web Consortium, 2005). Available at: http://www.w3.org/People/cmsmcq/doclist.html#xslt.

³⁰ Canonical XML (Cambridge, MA: World Wide Web Consortium, 2001). Available at: http://www.w3.org/TR/xml-c14n.

³¹ Erik Wilde, Making the Infoset Extensible (Zurich, Switzerland: Swiss Federal Institute of Technology, 2002). Available at: http://www.idealliance.org/papers/xml02/dx_xml02/papers/05-01-06/05-01-06.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: The XML Information Set (Infoset) and Beyond

Create new playlist

Sign In

Sign Up