Chapter 7. Constructing XML

Introduction

In Chapter 3, you learned how to access XML that exists outside the query and navigate over it. In this chapter, you will learn how to construct XML directly in a query. Constructing XML is useful for several purposes, including creating a new result shape (transformation), representing temporary intermediate data structures (composition), and organizing data into conceptual groups (views).

XQuery has expressions for constructing all seven of the well-known XML node kinds: element, attribute, text, document, comment, processing-instruction, and namespace. For all of these node kinds, XQuery supports two different construction expressions: one with a syntax similar to XML, and an alternate XQuery syntax primarily used for nodes whose names or contents are computed from XQuery expressions.

XML construction is a fairly complex process. Handling whitespace characters, namespace nodes, sequences of atomic values, and types are some of its trickier aspects.

Constructed XML elements and attributes are validated against the in-scope schema types (see Section 7.10). Use type operators such as validate to apply different or more specific types to constructed XML nodes (see Chapter 9).

Element Nodes

XQuery provides two different ways to construct elements: the direct constructor and the computed constructor. The direct constructor is essentially the XML syntax slightly modified to support embedded XQuery expressions. The element name is constant, but its content can be totally or partially computed by XQuery expressions. The computed constructor is specific to XQuery and is most commonly used when the element name is computed by some other XQuery expression (although it can also be used when the name is constant).

Direct Element Constructor

The XML syntax for constructing elements should be familiar to you already, and XQuery uses it directly. For example, the XQuery <x/> in Listing 7.1 constructs a sequence containing one element named x, while the XQuery <x><y/></x> constructs an element named x that contains another element named y.

Example 7.1. Direct element construction

<x/>
<x></x>
<x><y/></x>

The characters in between the start and end tags of the element are its content. When an element is constructed using a self-closing tag (<x/>) or with separate start and end tags with no characters in between (<x></x>), its content is empty. The syntax choice makes no difference.

Just like in XML, element nodes may contain any other kind of node except document. The other node kinds are explained later in this chapter. Character references, character entity references, and CDATA sections are also allowed, but become ordinary text; that is, the XQuery Data Model doesn't “remember” that they were CDATA or references.

When any part of an element's content is enclosed by curly braces ({}), the enclosed expression is evaluated as an XQuery. All other content is treated as ordinary character data. To use a curly brace character as an ordinary character, it must be escaped by doubling it or by using a character entity reference.

For example, the XQuery element constructed by <x>a{1+1}</x> contains the text value a2. The character a is kept unchanged, but the curly-brace enclosed expression 1+1 is treated as an XQuery expression, evaluated, and produces the result 2. To output the expression literally, without computing it, double-up its curly braces like this: <x>a{{1+1}}</x>. This XQuery results in an x element whose text content consists of the six characters a{1+1}.

The exact rules for element content are quite a bit more involved than this; see Section 7.11 for complete details.

Computed Element Constructor

XQuery supports a second syntax for constructing elements consisting of the element keyword followed by two expressions: the name and the content. The name can be either an ordinary name constant or an XQuery expression enclosed in braces. Listing 7.2 shows both of these possibilities. The content is always an enclosed expression.

Example 7.2. Computed element construction

element x { 1 + 1 }
element { concat("x", ":y") } { 1 + 1 }

For example, the XQuery expression element x { 1 + 1 } is equivalent to the XQuery <x>{1 + 1}</x>. It computes an element with the name x, and whose content is evaluated from the XQuery expression 1 + 1. More commonly, however, this syntax is used when the name isn't constant. For example, the XQuery expression element { concat("a", 1 + 1) } { "x" } computes both the element name (in this case, a2) and the content.

The name expression is converted to a qualified name as follows: If the name expression results in an xs:QName value, then that value is used directly. If the name results in an xs:string value, then that string is parsed as a QName using the in-scope namespaces. No other type of value is allowed as the name expression.

Again, the rules for evaluating element content are involved; see Section 7.11.

Attribute Nodes

XQuery also supports two styles of attribute construction: direct attribute constructors, and computed constructors similar to the computed constructors for elements.

Direct Attribute Constructors

The usual XML syntax for attributes (attributes in an element's start tag between the element name and the tag end character >) constructs attributes directly. Attribute values may appear inside single or double quotes; the quote character must be escaped (using an entity or by doubling it) when used in the content.

As with element content, an attribute value may contain character data, including character entities, and enclosed expressions are evaluated as XQuery expressions. Curly braces must be doubled to be used as character content. Listing 7.3 shows two examples of direct attribute constructors.

Example 7.3. Direct attribute construction

<x a="value1" b='value2' />
<y a="{1+2}" b="{{1+2}}"/>

Like elements, attribute content requires some special rules to handle whitespace and sequences of values. See section 7.11 for details.

Computed Attribute Constructors

XQuery provides an alternate syntax for attributes, similar to that for elements, using the attribute keyword. For example, attribute name { "value" } constructs an attribute with the given name and value. Listing 7.4 shows a computed attribute constructor on its own and another in an XML element as part of its computed content.

Example 7.4. Computed attribute construction

attribute { xs:QName("a") } {1+2}
<x>{ attribute a {1+2} } </x>

As with elements, the main reason to use the computed attribute constructor is that the name can be computed from an enclosed XQuery expression. However, another reason to use the alternate syntax is to construct an attribute node without a parent element. For example, you might write a function that computes an “attribute group” and uses it over and over again in other elements, as shown in Listing 7.5. Of course, this means that attributes don't always have a parent node, which can cause some difficulties when serializing out the data model (see Chapter 13).

Example 7.5. Computed attribute constructor can create "floating" attributes

declare function my-attrs() as attribute()* {
  (attribute one { "1" },
   attribute two { "2" },
   attribute three { "3" })
};

<x>
  <y>{ my-attrs() }</y>
  <z>{ my-attrs() }</z>
</x>
=>
<x>
  <y one="1" two="2" three="3"/>
  <z one="1" two="2" three="3"/>
</x>

Text Nodes

Text nodes can actually be created in three ways:

  • Using element content (explained in section 7.11)

  • Using the computed text constructor syntax text { content }, where content is any sequence of XQuery expressions

  • Using the CDATA syntax (described at the end of this section)

In the computed text constructor case, the content sequence is first atomized; if the atomized sequence is empty, then no text node is constructed. Otherwise, the atomic values are converted to xs:string and joined together with a space character between each pair—exactly like a call to the built-in string-join() function—and the resulting string value is the value of the text node.

Because text nodes are already nameless, the main reason to use this alternate syntax is to create “floating” text nodes without parent elements. The computed text constructor can also be useful when you need fine-grained control over whitespace handling in element content (see Section 7.11).

Sometimes, mainly in elements, the text value contains a lot of special characters that would require escaping or entitization if you wrote them normally. Instead of escaping or entitizing every such character, you can use a CDATA constructor. (In XQuery, CDATA constructors can be used anywhere, not just in elements.)

The CDATA constructor has the form <![CDATA[chars]]> where chars is a sequence of zero or more characters, excluding the sequence ]]> (in other words, exactly like it works in XML). The CDATA constructor creates a text node whose value is that string of characters.

A common misconception is that the CDATA constructor allows you to represent other characters, such as control characters, that aren't legal in XML, but in fact it doesn't allow this—it just gives you a way to avoid writing lots of character entities. (However, not all XQuery implementations enforce the XML rules; check the documentation accompanying your implementation.)

Document Nodes

XQuery provides a computed document constructor, document { content }, which constructs a document node with the given content. This constructor creates a new document node, copying all the content and stripping it of useful type information.

If the content sequence contains document or attribute nodes, an error is raised. Sequences of one or more consecutive atomic values are replaced by text nodes containing those atomic values converted to xs:string and joined together with spaces in between, like a call to the string-join() function. All other items in the content sequence are deep-copied—losing their node identity—and given new types: elements are typed as xs:anyType, attributes as xs:anySimpleType.

The new document node isn't validated against a schema, nor are XML well-formedness rules checked. If its content is empty, then an empty document node is constructed.

The main reason to use the document constructor, aside from the effects just mentioned, is to simulate a document loaded by the built-in doc() function. For example, let's suppose you wish to write a function that will return a computed document instead of one loaded from XML. Your first attempt would probably look like Listing 7.6, and it would be wrong.

Example 7.6. Incorrect implementation of a pseudo-document function

declare function pseudo-doc() {
  <x>
    <y/>
  </x>
};

The problem with this implementation becomes clear when you consider a path like pseudo-doc()/x. This path returns the empty sequence, instead of matching the x element as you might expect. The first step constructs the x element, and then the second step selects its child elements named x—but there aren't any.

We can solve this problem by using the document constructor, as in Listing 7.7.

Example 7.7. Correct implementation of a pseudo-document function

declare function pseudo-doc() as document-node() {
  document {
    <x>
      <y/>
    </x>
  }
};

With this correct definition, the first step of the path pseudo-doc()/x selects the document node, and the second step finds its child element named x, as expected.

Comment Nodes

XQuery supports the XML syntax for comment nodes, so you can write <!--content--> to create an XML comment node, where content is any sequence of characters not containing the terminator sequence -->.

XQuery also provides a computed comment constructor, with the comment keyword followed by an expression enclosed in curly braces. The expression is evaluated, atomized, and the resulting values converted to string and concatenated with space characters in between to produce the comment content. XML comment nodes shouldn't be confused with XQuery comments, which don't have any effects on a query. Listing 7.8 shows both comment constructor styles.

Example 7.8. Direct and computed comment construction

<!-- this is a comment -->
comment { "this is a comment" }

Processing Instruction Nodes

Processing instruction nodes are constructed using the usual XML syntax: <?name content?>, where name is any valid, unprefixed XML name, and content is any sequence of characters not containing the terminator sequence ?>. They can also be constructed using the processing-instruction keyword followed by an enclosed name expression and an enclosed content expression. In both cases, the name part is optional. Listing 7.9 demonstrates both styles of construction.

Example 7.9. Processing instruction constructors

<?hello world?>
processing-instruction { "hello" } { "world" }

Although the XML declaration <?xml version="1.0"?> that may appear at the top of an XML file looks like a processing instruction, it isn't. It cannot be selected or constructed by XQuery.

Namespace Nodes

XML namespaces can be bewildering. On the one hand, they are data similar to ordinary attributes; on the other hand, they are meta-data that affects how other XML names are interpreted. On the one hand, they are nodes with unique identities; on the other hand, they are copied by some data models into each node in their scope.

One of the biggest debates while designing XQuery was what to do with an expression like <foo xmlns="urn:bar"/>. How should the namespace declaration attribute (xmlns="urn:bar") affect the element? What about computed constructors, such as <foo>{ attribute xmlns {"urn:bar"} }</foo> or even <foo xmlns="{concat('urn:', 'bar')"/>? Should these even be allowed?

Because namespace declaration attributes are so nuanced, in XQuery it's generally best to forgo them entirely and put all namespace declarations in the query prolog. However, XQuery also accepts and uses namespace declaration attributes when they appear in direct element constructors. In Listing 7.10, the first element uses the namespace declaration in the prolog, while the second element uses a namespace declaration attribute to accomplish the same effect.

Example 7.10. Two different ways to declare a namespace

declare namespace foo="urn:one";
<foo:x/>
<bar:y xmlns:baz="urn:two"/>

Despite its name, the namespace declaration attribute doesn't cause an attribute to be constructed; instead, it constructs a namespace node and puts the namespace prefix (or default element namespace) into scope for that element and all of its content. The namespace declaration attribute cannot be computed; its content must be a literal string.

XQuery also supports a computed namespace constructor, demonstrated in Listing 7.11, in which the prefix is still constant but the namespace value can be computed by an arbitrary XQuery expression. The namespace value is processed the same as the content expressions in the computed comment and processing instruction constructors described previously.

Example 7.11. Computed namespace constructor

<foo:x>{ namespace foo { "bar" } }</x>
=>
<foo:x xmlns:foo="bar"/>

As in XML, the namespace prefixes xml and xmlns are special and cannot be overridden. However, any of the other XQuery built-in namespace prefixes, such as xs and fn, can be overridden using a namespace declaration attribute (just as they can be overridden using namespace declarations in the prolog).

Composition

Navigation over constructed nodes is called composition.

In XSLT 1.0, constructed elements create result tree fragments and composition is specifically disallowed. In contrast, XQuery encourages composition; constructed nodes aren't any different than nodes loaded from a document, and can be manipulated or navigated in the same way (see Listing 7.12). Most implementations eliminate unnecessary temporary nodes (although some things can prevent this optimization; see Chapter 13).

Example 7.12. XQuery supports composition of construction and navigation

(<x><y><z/></y></x>)//z
=>
<z/>

Because XQuery doesn't have structural types other than XML and flat sequences, composition enables you to construct your own hierarchical data structures, usually without significant loss in efficiency.

As a simple example, consider creating a point element that has x, y, and z attributes (corresponding to those coordinates). You could then write functions that use these point elements, and extract the coordinate values using attribute navigation, as shown in Listing 7.13. (If you're familiar with XML Schema and your XQuery implementation supports schema import, then you should consider creating complex types to associate with your data structures.)

Example 7.13. Composition facilitates custom XML "data structures"

declare function make-origin() as element(point) {
  <point x="0" y="0" z="0"/>
};

declare function length-squared($p as element(point)) as xs:double {
  $p/@x * $p/@x + $p/@y * $p/@y + $p/@z * $p/@z
};

declare function scale($p as element(point),
                       $scale as xs:double) as element(point) {
  <point x="{$p/@x * $scale}" y="{$p/@y * $scale}"
                              z="{$p/@z * $scale)" />
};

make-origin()
=>
<point x="0" y="0" z="0" />

scale(<point x="1" y="2" z="3"/>, 2)
=>
<point x="2" y="4" z="6" />

length-squared(make-origin())              => 0E0
length-squared(<point x="1" y="2" z="3"/>) => 1.4E1

Validation

If you don't use complex types from XML Schema, or if your implementation doesn't support import schema, then you can skip this section.

Every constructed element is implicitly validated against the current validation context, exactly like the validate expression (see Chapter 9). If the validation mode is skip, then no validation is actually performed; instead, the element is typed as xs:anyType, and its attributes are typed as xdt:untypedAtomic.

Validation is a complex process that not only augments the data model with type information for this element and its attributes, but may also add attributes (with default values) to the element.

When the element name is a constant qualified name, whether used in a direct element constructor or a computed one, it is added to the validation context; otherwise, the validation context is reset to global, regardless of whatever the initial or default validation context was. The new validation context is used for nested expressions.

Element and Attribute Content

The complete rules for handling element and attribute content are somewhat more complex than you might expect, mainly due to three complications: special characters, such as < or {; whitespace characters, which have special meaning in both XQuery and XML; and embedded XQuery expressions.

Character Escapes

In addition to the doubled-up curly brace escapes, XQuery supports three kinds of character references: hexadecimal, decimal, and named entities.

As in XML, entity references all begin with an ampersand (&) and end with a semicolon (;). XQuery has five named entities corresponding to the five special characters: less-than (<), greater-than (>), ampersand (&), quote ("), and apostrophe ('). These characters and their named entity references are listed in Table 7.1.

Numeric entities can be written in either decimal or hexadecimal format. Decimal numeric entities are written &#N; where N is any decimal number. Hexadecimal numeric entities are written &#xN; where N is any hexadecimal number. Hexadecimal characters may be uppercase or lowercase, but the x that precedes the value must be lowercase. In both cases, the number denotes a Unicode character code point (see Chapter 8) and must be a valid XML character. For example, the character reference &#0; isn't valid because character 0 (NULL) isn't a valid XML character.

Table 7.1. The five named character entities supported by XQuery

Named escape

Numerical escape (decimal)

Numerical escape (hexadecimal)

Result

&lt;

&#60;

&#x3C;

<

&gt;

&#62;

&#x3E;

>

&amp;

&#38;

&#x26;

&

&quot;

&#34;

&#x22;

"

&apos;

&#39;

&#x27;

'

As mentioned in Section 7.4, CDATA sections can be useful when text contains many characters that would require escapes, as demonstrated by Listing 7.14.

Example 7.14. CDATA sections eliminate the need for entity escapes

<x><![CDATA[Special characters such as <, >, and & do not need to be escaped in a CDATA 
CDATA sections eliminate the need for entity escapessection]]></x>

<x><![CDATA[However, the CDATA section cannot contain its terminator sequence, ], ], >.  
CDATA sections eliminate the need for entity escapesThis character sequence can be split across two CDATA sections, like this:
CDATA sections eliminate the need for entity escapes ]]]]><![CDATA[>]]></x>

Whitespace

Recall that XML whitespace consists of sequences of any of the four characters space (U+0020), tab (U+0009), line-feed (U+000A), and carriage return (U+000D). One of the more useful character escapes is the non-breaking space character U+00A0 (&#160;), which is not treated as whitespace by XML or XQuery, but is often treated as an ordinary space character by other applications (such as Web browsers).

As in HTML, whitespace in XML is mostly insignificant; applications that depend on whitespace being preserved exactly as written are in for a difficult time. That said, XQuery has very well-defined rules for how and when whitespace characters are preserved, stripped, or normalized.

Whitespace preservation keeps the whitespace characters exactly as written. Whitespace stripping removes boundary whitespace (explained momentarily). Whitespace normalization replaces consecutive whitespace characters with a single space character. (A variant known as new-line normalization replaces any end-of-line character sequence with a single line-feed character.)

Boundary whitespace is whitespace that occurs by itself in between XML constructors and/or enclosed XQuery expressions, excluding whitespace constructed using character entity references. For example, boundary whitespace occurs between the y and z elements in the expression <x><y/> <z/></x>, and between these elements and the enclosed XQuery expression in <x><y/> {1+1} <z/></x>. However, in the expression <x>y z</x> there isn't any boundary whitespace, nor is there any in the expressions <x> y z </x> or <x><y/>&#x20;<z/></x>.

Boundary whitespace is preserved or stripped depending on the current XML space policy. This policy can be set in the query prolog using the XML space declaration, as shown in Listings 7.15 and 7.16. The default is strip.

Example 7.15. Preserve boundary whitespace

declare xmlspace preserve;
<x><y/> {1+1} <z/> a b </x>
=>
<x><y/> 2  <z/> a b </x>

Example 7.16. Strip boundary whitespace

declare xmlspace strip;
<x><y/> {1+1} <z/> a b </x>
=>
<x><y/>2<z/> a b </x>

Computed whitespace isn't boundary whitespace either; for example, the expression <x>{" "}</x> always results in <x> </x>, regardless of the XML space policy.

In addition to all these rules, users may explicitly normalize whitespace or trim whitespace off the ends of string values using built-in text processing functions such as normalize-space() (see Chapter 8). Additionally, whitespace characters in attribute constructors are normalized (exactly like in XML).

Note that validation effectively removes whitespace (as well as other content); for example, if a schema has provided an element declaration for the name x with type xs:integer, then <x> 1<!--2-->3 </x> removes the whitespace characters (and the comment) and creates an element named x containing the integer 13.

Content Sequence

Finally, it remains to be explained how the content sequence of elements and attributes is computed, given that the content can contain both character data and embedded XQuery expressions that evaluate to nodes, atomic values, and sequences of these. The attribute content sequence is the simpler of the two, so let's consider that first.

Attribute Content

First, entity and character references are resolved into the corresponding strings. Each block of character data is treated as an atomic value of type xs:string containing those characters. Whitespace normalization is applied to this character data.

Next, each enclosed expression is evaluated and atomized. If the result is the empty sequence, then the empty string is used. Otherwise, each atomic value is converted to xs:string and joined together with space characters in between (exactly like the string-join() function). Either way, the result is a string value. The example in Listing 7.17 demonstrates these rules.

Example 7.17. Examples of the attribute content rules

<x a="12&apos;" b="{1,2}" c="12{3, 4}56" d="1{2+3}4"/>
=>
<x a="12'" b="1 2" c="123 456" d="154"/>

Finally, the sequence of strings is concatenated together without spaces to produce the final attribute value. The type of the attribute is initially xdt:untypedAtomic, although validation, which happens implicitly for element constructors, may assign a type to the attribute.

Element Content

The rules for evaluating element content are detailed but straightforward. They consist primarily of three main steps, described next with examples.

First, any entity and character references are resolved into their corresponding strings. Boundary whitespace is stripped, and the remaining character sequences are converted to text nodes containing those characters (one text node for each consecutive block of text).

Second, any nested constructors are evaluated, resulting in new nodes.

Third, any enclosed XQuery expressions are evaluated. Each one results in a sequence of items. If an item is a node, then it is deep-copied (destroying its identity and replacing all element types with xs:anyType and attribute types with xs:anySimpleType). Each sequence of consecutive atomic values are converted to string just like in the attribute case (converted to string and joined together with space characters in between), and a text node containing that string value is constructed in their place. The examples in Listing 7.18 illustrate these rules.

Example 7.18. Examples of the element content rules

<a><b>12</b><c>{1,2}</c><d>12{3, 4}56</d><e>1{2+3}4</e></a>
=>
<a><b>12</b><c>1 2</c><d>123 456</d><e>154</e></a>

<x>{ attribute y { 1 }, element z { "a", "b" }, text { " c " }</x>
=>
<x y="1"><z>a b</z> c </x>

At this point, the content sequence has been normalized to consist entirely of nodes. Several error cases are checked next; if any of the following conditions occurs, then an error is raised:

  • Any node is a document node

  • An attribute node occurs after a non-attribute node

  • Two or more attributes have the same name

  • A namespace node occurs after a non-namespace node

Listing 7.19 demonstrates two of these error cases. Note that these rules are different from XSLT 1.0, which, for example, allows implementation to use the last attribute given when names collide.

Example 7.19. Some XML rules are applied to element construction

<x a="1" a="2"/> => error("Duplicate attribute a")

<x>y{ attribute z { 1 }}</x>
=> error("Attribute after text content")

Otherwise, adjacent text nodes are concatenated (without spaces between) and replaced by single text nodes, and the final sequence becomes the content of the element. If the sequence is empty, then the element is constructed but empty.

Conclusion

This chapter shows the myriad ways an XQuery can create new XML nodes.

For elements and attributes, XQuery supports two methods of construction: one direct constructor syntax that is essentially XML—extended to allow enclosed XQuery expressions, and one computed constructor syntax that is uniquely XQuery (typically used to construct nodes whose names are computed). Constructed elements are implicitly validated.

For comment and processing instruction nodes, XQuery uses the direct XML syntax without modification. For document nodes, XQuery supports only a computed constructor syntax. Namespace nodes are created implicitly, based on the in-scope namespaces.

Text nodes are created implicitly in element content, may be created explicitly using a computed constructor syntax, and may be created using CDATA section constructors (exactly as in XML).

Further Reading

For more information about the XQuery Data Model, see Chapter 2 of this book. You may also be interested in the official standards, such as the XQuery Data Model specification at http://www.w3.org/TR/query-datamodel/, the XML specification at http://www.w3.org/TR/1998/REC-xml-19980210, and Namespaces in XML at http://www.w3.org/TR/1999/REC-xml-names-19990114/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.11.34