Chapter 22. Working with Other XML Constructs

So far, this book has focused on elements and attributes. This chapter discusses the other kinds of nodes, namely comments, processing instructions, documents, and text nodes. CDATA sections and XML character and entity references are also covered in this chapter.

XML Comments

XML comments, delimited by <!-- and -->, can be both queried and constructed in XQuery. Some implementations will discard comments when parsing input documents or loading them into a database, so you should consult the documentation for your implementation to see what is supported.

XML Comments and the Data Model

Comments may appear at the beginning or end of an input document, or within element content. Example 22-1 shows a small XML document with two comments, on the second and fifth lines.

Example 22-1. XML document with comments (comment.xml)
<?xml version="1.0" encoding="UTF-8"?>
<!-- This is a business document -->
<b:businessDocument xmlns:b="http://datypic.com/b">
  <b:header>
     <!-- date created --><b:date>2015-10-15</b:date>
  </b:header>
</b:businessDocument>

Comment nodes do not have any children, and their parent is either a document node or an element node. In this example, the comment on the second line is a child of the document node, and the comment on the fifth line is a child of the b:header element.

Comment nodes do not have names, so calling any of the name-related functions with a comment node will result in the empty sequence or a zero-length string, depending on the function. The string value (and typed value) of a comment node is its content, as an instance of xs:string.

Querying Comments

Comments can be queried using path expressions. The comment() kind test can be used to specifically ask for comments. For example:

doc("comment.xml")//comment()

will return both comments in the document, while:

doc("comment.xml")//b:header/comment()

will return only the second comment.

The node() kind test will return comments as well as all other node kinds. For example:

doc("comment.xml")/b:businessDocument/b:header/node()

will return a sequence consisting of the second comment, followed by the b:date element. This is in contrast to *, which selects child element nodes only.

You can take the string value of a comment node (e.g., by using the string function) and use that string in various operations.

Comments and Sequence Types

The comment() keyword can also be used in sequence types to match comment nodes. For example, if you wanted to write a function that places the content of a comment in a constructed comment element, you could use the function shown in Example 22-2. The use of the comment() sequence type in the function signature ensures that only comment nodes are passed to this function.

Example 22-2. Function that processes comments
declare function local:createCommentElement
  ($commentToAdd as comment()) as element() {
  <comment>{string($commentToAdd)}</comment>
};

A comment node will also match the node() and item() sequence types.

Constructing Comments

XML comment constructors can be used in queries to specify XML comments. Unlike XQuery comments, which are delimited by (: and :), XML comments are intended to appear in the results of the query.

XML comments can be constructed using either direct or computed constructors. A direct XML comment constructor is delimited as it would be in an XML document, by <!-- and -->. It is included character by character in the results of the query; no expressions that appear in direct comment constructors are evaluated.

Computed comment constructors are useful when you want to calculate the value of a comment. A computed comment constructor consists of an expression surrounded by comment{ and }, as shown in Figure 22-1. The expression within the constructor is evaluated and cast to xs:string.

Figure 22-1. Syntax of a computed comment constructor

As in XML syntax, neither direct nor computed comment constructors can result in a comment that contains two consecutive hyphens (--) or ends in a hyphen.

Example 22-3 shows examples of XML comment constructors. As you can see, the enclosed expression in the direct constructor is not evaluated, while the expression in the computed constructor is evaluated. In either case, a comment constructor results in a standard XML comment appearing in the query results.

Example 22-3. XML comment constructors

Query

let $count := count(doc("catalog.xml")//product)
(: unordered list :)
return <ul>
         <!-- {concat(" List of ", $count, " products ")} -->
         {comment{concat(" List of ", $count, " products ")}}
       </ul>

Results

<ul>
  <!-- {concat(" List of ", $count, " products ")} -->
  <!-- List of 4 products -->
</ul>

Note that the XQuery comment (: unordered list :) is not included in the results. XQuery comments, described in “Comments”, are used to comment on the query itself.

Processing Instructions

Processing instructions are generally used in XML documents to tell the XML application to perform some particular action. For example, a processing instruction similar to:

<?xml-stylesheet type="text/xsl" href="formatter.xsl"?>

appears in some XML documents to associate them with an XSLT stylesheet. When opened in some browsers, the XML document will be displayed using that stylesheet. This processing instruction has a target, which consists of the characters after the <?, up to the first space, namely xml-stylesheet. The rest of the characters are referred to as its content, namely type="text/xsl" href="formatter.xsl". Although the content of this particular processing instruction looks like a pair of attributes, it is simply considered a string.

Processing instructions can be both queried and constructed using XQuery.

Processing Instructions and the Data Model

Although processing instructions often appear at the beginning of an XML document, they can actually appear within element content or at the end of the document as well. Example 22-4 shows a small XML document with two processing instructions. The xml-stylesheet processing instruction appears on the second line, whereas doc-processor appears within the content of the b:header element. The first line is the XML declaration, which, although it looks like a processing instruction, is not considered to be one.

Example 22-4. XML document with processing instructions (pi.xml)
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="formatter.xsl"?>
<b:businessDocument xmlns:b="http://datypic.com/b">
  <b:header>
    <?doc-processor appl="BDS" version="4.3"?>
    <b:date>2015-10-15</b:date>
  </b:header>
</b:businessDocument>

Processing instruction nodes do not have any children, and their parent is either a document node or an element node. In this example, the xml-stylesheet processing instruction is a child of the document node, and doc-processor is a child of the b:header element.

The node name of a processing instruction node is its target. It is never in a namespace, so the namespace portion of the name will be a zero-length string. The string value (and typed value) is its content, minus any leading spaces, as an instance of xs:string. For example, the node name of the first processing instruction in the example is xml-stylesheet, and its string value is type="text/xsl" href="formatter.xsl".

Querying Processing Instructions

Processing instructions can be queried in path expressions by using the processing-instruction() kind test. For example:

doc("pi.xml")//processing-instruction()

will return the both processing instructions in the document, whereas:

doc("pi.xml")/b:businessDocument/b:header/processing-instruction()

will return only the doc-processor processing instruction. You can also specify a target between the parentheses. For example, specifying processing-instruction(doc-processor) returns only processing instructions whose target is doc-processor. Quotes can optionally be used around the target for compatibility with XPath 1.0.

The node() kind test will return processing instructions as well as all other node kinds. For example, the expression:

doc("pi.xml")/b:businessDocument/b:header/node()

will return a sequence consisting of the doc-processor processing instruction node, followed by the b:date element. This is in contrast to *, which selects child element nodes only.

Processing Instructions and Sequence Types

The processing-instruction() keyword can also be used in sequence types to match processing instruction nodes. For example, to display the target and content of a processing instruction as a string, you could use the function shown in Example 22-5. The use of the processing-instruction() sequence type in the function signature ensures that only processing instruction nodes are passed to this function.

Example 22-5. Function that displays processing instructions
declare function local:displayPIValue
  ($pi as processing-instruction())as xs:string {

  concat("Target is ", name($pi),
         " and content is ", string($pi))
};

As with node kind tests, a specific target may be specified in the sequence type. If the sequence type for the argument had been processing-instruction(xml-stylesheet), the function would only accept processing-instruction nodes with that target. Either way, the target must be a valid XML name with no colon, or type error XPTY0004 will be raised.

A processing instruction node will also match the node() and item() sequence types.

Constructing Processing Instructions

Processing instructions can be constructed in queries, using either direct or computed constructors. A direct processing-instruction constructor uses the XML syntax, namely target, followed by the optional content, enclosed in <? and ?>.

A computed processing-instruction constructor allows you to use an expression for its target and/or content. Its syntax, shown in Figure 22-2, has three parts:

  1. The keyword processing-instruction

  2. The target, which can be either a literal name or an enclosed expression (in braces) that evaluates to a name

  3. The content as an enclosed expression (in braces), that is evaluated and cast to xs:string

Figure 22-2. Syntax of a computed processing-instruction constructor

Example 22-6 shows three different processing-instruction constructors. The first is a direct constructor, the second is a computed constructor with a literal name, and the third is a computed constructor with a calculated name.

Example 22-6. Processing-instruction constructors

Query

<ul>{
  <?doc-processor version="4.3"?>,
  processing-instruction doc-processor2 {'version="4.3"'},
  processing-instruction {concat("doc-processor", "3")}
        {concat('version="', '4.3', '"')}
}</ul>

Results

<ul>
  <?doc-processor version="4.3"?>
  <?doc-processor2 version="4.3"?>
  <?doc-processor3 version="4.3"?>
</ul>

Whether it’s a direct or computed constructor, the target specified must be a valid NCName, which means that it must follow the rules for XML names and not contain a colon.

Documents

Document nodes represent entire XML documents in the XQuery data model. When an input document is opened using the doc function, a document node is returned. The document node should not be confused with the outermost element node, which is its child.

Not all XML data selected or constructed by queries has a document node at its root. Some implementations will allow you to query XML fragments, such as an element or a sequence of elements that are not part of a document. When XML is stored in a relational database, it often holds elements without any containing document. It is also possible, using element constructors, to create result elements that are not part of a document.

The root function can be used to determine whether a node is part of a document. It will return the root of the hierarchy, whether it is a document node or simply a standalone element.

Document Nodes and the Data Model

A document node is the root of a node hierarchy, and therefore has no parent. The children of a document node are the comments and processing instructions that appear outside of any element, and the outermost element node. For example, the document shown in Example 22-4 would be represented by a single document node that has two children: the xml-stylesheet processing-instruction node and the businessDocument element node.

The string value of a document node is the string value of all its text node descendants, concatenated together. In Example 22-4, that would simply be 2015-10-15. Its typed value is the same as its string value, but with the type xs:untypedAtomic.

Document nodes do not have names. In particular, the base URI of a document node is not its name. Therefore, calling any of the name-related functions with a document node will result in the empty sequence or a zero-length string, depending on the function.

Document Nodes and Sequence Types

The document-node() keyword can be used in sequence types to match document nodes. Used with nothing in between the parentheses, it will match any document node. It is also possible to include an element test in between the parentheses. For example:

document-node(element(product))

tests for a document whose only element child (the outermost element) is named product. The document-node() keyword can also be used with a schema element test, as in:

document-node(schema-element(product))

Schema element tests are described in “Sequence Types and Schemas”.

Constructing Document Nodes

Documents can be explicitly constructed using XQuery. This is generally not necessary, because the results of a query do not have to be an XML document node; they can be a single element, or a sequence of multiple elements, or even any combination of nodes and atomic values. If the results of a query are serialized, they become an XML “document” automatically, regardless of whether a document node was constructed in the query.

However, being able to construct a document node is useful if the application that processes the results of the query expects a complete XML document, with a document node. It’s also useful when you are doing schema validation. Validation of a document node gives a more thorough check than validation of the outermost element, because it checks ID/IDREF integrity.

A computed document constructor is used to construct a complete XML document. Its syntax, shown in Figure 22-3, consists of an expression enclosed in document{ and }. An example of a computed document constructor is shown in Example 22-7.

Figure 22-3. Syntax of a computed document constructor

The enclosed expression must evaluate to a sequence of nodes. If it contains (directly) any attribute nodes, a type error is raised.

Example 22-7. Computed document constructor

Query

document {
  element product {
    attribute dept { "ACC" },
    element number { 563 },
    element name { attribute language {"en"}, "Floppy Sun Hat"}
  }
}

Results

<product dept="ACC">
  <number>563</number>
  <name language="en">Floppy Sun Hat</name>
</product>

No validation is performed on the document node, unless it is enclosed in a validate expression. XQuery does not require that a document node only contain one single element node, although XML syntax does require a document to have only one outermost element. If you want a result document that is well-formed XML, you should ensure that the enclosed expression evaluates to only one element node.

Text Nodes

Text nodes represent the character data content within elements. Every adjacent string of characters within element content makes up a single text node. Text nodes can be both queried and constructed in XQuery, although these expressions have limited usefulness.

Text Nodes and the Data Model

A text node does not have any children, and its parent is an element. In Example 22-8, the desc element has three children:

  • A text node whose content is Our  (ending with a space)

  • A child element i

  • A text node whose content is  shirt! (starting with a space)

The i element itself has one child: a text node whose content is favorite.

Example 22-8. Text nodes in XML (desc.xml)
<desc>Our <i>favorite</i> shirt!</desc>

The string value of a text node is its content, as an instance of xs:string. Its typed value is the same as the string value, except that it is of type xs:untypedAtomic rather than xs:string.

Text nodes do not have names, so calling any of the name-related functions with a text node will result in the empty sequence or a zero-length string, depending on the function.

If your document has no DTD or schema, any whitespace appearing between the tags in your source XML will be translated into text nodes. This is true even if it is just there to indent the document. For example, the following b:header element node:

<b:header>
   <b:date>2015-10-15</b:date>
</b:header>

has three children. The first and third children are text nodes that contain only whitespace, and the second child is the b:date element node. If a DTD or schema is used, and the element’s type allows only child elements (no character data content), then the whitespace will be discarded and b:header will not have text node children.

In the data model, there are never two adjacent text nodes with the same parent; all adjacent text is merged into a single text node. This means that if you construct a new element using:

<example>{1}{2}{3}</example>

the resulting example element will have only one text node child, whose value is 123. There is also no such thing as an empty text node, so the element constructor:

<example>{""}</example>

will result in an element with no children at all.

Querying Text Nodes

Text nodes can be queried using path expressions. The text() kind test can be used to specifically ask for text nodes. For example:

doc("desc.xml")//text()

will return all of the three text nodes in the document, while:

doc("desc.xml")/desc/text()

will return only the two text nodes that are children of desc.

The node() kind test will return text nodes as well as all other node kinds. For example:

doc("desc.xml")/desc/node()

will return a sequence consisting of the first text node, the i element node, and the second text node. This is in contrast to *, which selects child element nodes only.

Text Nodes and Sequence Types

The text() keyword can also be used in sequence types to match text nodes. For example, to display the content of a text node as a string, you could use the function shown in Example 22-9. The use of the text() sequence type in the function signature ensures that only text nodes are passed to this function.

Example 22-9. Function that displays text nodes
declare function local:displayTextNodeContent
  ($textNode as text()) as xs:string {
  concat("Content of the text node is ", $textNode)
};

A text node will also match the node() and item() sequence types.

Why Work with Text Nodes?

Because text nodes contain all the data content of elements, it may seem that the text() kind test would be used frequently and would be covered earlier in this book. However, because of atomization and casting, it is often unnecessary to ask explicitly for the text nodes. For example, the expression:

doc("catalog.xml")//product[name/text()="Floppy Sun Hat"]

has basically the same effect as:

doc("catalog.xml")//product[name="Floppy Sun Hat"]

because the name element is atomized before being compared to the string Floppy Sun Hat. Likewise, the expression:

distinct-values(doc("catalog.xml")//product/number/text())

is very similar to:

distinct-values(doc("catalog.xml")//product/number)

because the function conversion rules call for atomization of the number elements.

One difference is that text nodes, when atomized, result in untyped values, while element nodes will take on the type specified in the schema. Therefore, if your number element is of type xs:integer, the second distinct-values expression above will compare the numbers as integers. The first expression will compare them as untyped values, which, according to the rules of the distinct-values function, means that they are treated like strings.

Warning

Not only is it almost always unnecessary to use the node test text(), it sometimes yields surprising results. For example, the expression:

doc("catalog.xml")//product[4]/desc/text()

has a string value of Our shirt! instead of Our favorite shirt! because only the text nodes that are direct children of the desc element are included. If /text() is left out of the expression, its string value is Our favorite shirt!.

There are some cases where the text() sequence type does come in handy, though. One case is when you are working with mixed content and want to work with each text node specifically. For example, suppose you wanted to modify the product catalog to change all the i elements to em elements (without knowing in advance where i elements appear). You could use the recursive function shown in Example 22-10.

Example 22-10. Testing for text nodes
declare function local:change-i-to-em
  ($node as element()) as node() {
  element {node-name($node)} {
    $node/@*,
    for $child in $node/node()
    return if ($child instance of text())
           then $child
           else if ($child instance of element(i))
                then <em>{$child/@*, $child/node()}</em>
                else if ($child instance of element())
                     then local:change-i-to-em($child)
                     else ()
  }
};

The function checks all the children of an element node. If it encounters a text node, it copies it as is. If it encounters an element child, it recursively calls itself to process that child element’s children. When it encounters an i element, it constructs an em element and includes the original children of i.

It is important, in this case, to test for text nodes because the desc element has mixed content; it contains both text nodes and child element nodes. If you throw away the text nodes, it changes the content of the document.

Constructing Text Nodes

You can also construct text nodes, using a text node constructor. The syntax of a text node constructor, shown in Figure 22-4, consists of an expression enclosed by text{ and }. For example, if the value of variable $seq is 1, the expression:

text{concat("Sequence number: ", $seq)}

will construct a text node whose content is Sequence number: 1.

Figure 22-4. Syntax of a text node constructor

The value of the expression used in the constructor is atomized (if necessary) and cast to xs:string. Text node constructors have limited usefulness in XQuery because they are created automatically in element constructors by using literal text or expressions that return atomic values. For example, the expression:

<example>{concat("Sequence number: ", $seq)}</example>

will automatically create a text node as a child of the example element node. No explicit text node constructor is needed.

XML Entity and Character References

Like XML, the XQuery syntax allows for the escaping of individual characters by using two mechanisms: character references and predefined entity references. These escapes can be used in string literals, as well as in the content of direct element and attribute constructors.

Character references are useful for representing characters that are not easily typed on a keyboard. They take two forms:

  • &# plus a sequence of decimal digits representing the character’s Unicode codepoint, followed by a semicolon (;).

  • &#x plus a sequence of hexadecimal digits representing the character’s Unicode codepoint, followed by a semicolon (;).

For example, a space can be represented as &#x20; or &#32;. The number always refers to the Unicode codepoint; it doesn’t depend on the query encoding. Table 22-1 lists a few common XML character references.

Table 22-1. XML character reference examples
Character referenceMeaning
&#x20;Space
&#xA;Line feed
&#xD;Carriage return
&#x9;Tab

Predefined entity references are useful for escaping characters that have special meaning in XML syntax. They are listed in Table 22-2.

Table 22-2. Predefined entity references
Entity referenceMeaning
&amp;Ampersand (&)
&lt;Less than (<)
&gt;Greater than (>)
&apos;Apostrophe/single quote (')
&quot;Double quote (“)

Certain of these characters must be escaped, namely:

  • In literal strings, ampersands, as well as single or double quotes (depending on which was used to surround the literal)

  • In the content of direct element constructors (but not inside curly braces), both ampersands and less-than characters

  • In attribute values of direct element constructors (but not inside curly braces), single or double quotes (depending on which was used to surround the attribute value)

The set of predefined entities does not include certain entities that are predefined for HTML, such as &nbsp; and &eacute;. If these characters are needed as literals in queries, they should be represented using character references. For example, if your query is generating HTML output and you want to generate a non-breaking space character, which is often written as &nbsp; in HTML, you can represent it in your query as &#xa0;. If you want to be less cryptic, you can use a variable, as in:

declare variable $nbsp := "&#xa0;";
<h1>aaa{$nbsp}bbb</h1>

Example 22-11 shows a query that uses character and entity references in both a literal string and in the content of an element constructor. The first line of the query uses &#65; in place of the letter A in a quoted string. The second line uses various predefined entity references, as well as the character reference &#x20;, which represents the space character inside a direct element constructor.

Example 22-11. Query with XML entities

Query

if (doc("catalog.xml")//product[@dept='&#65;CC'])
then <h1>Accessories &amp; Misc&#x20;List from &lt;catalog&gt;</h1>
else ()

Results

<h1>Accessories &amp; Misc List from &lt;catalog&gt;</h1>

In element constructors, references must appear directly in the literal content, outside of any enclosed expression. For example, the constructor:

<quoted>&apos;{"abc"}&apos;</quoted>

returns the result <quoted>'abc'</quoted>, while the constructor:

<quoted>{&apos;"abc"&apos;}</quoted>

raises error XPST0003, because &apos; is within the curly braces of the enclosed expression.

Including an entity or character reference in a query does not necessarily result in a reference in the query results. As you can see from Example 22-11, the results of the query (when serialized) contain a space character rather than a character reference.

CDATA Sections

A CDATA section is a convenience used in XML documents that allows you to include literal text in element content without having to use &lt; and &amp; entity references to escape less-than and ampersand symbols, respectively. CDATA sections are not a separate kind of node; they are not represented in the XQuery data model at all.

CDATA sections are delimited by <![CDATA[ and ]]>. Example 22-12 shows two h1 elements. The first element has a CDATA section that contains some literal text, including an unescaped ampersand character. It also contains a reference to a <catalog> element that is intended to be taken as a string, not as an XML element in the document. If this text were not enclosed in a CDATA section, the XML element would not be well formed. The second h1 element shown in the example is equivalent, using predefined entities to escape the ampersand and less-than characters.

Example 22-12. Two equivalent h1 elements, one with a CDATA section
<h1><![CDATA[Catalog & Price List from <catalog>]]></h1>
<h1>Catalog &amp; Price List from &lt;catalog&gt;</h1>

When your query accesses an XML document that contains a CDATA section, the CDATA section is not retained. If the h1 element in Example 22-12 is queried, its content is Product Catalog & Price List from <catalog>. There is no way for the query processor to know that a CDATA section was used in the input document.

For convenience, CDATA sections can also be specified in a query, in the character data content of an element constructor. Example 22-13 shows a query that uses a CDATA section. All of the text in a CDATA section is taken literally; it is not possible to include enclosed expressions in a CDATA section.

Just as in an XML document, a CDATA section in a query serves as a convenient way to avoid having to escape characters. Including a CDATA section in a query does not result in a CDATA section in the query results. As you can see from Example 22-13, the results of the query (when serialized) contain an escaped ampersand and less-than sign in the element content. However, it is possible to force CDATA sections for certain output elements using the cdata-section-elements serialization parameter, described in “Serialization Parameters”.

Example 22-13. Query with CDATA section

Query

if (doc("catalog.xml")//product)
then <h1><![CDATA[Catalog & Price List from <catalog>]]></h1>
else <h1>No catalog items to display</h1>

Results

<h1>Catalog &amp; Price List from &lt;catalog></h1>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.189