Chapter 2. XQuery Foundations

This chapter provides a brief overview of the foundations of XQuery: its design, its place among XML-related standards, and its processing model. It also discusses the underlying data model behind XQuery and the use of types and namespaces in queries.

The Design and History of the XQuery Language

The XML Query Working Group of the World Wide Web Consortium (W3C) began work on XQuery in 1999. It used as a starting point an XML query language called Quilt, which was itself influenced by two earlier XML query languages: XQL and XML-QL.

The working group set out to design a language that would:

  • Be useful for both highly structured and semi-structured documents

  • Be protocol-independent, allowing a query to be evaluated on any system with predictable results

  • Be a declarative language rather than a procedural one

  • Be strongly typed, allowing queries to be “compiled” to identify possible errors and to optimize evaluation of the query

  • Allow querying across collections of documents

  • Use and share as much as possible with appropriate W3C recommendations, such as XML, Namespaces in XML, XML Schema, and XPath

The XQuery recommendation and related supporting standards include over 15 separate documents and over 1,000 printed pages. These documents are listed (with links) at the public XQuery website at http://www.w3.org/XML/Query. The various recommendation documents are generally designed to be used by implementers of XQuery software, and they vary in readability and accessibility.

Version 1.0 of XQuery became a standard in 2007. Subsequently, version 3.0 was finalized in 2014. The version number 2.0 was skipped in order to align the version numbers with XPath and XSLT. XQuery 3.1 was finalized in 2017. A list of the major new features added in versions 3.0 and 3.1 can be found in “New Features in XQuery 3.0” and “New Features in XQuery 3.1”, respectively.

XQuery in Context

XQuery is dependent on or related to a number of other technologies, particularly XPath, XSLT, SQL, and XML Schema. This section explains how XQuery fits in with these technologies.

XQuery and XPath

XPath started out as a language for selecting elements and attributes from an XML document while traversing its hierarchy and filtering out unwanted content. XPath 1.0 is a fairly simple yet useful recommendation that specifies path expressions and a limited set of functions. Later versions of XPath are much more than that, encompassing a wide variety of expressions and functions, not just path expressions.

XQuery and XPath overlap to a very large degree. They have the same data model and the same set of built-in functions and operators. XPath is essentially a subset of XQuery. XQuery has a number of features that are not included in XPath, such as FLWORs, user-defined functions, and XML constructors. This is because these features are not relevant to selecting, but instead have to do with structuring, sorting query results or more complex programming.

The two languages are consistent in that any expression that is valid in both languages evaluates to the same value using both languages.

XQuery Versus XSLT

XSLT is a W3C language for transforming XML documents into other XML documents or, indeed, documents of any kind. There is a lot of overlap in the capabilities of XQuery and XSLT. XSLT makes use of XPath, so it has the same data model and supports all the same built-in functions and operators as XQuery, as well as many of the same expressions.

Some of the differences between XQuery and XSLT are:

  • XSLT implementations are generally optimized for transforming entire documents. Whether they load the entire input document into memory or use the streaming features of XSLT 3.0, the use case is typically handling a single document at a time. XQuery is optimized for selecting fragments of data, for example, from a database. XQuery is designed to be scalable and to take advantage of database features such as indexes for optimization.

  • XQuery has a more compact non-XML syntax, which is sometimes easier to read and write (and embed in program code) than the XML syntax of XSLT.

  • XQuery is designed to select from a collection of documents as opposed to a single document. FLWORs make it easy to join information across (and within) documents. XSLT stylesheets can also operate on multiple documents, but XSLT processors are not particularly optimized for this less common use case.

Generally, when transforming an entire XML document from one XML vocabulary to another, it makes more sense to use XSLT. When your main focus is selecting a subset of data from an XML document or database, you should use XQuery. The relationship between XQuery and XSLT is explored further in Chapter 27.

XQuery Versus SQL

XQuery borrows ideas from SQL, and many of the designers of XQuery were also designers of SQL. The line between XQuery and SQL may seem clear; XQuery is for XML, and SQL is for relational data. However, increasingly this line is blurred, because relational database vendors are putting XML frontends on their products and allowing XML to be stored in traditionally relational databases.

XQuery does not replace SQL for the highly structured data that is traditionally stored in relational databases. The two can coexist, with XQuery being used to query less-structured data, or data that is destined for an XML-based application, and SQL continuing to be used for highly structured relational data.

Chapter 26 compares XQuery and SQL, and describes how they can be used together.

XQuery and XML Schema

XML Schema is a W3C standard for defining schemas, which can be used to validate XML documents and to assign types to XML elements and attributes. XQuery uses the type system of XML Schema, which includes built-in types that represent common datatypes such as decimal, date, and string. XML Schema also specifies a language for defining your own types based on the built-in types.

If an input document to a query has a schema, the types can be used when evaluating expressions on the input data. For example, if your item element has a quantity attribute, and you know from the schema that the value of the quantity attribute is an integer, you can perform sorts or other operations on that attribute’s value without converting it to an integer in the query. This also has the advantages of allowing the processor to better optimize the query and to catch errors earlier.

XQuery users are not required to use schemas. It is entirely possible (and common) to write a complete query with no mention of schemas. However, a rich set of functions and operators are provided that generally operate on typed data, so it is useful to understand the type system and use the built-in types, even if no schema is present. Chapter 14 covers schemas in more detail.

Processing Queries

A simple, typical example of a processing model for XQuery is shown in Figure 2-1. This section describes the various components of this model.

Figure 2-1. A basic XQuery processor

Input Documents

Throughout this book, the term input document is used to refer to the data that is being queried, which is most often XML. The XML that is being queried can, in fact, take a number of different forms, for example:

  • Text files that are XML documents

  • Fragments of XML documents that are retrieved from the Web using a URI

  • A collection of XML documents that are associated with a particular URI

  • Data stored in native XML databases

  • Data stored in relational databases that have an XML frontend

  • In-memory XML documents

Some queries use a hardcoded link to the location of the input document(s), using the doc or collection function in the query. Other queries operate on a set of input data that is set by the processor at the time the query is evaluated.

Whether it is physically stored as an XML document or not, an XML input document must conform to other constraints on XML documents. For example, an element cannot have two attributes with the same name, and element and attribute names cannot contain special characters other than hyphens, underscores, and periods.

In addition to XML input, it is also possible to query JSON documents and simple text files. This can be done via the json-doc and unparsed-text functions, or through in-memory data structures.

The Query

An XQuery query could be contained in a text file, embedded in program code or in a query library, generated dynamically by program code, or input by the user on a command line or in a dialog box. Queries can also be composed from multiple files, known as modules.

A query is made up of three parts: a version declaration, a prolog, and a body, in that order.

  • The optional version declaration says what version of XQuery you are using, for example 3.1. If the version declaration does not appear, the processor makes an assumption about the version based on which version it supports.

  • The optional query prolog contains various declarations that are used in evaluating the query. This includes namespace declarations, variable declarations, user-defined functions, and other settings. These declarations are discussed in relevant sections throughout the book and summarized in Chapter 12.

  • The query body contains one or more expressions, separated by commas, that indicate what the query should return.

So far, the examples in this book have had only a query body. Example 2-1 shows a query with all three parts. As you can see, a semicolon separates the version declaration and each of the two declarations in the prolog. The query body contains two expressions, a constructed h1 element, and a FLWOR. The comma after the h1 element is used to separate the two expressions in the query body.

Example 2-1. A query with a prolog

Query

xquery version "3.1";
            
declare namespace html = "http://www.w3.org/1999/xhtml";
declare variable $orderTitle := "Order Report";

<h1>{$orderTitle}</h1>,
for $item in doc("order.xml")//item
order by $item/@num
return <p>{data($item/@num)}</p>

Results

<h1>Order Report</h1>
<p>443</p>
<p>557</p>
<p>557</p>
<p>563</p>
<p>784</p>
<p>784</p>

The Context

A query is not evaluated in a vacuum. The query context consists of a collection of information that affects the evaluation of the query. Some of these values can be set by the processor outside the scope of the query, while others are set in the query prolog. The context may include such values as:

  • The context item, which determines the context for path expressions in the query, i.e., what input documents are being queried

  • Current date and time, and the implicit time zone

  • Names and values of variables that are bound outside the query or in the prolog

  • External function libraries built into your processor

The Query Processor

The query processor is the software that parses, analyzes, and evaluates the query. The analysis and evaluation phases are roughly equivalent to compiling and executing program code. The analysis phase finds syntax errors and other static errors that do not depend on the input data. The evaluation phase actually evaluates the results of the query based on input documents, possibly raising dynamic errors for situations like missing input documents or division by zero. Either phase may raise type errors, which result when a value is encountered that has a different type than expected. Errors in XQuery all have eight-character names, such as XPST0001, and they are described in detail in Appendix C.

There are a number of implementations of XQuery. Some are open source, while others are available commercially from major vendors. Many are listed at the official XQuery website at http://www.w3.org/XML/Query. This book does not delve into all the details of individual XQuery implementations but points out features that are implementation-defined or implementation-dependent, meaning that they may vary by implementation.

The Results of the Query

The query processor returns a sequence of values as the results. The results are often XML elements (or entire documents), but a query could also return a result that is not XML, for example a string or an array of integers. Depending on the implementation, these results can then be written to a physical file, sent to a user interface, or passed to another application for further processing.

Writing the results to a physical XML document is known as serialization. In your query you can specify that you want the output serialized as XML, HTML, XHTML, text, or JSON. “Serializing Output” covers serialization options in more detail.

The XQuery Data Model

XQuery has a data model that is used to define formally all the values used within queries, including those from the input document(s), those in the results, and any intermediate values. The data model is officially known as the XQuery and XPath Data Model, or XDM. Understanding the data model is analogous to understanding tables, columns, and rows when learning SQL. It describes the structure of both the inputs and outputs of the query. It is not necessary to become an expert on the intricacies of the data model to write XML queries, but it is essential to understand the basic components:

Node

An XML construct such as an element or attribute

Atomic value

A simple data value with no markup associated with it

Function

Starting in version 3.0, a function is a full-fledged item in the data model. Maps and arrays are subtypes of functions. These more advanced use cases are described in Chapters 23 and 24.

Item

A generic term that refers to either a node, atomic value, or function.

Sequence

An ordered list of zero, one, or more items

The relationship among these components is depicted in Figure 2-2.

Figure 2-2. Basic components of the data model

Nodes

Nodes are used to represent XML constructs such as elements and attributes. Nodes are returned by many expressions, including path expressions and constructors. For example, the following path expression returns four product element nodes:

doc("catalog.xml")/catalog/product

Node kinds

XQuery uses six kinds of nodes:

Element node

An XML element

Attribute node

An XML attribute

Document node

An entire XML document (not its outermost element)

Text node

Some character data content of an element

Processing instruction node

An XML processing instruction

Comment node

An XML comment

Most of this book focuses on element and attribute nodes, the ones most often used within queries. Generally, the book refers to them as “elements” and “attributes” rather than “element nodes” and “attribute nodes,” unless a special emphasis on the data model is required. The other node kinds are discussed in Chapter 22.

The data model also allows for namespace nodes, but the XQuery language does not provide any way to access them or perform any operations on them. Therefore, they are not discussed directly in this book. Chapter 10 provides complete coverage of namespaces in XQuery.

The node hierarchy

An XML document (or document fragment) is made up of a hierarchy of nodes. For example, suppose you have the document shown in Example 2-2.

Example 2-2. Small XML example
<catalog xmlns="http://datypic.com/cat">
  <product dept="MEN" xmlns="http://datypic.com/prod">
    <number>784</number>
    <name language="en">Cotton Dress Shirt</name>
    <colorChoices>white gray</colorChoices>
    <desc>Our <i>favorite</i> shirt!</desc>
  </product>
</catalog>

When translated to the XQuery data model, it looks like the diagram in Figure 2-3. (Depending on the processor, there may also be text nodes, not shown in the diagram, for the line breaks and spaces used to indent the XML document.)

Figure 2-3. A node hierarchy

The node family

A family analogy is used to describe the relationships between nodes in the hierarchy. Each node can have a number of different kinds of relatives:

Children

An element may have zero, one, or several elements as its children. It can also have text, comment, and processing instruction children. Attributes are not considered children of an element. A document node can have an element child (the outermost element), as well as comment and processing instruction children.

Parent

The parent of an element is either another element or a document node. The parent of an attribute is the element that carries it. Strangely, even though attributes are not considered children of elements, elements are considered parents of attributes!

Ancestors

Ancestors are a node’s parent, parent’s parent, etc.

Descendants

Descendants are a node’s children, children’s children, etc.

Siblings

A node’s siblings are the other children of its parent. Attributes are not considered to be siblings.

Roots, documents, and elements

A lot of confusion surrounds the term root in XML processing, because it’s used to mean several different things. XML 1.0 uses the term root element, to mean the top-level, outermost element in a document. Every well-formed XML document must have a single element at the top level. In Example 2-2, the root element is the catalog element.

XPath 1.0, by contrast, does not use the term root element and instead would call the catalog element the document element. XPath 1.0 has a separate concept of a root node, which is equivalent to a document node in XQuery (and later versions of XPath). A root node represents the entire document and would be the parent of the catalog element in our example.

This terminology made sense in XPath 1.0, where the input to a query was always expected to be a complete, well-formed XML document. However, the XQuery/XPath data model allows for inputs that are not complete documents. For example, the input might be a document fragment, a sequence of multiple elements, or even a sequence of processing instruction nodes. Therefore, the root is not one special kind of node; it could be one of several different kinds of nodes.

In order to avoid confusion, this book does not use either of the terms root element or document element. Instead, when referring to the top-level element, it uses the term outermost element. The term root is reserved for whatever node might be at the top of a hierarchy, which may be a document node (in the case of a complete XML document), or an element or other kind of node (in the case of a document fragment).

Node identity and name

Every node has a unique identity. You may have two XML elements in the input document that have the same name and contain the exact same content, but that does not mean they have the same identity. Identity is unique to each node and is assigned by the query processor. You can test whether two nodes have the same identity using the is operator. It is also possible to retrieve a unique identifier for a node using the generate-id function.

In addition to their identity, element and attribute nodes have names. These names can be accessed using the built-in functions node-name, name, and local-name.

String and typed values of nodes

There are two kinds of values for a node: string and typed. All nodes have a string value. The string value of an element node is its character data content and that of all its descendant elements concatenated together. If an element has no content, its string value is a zero-length string. The string value of an attribute node is simply the attribute value.

The string value of a node can be accessed using the string function. For example:

string(doc("catalog.xml")/catalog/product[4]/number)

returns the string 784, while:

string(<desc>Our <i>favorite</i> shirt!</desc>)

returns the string Our favorite shirt!, without the i start and end tags.

Element and attribute nodes also both have a typed value that takes into account their type, if any. An element or attribute might have a particular type if it has been validated with a schema. The typed value of a node can be accessed using the data function. For example:

data(doc("catalog.xml")/catalog/product[4]/number)

returns the integer 784, if the number element is declared in a schema to be an integer. If it is not declared in the schema, its typed value is still 784, but the value is considered to be untyped (meaning it does not have a specified type).

Atomic Values

An atomic value is a simple data value such as 784 or ACC, with no markup, and no association with any particular element or attribute. An atomic value can have a specific type, such as xs:integer or xs:string, or it can be untyped, meaning that it is assigned the generic type xs:untypedAtomic.

Atomic values can be extracted from element or attribute nodes using the string and data functions described in the previous section. They can also be created from literals in queries. For example, in the expression @dept = 'ACC', the string ACC is an atomic value. The result of the entire expression is also an atomic value; it is a Boolean true/false value.

The line between a node and an atomic value that it contains is often blurred. That is because all functions and operators that expect to have atomic values as their operands also accept nodes. For example, you can call the substring function as follows:

doc("catalog.xml")//product[4]/substring(name, 1, 15)

The function expects a string atomic value as the first argument, but you can pass it an element node (name). In this case, the atomic value is automatically extracted from the node in a process known as atomization.

Unlike nodes, atomic values don’t have identities. It’s not meaningful (or possible) to ask whether "abc" and "abc" are the same string or different strings; you can only ask whether they are equal.

Sequences

Sequences are ordered collections of items. A sequence can contain zero, one, or many items of any kind. For example, you could have a sequence of atomic values, or a sequence of nodes, or a sequence that contains both atomic values and nodes.

The most common way that sequences are created is that they are returned from expressions or functions that return sequences. For example, the expression:

doc("catalog.xml")/catalog/product

returns a sequence of four items, which happen to be product element nodes.

A sequence can also be created explicitly using a sequence constructor. The syntax of a sequence constructor is a series of values, delimited by commas, surrounded by parentheses. For example, the expression (1, 2, 3) creates a sequence consisting of those three atomic values.

You can also use expressions in sequence constructors. For example, the expression:

(doc("catalog.xml")/catalog/product, 1, 2, 3)

results in a seven-item sequence containing the four product element nodes, plus the three atomic values 1, 2, and 3, in that order.

Sequences do not have names, although they may be bound to a named variable. For example, the let clause:

let $prodList := doc("catalog.xml")/catalog/product

binds the sequence of four product elements to the variable $prodList.

A sequence with only one item is known as a singleton sequence. There is no difference between a singleton sequence and the item it contains. Therefore, any of the functions or operators that can operate on sequences can also operate on individual items, which are treated as singleton sequences.

A sequence with zero items is known as the empty sequence. In XQuery, the empty sequence is different from a zero-length string (i.e., "") or a zero value. Many of the built-in functions and operations accept the empty sequence as an argument, and have defined behavior for handling it. Some expressions will return the empty sequence, such as doc("catalog.xml")//foo, if there are no foo elements in the document.

Sequences cannot be nested within other sequences; there is only one level of items. If a sequence constructor is inserted into another sequence constructor, the items created by the inserted sequence constructor become full-fledged items of the result sequence. For example:

(10, (20, 30), 40)

is equivalent to:

(10, 20, 30, 40)

Quite a few functions and operators in XQuery operate on sequences. Some of the most used functions on sequences are the aggregate functions (count, min, max, avg, and sum). In addition, union, except, and intersect operators allow sequences to be combined. There are also a number of functions that operate generically on any sequence, such as index-of and insert-before.

Like atomic values, sequences have no identity. You can’t ask whether (1, 2, 3) and (1, 2, 3) are the same sequence; you can only compare their contents.

Types

XQuery is a strongly typed language, meaning that each function and operator expects its arguments or operands to be of a particular type. This section provides some basic information about types that is useful to any query author. More detailed coverage of types in XQuery can be found in Chapter 11.

The XQuery type system is based on that of XML Schema. XML Schema has built-in simple types representing common datatypes such as xs:integer, xs:string, and xs:date. The xs prefix is used to indicate that these types are defined in the XML Schema specification. Types are assigned to items in the input document during schema validation, which is optional. If no schema is used, the items are untyped.

The type system of XQuery is not as rigid as it may sound because there are a number of type conversions that happen automatically. Most notably, the processor attempts to automatically cast untyped items to the type required by a particular operation. Casting involves converting a value from one type to another following specified rules. For example, the function call:

doc("order.xml")/order/substring(@num, 1, 4)

does not require that the num attribute be declared to be of type xs:string. If it is untyped, it is cast to xs:string. In fact, if you do not plan to use a schema, you can in many cases use XQuery without any regard for types. However, if you do use a schema and the num attribute is declared to be of type xs:integer, you cannot use the preceding substring example without explicitly converting the value of the num attribute to xs:string, as in:

doc("order.xml")/order/substring(xs:string(@num), 1, 4)

Namespaces

Namespaces are used to identify the vocabulary to which XML elements and attributes belong, and to disambiguate names from different vocabularies. This section provides a brief overview of the use of namespaces in XQuery for those who expect to be writing queries with basic use of namespaces. More detailed coverage of namespaces, including a complete explanation of the use of namespaces in XML documents, can be found in Chapter 10.

Many of the names used in a query are namespace-qualified, including those of:

  • Elements and attributes from an input document

  • Elements and attributes in the query results

  • Functions, variables, and types

Example 2-3 shows an input document that contains a namespace declaration, a special attribute whose name starts with xmlns. The prod prefix is bound to the namespace http://datypic.com/prod. This means that any element or attribute name in the document that is prefixed with prod is in that namespace.

Example 2-3. Input document with namespaces (prod_ns.xml)
<prod:product xmlns:prod="http://datypic.com/prod">
  <prod:number>563</prod:number>
  <prod:name language="en">Floppy Sun Hat</prod:name>
</prod:product>

Example 2-4 shows a query (and its results) that might be used to select the products from the input document.

Example 2-4. Querying with namespaces

Query

declare namespace prod = "http://datypic.com/prod";
for $prod in doc("prod_ns.xml")/prod:product
return $prod/prod:name

Results

<prod:name xmlns:prod="http://datypic.com/prod"
           language="en">Floppy Sun Hat</prod:name>

The namespace declaration that appears in the first line of the query binds the namespace http://datypic.com/prod to the prefix prod. The prod prefix is then used in the body of the query to refer to elements in the input document. The namespace (not the prefix) is considered to be a significant part of the name of an element or attribute, so the namespace URIs (if any) in the query and input document must match exactly. The prefixes themselves are technically irrelevant; they do not have to be the same in the input document and the query.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.190.167