Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1. A Tour of XQuery

Introduction

XQuery 1.0 is a concise but flexible query language for XML. XQuery is the product of many years of work by individuals and companies from around the world. Actively developed by the World Wide Web Consortium (W3C) XML Query and XSL Working Groups from the first W3C workshop on Query Languages in 1998 through today, XQuery contains ideas that are both decades old and brand-new.

XQuery is mainly intended for use with XML data, but is also finding uses as a language for data integration, even with data sources that may not be XML but can be viewed through an XML lens. Time will tell how successful XQuery will become, but the stage is set for it to become for XML what Structured Query Language (SQL) has become for relational data.

At the time of this writing, XQuery 1.0 is still in the draft stage, at Last Call. Consequently, some aspects of the language will change between now and the final Recommendation (when it becomes a standard). As much as possible, anticipated changes are noted throughout this book, and also at the book's Web site at http://www.qbrundage.com/xquery.

Also, as can be expected with a new language still under development, some parts of XQuery will strike you as a little rough or overly complex. In such cases (such as the data model and type system described in the next chapter), the design rationale is explained and numerous examples are provided to ease the learning process.

Getting Started

The many examples in this book are based on the current draft specifications. To run these examples, you need an XQuery implementation; several are listed on this book's Web site.

The XQuery standard gives implementations discretion in how they implement many of its features. Consequently, no two XQuery implementations are exactly the same. As much as possible, I have noted these potential differences throughout the book. However, I do not explain how to use any particular XQuery implementation in this book—please consult the implementation's documentation for instructions and differences between it and the official standard.

Notational Conventions

A few words about the notation used in this book: Important words and phrases are italicized when first introduced. Examples are always set off from the main text using a fixed width font, and sometimes appear on separate lines like Listing 1.1.

Example 1.1. This is a sample listing

this is the example

In some cases, it helps to see not only the example but also what the expected result of executing it should be. A => symbol, as shown in Listing 1.2, separates examples from their results.

Example 1.2. This is a sample listing with its evaluated result

example expression
=>
its result.

another example => another result

XML and XQuery are intricately tied to the Unicode character set (described in Chapter 8). To avoid confusion when describing some Unicode characters, I use the notation U+NNNN where NNNN is the corresponding hexadecimal number for that character (usually NNNN is a four-digit number). For example, U+003F is the question mark (?).

Why XQuery?

At some point, you should stop to wonder: Why use XQuery? Why not use some other XML query language, like XPath or XSLT? For that matter, why use a query language at all? Why not work with the XML data structures directly using an existing programming language and some XML API? These are important questions, so let's address them before diving into the technical features of the XQuery language.

Query Languages Versus Programming Languages

First, why use a new language that is specific to a particular domain (like querying XML) instead of an existing, general-purpose programming language? There are two main reasons: ease-of-use and performance.

When it comes to ease-of-use, existing programming languages have some obvious advantages: They usually offer expressive power, allowing complex ideas to be expressed in a few lines of code. They also leverage your existing knowledge, so you can be productive right away.

In contrast, domain-specific languages let you work with domain concepts (like XML) directly. For example, most general-purpose programming languages treat XML as any other API, instead of as a first-class part of the language. Instead of providing operators for constructing and navigating XML directly, you have to access it through an API layer. Just as text manipulation is easier in Perl than in, say, Fortran, so a single line of an XML query language like XSLT or XQuery can accomplish the equivalent of hundreds of lines of C, C#, Java, or some other general-purpose language.

As far as performance goes, there are three reasons that domain languages usually outperform general-purpose languages. One is that domain languages are usually optimized for tasks common to that domain. General-purpose programming languages have to perform well on a wide range of tasks, while XML query languages only have to perform well on a narrow set of common XML tasks. This focus may limit their applicability, but often yields superior performance.

Another is that the general-purpose language will use an API of some sort for working with that domain. The abstraction layer provided by that API almost always hides the internal data structures from software using the API. Information hiding is a great benefit to program design, but can introduce overhead and by definition prevents you from manipulating the underlying data directly. In contrast, a query language is the abstraction. When the query is executed, it has full access to all aspects of its own internal data structures.

However, the main reason XML query languages can outperform general-purpose programming languages is that they are less constrained. A programming language usually has to do exactly what you tell it to do, in the order you specified. Without extensive analysis, temporary intermediate results must be computed exactly as described. For example, using a C++ matrix library to multiply two matrices and then extract the value of a single entry usually computes the entire matrix product, even though only a single value was required.

In a query language, every temporary intermediate result is unimportant. As long as the query produces the correct, final “answer,” how it computes that answer is irrelevant. Maybe it looks it up in a cache (because this query has been answered recently) or maybe it uses a new algorithm discovered yesterday. The program you write using the query language is unaffected; it automatically benefits from whatever new advances that take place in the underlying implementation.

This difference is often summarized by saying that query languages are declarative (stating what you want), while programming languages are descriptive (stating how you want it done). The difference is subtle, but significant.

Of course, these reasons are all generalizations, and therefore break down at a certain point. Sometimes general-purpose programming languages find ways to express the same features offered by domain languages, without complicating the language too much and without significant loss in performance. Sometimes domain languages are very poorly implemented, resulting in performance far worse than manually traversing the structures yourself.

However, the point is clear: Query languages enjoy certain advantages over traditional, general-purpose programming languages. These advantages are illustrated in the success of SQL for accessing relational data, and the successes of many other domain-specific “little languages,” from regular expressions to shell scripts to graphics libraries.

XQuery Versus XPath, XSLT, and SQL

Returning to the original question, you may ask: Why XQuery instead of an existing query language like XPath or XSLT (or SQL)? (If you're not familiar with these languages, feel free to skip ahead to the next section.)

This question frames the choice incorrectly. XPath 1.0, XSLT 1.0, and SQL are great query languages, and XQuery does not replace them for every task. Each of these languages is useful in different situations.

XQuery is another tool in the XML developer's workshop, not the only tool. So the question becomes: When use XQuery instead of XPath or XSLT or SQL?

XPath 1.0 introduced a convenient syntax for addressing parts of an XML document. If you need to select a node out of an existing XML document or database, XPath is the perfect choice, and XQuery doesn't change that.

However, XPath 1.0 wasn't designed for any other purpose. XPath can't create new XML, it can't select only part of an XML node (say, just the tag omitting attributes and content), and—because of its conciseness—it can be hard to read and understand. XPath 1.0 also can't introduce variables or namespace bindings—although it does use them—and it has a very simple type system, essentially just string, boolean, double, and nodeset (a sequence of nodes in document order). If you need to work with date values, calculate the maximum of a set of numbers, or sort a list of strings, then XPath just can't do it.

XSLT 1.0 (which was developed at the same time as XPath) takes XML querying a step further, including XPath 1.0 as a subset to address parts of an XML document and then adding many other features. XSLT is fantastic for recursively processing an XML document or translating XML into HTML and text. XSLT can create new XML or (copy) part of existing nodes, and it can introduce variables and namespaces.

Some people say that XSLT 1.0 can't be optimized and isn't strongly-typed. However, both of these assertions are false: Every expression in XSLT has a compile-time type, and XSLT can certainly be optimized. XSLT 1.0 does have a small type system, and many implementations naively execute XSLT as written without any optimizations, so it's easy to see how these misconceptions came to exist.

However, XSLT 1.0 still has a few drawbacks, some of which could be easily corrected (and will be in the future XSLT 2.0 standard), and others that cannot be addressed without effectively creating a language like XQuery.

XSLT does not work with sequences of values (only sequences of nodes), but user-defined functions, joins, and other common operations can be awkward and difficult to write. A sorted outer join, grouping the result into pairs of adjacent nodes, can be expressed using XSLT, but most users won't ever figure out how to do it.

XML Schema 1.0 didn't exist when XSLT 1.0 was invented, so XSLT uses a different type system and has no operators for validation or other schema interactions.

XSLT 1.0 also uses an XML syntax, which is both a strength and a weakness. This is a strength because it goes “meta”—XSLT can process itself. However, XML is also very verbose compared to plain text, and authoring an XML document is overkill for simple processing tasks.

Finally, XSLT 1.0 encourages and often requires users to solve problems in unnatural ways. XSLT is inherently recursive, but most programmers today think procedurally; we think of calling functions directly ourselves, not having functions called for us in an event-driven fashion whenever a match occurs. Many people write large XSLT queries using only a single <xsl:template> rule, apparently unaware that XSLT's recursive matching capabilities would cut their query size in half and make it much easier to maintain.

XQuery takes a different approach from XSLT 1.0, with more functionality but similar results. XQuery is especially great at expressing joins and sorts. XQuery can manipulate sequences of values and nodes in arbitrary order, not just document order. XQuery takes a procedural approach to query processing, putting users in the driver's seat and making it easy to write user-defined functions, including recursive ones, but more difficult to perform pattern matching. Support for XML Schema 1.0 is built into XQuery, and XQuery was designed with optimization in mind.

XQuery also supports a really important feature that was purposely disabled in XSLT 1.0, something commonly known as composition. Composition allows users to construct temporary XML results in the middle of a query, and then navigate into that. This is such an important feature that many vendors added extension functions, such as nodeset() to XSLT 1.0, to support it anyway; XQuery makes it a first-class operation.

XSLT is still stronger than XQuery at certain tasks. XQuery is focused on generating XML instead of HTML and text, although it is capable of generating them. Compared to XSLT, XQuery's type system is much larger and more complex. And of course, XQuery is new, so XQuery implementations are less mature than XSLT ones.

SQL is a relational query language for databases, and several products and standards efforts extend it to handle XML. Although not designed to be an XML query language, SQL increasingly is finding use as one.

I mention SQL because XQuery has similarities to it in style and syntax, and both can be used to query databases. The biggest difference between XQuery and SQL is that SQL focuses on unordered sets of “flat” rows, while XQuery focuses on ordered sequences of values and hierarchical nodes.

Documents and Databases

You may also question whether XQuery is right for your application. Whether you're working with documents or databases, you may wonder whether XQuery is designed primarily for the other.

Some years ago, there were documents, databases, and programs. Documents were units of semi-structured data (mostly unstructured text with some structural “markup” signifying logical parts of the document) usually stored as standalone files. Databases were self-contained storage systems with highly structured data organized into rigid tables with typed columns. Programs were source code stored in flat text files and parsed on demand to extract structure and meaning.

Each of these had its own community happily evolving independently of the others, with different tools and terminologies. The document community had word processors, document storage systems, scripting languages, and search engines; the database community had query builders, B-trees, indices, query languages, and query processors; the programming community had editors, source control systems, programming languages, and compilers.

And then XML happened.

In many ways, XML has brought these communities together. The increased contact has led many to realize that the challenges in their field are identical in essence to challenges in the other fields. However, each community has independently discovered very different ways of solving the same problem. For example, few database systems use just-in-time compilation and code generation techniques that are today common in compiler implementations; conversely, few programming languages use indices and cost-based optimizers that are common in databases.

Perhaps the most electrifying effects of XML have occurred between the database and document communities. The difficulties involved with storing and querying semi-structured documents have gained new prominence among database vendors; at the same time, the document community increasingly treats documents as “views” over structured data that can be queried and transformed into other representations.

XQuery was designed by people from all three communities to solve common document, database, and programming tasks. In places, XQuery has made compromises to one community or the other; for example, XQuery uses ordered sequences, regular expressions, recursion, and untyped data (document-centric features that are difficult for databases), but also supports joins and transformations, strongly-typed data, and allows expression evaluations to be reordered or skipped entirely through optimization (database-centric features that are difficult for document processors).

XQuery is a query language for both XML documents and databases. It makes no distinction between the two; to XQuery, it's all XML.

Typed and Untyped Data

XQuery can handle both ordinary XML and XML associated with an XML schema. These are colloquially known as untyped XML and typed XML, respectively.

If all your XML data is untyped, you may think that you don't need to understand XML Schema, and you're partially correct. However, you do need to understand the XQuery type system, which plays a major role.

XQuery uses XML Schema 1.0 as the basis for its type system. Consequently, these two standards share some terminology and definitions. XQuery also provides some operators such as import schema and validate to support working with XML schemas.

XML Schema Redux

If you are already familiar with XML Schema 1.0, then you have a head start on some of the more challenging aspects of XQuery, such as its type system.

Otherwise, this book will only teach you a subset required to use XQuery effectively. All you really need to know about XML Schema 1.0 for the purposes of XQuery is that it is an XML standard aimed at describing how to validate XML according to certain rules. For example, an attribute value might be required to be an integer, or an element might be required to contain certain subelements in a particular order. The set of validation rules that apply to a particular element or attribute constitute its schema type. The schema type is often given a qualified name so it can be referred to elsewhere.

When an XML document satisfies all of the schema rules associated with it, then it is valid; otherwise, it is invalid. XML used by XQuery is always well-formed, and every implementation supports untyped XML. Some implementations may also support XML data that is typed and valid, or even typed but invalid.

In XML Schema 1.0, some types are built in to the standard itself, and users can also define their own types. Types are either primitive (like float and string) or else derived from other types. (If you are already familiar with XML Schema, then you know it actually has several different kinds of derivation. The only one that is used in the XQuery type system is derivation by restriction. Throughout this book, “derived” means “derived by restriction.”)

The XQuery type system consists of all of the built-in XML Schema types, plus seven XML node kinds, plus six more types that are new to XQuery. That's right; XQuery defines 59 built-in types (compared to 9 types in Java). Don't panic—you only need to understand a few of these types to use XQuery effectively.

The Types You Need

Every XQuery expression has a static type (compile-time) and a dynamic type (run-time). The dynamic type applies to the actual value that results when the expression is evaluated; the value is an instance of that dynamic type. The static type applies to the expression itself, and can be used to perform type checking during compilation. All XQuery implementations perform dynamic type checking, but only some perform static type checking.

Figure 1.1 depicts the XQuery types you need to know. Arrows show inheritance, and dotted lines indicate that some types in between have been omitted.

Figure 1.1. Part of the XQuery type system

Every XQuery value is a sequence containing zero or more items. Each individual item in a sequence is a singleton, and is the same as a sequence of length one containing just that item. Consequently, sequences are never nested.

Every singleton item in XQuery has a type derived from item(). The item() type is similar to the object type in Java and C#, except that it is abstract: you can't create an instance of item(). (It's written with parentheses in part to avoid confusion with user-defined types with the same name and in part to be consistent with the XPath node tests.)

As shown in Figure 1.1, items are classified into two kinds: XML nodes and atomic values. Nodes derive from the type node(), and atomic values derive from xdt:anyAtomicType. Like item(), the node() and xdt:anyAtomicType types are abstract.

You are probably already familiar with the seven XML node kinds depicted in Figure 1.1. XQuery provides several ways to create instances of XML nodes, which we'll explore later in this chapter and in Chapter 7.

Finally, there are 50 kinds of atomic types. Of these, you really only need to know fourteen—ten from XML Schema and four new ones (including the xdt:anyAtomicType) added by XQuery. The types you need to know are depicted in Figure 1.1. We explain them all in Chapter 2; some of their meanings are clear already from their names. Many examples of these types occur throughout the rest of the book.

All of the atomic type names are in one of two namespaces: The XML Schema type names are in the XML Schema namespace http://www.w3.org/2001/XMLSchema, which is bound to the prefix xs. The XQuery type names are in the XQuery type namespace http://www.w3.org/2003/11/xpath-datatypes, which is bound to the prefix xdt. These prefixes are built in to XQuery, and we'll use them throughout the book. (The namespaces are versioned to the current draft. The values given here correspond to the Last Call drafts available at the time of publication.)

Finally, it's worth mentioning that every user-defined type derives from one of the XML Schema built-in atomic types. The four XQuery atomic types, the seven node kinds, item(), and node() do not allow user derivation. In particular, it isn't possible to create your own node kind, although it is possible to create structural (complex) types using XML Schema and the import schema operator (see Chapter 9). Not all implementations support user-defined types and schema import.

The Types You Don't Need

While Figure 1.1 illustrates the types that you do need, Table 1.1 lists the XQuery types 36 from XML Schema that you don't need.

Many of these types exist in XQuery only because they existed in XML Schema, and are less “types” (in the traditional sense) than just validation rules. Each of these types does serve a purpose, but these purposes are often esoteric, like xs:NOTATION, or highly specialized, like xs:language, and so it's unlikely you'll ever need them (unless you already know what they do). For complete information about these lesser-used types, see appendix A.

Table 1.1. XML Schema types you don't need

▪ xs:anySimpleType

▪ xs:anyType

▪ xs:anyURI

▪ xs:base64Binary

▪ xs:byte

▪ xs:duration

▪ xs:ENTITIES

▪ xs:ENTITY

▪ xs:gDay

▪ xs:gMonth

▪ xs:gMonthDay

▪ xs:gYear

▪ xs:gYearMonth

▪ xs:hexBinary

▪ xs:ID

▪ xs:IDREF

▪ xs:IDREFS

▪ xs:int

▪ xs:language

▪ xs:long

▪ xs:Name

▪ xs:NCName

▪ xs:negativeInteger

▪ xs:NMTOKEN

▪ xs:NMTOKENS

▪ xs:nonNegativeInteger

▪ xs:nonPositiveInteger

▪ xs:normalizedString

▪ xs:NOTATION

▪ xs:positiveInteger

▪ xs:short

▪ xs:token

▪ xs:unsignedByte

▪ xs:unsignedInt

▪ xs:unsignedLong

▪ xs:unsignedShort

A Sample Query

Figure 1.2 presents a prototypical XQuery for your reading pleasure. Each query consists of a prolog and/or a body. The prolog, if any, sets up the compile-time environment (schema and module imports, namespace declarations, user-defined functions, and so on). The body, if any, is evaluated to produce the value of the overall query.

Figure 1.2. Anatomy of an XQuery

The rest of this chapter highlights the various kinds of expressions that can occur in the prolog and body of an XQuery, and the remaining chapters drill into the details.

Processing Model

Every XQuery expression evaluates to a sequence (a single item is equivalent to a sequence of length one containing that item). Items in a sequence can be atomic values or nodes. Collectively, these make up the XQuery Data Model, described in Chapter 2.

XQuery is primarily designed as a strongly-typed language, meaning that every expression has a compile-time type, and must be combined with other expressions that have compatible types. When types are incompatible, the error can be detected and reported early (at compile-time). However, XQuery allows implementations to perform dynamic type checking instead, in which these errors are reported only during execution.

To understand the difference, consider the expression in Listing 1.3. If an implementation supports static typing, then this query will produce a compile-time error, because the if expression might evaluate to a string and strings cannot be added to integers. In implementations that use dynamic typing, the result depends on the value of the variable $foo. If it is true, then the query will succeed (because if evaluates to an integer); if it is false, then it will raise a dynamic type error.

Example 1.3. Static typing versus dynamic typing

13 + if ($foo) then 30 else "0"
=>
43 or a type error, depending on the implementation and $foo

This book mostly describes the type rules in general, without respect to when they are applied. Consult your XQuery implementation's documentation to determine what kind of type checking it performs.

Next, let's turn our attention to the various kinds of XQuery expressions.

Comments and Whitespace

In XQuery, the whitespace characters are space (U+0020), tab (U+0009), carriage return (U+000D), and new line (U+000A). XQuery allows descriptive comments to appear anywhere that whitespace characters are allowed and ignored—which is almost everywhere. Their only purpose is to make code easier for humans to read.

XQuery comments begin with the two characters (: and end with the two characters :), as shown in Listing 1.4. Note that in places where whitespace characters are not ignored (such as string constants or direct XML constructors), comments are not ignored either but instead are treated as ordinary text.

Example 1.4. Comments spice up any XQuery

(: You are here. :)
let $i := 42 (: This is also a comment. :)
return <x>(: This is not a comment. :)</x>
=>
<x>(: This is not a comment. :)</x>

Prolog

As mentioned earlier, every query begins with an optional section called the prolog. The prolog sets up the compile-time context for the rest of the query, including things like default namespaces, in-scope namespaces, user-defined functions, imported schema types, and even external variables and functions (if the implementation supports them). Chapter 5 explains all of these expressions. Each prolog statement must end with a semicolon (;).

For example, the query prolog in Listing 1.5 declares a namespace, and then the body of the query uses it.

Example 1.5. Query prolog sets up static context

declare namespace x = "http://www.awprofessional.com/";
<x:foo/>

The query prolog also can be used to define global variables and to create user-defined functions for use in the rest of the query. The sample query shown previously in Figure 1.2 defines a recursive function, my:fact(), that computes the factorial of an integer, and then defines a global variable, $my:ten, that uses it.

Each function definition starts with the keywords declare function, followed by the name of the function, the names of its parameters (if any) and optionally their types, optionally the return type of the function, and finally the body of the function (enclosed in curly braces). Figure 1.3 illustrates all of these parts, using the same example as in Figure 1.2 but with types added.

Figure 1.3. A prototypical user-defined function

Queries may be divided into separate modules. Each module is a self-contained unit, analogous to a file containing code. Modules are most commonly used to define function libraries, which can then be shared by many queries using the import module statement in the prolog. Modules and user-defined functions are fully explained in Chapter 4. Note that not every implementation supports modules.

Constants

The constant value you will encounter most frequently is the empty sequence, written as left and right parentheses: (). Naturally enough, it denotes a sequence of length zero. XML constants are also very common, but are described later in Section 1.12. This section discusses constant atomic values such as booleans, strings, and integers.

Boolean Constants

Boolean constants are written as functions, true() and false(), mainly because that's how they were handled in XPath 1.0. These represent the two boolean values true and false, respectively. The type of a boolean constant is xs:boolean.

String Constants

String constants may be written using either single- or double-quotes, such as "hello" and 'world'. The choice makes no difference in meaning. In XQuery, string values are always sequences of Unicode code points (see Chapter 8), and may be the empty string. The type of a string constant is xs:string.

Escape the quote character by doubling it; for example, '''' is a string containing one apostrophe character. Also, string constants may contain the five built-in XML entity references (&, ', >, <, and ") or XML character references (such as   or ±), as shown in Listing 1.6.

Example 1.6. String constants may contain entity and character references

"&lt; &amp; &gt; are special characters in XML"
=>
"< & > are special characters in XML"

Numeric Constants

There are three kinds of numeric constants: integers, floating-point numbers, and fixed-point numbers.

Integer constants are written as a sequence of digits (0-9) with no decimal point, for example 42. The type of an integer constant is xs:integer. XQuery defines unary - and + operators that can be used to negate the integer or emphasize that it is positive, respectively. These operators can also be used with the other number types below.

Decimal constants (that is, fixed-point numbers) are written using a sequence of digits with a decimal point anywhere in the number, such as 42. or 4.2 or .42. The type of a decimal constant is xs:decimal. Decimal numbers are commonly used to represent quantities with a fixed number of decimal places, such as monetary values.

Double values (that is, double-precision floating-point numbers) are decimal or integer numbers with exponential notation after the number, such as 42E0 or 4.2e+0 or 42E-2. The type of a double constant is xs:double.

XML Schema defines the xs:decimal and xs:integer types to have arbitrary precision (allowing any number of digits), but XQuery allows implementations to use limited-precision types for efficiency in computations. Consequently, their behavior varies from one implementation to the next.

Some implementations provide arbitrary-precision integers and decimals. However, decimal is more commonly implemented using either 128 or 64 bits and integer is commonly implemented using 64 or 32 bits. The documentation for your XQuery implementation should clearly state what the implementation does.

Other Constants

Finally, a constant value of any type—the ones already described, the other built-in types, and user-defined types—can be constructed by writing the type name followed by a parenthesized string containing the value. For example, the expression xs:float("1.25") constructs an xs:float constant with the value 1.25 and xs:ID("X1") constructs an xs:ID constant with the value X1.

These expressions are known as type constructors because they construct a value of a given atomic type. The type constructors use the XML Schema validation rules for that type. For example, xs:boolean("1") and xs:boolean("true") result in the boolean constant true, exactly like the expression true(). However, xs:boolean("wahr") and xs:boolean("vrai") are errors, even on German and French systems, because wahr and vrai are not boolean representations accepted by XML Schema (and, therefore, are not accepted by XQuery either).

You can find the validation rules for every built-in atomic type in Appendix A.

XML

Although XQuery can be used to compute simple atomic values, more often it is used to produce XML output. This section briefly touches on the various ways to construct XML in a query (see Chapter 7 for more information).

All seven XML node-kinds are supported in XQuery. XML syntax can be used verbatim as XQuery expressions, as shown in Listing 1.7. These expressions are called node constructors because they construct XML nodes.

As you'll see in a moment, it's also possible to load XML from outside the query, or to compute parts of an XML structure using embedded XQuery expressions.

Example 1.7. Constructing XML in XQuery is a snap

<hello world="this is" xmlns="http://www.awprofessional.com/">
  XQuery!!
  <!-- the last language you'll ever need -->
  <?or maybe not?>
  <![CDATA[Even CDATA sections are allowed]]>
</hello>

By leveraging the XML syntax with which you are already familiar, XQuery makes it simple to create XML values. However, the greatest flexibility comes in creating dynamic XML values that combine constant parts with parts that are computed by XQuery expressions.

To allow XQuery expressions to compute part or all of a node's content, XQuery reserves the two curly-brace characters ({}) in element and attribute constructors. The curly braces enclose an XQuery expression to be evaluated; the results of the XQuery are inserted into the XML structure at that point. Listing 1.8 demonstrates XQuery expressions in both attribute and element content.

Example 1.8. XQuery expressions may be embedded in XML constructors

<x y="6*7 = {6*7}">
It is { true() or false() } that this is an example.
</x>
=>
<x y="6*7 = 42">It is true that this is an example.</x>

To use the curly braces as ordinary characters in an XML constant, they must be escaped by doubling them ({{ and }}) as shown in Listing 1.9 or by using character references. XQuery supports hexadecimal and decimal character references (such as  ), as well as the five built in named entity references (such as &) from XML.

Example 1.9. Curly braces may be escaped by doubling them

<add>
  {{ 1 + 1  = { 1+1 }}}
</add>
=>
<add>{ 1 + 1 = 2 }</add>

Only some kinds of nodes can have computed content. XML comments and processing instructions are always constants in XQuery, and are written using the usual XML syntax  and <?processing instruction?>, respectively. Curly braces in them are treated as ordinary characters instead of as expressions.

In addition to the usual XML syntax, XQuery provides an alternate keyword-style syntax for creating nodes. This alternate syntax allows not only the content but also the node name to be computed (for elements and attributes). Otherwise, computed constructors have the same effect as direct XML syntax; it's a matter of personal choice which you use.

Element nodes can be constructed using the usual XML syntax shown in Listing 1.7 or using an alternate syntax, shown in Listing 1.10, that allows their names and content to be computed by XQuery expressions.

Example 1.10. Element nodes can be constructed using an alternate syntax

element { "any-name" } { "any content" }
=>
<any-name>any content</any-name>

Document nodes can be constructed only using this alternate syntax. Attribute and text nodes can be constructed using the alternate syntax, or can be constructed inside of XML elements as usual. Finally, namespace nodes are constructed only in element constants. All of these are shown in Listing 1.11.

Example 1.11. Other computed constructors

document {
   element foo {
      attribute bar { 1 + 1 }
      text { "baz" }
      <x xmlns='urn:x'>Ordinary XML can be
      intermixed with the alternate syntax</x>
   }
}
=>
<foo bar="2">baz<x xmlns='urn:x'>Ordinary XML can be
      intermixed with the alternate syntax</x></foo>

An XQuery expression in a node constructor can result in the empty sequence, in which case it contributes nothing to the content, or it can result in a sequence containing more than one item. The rules in this case are somewhat complicated, but in general the values are separated by spaces. These effects are demonstrated in Listing 1.12.

Example 1.12. Sequence content is flattened before inserting into XML

<x y="{ () }">{ (1, 2) }</x>
=>
<x y="">1 2</x>

Constructing XML turns out to be a very intricate process, in part because there are so many special cases in XML and XQuery, like how whitespace characters are handled and how XQuery values are represented as XML. For a complete explanation of all the rules, see Chapter 7.

Built-in Functions

Not counting the type constructors mentioned in Section 1.11.4, XQuery defines 110 built-in functions, counted by name. Both built-in and user-defined functions are invoked by writing the function name followed by zero or more argument values in parentheses, like function(param1, param2). Function invocations are matched to function definitions by the function name and the number of arguments, which together are called the function signature.

For example, the number of items in a sequence can be counted using the count() function. A subsequence of several items can be selected using the subsequence() function. Sequences use 1-based indexing, so the first item in the sequence occurs at position 1. See Listing 1.13 for examples.

Example 1.13. Some of the built-in sequence functions

count(("a", 2, "c"))              => 3
subsequence((-5,4,-3,2,-1), 3)    => (-3, 2, -1)
subsequence((-5,4,-3,2,-1), 2, 3) => (4, -3, 2)

All of the built-in functions (except type constructors) belong to the namespace http://www.w3.org/2003/11/xpath-functions, which is bound to the prefix fn. This is also the default namespace for functions, which means that unqualified function names are matched against the built-in functions. For example, true() is the same as fn:true(), provided that you haven't changed the default function namespace or the namespace binding for fn. I generally omit the built-in prefix in this book.

Appendix C lists all of the built-in functions, sorted alphabetically for convenient reference, and you will encounter many of them throughout the following chapters.

Operators

Sequences of values are written with commas separating items in the sequence, as shown in Listing 1.14.

Example 1.14. Comma creates a sequence of expressions

1, "fish"

Because the comma operator has the lowest precedence of all XQuery operators, sequences usually need to be enclosed in parentheses. In fact, parentheses can be used to group all kinds of expressions together, as shown in Listing 1.15.

Example 1.15. Parentheses are used around sequences or to group expressions

()           => ()
(1, 2)       => (1, 2)
1 + 2 * 3    => 7
(1 + 2)*3    => 9

XQuery defines a rich assortment of other operators and expressions. In this chapter we scratch the surface of a few of them; for complete details, see Chapter 5 and Appendix B.

Some operators are unary prefix (such as - and +) or unary postfix (such as []), meaning they take a single operand and appear before or after it, respectively. Others are binary infix operators (such as or and +), meaning they appear between their two operands. Some operators are written using punctuation symbols (like * and /), while others are written using names (like div and intersect).

XQuery does not reserve keywords; instead, context is used to determine their meaning. Like XML, XQuery is a case-sensitive language. All XQuery keywords are lowercase.

Logic Operators

XQuery defines three logic operators: and, or, and not(). The not() operator is written as a function because that's how XPath 1.0 handled it; the other two are binary operators, as shown in Listing 1.16. Each of these operators performs the corresponding boolean calculation.

Example 1.16. Boolean operators

true() and false()  => false()
true() or false()   => true()
not(false())        => true()

XQuery also provides an if/then/else operator to expression conditional statements. The if condition is always evaluated, and then only the corresponding branch (then if the condition is true, else if it is false) is evaluated and becomes the result of the entire expression, as shown in Listing 1.17.

Example 1.17. Conditionals

if (true()) then "true" else "false"
=>
"true"

Conditionals may also be chained together one after another, as shown in Listing 1.18. The final else branch is always required.

Example 1.18. A sequence of conditionals

if (expr < 0)
then "negative"
else if (expr > 0)
then "positive"
else "zero"

Arithmetic Operators

XQuery supports eight mathematical operators, corresponding to addition (+), subtraction (-), multiplication (*), division (div), integer division (idiv), modulo (mod), unary plus (+), and unary minus (-). Division is expressed using a keyword (div) instead of the usual slash operator, because slash is used for a different purpose (see Section 1.15). Listing 1.19 shows how these operators are used.

Example 1.19. XQuery provides the usual arithmetic operators

1 + 2      => 3
3 – 4      => -1
1 * 2      => 2
1 div 2    => 5E-1
1 idiv 2   => 0
1 mod 2    => 1
+1.0       => 1.0
-1.0       => -1.0
2.0 * 3    => 6.0
2E1 div 4  => 5E0

In addition to these operators, XQuery also defines a few arithmetic functions, including round(), floor(), ceiling(), round-half-to-even(), abs(), sum(), min(), max(), and avg(). Listing 1.20 illustrates the use of some of these functions. Many XQuery implementations also provide additional mathematical capabilities through extension functions.

Example 1.20. XQuery also defines several arithmetic functions

min((2, 1, 3, -100))        => -100
round(9 div 2)              => 5
round-half-to-even(9 div 2) => 4

Like XML construction, arithmetic has a lot of detailed rules, such as type promotion. For complete details, see Chapter 5 and Appendix B.

Text Operators

XQuery 1.0 doesn't define any text operators per se, although it does provide a large number of built-in functions for string manipulation, including regular expression matching and replacement.

Note that the plus operator (+) doesn't perform string concatenation. To combine strings, use either the concat() or the string-join() function.

Two of the most commonly used text functions are substring(), which can be used to extract zero or more characters from a given string value, and string-length(), which computes the length of a string. Like sequences, string positions are always 1.based, so the first character in the string occurs at position 1. Listing 1.21 demonstrates a few of the more common XQuery string functions.

Example 1.21. Some of the built-in string functions

string-length("abcde")               => 5
substring("abcde", 3)                => "cde"
substring("abcde", 2, 3)             => "bcd"
concat("ab", "cd", "", "e")          => "abcde"
string-join(("ab","cd","","e"), "")  => "abcde"
string-join(("ab","cd","","e"), "x") => "abxcdxxe"
contains("abcde", "e")               => true
replace("abcde", "a.*d", "x")        => "xe"
replace("abcde", "([ab][cd])+", "x") => "axde"
normalize-space("  a  b cd  e  ")    => "a b cd e"

Two other very useful string functions are string-to-codepoints() and codepoints-to-string(). The first takes a string and returns the sequence of Unicode code points it contains. The second does the reverse; it takes a sequence of code points and returns the string containing those characters. Both are demonstrated in Listing 1.22.

Example 1.22. Strings are sequences of Unicode code points

string-to-codepoints("Hello")          => (72,101,108,108,111)
codepoints-to-string((87,79,82,76,68)) => "WORLD"

Most string functions in XQuery accept an optional collation parameter. A collation describes how characters should be compared (in comparisons, sorts, and substring searches). For example, a case-insensitive collation would treat X and x as the same character; another common collation treats all punctuation characters as less than all letters.

In XQuery, collations are represented using URI strings. The only collation implementations are required to support is also the default collation, known as the Unicode code point collation. This collation corresponds to the URI http://www.w3.org/2003/11/xpath-functions/collation/codepoint and it sorts characters according to their Unicode code points. Implementations are free to support any additional collations they wish; there is no standard for specifying collation names. See Chapter 8 for additional information and examples.

Comparison Operators

XQuery also supports many different comparison operators. The comparison operators are grouped into three categories: value, general, and node comparison operators.

Value comparison operators compare two singleton values and return true if the operands compare true (using the default collation for string comparisons), and false otherwise. The value comparison operators are all expressed using keywords: eq, ne, gt, ge, lt, and le. These have the expected meanings (for example, eq returns true if the values are equal, ne returns true if they are unequal, gt returns true if the first operand is greater than the second, etc.). Listing 1.23 illustrates some examples.

Example 1.23. Value comparison operators work on singleton values

1 eq 1 => true
1 eq 2 => false
1 ne 2 => true
1 gt 2 => false
1 lt 2 => true

General comparison operators are similar to value comparisons, except that they operate on sequences. They return true if there exists an item in one sequence and in the second sequence such that the two compare true using the corresponding value comparison operator. The general comparison operators are represented using punctuation: =, !=, >, >=, < and <=.

As Listing 1.24 demonstrates, the general comparison operators sometimes produce surprising results. For example, (1, 2) = (2, 3) because there exists an item (2) in the first sequence and there exists an item (2 again) in the second sequence such that the two items are equal.

Example 1.24. General comparison operators work on sequences

(1, 2, 3) = 4     => false
(1, 2, 3) = 3     => true
(1, 2) = (3, 4)   => false
(1, 2) != (3, 4)  => true
(1, 2) = (2, 3)   => true
(1, 2) != (2, 3)  => true
(1, 2) != (1, 2)  => true

Finally, there are three node comparison operators: <<, >>, and is. The node comparison operators depend on node identity and document order, which are explained in Chapter 2.

These operators work on sequences of nodes. Like the general comparisons, the node comparisons test whether there exists a node in the first sequence and there exists a node in the second sequence such that the comparison is true.

The is operator returns true if two nodes are the same node by identity. The << operator is pronounced “before” and tests whether a node occurs before another one in document order. Similarly, the >> operator is pronounced “after” and tests whether a node occurs after another one in document order. Listing 1.25 demonstrates the use of these three operators.

Example 1.25. Node comparison operators work on sequences of nodes

<a/> is <b/>                       => false
<a/> isnot <a/>                    => true
doc("test.xml") is doc("test.xml") => true
x/.. << x                          => true

Paths

You have seen how to construct XML and how to operate on sequences of nodes and values, but of course the most important topic is the application of XQuery to existing (external) XML sources. This section explores the use of existing XML data. In the examples given in this section and the remainder of this chapter suppose team.xml is the XML document shown in Listing 1.26.

XQuery provides several functions to access existing XML data, including the doc() function. This function is similar to the document() function in XPath and XSLT: It takes a single argument, which is a string URI pointing to the XML source to be loaded, and returns the resulting document. For example, doc("team.xml") accesses the data source team.xml.

Given an XML document, the next step is to select some of the nodes it contains. Just as XSLT 1.0 used XPath 1.0 to select nodes, XQuery uses the XPath 2.0 path syntax. By conscious design, these paths are somewhat similar to file system paths, because both navigate a hierarchy of information.

Example 1.26. The team.xml document

<?xml version='1.0'?>
<Team name="Project 42" xmlns:a="urn:annotations">
  <Employee id="E6" years="4.3">
    <Name>Chaz Hoover</Name>
    <Title>Architect</Title>
    <Expertise>Puzzles</Expertise>
    <Expertise>Games</Expertise>
    <Employee id="E2" years="6.1" a:assigned-to="Jade Studios">
      <Name>Carl Yates</Name>
      <Title>Dev Lead</Title>
      <Expertise>Video Games</Expertise>
      <Employee id="E4" years="1.2" a:assigned-to="PVR">
        <Name>Panda Serai</Name>
        <Title>Developer</Title>
        <Expertise>Hardware</Expertise>
        <Expertise>Entertainment</Expertise>
      </Employee>
      <Employee id="E5" years="0.6">
        <?Follow-up?>
        <Name>Jason Abedora</Name>
        <Title>Developer</Title>
        <Expertise>Puzzles</Expertise>
      </Employee>
    </Employee>
    <Employee id="E1" years="8.2">
      <!-- new hire 13 May -->
      <Name>Kandy Konrad</Name>
      <Title>QA Lead</Title>
      <Expertise>Movies</Expertise>
      <Expertise>Sports</Expertise>
      <Employee id="E0" years="8.5" a:status="on leave">
        <Name>Wanda Wilson</Name>
        <Title>QA Engineer</Title>
        <Expertise>Home Theater</Expertise>
        <Expertise>Board Games</Expertise>
        <Expertise>Puzzles</Expertise>
      </Employee>
    </Employee>
    <Employee id="E3" years="2.8">
      <Name>Jim Barry</Name>
      <Title>QA Engineer</Title>
      <Expertise>Video Games</Expertise>
    </Employee>
  </Employee>
</Team>

For example, suppose you want to select the Team element at the top of the document. This can be done using the XQuery doc("team.xml")/Team. The slash operator iterates through every node in the expression on the left (the context), and for each such node performs the selection on the right (the step). In this case, the context is the root node of the document team.xml, and the step selects its Team element children. Any number of steps may be combined together in a path.

To select attribute nodes instead of elements, you can use the @ symbol in front of the step name. For example, doc("team.xml")/Team/@name selects the attribute name="Project 42".

Paths are easily one of the most important types of expressions in XQuery. Paths provide many other navigation operators for moving around the hierarchy, selecting different kinds of nodes, and filtering the nodes selected. Chapter 3 covers paths and navigation more generally.

Variables

Variables in XQuery are written using a dollar sign symbol in front of a name, like so: $variable. The variable name may consist of only a local-name like this one, or it may be a qualified name consisting of a prefix and local-name, like $prefix:local. In this case, it behaves like any other XML qualified name. (The prefix must be bound to a namespace in scope, and it is the namespace value that matters, not the prefix.)

Several different expressions in XQuery can introduce new variables into scope. These are described in later in the book: function definitions (Chapter 4), global variable declarations (Chapter 5), FLWOR and quantification (Chapter 6), and typeswitch (Chapter 9). If there is already a variable in scope with that name, then the new definition temporarily overrides the old one.

It's worth observing that XQuery variables, despite being called “variable,” are actually immutable. In fact, everything in XQuery is read-only; in XQuery 1.0, no expressions can change the values of variables or XML data. There are proposed extensions to XQuery (see Chapter 14) that would allow some values to be modified and may appear in future versions of the standard (and possibly the implementation you use today).

FLWOR

The central expression in XQuery is the so-called “flower expression,” named after the first letters of its clauses—for, let, where, order by, return—FLWOR. FLWOR is an expression with many features, which are covered completely in Chapter 6.

The FLWOR expression is used for many different purposes in XQuery: to introduce variables, to iterate over sequences, to filter results, to sort sequences, and to join different data sources. The FLWOR expression in Listing 1.27 uses all five clauses to iterate over an existing document and return a result.

Example 1.27. A typical FLWOR expression

for $i in doc("orders.xml")//Customer
let $name := concat($i/@FirstName, $i/@LastName)
where $i/@ZipCode = 91126
order by $i/@LastName
return
  <Customer Name="{$name}">
    { $i//Order }
  </Customer>

The for and let clauses may appear in any order relative to one another, and there may be any number of each, provided there is at least one for or let clause. Each for clause iterates through a sequence, binding a variable to each member of the sequence in turn. Each let clause assigns a variable to the value of an expression. Every variable introduced this way is in scope for the remainder of the FLWOR expression, including any for/let clauses that follow.

The optional where clause filters the possibilities, and the optional order by clause sorts the result into a particular order. Finally, the return clause constructs the result, which can be any expression at all.

A very simple FLWOR is shown in Listing 1.28. This example declares a variable ($variable) using a let clause, and then returns some expression (which might use that variable).

Example 1.28. FLWOR can introduce variables into scope

let $variable := "any expression here"
return concat("xx", $variable, "xx")
=>
"xxany expression herexx"

A more complex example is shown in Listing 1.29. This FLWOR iterates through a sequence, and returns only those members that are greater than 3.

Example 1.29. FLWOR is also useful for filtering sequences

for $i in (1, 2, 3, 4, 5)
where $i > 3
return $i
=>
(4, 5)

Often, simple FLWOR expressions can be expressed using paths instead. For example, Listing 1.29 could also be expressed as the path (1,2,3,4,5)[. > 3]. Path expressions are very concise, but can be difficult to comprehend.

FLWOR is especially useful when used together with paths. For example, consider the team.xml example in Listing 1.26. Suppose you want to list all employees alphabetically by last name. You can use a path to select the employee names, the tokenize() function to split the name into first and last parts, and then an order by clause to sort by the last name, as shown in Listing 1.30.

Example 1.30. Sort employee names by last name

for $e in doc("team.xml")//Employee
let $name := $e/Name
order by tokenize($name)[2] (: Extract the last name :)
return $name

FLWOR is also commonly used to join a data source with itself or other data sources, as shown in Listing 1.31. For more examples, see Chapter 6.

Example 1.31. Joining two documents together

for $i in doc("one.xml")//fish,
    $j in doc("two.xml")//fish
where $i/red = $j/blue
return <fishes> { $i, $j } </fishes>

Error Handling

XQuery distinguishes between static errors that may occur when compiling a query and dynamic errors that may occur when evaluating a query. Dynamic errors may be reported statically if they are detected during compilation (for example, xs:decimal("X") may result in either a dynamic or a static error, depending on the implementation).

Most XQuery expressions perform extensive type checking. For example, the addition $a + $b results in an error if either $a or $b is a sequence containing more than one item, or if the two values cannot be added together. For example, "1" + 2 is an error. This is very different from XPath and XSLT 1.0, in which "1" + 2 converted the string to a number, and then performed the addition without error.

XQuery also defines a built-in error() function that takes an optional argument (the error value) and raises a dynamic error. In addition, some implementations support the trace() function, which allows you to generate a message without terminating query execution. See Appendix C for examples.

Many other XQuery operations may cause dynamic errors, such as type conversion errors. As mentioned previously, often implementations are allowed to evaluate expressions in any order or to optimize out certain temporary expressions. Consequently, an implementation may optimize out some dynamic errors. For example, error() and false() might raise an error, or might return false. The only expressions that guarantee a particular order-of-evaluation are if/then/else and typeswitch.

Conclusion

Query languages are powerful tools for manipulating XML, and XQuery doesn't disappoint. With literally hundreds of operators and functions, it's a rich and feature-full language for constructing and navigating XML and typed values.

The core features of XQuery are its type system, XML constructors, navigation paths, and FLWOR (“flower”) expressions, all of which are explained in later chapters. XQuery also provides many useful operators (Chapter 5) and even allows users to define their own functions (Chapter 4).

Table of Contents for
1. A Tour of XQuery

Chapter 1. A Tour of XQuery

Introduction

Getting Started

Notational Conventions

Why XQuery?

Query Languages Versus Programming Languages

XQuery Versus XPath, XSLT, and SQL

Documents and Databases

Typed and Untyped Data

XML Schema Redux

The Types You Need

The Types You Don't Need

A Sample Query

Processing Model

Comments and Whitespace

Prolog

Constants

Boolean Constants

String Constants

Numeric Constants

Other Constants

XML

Built-in Functions

Operators

Logic Operators

Arithmetic Operators

Text Operators

Comparison Operators

Paths

Variables

FLWOR

Error Handling

Conclusion

Further Reading

Table of Contents for 1. A Tour of XQuery

Create new playlist

Sign In

Sign Up

Chapter 1. A Tour of XQuery

Introduction

Getting Started

Notational Conventions

Why XQuery?

Query Languages Versus Programming Languages

XQuery Versus XPath, XSLT, and SQL

Documents and Databases

Typed and Untyped Data

XML Schema Redux

The Types You Need

The Types You Don't Need

A Sample Query

Processing Model

Comments and Whitespace

Prolog

Constants

Boolean Constants

String Constants

Numeric Constants

Other Constants

XML

Built-in Functions

Operators

Logic Operators

Arithmetic Operators

Text Operators

Comparison Operators

Paths

Variables

FLWOR

Error Handling

Conclusion

Further Reading

Table of Contents for
1. A Tour of XQuery