Chapter 13. Query Optimization

Introduction

One of the key ideas behind XQuery was to create a language with a solid formal basis, like SQL has the relational calculus, and to encourage query optimization. In this chapter we explore what the XQuery standard says about optimization, and the effects query optimization have on your query—actually, more what effects your queries have on the optimizations that can be performed.

Every XQuery implementation optimizes its queries differently. Some don't perform any optimization at all, while others optimize queries so much you wouldn't even recognize them anymore. Here we focus on what optimizations can be performed, not necessarily what optimizations are performed by your particular implementation.

Common Query Optimizations

Let's begin by taking a look at three evaluation strategies commonly employed by XQuery implementations to achieve optimal performance.

Lazy Evaluation

The best way for a program to go quickly is to not compute anything at all—an empty loop sure is fast! Seriously though, not evaluating expressions whose values will never be used is one of the chief optimization techniques available to an XQuery processor.

Lazy evaluation applies to almost every XQuery expression. For example, dead code elimination involves not compiling functions that are never used. Variables are typically not evaluated until used. Operators such as and and or are short-circuited, meaning that if one operand is enough to determine the whole expression, then the other can be skipped. Function parameters that aren't used by the function needn't be evaluated.

Early Evaluation

Although it may seem contradictory, many expressions also benefit from being evaluated only once, and as early as possible. You are probably already familiar with many of the terms used to describe these kinds of optimizations.

For example, instead of computing the same expression several times, it's better to use a variable and compute it only once (common subexpression elimination). Predicates and where clauses are commonly reordered so that they can filter sequences as early as possible, reducing unnecessary later computations (loop-invariant code-motion). Constant-folding is applied, so that expressions that can be evaluated at compile-time are, instead of calculating them at run-time.

Streaming and Database Evaluation

A third evaluation strategy, often used along with the other two, involves reducing the working set. Instead of keeping all the XML data used by a query in memory, buffer as little as possible (streaming execution) and/or offload some data to disk for part of or all of the computation (database execution).

Simple path expressions like doc("a.xml")/x/y[2]/z don't need to load the entire document in-memory; instead, the XML can be processed using a streaming API. Expressions like x[@id = 1] can avoid looping through all elements to test the predicate for each one by instead storing and looking up the matching elements in a hash table or equivalent (i.e., indexing). As that example demonstrates, sometimes it's better to trade a little space for time. Other times, it's better to trade a little time for space. Query optimization is a difficult and fuzzy problem, with no single “correct” answer.

Barriers to Optimization

As you write queries, you should understand that some expressions hinder the kinds of optimizations just described, while others help them. In this section we focus on expressions that prevent optimization and ways to avoid or mitigate those expressions. We briefly consider examples of four well-known barriers to XML query optimization: node identity, sequence order, error preservation, and side effects.

Node Identity

Every XML node has a unique identity and some XQuery operators—such as union and node order comparisons—depend on it. However, node identity means that two of the most common optimization techniques, variable substitution and common subexpression elimination, don't work without serious analysis by the implementation.

To understand why, consider a simple query like the one in Listing 13.1. This query constructs exactly one node, which is then bound to a variable. The node comparison operator is returns true, because of course this node is the same as itself.

Example 13.1. A query that thwarts variable substitution

let    $a := <a/>
return $a is $a

However, an implementation cannot eliminate the variable $a by replacing it with its value. If it did that, the result would be <a/> is <a/>, which constructs two different nodes, and then the is operator would return false.

This is just a simple example; this problem appears anywhere an expression results in a node or sequence of nodes. Potentially a lot of data-flow analysis is required to determine when a variable bound to a node can be substituted and when it cannot, or conversely, when an expression can be replaced by a variable (common subexpression elimination).

You can assist these optimizations by assigning variables to nodes only when you depend on node identity or when the computation is expensive. For example, if you just want the typed value of a node, extract that typed value when you assign the variable, instead of assigning the variable to the node and extracting its later when you use it. This way, the type conversion is cached along with the value. Listing 13.2 illustrates the difference.

Example 13.2. Atomic values are cheaper than nodes

let $a := <x>42</x>       (: potentially expensive :)
let $b := data(<x>31</x>) (: potentially cheaper :)
return data($a) + $b

Sequence Order

Most databases work with tables whose rows are unordered. One of the reasons is that when you don't have to preserve ordering, many optimizations are possible. When you do have to preserve ordering, sorting the final results may completely wipe out the gains brought by the optimization in the first place.

In XML and XQuery, order matters. Nodes have an ordering (document order), and sequences are also ordered. Every path you write implicitly sorts its results by document order. Most implementations know that a path that involves only simple child steps is already in document order and does not need to be sorted. However, the moment that you use an operator such as union or // in the path, the sequence can end up out of order.

For example, in most implementations, the simple path //x/y requires a sort at the end because it's possible that the XML is shaped like Listing 13.3. The reason is that most implementations loop through the nodes of each step in order. In this case, the //x step selects the x nodes in order, but then selecting their y children finds the second y first. Of course, every implementation is different, but this gives you some idea of the impact of sorting on even simple path expressions.

Example 13.3. Most implementations of //x/y reach the y elements out of order

<x>
  <x>
    <y id="1"/>
  </x>
  <y id="2"/>
</x>

As another example, consider two predicates applied to a sequence. The order in which the predicates are applied matters when position is used. For example, in the query (1, 2, 3, 4)[. > 2][2], the first predicate reduces the sequence to (3, 4), and then the second predicate selects the second of these (4). If the predicates are applied in the other order, then the result is the empty sequence (). Using position prevents the implementation from executing the predicates in just any order it wishes. One thing you can do to avoid these barriers to optimization is to refrain from using position unless necessary.

Some implementations also make use of the built-in unordered() function. This function doesn't change any values; it just tells an implementation that order is unimportant. For example, if you write unordered(//x/y), then an implementation doesn't need to maintain any particular ordering to the result. You've indicated that even if the nodes are out of order, that's okay.

Error Preservation

Consider the query doc("a.xml")[doc("b.xml")], which says “select a.xml, but only if b.xml exists.” Now, must this query load and parse a.xml entirely before testing whether b.xml exists, or can it test b.xml first and then load a.xml?

As another example, consider the query $set["x" cast as xs:integer]. If $set is empty, must the predicate (which results in an error) still be evaluated? Or may an implementation resolve the predicate at compile-time, finding the error before the query is even executed?

In general, XQuery has chosen a permissive set of rules for query evaluation. Generally speaking, most errors that are reported at compile-time can be reported at run-time instead, or vice versa. In equally broad terms, expressions that might result in an error can usually be skipped if the implementation doesn't need to evaluate them.

There are a few exceptions to this rule. One is that an implementation that chooses to perform static typing must report certain errors at compile-time, while an implementation that chooses dynamic typing must not report certain errors at compile-time.

The most important exceptions to these general policies occur with the conditional expression (if/then/else). XQuery requires that the condition always be evaluated, and that the corresponding branch be evaluated (and only that branch). Expressions like predicates and where clauses don't have these guarantees. When it's important to your application that errors be preserved (and not be optimized out), use conditionals.

Side Effects

Some XQuery expressions have side effects. For example, the expression <a>{$nodes}</a> has the side effect that all the nodes are copied. Even if you then navigate into this expression, like (<a>{$nodes}</a>)/node(), you don't get back the original node sequence, but instead get a copy of it. As you saw in Section 13.3.1, node identity matters, and one of the side effects of constructors is that they lose node identity by deep-copying their contents.

Some implementations also support external functions, and then of course all bets are off. If the external function depends on having side effects (for example, a random number generator or a counter that is incremented), then implementations must take that into account during query optimization. Consequently, most implementations that support external functions often do not heavily optimize queries and vice versa.

Formal Semantics

The XQuery Formal Semantics, one of the documents that makes up the XQuery standard, exists to help address these issues. The idea behind the Formal Semantics is to provide an algebraic framework for describing queries precisely, and then using this framework to prove which optimizations are valid and which are not.

It does this by taking a theoretical approach to XQuery. The XQuery Formal Semantics begins by stating some axioms and then establishing basic theorems that determine how expressions can be reduced into simpler (or faster) expressions. This is done using techniques from logic theory, including methods of type inferencing and formal reduction. Consequently, the Formal Semantics is of value only to XQuery implementers and researchers.

However, even the Formal Semantics is incomplete in its treatment of the issues facing XML query. Node identity, for example, isn't represented by the Formal Semantics, and issues such as streaming query and expressions with side effects aren't addressed at all.

Conclusion

In this chapter we considered some of the optimizations that implementations can perform, expressions that perform barriers to those optimizations, and how you can avoid these barriers. We also very briefly touched on the XQuery Formal Semantics, a document that attempts to establish a formal basis for the XQuery language.

Further Reading

If you're interested in the Formal Semantics, read XQuery from the Experts: A Guide to the W3C XML Query Language, which contains explanatory essays from the Formal Semantics editors. The Formal Semantics document itself can be found online at http://www.w3.org/TR/xquery-semantics/.

XML query optimization more generally is still a burgeoning field. Although many papers are published each year on this topic and related subjects such as graph rewrites, there doesn't yet exist one unified reference for the subject.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.104.238