Chapter 16. Writing Better Queries

Well-designed, robust queries have the advantages of running faster and with fewer errors, as well as being easier to debug and maintain. This chapter describes some of the goals of query design, with particular attention to handling errors and tuning for performance.

Query Design Goals

Some of the elements of good query design include:

Clarity

Queries that clearly convey their meaning are much easier to understand and therefore to maintain.

Modularity

Expressions should be reusable in many parts of a query and across multiple queries.

Robustness

Queries should be able to handle all possible combinations of values of input data.

Error handling

Queries should handle dynamic errors gracefully, with useful messages.

Performance

Queries should be tuned for performance.

The rest of this chapter takes a closer look at these design goals.

Clarity

You can increase the clarity of your queries by improving the layout of the query, making appropriate use of names, and using comments liberally. In addition to the recommendations in this chapter, you can go to http://xqdoc.org/xquery-style.html for some more detailed XQuery style conventions.

Improving the Layout

To make the structure of a query more obvious, you should make appropriate use of whitespace and parentheses. Whitespace (line breaks, spaces, and tabs) is allowed anywhere between keywords to make it more readable.

It is helpful to split longer FLWOR and conditional expressions into multiple lines and indent each clause to line up, as shown in Example 16-1. FLWORs embedded within FLWORs should be further indented. When constructing XML results, you should indent the element constructors just as you would indent the elements in an XML document.

Example 16-1. Making use of whitespace

Less clear query

for $prod in doc("catalog.xml")//product return
<product><num>{data($prod/number)}</num>
<price>{for $price in doc("prices.xml")//prod
where $prod/number = $price/@num
return data($price/price)}</price>
</product>

More clear query

for    $prod in doc("catalog.xml")//product
return <product>
         <num>{data($prod/number)}</num>
         <price>{for    $price in doc("prices.xml")//prod
                 where  $prod/number = $price/@num
                 return data($price/price)}</price>
       </product>

Parentheses can be used around most expressions to group them together. If the beginning and end of an expression are not obvious, parentheses are highly recommended. For example, a complex where clause in a FLWOR, or a complex then clause in a conditional expression, are much clearer when wrapped in parentheses.

Choosing Names

Choosing meaningful names can also make a query much easier to understand. This includes the names of variables, functions, and function parameters. Names can also be used, along with repeating let clauses, to make an expression more clear. For example, in the expression:

let $substring := substring($myString, 1, 32)
let $substringNoQuotes := replace($substring, '"', '')
let $substringUpperCase := upper-case($substringNoQuotes)
return $substringUpperCase

the names are bound to the string in various states of processing. This is more obvious than its equivalent:

upper-case(replace(substring($myString, 1, 32), '"', ''))

Namespace prefixes should also be chosen carefully. When possible, use popular prefix conventions such as xs for XML Schema, wsdl for Web Services Description Language, and html for XHTML. If you are using several namespaces, declare prefixes for all of them rather than making one the default. This makes it more clear which namespace each name belongs to.

Using Comments for Documentation

An important part of writing understandable queries is documenting them. Comments delimited by (: and :) can appear anywhere that insignificant whitespace is allowed in a query. For example, they may appear at the end of a line to explain the expression on that line, as a separate line, or as a block on multiple lines, as in:

 (:::::::::::::::::::::::::::::::::
  : The following expression returns the price of a product
  : It assumes there is one price per product element
  ::::::::::::::::::::::::::::::::::)

A standard method of documenting XQuery modules and functions is by using xqDoc tags. These tags, listed in Table 16-1, appear in normal XQuery comments. All of them are optional and most are allowed to repeat.

Table 16-1. xqDoc tags
TagMeaning
@author The author of the component
@version The version number
@since The first version (e.g., of a library) when a component is supported
@see Where to go for additional information; it can be a URL or a textual description
@param A description of a function parameter, in the form @param $name text
@return A description of what a function returns
@deprecated An indication that the component has been deprecated and should therefore no longer be used; text can follow the keyword for further explanation
@error A description of a type of error the function might return

Once a module is documented using xqDoc tags, human-readable HTML documentation can be generated automatically. The process is very similar to that of Javadoc, which generates documentation for Java classes. For more information, or to download the scripts to generate the documentation, see http://xqdoc.org.

Example 16-2 shows a function that is documented using xqDoc comments. The documentation, which appears before the function declaration, contains a textual description of the function, followed by the @param and @return tags to describe the inputs and output of the function. HTML tags (the b elements) are used in the description to enhance the display of the description in the resulting HTML documentation.

Example 16-2. Documenting a function with xqDoc
declare namespace functx = "http://www.functx.com";
(:~
: The <b>functx:substring-after-last</b> function returns the part
: of <b>$string</b> that appears after the last occurrence of
: <b>$delim</b>. If <b>$string</b> does not contain
: <b>$delim</b>, the entire string is returned.
:
: @param $string the string to substring
: @param $delim the delimiter
: @return the substring
:)
declare function functx:substring-after-last
($string as xs:string?, $delim as xs:string) as xs:string?
 { "function body here"  };

Modularity

Expressions that are used more than once or twice should be separated into functions and shared. Functions make it clearer to the query reader what is going on. Having a function clearly named, with a set of named, typed parameters, serves as a form of documentation. It also physically separates it from the rest of the query, which makes it easier to decipher complex queries with many nested expressions.

In addition, function declarations encourage reuse. When reused, an expression needs to be written (and maintained) only once. If you want to change the algorithm later, for example to accept the empty sequence or to fix a bug, you can do it in one place only.

Functions can be made even more reusable by separating them into library modules. XQuery libraries can also be used to create standardized sets of functions for specific XML vocabularies. These libraries can serve as an API to an XML vocabulary, shielding query authors from some of the complexity of the vocabulary. They can then be distributed to a variety of query writers, allowing reuse among an entire community of users.

Robustness

Queries should be able to handle all possible combinations of values of input data. This includes handling any potential variations in the data, and considering the impact of missing or empty values.

Handling Data Variations

It is important to consider variations in the input documents that may cause incorrect results or dynamic errors. Some common problems occur when:

Sequences of multiple items appear where only one item was expected

For example, the expressions $prod[name eq "Floppy Sun Hat"] and substring($prod/name, 1, 30) raise type error XPTY0004 if there is more than one name child. The expression $prod/name != "Floppy Sun Hat" evaluates to true if two name children exist and either one is not equal to Floppy Sun Hat.

Zero items appear where one was expected

For example, the expression $prod/price - $prod/discount returns the empty sequence if there is no discount element.

Values do not conform to the expected type

For example, the expression max($prod/number) raises an error if the product number is N/A instead of an integer.

Values are outside the expected range

For example, zero and negative numbers where a positive number was expected.

You should not assume that because an input document is validated by a schema it must be exactly as you expect. A schema can validate, for example, that a required element is present, but other assumptions might be made that cannot be validated by a schema. For example, the constraint “if the discounted attribute is true, a discount child must appear” cannot be validated by XML Schema 1.0.

Handling Missing Values

Individual data values may be missing from an input document. Sometimes these missing values are in error, and sometimes they are not. In either case, they need to be handled gracefully.

Suppose you are calculating the sale price of an item by taking the regular price, and applying a discount. Some product prices have discounts, but others do not. The absence of a discount could be represented in (at least) four ways:

  • A discount element or attribute that is entirely absent from the input document

  • An empty element or an attribute whose value is a zero-length string—for example, <discount></discount>, <discount/>, or discount=""

  • An element that is marked with the attribute xsi:nil—for example, <discount xsi:nil="true"></discount>

  • An element or attribute that has a default “missing” value such as N/A or 0—for example, <discount>N/A</discount>

Absent values

An expression used to calculate a sequence of prices might be:

for $prod in doc("prices.xml")//prod
return $prod/price - $prod/discount

In the case where the discount element is absent, the value of the discount expression is the empty sequence, and therefore the $prod/price - $prod/discount expression also returns the empty sequence. You probably intended instead for the discount to default to zero, and for your expression to return the price itself if no discount was available.

Another problem that might occur when the discount value is missing is in calculating an average. To find the average discount, you might be tempted to use an expression like avg(doc("prices.xml")//discount). However, that function gives the average of the discount values that exist, ignoring any prices that do not have discounts.

Empty and nil values

The second possible scenario is that the discount element appears, but it is empty (and it may or may not have an xsi:nil attribute). It may appear in the input document as either <discount></discount> or <discount/>, which are equivalent elements in XML. This scenario also poses problems for the $prod/price - $prod/discount expression. Assuming the discount element is untyped, the processor attempts to cast the empty value to xs:double for use of the arithmetic expression. This results in a type error because the value is not a valid number.

Default “missing” values

The final scenario is one where the discount element contains a value such as N/A to indicate that it is absent. A simple conditional expression can handle this scenario, as in:

for $prod in doc("prices.xml")//prod
return if ($prod/discount = "N/A")
       then $prod/price
       else $prod/price - $prod/discount

Alternatively, if N/A is used for a number of different elements, you could alter the functx:if-absent or functx:if-empty function to check for this value, too.

Error Handling

Evaluating a query has two phases: the analysis phase, which catches static errors, and the evaluation phase, which catches dynamic errors. This section does not cover static errors, since they can be caught by the query processor and debugged as part of the development process. The dynamic errors are the unexpected errors that need to be considered carefully when writing queries.

Avoiding Dynamic Errors

It is important to consider variations in the input documents that may cause dynamic errors. For example, if you are dividing a total amount by the number of items in an order, consider the possibility that there are no items in the order, which may result in a “division by zero” error. You can avoid this by checking the number of items first, as in:

if ($items) then $orderTotal div count($items) else 0

Dynamic type errors often occur when data cannot be cast to the required type. For example, to double the price discount, you might use the expression 2 * $discount. This expression raises a dynamic error if the value of discount is N/A or a zero-length string, which cannot be cast to a numeric type. You can avoid a dynamic error by checking whether a value is castable in advance, as in:

if ($discount castable as xs:decimal) then 2 * $discount else 0

If the input data is really in error (and you are not performing schema validation), it may be helpful to test for the error condition and raise a more meaningful error by using the error function. If the input data is not in error, you should modify the query to allow for the variations in input documents.

If an input document is schema validated, you can be less concerned about some of these dynamic errors. For example, if the schema specifies that the number of items in an order must be more than zero, you may not have to worry about a “division by zero” error. If the schema validates that the type of a discount element is xs:decimal, there is no chance that it is N/A.

The error and trace Functions

The error function is used to explicitly raise an error when certain conditions arise. For example, if an important data item is missing or invalid, you may want to stop evaluation of the query with a specific error message. To do this, you can incorporate calls to the error function in your query. For example:

if (not($prod/number))
then error(QName("http://datypic.com/err", "ProdNumReq"), 
           "missing product number")
else $prod/number

raises a ProdNumReq error (whose description is “missing product number”) if $prod has no number child.

During the query debugging process, the trace function can be used to track the value of an item. For example:

trace($var1, "The value of $var1 is: ")

might write the string The value of $var1 is: 4 to a log file.

Try/Catch Expressions

The try/catch expression, new in XQuery 3.0, allows the processor to try to evaluate an expression (specified in the try clause), but if there is a dynamic error, gracefully bow out and evaluate a different expression (specified in the catch clause). For example, one of the previous examples could be rewritten as:

try { $orderTotal div count($items) }
catch * { 0 }

where it is trying to perform decimal division. If an error occurs, the expression will return 0. The syntax of the try/catch expression is shown in Figure 16-1.

Figure 16-1. Syntax of a try/catch expression

The use of the * wildcard in the previous example means that any error will be caught by that catch clause. It is also possible to list one or more specific error messages, in one or more catch clauses, as in:

try { $orderTotal div count($items) }
catch err:FOCA0001 { xs:float('INF')}
catch err:FOAR0001 | err:XPTY0004 { 0 }
catch * { () }

In this example, there are three catch clauses. The first catch clause specifies what to return if the built-in error FOCA0001 is raised. The second catch clause applies if either FOAR0001 or XPTY0004 is raised; multiple error names are separated by vertical bars. The third catch clause catches any errors other than the three specifically listed. Only the first catch clause that applies is evaluated.

For the built-in errors, the error names are in the namespace http://www.w3.org/2005/xqt-errors, which is typically prefixed with err. Because this is not one of the predeclared namespaces, you need to declare the namespace in the query prolog in order to use it. You could also have error names in custom namespaces if you called the error function from within the try clause and used a qualified name as the first argument. The wildcard can also indicate error names in a specific namespace, for example catch err:* {...} would catch all errors in the built-in error namespace.

There is no requirement to have catch clauses that trap all errors, and if no wildcard is used, and an error is raised that is not listed in a catch clause, the expression will fail as it normally would if there was no try/catch expression.

Certain variables, listed in Table 16-2, can be used within the catch clause that may provide more information about the error. Only the err:code variable is guaranteed to have a value; the other variables are not necessarily supported by all processors. The catch variables are all in the built-in error namespace and therefore should be prefixed with err when referenced. Example 16-3 shows an example of using the error function to raise custom errors, and then provide the error information in a formatted string in the result.

Table 16-2. Variables available in a catch clause
NameTypeDescription
err:codexs:QNameThe error name
err:descriptionxs:string?A description of the error, if one is available. If the error function was called, this is the second argument.
err:valueitem()*Value associated with the error. If the error function was called, this is the third argument.
err:modulexs:string?The URI (or system ID) of the module containing the expression where the error occurred, if available
err:line-numberxs:integer?The line number within the module of the instruction where the error occurred, if available
err:column-numberxs:integer?The column number within the module of the instruction where the error occurred, if available
err:additionalitem()*Any additional implementation-defined information
Example 16-3. Using the catch variables
xquery version "3.0";
declare namespace err = "http://www.w3.org/2005/xqt-errors";
let $order-date := doc("order.xml")/order/@date
return 
try {
  if (not($order-date)) 
  then error(QName("http://datypic.com/err", "MissingOrdDate"),
             "Order date must be present")
  else if (not($order-date castable as xs:date)) 
  then error(QName("http://datypic.com/err", "InvalidOrdDate"),
             "Order date must be YYYY-MM-DD")
  else concat("The order-date is ",
         format-date(xs:date($order-date), "[M01]/[D01]/[Y0001]"))
}
catch * { 
  concat("[", $err:line-number, ": ", $err:column-number,
         "] Error ", $err:code, ": ", $err:description)
}

Performance

When querying large documents or databases, it is important to tune queries to optimize performance. Implementations vary significantly in their ability to take clues from the query in order to optimize its evaluation. This section provides some general tips for improving query performance. For more specific tuning information for your XQuery processor, consult the documentation.

Avoid Reevaluating the Same or Similar Expressions

A let clause can be used to evaluate an expression once and bind the value to a variable that can be referenced many times. This can be much more efficient than evaluating the expression many times. For example, suppose you want to add a bargain-bin element to your results, but only if there are products whose price is less than 30. You first need to check whether any bargain products exist, and if so, construct a bargain-bin element and list the products in it. Example 16-4 shows an example of this.

Example 16-4. Avoid reevaluating the same expression

Less efficient query

if (doc("prices.xml")/prices/priceList/prod[price < 30])
then <bargain-bin>{
       doc("prices.xml")/*/priceList/prod[price < 30]
     }</bargain-bin>
else ()

More efficient query

let $bargains := doc("prices.xml")/prices/priceList/prod[price < 30]
return if ($bargains)
       then <bargain-bin>{$bargains}</bargain-bin>
       else ()

In the first query, similar path expressions appear in the test expression (after if) and in the bargain-bin element constructor. In the second query, the expression is evaluated once and bound to the variable $bargains, then referenced twice in the rest of the query. This is considerably more efficient, since the expensive expression need only be evaluated once. Using some XQuery implementations, the difference in performance can be dramatic, especially when the doc function is part of the expression.

Avoid Unnecessary Sorting

If you are not concerned about the order in which your results are returned, you can improve the performance of your query by not sorting. Some expressions, particularly path expressions and the union, intersect, and except expressions, always sort the results in document order unless they appear in an unordered expression or function. Example 16-5 shows two queries that select all the number and name elements from the catalog document.

Example 16-5. Avoid unnecessary sorting

Less efficient query

let $doc := doc("catalog.xml")
return $doc//number | $doc//name

More efficient query

unordered {
  let $doc := doc("catalog.xml")
  return $doc//(number|name)
}

The first query has two inefficiencies related to sorting:

  • It selects the elements without using an unordered expression, so each of the two path expressions sorts the elements in document order.

  • It performs a union of the two sequences, which causes them to be re-sorted in document order.

The more efficient query uses an unordered expression to indicate that the order of the elements does not matter. Even if you care about the order of the final results, there may be some steps along the way that can be unordered. More information on indicating that order is not significant can be found in “Indicating That Order Is Not Significant”.

Avoid Expensive Path Expressions

Using some processors, the use of the descendant-or-self axis (abbreviated //) in path expressions can be very expensive, because every descendant node must be checked. If the path to the desired descendant is known and consistent, it may be far more efficient to specify the exact path. Example 16-6 shows an example of this situation.

The first query uses the // abbreviation to indicate all number descendants of the input document, while the second specifies the exact path to the number descendants. It should be noted, however, that for some XQuery implementations, starting a path expression with // actually helps performance significantly, because it triggers the use of an index based on the element name.

Example 16-6. Avoid expensive path expressions

Less efficient query

doc("catalog.xml")//number

More efficient query

doc("catalog.xml")/catalog/product/number

Use of the parent, ancestor, or ancestor-or-self axis can also be costly when using some XQuery implementations based on databases. This is because the document may actually be stored in chunks in the database, and accessing reverse axes requires traversing to a different chunk. If possible, expressions should be rewritten to avoid these axes.

Use Predicates Instead of where Clauses

Using some XQuery implementations that are based on databases, predicates are more efficient than where clauses of FLWORs. An example of this is shown in Example 16-7.

Example 16-7. Use predicates instead of where clauses

Less efficient query

for $prod in doc("catalog.xml")//product
where $prod/@dept = "ACC"
order by $prod/name
return $prod/name

More efficient query

for $prod in doc("catalog.xml")//product[@dept = "ACC"]
order by $prod/name
return $prod/name

The first query uses a where clause $prod/@dept = "ACC" to filter out elements, while the second query uses the predicate [@dept = "ACC"]. The predicate is more efficient in some implementations because it filters out the elements before they are selected from the database and stored in memory.

All of these performance recommendations are somewhat implementation-dependent. You should consult the documentation for your XQuery processor to gain additional tips to improve query performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.147.53