Chapter 7. Sorting and Grouping

This chapter explains how to sort and group data from input documents. It covers sorting in FLWORs, grouping results together, and calculating summary values by using aggregate functions.

Sorting in XQuery

Path expressions, which are most often used to select elements and attributes from input documents, always return items in document order. FLWORs by default return results based on the order of the sequence specified in the for clause, which is also often document order if a path expression was used.

You can sort data in an order other than document order by using the order by clause of the FLWOR. Therefore, in some cases it is necessary to use a FLWOR where it would not otherwise be necessary. For example, if you simply want to select all your items from an order, you can use the path expression doc("order.xml")//item. However, if you want to sort those items based on their num attribute, the most straightforward way to do this is using a FLWOR with an order by clause. The other alternative is to use the sort function, described in “The sort Function”.

The order by Clause

Example 7-1 shows an order by clause in a FLWOR.

Example 7-1. The order by clause
for $item in doc("order.xml")//item
order by $item/@num
return $item

The results will be sorted by item number. The syntax of an order by clause is shown in Figure 7-1.

The order by clause is made up of one or more ordering specifications, separated by commas, each of which consists of an expression and an optional modifier. The expression can only return one value for each item being sorted. In Example 7-1, there is only one num attribute of $item. If instead, you had specified order by $item/@*, which selects all attributes of item, type error XPTY0004 would have been raised because more than one value is returned by that expression.

Figure 7-1. Syntax of an order by clause

You can order by a value that is not returned by the expression. For example, you can order by $item/@dept and only return $item/@num in the results.

Using multiple ordering specifications

In order to sort on more than one expression, you can include multiple ordering specifications, as shown in Example 7-2.

Example 7-2. Using multiple ordering specifications
for $item in doc("order.xml")//item
order by $item/@dept, $item/@num
return $item

This sorts the results first by department, then by item number. An unlimited number of ordering specifications can be included.

Sorting and types

When sorting values, the processor considers their type. All the values returned by a single ordering specification expression must have comparable types. For example, they could be all xs:integer or all xs:string. They could also be a mix of xs:integer and xs:decimal, since values of these two types can be compared.

However, if integer values are mixed with string values, type error XPTY0004 is raised. It is acceptable, of course, for different ordering specifications to sort on values of different types; in Example 7-2, item numbers could be integers while departments are strings.

Untyped values are treated like strings. If your values are untyped but you want them to be treated as numeric for sorting purposes, you can use the number function, as in:

order by number($item/@num)

This allows the untyped value 10 to come after the untyped value 9. If they were treated as strings, the value 10 would come before 9.

Order modifiers

Several order modifiers can optionally be specified for each ordering specification:

  • ascending and descending specify the sort direction. The default is ascending.

  • empty greatest and empty least specify how to sort the empty sequence.

  • collation, followed by a collation URI in quotes, specifies a collation used to determine the sort order of strings. Collations are described in detail in “Collations”.

The syntax of an order modifier is shown in Figure 7-2.

Figure 7-2. Syntax of an order modifier

Order modifiers apply to only one order specification. For example, if you specify:

order by $item/@dept, $item/@num descending

the descending modifier applies only to $item/@num, not to $item/@dept. If you want both to be sorted in descending order, you have to specify:

order by $item/@dept descending, $item/@num descending

Empty order

The order modifiers empty greatest and empty least indicate whether the empty sequence and NaN should be considered a low value or a high value. If empty greatest is specified, the empty sequence is greater than NaN, and NaN is greater than all other values. If empty least is specified, the opposite is true; the empty sequence is less than NaN, and NaN is less than all other values. Note that this applies to the empty sequence and NaN only, not to zero-length strings, which are always sorted before other strings.

You can also specify the default behavior for all order by clauses in the query prolog, using an empty order declaration, whose syntax is shown in Figure 7-3.

Figure 7-3. Syntax of an empty order declaration

Example 7-3 shows a query that uses an empty order declaration and sorts the results by the color attributes. Because the greatest option is chosen, the items with no color attribute appear last in the results.

Example 7-3. Using an empty order declaration

Query

declare default order empty greatest;
for $item in doc("order.xml")//item
order by $item/@color
return $item

Results

<item dept="WMN" num="557" quantity="1" color="black"/>
<item dept="MEN" num="784" quantity="1" color="gray"/>
<item dept="WMN" num="557" quantity="1" color="navy"/>
<item dept="MEN" num="784" quantity="1" color="white"/>
<item dept="ACC" num="563" quantity="1"/>
<item dept="ACC" num="443" quantity="2"/>

The setting in the empty order declaration applies unless it is overridden by an order modifier in an individual ordering specification. The empty order declaration in the prolog applies only when an order by clause is present; otherwise, the results are not sorted. If no empty order declaration is present, the default order for empty sequences is implementation-defined.

Stable ordering

When you sort on $item/@num, several values may be returned that have the same sort value. If stable ordering is not in use, the implementation is free to return those values that have equal sort values in any order. If you want those with equal sort values to be sorted in the order of the input sequence, or if you simply want to ensure that every implementation returns the values in the same order for the query, you can use the stable keyword before the keywords order by. For example, if you specify:

stable order by $item/@num

the items with the same num value are always returned in the order returned by the for expression, within the sorted results.

More complex order specifications

So far, the order specifications have been simple path expressions. You can sort based on almost any expression, as long as it only returns a single item. For example, you could sort on the result of a function call, such as:

order by substring($item/@dept, 2, 2)

which sorts on a substring of the department, or you could sort on a conditional expression, as in:

order by (if ($item/@color) then $item/@color else "unknown")

which sorts on the color if it exists or the string unknown if it does not. In addition, you could use a path expression that refers to a completely different XML document, as in:

order by doc("catalog.xml")//product[number = $item/@num]/name

which orders the results based on a name it looks up in the catalog.xml document.

A common requirement is to parameterize the sort key—that is, to decide at runtime what sort key to use. In some cases you can use:

order by $item/@*[name()=$param]

In other cases you may need to use an extension function, as described in “Dynamic Paths”.

The sort Function

A built-in sort function is available in version 3.1 that will sort a sequence of items. The one-argument version simply sorts the items based on their typed values. For example:

sort(doc("catalog.xml")//product/number)

will return the number elements, sorted by their contents. A second argument can be used to provide an optional collation, and a third argument can be used to provide function that generates the sort key for each item. For example:

sort(doc("catalog.xml")//product, (), function($product) { $product/number })

will return the product elements sorted by their number child. The third argument is a function, in this case an inline function expression, whose syntax is covered in detail later in the book in “Inline Function Expressions”.

Document Order

Every XML document (or document fragment) has an order, known as document order, which defines the sequence of nodes. Document order is significant because certain expressions return nodes in document order. Additionally, document order is used when determining whether one node precedes another. Note that items in sequences are not always arranged in document order; it depends on how the sequence was constructed.

Document order defined

The document order of a set of nodes is:

  • The document node itself

  • Each element node in order of the appearance of its start tag, followed by:

    • Its attribute nodes, in an implementation-dependent order

    • Its children (text nodes, child elements, comments, and processing instructions) in the order they appear

If a sequence containing nodes from more than one document is sorted in document order, it is arbitrary (implementation-dependent) which document comes first, but all the nodes from one document come before all the nodes from the other document. For nodes that are not part of a document, such as those that are constructed in your query, the order is implementation-dependent, but stable.

There is no such thing as a document order on atomic values.

Sorting in document order

Certain kinds of expressions, including path expressions and operators that combine sequences (|, union, intersect, and except), eliminate duplicate nodes and return nodes in document order automatically. For example, the path expression:

doc("catalog.xml")//product/(number | name)

retrieves the number and name children of product, in document order. If you want all the number children to appear before all the name children, you need to use a sequence constructor, as in:

(doc("catalog.xml")//product/number, doc("catalog.xml")//product/name)

which uses parentheses and a comma. This sequence constructor maintains the order of the items, putting all the results of the first expression first in the sequence, and all the results of the second expression next.

If you have a sequence of nodes that are not in document order, but you want them to be, you can simply use the expression:

$mySequence/.

where $mySequence is a sequence of nodes. The / operator means that it is a path expression, which always returns nodes in document order.

Inadvertent re-sorting in document order

If you have used an order by clause to sort the results of a FLWOR, you should use caution when using the resulting sequence in another expression because the results may be re-sorted to document order. The example shown in Example 7-4 first sorts the products in order by product number, then returns their names in li elements.

Example 7-4. Inadvertent re-sorting in document order
let $sortedProds := for $prod in doc("catalog.xml")//product
                    order by $prod/number
                    return $prod
for $prodName in $sortedProds/name
return <li>{string($prodName)}</li>

However, this query returns the products in document order, not product number order. This is because the expression $sortedProds/name re-sorts the nodes back to document order. In this case, the expression can easily be rewritten as shown in Example 7-5. In more complex queries, the error might be more subtle.

Example 7-5. FLWOR without inadvertent re-sorting
for $prod in doc("catalog.xml")//product
order by $prod/number
return <li>{string($prod/name)}</li>

Document Order Comparisons

Two nodes can be compared based on their relative position in document order by using the << and >> operators. For example, $n1 << $n2 returns true if $n1 precedes $n2 in document order. According to the definition of document order, a parent precedes its children.

Each of the operands of the << and >> operators must be a single node, or the empty sequence. If one of the operands is the empty sequence, the result of the comparison is the empty sequence.

Example 7-6 shows a FLWOR that makes use of an order comparison in its where clause. For each product, it checks whether any other products later in the document are in the same department. If so, it returns the product element. Specifically, it binds the $prods variable to a sequence of all four product elements. In the where clause, it uses predicates to choose from the $prods sequence those that are in the same department as the current $prod, and then gets the last of those. If the current $prod precedes that last product in the department, the expression evaluates to true, and the product is selected.

In the case of catalog.xml, only the second product element is returned because it appears before another product in the same department (ACC).

Example 7-6. Using an order comparison
let $prods := doc("catalog.xml")//product
for $prod in $prods
where $prod << $prods[@dept = $prod/@dept][last()]
return $prod

Reversing the Order

The reverse function reverses the order of items in a sequence. For example:

reverse(doc("catalog.xml")//product)

returns the product elements in reverse document order. The function is not just for reversing document order; it can reverse any sequence. For example:

reverse( (6, 2, 3) )

returns the sequence (3, 2, 6).

Indicating That Order Is Not Significant

As described in the previous section, several kinds of expressions return results in document order. In cases where the order of the results does not matter, the processor may be much more efficient if it does not have to keep track of order. This is especially true for FLWORs that perform joins. For example, processing multiple variable bindings in a for clause might be significantly faster if the processor can decide which variable binding controls the join without regard to the order of the results.

To make a query more efficient, there are three ways for a query author to indicate that order is not significant: the unordered function, the unordered expression, and the ordering mode declaration.

The unordered function

A query author can tell the processor that order does not matter for an individual expression by enclosing it in a call to the unordered function, as shown in Example 7-7. The unordered function takes as an argument any sequence of items, and returns those same items in an undetermined order. Rather than being a function that performs some operation on its argument, it is more a signal to the processor to evaluate the expression without regard to order.

Example 7-7. Using the unordered function
unordered(
for $item in doc("order.xml")//item,
    $prod in doc("catalog.xml")//product
where $item/@num = $prod/number
return <item number="{$item/@num}"
             name="{$prod/name}"
             quantity="{$item/@quantity}"/>
)

The unordered expression

An unordered expression is similar to a call to the unordered function, except that it affects not just the main expression passed as an argument, but also every embedded expression. The syntax of an unordered expression is similar, but it uses curly braces instead of the parentheses, as shown in Example 7-8.

Example 7-8. An unordered expression
unordered {
for $item in doc("order.xml")//item,
    $prod in doc("catalog.xml")//product
where $item/@num = $prod/number
return <item number="{$item/@num}"
             name="{$prod/name}"
             quantity="{$item/@quantity}"/>
}

Similarly, an ordered expression will allow you to specify that order matters in a certain section of your query. This is generally unnecessary except to override an ordering mode declaration, as described in the next section.

The ordering mode declaration

You can specify whether order is significant for an entire query in the query prolog, using an ordering mode declaration, whose syntax is shown in Figure 7-4.

Figure 7-4. Syntax of an ordering mode declaration

For example, the prolog declaration:

declare ordering unordered;

allows the processor to disregard order for the scope of the entire query, unless it is overridden by an ordered expression or an order by clause. If no ordering mode declaration is present, the default is ordered.

Grouping

Queries are often written to summarize or organize information into categories. For example, suppose you want your list of items to be grouped by department. The only way to accomplish this in version 1.0 is to use FLWORs along with the distinct-values function, as shown in Example 7-9.

Example 7-9. Grouping by department

Query

for $d in distinct-values(doc("order.xml")//item/@dept)
let $items := doc("order.xml")//item[@dept = $d]
order by $d
return <department code="{$d}">{
         for $i in $items
         order by $i/@num
         return $i
       }</department>

Results

<department code="ACC">
  <item dept="ACC" num="443" quantity="2"/>
  <item dept="ACC" num="563" quantity="1"/>
</department>
<department code="MEN">
  <item dept="MEN" num="784" quantity="1" color="white"/>
  <item dept="MEN" num="784" quantity="1" color="gray"/>
</department>
<department code="WMN">
  <item dept="WMN" num="557" quantity="1" color="navy"/>
  <item dept="WMN" num="557" quantity="1" color="black"/>
</department>

In this example, the variable $d is iteratively bound to each of the distinct values for department code, namely WMN, ACC, and MEN. For each department, the variable $items is bound to all the items that have the particular department code $d. Because $items is bound in a let clause rather than a for clause, the entire sequence of items (for a single department) is bound to $items, not each item individually. The order by clause causes the results to be sorted by department.

The inner FLWOR is used simply to sort $items by item number. If the order of the items within a department is not a concern, the entire inner FLWOR can simply be replaced by $items, which returns the items in document order.

Grouping Using the group by Clause

Starting in version 3.0, it is possible to be more explicit about grouping by a particular value by using a group by clause. This clause not only makes your intentions clearer, but is also in many cases likely to be a lot faster to execute. Example 7-10 shows a revised version of Example 7-9 that uses a group by clause, and returns the same results.

Example 7-10. Grouping by department with group by clause
xquery version "3.0";
for $item in doc("order.xml")//item
let $d := $item/@dept
group by $d
order by $d
return <department code="{$d}">{
         for $i in $item
         order by $i/@num
         return $i}</department>

Adding the group by clause changes the iteration of the FLWOR expression. Instead of iterating over the items specified in the for clause (the item elements), it is now iterating over groups of those items, so it evaluates the return clause once for every group.

The group by clause also changes the variables that were defined before it. Before the group by clause, the $item variable is bound to one item element at a time. After the group by clause, the $item variable is bound to a sequence of one or more item elements, whichever ones are in the current group.

Figure 7-5. Syntax of a group by clause

The syntax of a group by clause is shown in Figure 7-5. The group by clause is made up of one or more grouping specifications, separated by commas, each of which consists of a grouping variable name with an optional expression and an optional collation specification. The as clause, which declares the type of the grouping variable, is described in “Type Declarations in FLWORs”. A collation can be used to compare strings for equality. Collations are described in detail in “Collations”.

The grouping variable name, for example, $d in Example 7-10, is required, and must be bound to a value. In the example, this was done by the let clause. It is not possible to specify just an expression to group on, for example, group by $item/@dept. However, as a shortcut you can bind a value to that variable in the group by clause. For example, you can eliminate the let clause in Example 7-10 and give $d a value in the group by clause using the following syntax:

for $item in doc("order.xml")//item
group by $d := $item/@dept
order by $d
return <department code="{$d}">{$item}</department>

The grouping variable must be bound to a single atomic value, known as the grouping key. In Example 7-10, there is only one dept attribute of $item. If instead, you had specified group by $item/@*, which selects all attributes of item, type error XPTY0004 would have been raised because more than one value is returned by that expression.

Grouping on multiple values

In order to group on more than one expression, you can include multiple grouping specifications, separated by commas, as shown in Example 7-11.

Example 7-11. Using multiple grouping specifications

Query

xquery version "3.0";
for $item in doc("order.xml")//item
group by $d:= $item/@dept, $n:= $item/@num
return <group dept="{$d}" num="{$n}" count="{count($item)}"/>

Results

<group dept="ACC" num="563" count="1"/>
<group dept="MEN" num="784" count="2"/>
<group dept="WMN" num="557" count="2"/>
<group dept="ACC" num="443" count="1"/>

This groups the results both by department and item number. An unlimited number of grouping specifications can be included. To have nested groups, where the groups of numbers appear within the groups of departments, you would need to use a FLWOR embedded within another FLWOR, as shown in Example 7-12.

Example 7-12. Using nested FLWORs for multilevel grouping

Query

xquery version "3.0";
for $item in doc("order.xml")//item
group by $d:= $item/@dept
return <group dept="{$d}" count="{count($item)}">{
        for $item-in-dept in $item
        group by $n := $item-in-dept/@num
        return <subgroup num="{$n}" count="{count($item-in-dept)}"/>
      }</group>

Results

<group dept="ACC" count="2">
   <subgroup num="443" count="1"/>
   <subgroup num="563" count="1"/>
</group>
<group dept="MEN" count="2">
   <subgroup num="784" count="2"/>
</group>
<group dept="WMN" count="2">
   <subgroup num="557" count="2"/>
</group>

More complex grouping specifications

The previous examples show grouping on individual values that appear in the input document. However, because you can use any expression to bind a value to a grouping variable, you can write more sophisticated grouping specifications. Example 7-13 shows how you can group the items based on whether the product number is over 500. The grouping variable $g is bound to a Boolean value indicating whether the product number is over 500.

Example 7-13. Grouping high and low product numbers

Query

xquery version "3.0";
for $item in doc("order.xml")//item
group by $g:= $item/@num > 500
return <group prodnumrange="{if ($g) then 'high' else 'low'}"
                 count="{count($item)}"/>

Results

<group prodnumrange="high" count="5"/>
<group prodnumrange="low" count="1"/>

You could even use the query shown in Example 7-14 to group the items on ranges of product numbers.

Example 7-14. Grouping on a range of values

Query

xquery version "3.0";
for $item in doc("order.xml")//item
group by $g:= $item/@num - ($item/@num mod 100)
return <group prodnumrange="{$g}-{$g+99}" count="{count($item)}"/>

Results

<group prodnumrange="400-499" count="1"/>
<group prodnumrange="500-599" count="3"/>
<group prodnumrange="700-799" count="2"/>

Aggregating Values

In addition to simply regrouping items, it is often desirable to perform calculations on the groups. For example, suppose you want to know the number of item elements in a department, or the sum of the quantities for a department. This type of aggregation can be performed using the aggregate functions. Example 7-15 shows some of these functions in action.

Example 7-15. Aggregation

Query

xquery version "3.0";
for $item in doc("order.xml")//item
group by $d := $item/@dept
order by $d
return 
  <department code="{$d}"
              numItems="{count($item)}"
              distinctItemNums="{count(distinct-values($item/@num))}"
              totQuant="{sum($item/@quantity)}"/>

Results

<department code="ACC" numItems="2" distinctItemNums="2" totQuant="3"/>
<department code="MEN" numItems="2" distinctItemNums="1" totQuant="2"/>
<department code="WMN" numItems="2" distinctItemNums="1" totQuant="2"/>

The aggregate functions can operate on the $item variable, because after the group by clause, that variable is bound to all the items in the current group. Here is how the aggregate functions work:

count

This function is used to determine the number of items in the sequence. In Example 7-15, the count function is used to calculate the value of numItems, which is the number of items in the department. It is also used to calculate the value of distinctItemNums. In the latter case, the count function is combined with the distinct-values function to count only the unique numbers in that department.

sum

This function is used to determine the total value of the items in a sequence. In Example 7-15, the sum function is used to calculate the value of totQuant, the sum of all the quantity attributes for that department.

min and max

These functions are used to determine the minimum and maximum values of the items in the sequence.

avg

This function is used to determine the average value of the items in a sequence.

The sum and avg functions accept values that are all numeric, all xs:yearMonthDuration values, or all xs:dayTimeDuration values. The max and min functions accept values of any type that is ordered (i.e., values can be compared using < and >). This includes strings, dates, and many other types.

The sum, min, max, and avg functions treat untyped data as numeric. This means that if you are not using a schema, and you want to find a maximum string value, you need to use an expression like:

max(doc("order.xml")//item/string(@dept))

which uses the string function to convert each value to xs:string before the comparison. Otherwise, a type error will be raised because the dept attribute contains a value that is not a number.

Ignoring “Missing” Values

The sequence passed to an aggregate function may contain nodes that are zero-length strings, even though the user might think of them as “missing” values. For example, the minimum value of the color attribute in order.xml is black. However, if there had been an item with a color attribute whose value was a zero-length string (as in color=""), the min function would have returned a zero-length string.

Counting “Missing” Values

On the other hand, there may be cases where you want “missing” values to be taken into consideration, but they are not. For example, the avg function ignores any absent nodes. If you want the average product discount, and you use:

avg(doc("prices.xml")//discount)

you get the average of the two discount values. It does not take into account the fact that there are three products, and that you might want the discount to be counted as zero for the product with no discount child. To count absent discount children as zero, you need to calculate the average explicitly, using:

sum(doc("prices.xml")//prod/discount)
div count(doc("prices.xml")//prod)

On the other hand, if a prod had an empty discount child (i.e., <discount/> or <discount></discount>), it would be considered a zero-length string and the avg function would raise error FORG0006 because this value is not of a numeric or duration type. In that case, you would have to test for missing values using:

avg(doc("prices.xml")//prod/discount[. != ""])

Aggregating on Multiple Values

So far, the aggregation examples assume that you want to group on one value, the dept attribute. Suppose you want to group on two values: the dept attribute and the num attribute. You would like to know the number of items and total quantity for each department/product number combination. This could be accomplished by adding another grouping specification, as shown in Example 7-16.

Example 7-16. Aggregation on multiple values

Query

xquery version "3.0";
for $item in doc("order.xml")//item
group by $d := $item/@dept, $n := $item/@num
order by $d, $n
return
    <group dept="{$d}" num="{$n}" numItems="{count($item)}"
           totQuant="{sum($item/@quantity)}"/>

Results

<group dept="ACC" num="443" numItems="1" totQuant="2"/>
<group dept="ACC" num="563" numItems="1" totQuant="1"/>
<group dept="MEN" num="784" numItems="2" totQuant="2"/>
<group dept="WMN" num="557" numItems="2" totQuant="2"/>

Constraining and Sorting on Aggregated Values

In addition to returning aggregated values in the query results, you can constrain and sort the results on the aggregated values. Suppose you want to return the similar results to those shown in Example 7-16, but you only want the groups whose total quantity (totQuant) is greater than 1, and you want the results sorted by the number of items (numItems). The query shown in Example 7-17 accomplishes this.

Example 7-17. Constraining and sorting on aggregated values

Query

xquery version "3.0";
for $item in doc("order.xml")//item
group by $d := $item/@dept, $n := $item/@num
where sum($item/@quantity) gt 1
order by count($item)
return
    <group dept="{$d}" num="{$n}" numItems="{count($item)}"
              totQuant="{sum($item/@quantity)}"/>

Results

<group dept="ACC" num="443" numItems="1" totQuant="2"/>
<group dept="WMN" num="557" numItems="2" totQuant="2"/>
<group dept="MEN" num="784" numItems="2" totQuant="2"/>

Adjusting the query was a simple matter of adding a where clause that tested the total quantity, and modifying the order by clause to use the number of items.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.130.201