CHAPTER 8

image

Analytic Functions

The use of analytic functions, also known as windowing functions, is often overlooked even though they’ve been around since Oracle 8i. Perhaps because the primary documentation for these functions is found in the Oracle Database Data Warehousing Guide (http://www.oracle.com/technetwork/indexes/documentation/index.html), they are often thought useful only in data warehousing SQL. In the previous chapter, we examined how advanced grouping techniques can be used to accomplish so much that you may be wondering why you need to bother with analytic functions. Well, just like really good carpenters have numerous tools in their toolbox, so should we make sure our developer toolbox is filled with many different tools to help us write good SQL. If we limit ourselves to only a couple tools, we're bound to use constructs every now and then that are less performance friendly than others.

If you follow Tom Kyte (if you don't, you should—at asktom.oracle.com—and you should read his blog at tkyte.blogspot.com), you're probably aware of his “Analytics Rock . . . Analytics Roll!” mantra. In this chapter, we take a look at how analytic functions work. I provide descriptions and examples of many commonly used analytics, and discuss the performance benefits that analytics provide. As with any construct, analytics have their good points and their not-so-good points, so you should always make sure to test carefully to make sure choosing an analytic function is the best option compared with others like advanced grouping (Chapter 7) and subfactored queries (Chapter 10).

Overview

Queries using analytic functions compute aggregates based on groups of rows. Although similar to grouping totals provided by common aggregate functions, the main difference is that analytic functions return more than one row for each group. Instead of just getting a single row per group, you can get all the detail data rows as well. One way to differentiate analytic functions from their aggregate function cousins is to refer to the term window when referring to a group of rows used in an analytic function.

Windows are defined by an analytic clause and each row is identified within a sliding window. These sliding windows determine the range of rows used to perform calculations for the current row. Each window can vary in size based on either a physical number of rows or a logical interval such as time. These functions do so much more than common aggregates in that they can reference values across rows, produce multilevel aggregations, and allow subsets of data to be sorted with more granular control.

Although conventional SQL statements can be used to implement online analytic processing (OLAP) queries, these statements are usually more complex and perform poorly in comparison. With analytic functions, repeated access to the same objects can be avoided, saving time and resources. Because both detail data and grouped values can be returned, analytics can be used to provide cumulative, moving, centered, and reporting totals easily.

Analytic functions are the last operations performed in a query except for the final ORDER BY clause. All other clauses are completed, including the WHERE, GROUP BY, and HAVING clauses, before the analytic functions are processed. For this reason, analytic functions can be used only in the select list or ORDER BY clause.

As mentioned, the use of analytic functions is often assumed to be pertinent only in data warehousing or large reporting applications. However, when you see the power and flexibility of what they can do, you'll likely find uses for them everywhere.

Example Data

To begin our investigation of the analytic SQL functions, let’s start by creating a denormalized fact table using the script in Listing 8-1. All the tables in this chapter refer to the objects in SH schema supplied by Oracle Corporation example scripts.

Listing 8-1.  Denormalized sales_fact Table

drop table sales_fact;

CREATE table sales_fact AS
SELECT country_name country,country_subRegion region, prod_name product,
calendar_year year, calendar_week_number week,
SUM(amount_sold) sale,
sum(amount_sold*
( case
 when mod(rownum, 10)=0 then 1.4
 when mod(rownum, 5)=0 then 0.6
 when mod(rownum, 2)=0 then 0.9
 when mod(rownum,2)=1 then 1.2
 else 1
    end )) receipts
FROM sales, times, customers, countries, products
WHERE sales.time_id = times.time_id AND
sales.prod_id = products.prod_id AND
sales.cust_id = customers.cust_id AND
customers.country_id = countries.country_id
GROUP BY
country_name,country_subRegion, prod_name, calendar_year, calendar_week_number;

Anatomy of Analytic Functions

Analytic functions have three basic components: partition-by-clause, order-by-clause, and the windowing-clause. The basic syntax of an analytic function is as follows:

function1 (argument1, argument2,..argumentN)
over ([partition-by-clause] [order-by-clause] [windowing-clause])

function1 is the analytic function you wish to call that accepts zero or more arguments. The partition-by-clause groups the rows by partitioning column values. All rows with the same value for the partitioning column are grouped as a data partition.

Operationally, rows are sorted by the partitioning columns and are separated into data partitions. For example, the SQL clause partition by product, country partitions the data using the product and country columns. Rows are sorted by both columns and are grouped into one partition for each combination of product and country.

The order-by-clausesorts the rows in a data partition by a column or expression. In an analytic SQL statement, the position of a row in the data partition is important and it is controlled by the order-by-clause. Rows are sorted by the sort columns within a data partition. Because the partition-by-clause sorts the rows by the partitioning columns, you actually end up with one sort that includes columns specified in the partition-by-clause and order-by-clause.

Sort order, just like with the SQL statement’s ORDER BY, can be specified as ascending or descending order. Nulls can be specified to sort to the top or bottom in a data partition using the clause NULLS FIRST or NULLS LAST.

The windowing-clausespecifies the subset of rows on which the analytic function operates. This window can be dynamic and is aptly termed sliding window. You can specify the top and bottom boundary condition of the sliding window using the window specification clause. Syntax for the window specification clause is as follows:

[ROWS | RANGE] BETWEEN <Start expr> AND <End expr>

Whereas
<Start expr> is [UNBOUNDED PRECEDING | CURRENT ROW | n PRECEDING | n FOLLOWING]
<End expr> is [UNBOUNDED FOLLOWING | CURRENT ROW | n PRECEDING | n FOLLOWING]

The keyword PRECEDING specifies the top boundary condition, and the clause FOLLOWING or CURRENT ROW specifies the bottom boundary condition for the window. A sliding window provides the ability to compute complex metrics with ease. For example, you can compute the running sum of the sale column using the clause rows between unbounded preceding and current row. In this example, the top row in the window is the first row in the current partition and the bottom row in the window is the current row.

image Note   The windowing-clause is not supported by all analytic functions.

Analytic functions may not be nested, but a nesting effect can be achieved by placing the encompassing SQL statement in an inline view and then by applying analytic functions outside the view. Analytic functions can be used in deeply nested inline views, too.

List of Functions

Table 8-1 contains the analytic functions for easy reference.

Table 8-1. Analytic Functions

Function Description
lag To access prior row in a partition or result set.
lead To access later row in a partition or result set
first_value To access first row in a partition or result set.
last_value To access last row in a partition or result set.
nth_value To access any arbitrary row in a partition or result set.
rank To rank the rows in a sort order. Ranks are skipped in the case of ties.
dense_rank To rank the rows in a sort order. Ranks are not skipped in the case of ties.
row_number To sort the rows and add a unique number to each row. This is a nondeterministic function.
ratio_to_report To compute the ratio of value to the report.
percent_rank To compute the rank of value normalized to a value between 0 and 1.
percentile_cont To retrieve the value matching with the specified percent_rank. Reverse of percent_rank function.
percentile_dist To retrieve the value matching with the specified percent_rank. Assumes discreet distribution model.
ntile To group rows into units.
listagg To convert column values from different rows into a list format.

Aggregation Functions

Aggregation functions can operate in analytic mode or conventional nonanalytic mode. Aggregation functions in nonanalytic mode reduce the result set to fewer rows. However, in analytic mode, aggregation functions do not reduce the result set but can fetch both aggregated and nonaggregated columns in the same row. Aggregation functions in analytic mode provide the ability to aggregate data at different levels without any need for a self-join.

Analytic functions are useful in writing complex report queries that aggregate data at different levels. Consider a demographic market analysis report for a product, a favorite among advertising executives, that requires sales data to be aggregated at myriad levels such as age, gender, store, district, region, and country. Aggregation functions in the analytic mode can be used effectively to implement this market analysis report with ease. Analytic functions improve the clarity and performance of the SQL statements markedly compared with its nonanalytic counterparts.

Let’s review the example in Listing 8-2. The SQL statement calculates the sum of the sale column from the beginning of the year for a product, country, region, and year combination. The clause partition by product, country, region, year specifies the partition columns. Within the data partition, rows are sorted by the week column using the clause order by week.

As mentioned, in Listing 8-2, the SQL calculates the sum of sale column, so the analytic function must operate on a window of rows from the beginning of the year to the current week. This goal is achieved by the windowing clause rows between unbounded preceding and current row. The sum(sale) function calculates the sum of the sale column values over this window of rows. Because the rows are sorted by the week column, the sum function is operating over a set of rows from the beginning of the year until the current week.

Listing 8-2.  Running Sum of the sale Column

SQL> select  year, week,sale,
2      sum (sale) over(
3            partition by product, country, region, year
4            order by week
5            rows between unbounded preceding and current row
6             ) running_sum_ytd
7    from sales_fact
8    where country in ('Australia')  and product ='Xtend Memory'
9*   order by product, country,year, week ;

 YEAR WEEK       SALE  RUNNING_SUM_YTD
----- ---- ---------- ----------------
...
 2000   49      42.8          3450.85
 2000   50      21.19          3472.04
 2000   52      67.45          3539.49
 2001    1      92.26            92.26
 2001    2     118.38           210.64
 2001    3      47.24           257.88
 2001    4     256.70           514.58
...

Notice in the output of Listing 8-2, column running_sum_ytd is the output of the sum function in the analytic mode. The column value resets at the onset of the new year 2001 because year is also a partitioning column, so a new partition starts with each new year.

When a new year begins, the window slides to the next data partition and the sum function begins aggregating from week 1. Implementing this functionality with a conventional SQL statement requires multiple self-joins and/or costly column-level subqueries.

Aggregate Function over an Entire Partition

In some cases, analytic functions might need to be applied over all rows in a given data partition. For example, computing the maximum value of the sale column for the entire year requires a window encompassing every row in the data partition. In Listing 8-3, I use the SQL clause rows between unbounded preceding and unbounded following to specify that the MAX function applies to all rows in a data partition. The key difference between Listing 8-2 and Listing 8-3 is that the clause unbounded following specifies the window size to include all rows in a data partition.

Listing 8-3.  Maximum of sale Column

SQL> select  year, week,sale,
2      max (sale) over(
3            partition by product, country, region ,year
4            order by week
5            rows between unbounded preceding and unbounded following
6             ) Max_sale
7    from sales_fact
8    where country in ('Australia')  and product ='Xtend Memory'
9*   order by product, country,year, week ;

 YEAR WEEK       SALE        MAX_SALE
----- ---- ---------- ---------------
...
 2000   44     135.24          246.74
 2000   45      67.62          246.74
 2000   46     246.74          246.74
...
 2000   50      21.19          246.74
 2000   52      67.45          246.74
 2001    1      92.26          278.44
 2001    2     118.38          278.44
...

Granular Window Specifications

Window specifications can be more granular, too. Let’s say that we want to calculate the maximum of the sale column for a five-week window encompassing two weeks prior to the current week, the current week, and the two weeks following the current week. We do this by using the clause rows between 2 preceding and 2 following.

In Listing 8-4, for week 36, the maximum value for the sale column during the five-week window is 178.52. For week 37, the maximum value for the sale column during the five-week window is 118.41. You can see these values in the MAX_WEEKS_5 column of the output.

Listing 8-4.  Maximum of sale Column for a Five-Week Window

SQL> select  year, week,sale,
2      max (sale) over(
3            partition by product, country, region ,year
4            order by week
5            rows between 2 preceding and 2 following
6             ) max_weeks_5
7    from sales_fact
8    where country in ('Australia')  and product ='Xtend Memory'
9*   order by product, country,year, week ;

YEAR WEEK       SALE MAX_WEEKS_5
---- ---- ---------- -----------
...
2000   34     178.52      178.52
2000   35      78.82      178.52
2000   36     118.41      178.52
2000   37     117.96      118.41
2000   38      79.36      118.41
...

Default Window Specification

The default windowing clause is rows between unbounded preceding and current row. If you do not specify a window explicitly, you get the default window. It is a good approach to specify this clause explicitly to avoid ambiguities.

lead and lag

lag and lead functions provide interrow referencing ability. lag provides the ability to access a prior row in the result set. The lead function allows access to a later row in the result set.

In the retail industry, same-store sales is a metric calculated to measure an outlet’s performance, usually sales data compared with the same quarter in the past year. With a normalized data model, this metric calculation could not be computed from a single row; it requires accessing another row because the sale column values for current and prior years are stored in different rows. Using the powerful interrow referencing ability of lead and lag functions, this metric can be calculated with ease.

Another example is percentage increase or decrease calculations requiring access to the prior or following row. This calculation can be written optimally using lead and lag functions, too.

Syntax and Ordering

As discussed earlier, data in analytic SQL is partitioned on a partitioning column. Fetching a prior row is a position-dependent operation, and the order of the rows in a data partition is important in maintaining logical consistency. Within a data partition, rows are sorted with an order-by-clause to control the position of a row in the result set. Syntax for the lag function is as follows:

lag (expression, offset, default ) over (partition-clause order-by-clause)

lead and lag do not support the windowing clause. Only the partition by and order by clauses are supported with these two functions.

Example 1: Returning a Value from a Prior Row

Let’s say that you need to fetch the sales quantity for the current week and the prior week in the same row. Your requirement indicates an interrow reference, and this in turn necessitates a need for a self-join, or perhaps a column-list subquery, in a nonanalytic SQL statement. However, the lag function provides this interrow reference without requiring that access step.

Listing 8-5 uses lag(sale,1,sale) to retrieve the sale column value from one row prior in the result set. The clause order by year, week specifies the column sort order in each data partition. Because the rows are ordered by the columns year and week, the function lag(sale,1,sale) retrieves the sale column value from the prior row, which is the sale column value from the prior week (assuming no gaps in the week column). For example, refer to the row where year is equal to 1998 and week is equal to 3. For that row, the lag function is retrieving the sale column value from the prior row where year is equal to 1998 and week is equal to 2. Notice that the analytic function does not specify the partitioning column in the clause lag(sale,1,sale). It is referring implicitly to the current partition.

Listing 8-5.  lag Function

col product format A30
col country format A10
col region format A10
col year format 9999
col week format 99
col sale format 999999.99
col receipts format 999999.99
set lines 120 pages 100

SQL> select  year, week,sale,
2    lag(sale, 1,sale) over(
3          partition by product, country, region
4          order by year, week
5     ) prior_wk_sales
6  from sales_fact
7  where country in ('Australia')  and product ='Xtend Memory'
8  order by product, country,year, week ;

YEAR WEEK       SALE PRIOR_WK_SALES
---- ---- ---------- --------------
1998    1      58.15          58.15
1998    2      29.39          58.15
1998    3      29.49          29.39
...
1998   52      86.38          58.32
1999    1      53.52          86.38
1999    3      94.60          53.52

The third argument in the lag function specifies a default value and it is optional. If the analytic function refers to a nonexistent row, then a null is returned. This is the default behavior, but you can modify it by specifying some other return value in the third argument. For example, consider the row with year equal to 1998 and week equal to 1. This is the first row in its data partition. In this row’s case, the lag function accesses a nonexistent prior row. Because the third argument to lag is sale, the lag function returns the current row’s sale value when the referenced row does not exist.

Understanding That Offset Is in Rows

It is possible to access any row within a data partition by specifying a different offset. In Listing 8-6, the lag function is using an offset of ten to access the tenth prior row. Output shows that at the row with year equal to 2001 and week equal to 52, the lag function is accessing the tenth prior row in the result set, which is for week equal to 40. Notice that lag (sale,10,sale) is not accessing week equal to 42 by subtracting ten from the current week column value of 52; rather, this clause is accessing the tenth prior row in the partition. In this case, the tenth prior row is the row with a week column value equal to 40.

Listing 8-6.  lag Function with Offset of Ten

SQL> select  year, week,sale,
2    lag(sale, 10,sale) over(
3          partition by product, country, region
4          order by year, week
5     ) prior_wk_sales_10
6  from sales_fact
7  where country in ('Australia')  and product ='Xtend Memory'
8  order by product, country,year, week ;

 YEAR WEEK       SALE PRIOR_WK_SALES_10
----- ---- ---------- -----------------
2001   38     139.00            139.28
2001   39     115.57             94.48
2001   40      45.18            116.85
2001   41      67.19            162.91
...
2001   49      45.26             93.16
2001   50      23.14               139
2001   51     114.82            115.57
2001   52      23.14             45.18

This issue is tricky, because usually data gaps are not detected in the development environment; but, in the production environment, this problem manifests itself as a bug. If there are gaps in the data, as in this example, you have a few options: populate dummy values for the missing rows or use the MODEL clause discussed in Chapter 9.

Example 2: Returning a Value from an Upcoming Row

The lead function is similar to the lag function except that the lead function accesses later rows in the ordered result set. For example, in Listing 8-7, the clause lead(sale, 1,sale) accesses a later row in the ordered result set.

Listing 8-7.  lead Function

SQL> select  year, week,sale,
  2      lead(sale, 1,sale) over(
  3            partition by product, country, region
  4            order by year, week
  5       ) prior_wk_sales
  6    from sales_fact
  7    where country in ('Australia')  and product ='Xtend Memory'
  8*   order by product, country,year, week ;

YEAR WEEK       SALE PRIOR_WK_SALES
---- ---- ---------- --------------
2000   31      44.78         134.11
2000   33     134.11         178.52
2000   34     178.52          78.82
2000   35      78.82         118.41
...

The partition by clause can be used to specify different partition boundaries and the order by clause can be used to alter the sorting order within a partition. With an effective choice of partitioning and order by columns, any row in a result set can be accessed.

first_value and last_value

The first_value and last_value functions are useful in calculating the maximum and minimum values in an ordered result set. The first_value function retrieves the column value from the first row in a window of rows; the last_value function retrieves the column value from the last row in that window. Queries generating reports such as Top Store by Sales for a product and market segment are classic use cases for these analytic functions. Usually, store details and sales amounts are shown in the report together for the store, with the maximum value in the sale column. With the proper partition clause specification, the first_value function can be used to retrieve these values in an optimal manner. Essentially, any report calculating maximum and minimum values can use the first_value and last_value functions.

The power of first_value and last_value functions emanates from the support for partitioning and windowing clauses. Multilevel aggregation can be implemented using the partitioning clause concisely. For example, if the goal is to fetch the rows with maximum or minimum column values aggregated at different levels such as country, product, or region from the sales table, then implementing the multilevel aggregation is akin to deciding the columns to include in the partitioning clause.

Using the windowing clause, you can define a sliding dynamic window for these functions to operate. This window can be defined to include just a few prior and/or later rows or every row in a data partition. Specifically, queries computing metrics such as “maximum sales so far” can be implemented using these functions. Because the window can be defined to be a sliding window, these two functions can be used to answer questions such as “Which store had maximum sales in the past three weeks?” “Which product had maximum returns in the last two weeks?” And so on.

Syntax for the first_value function is as follows:

first_value(expression) over (partition-clause order-by-clause windowing-clause)

In Listing 8-8, the clause partition by product, country, region, year partitions the rows using the specified partitioning columns. The rows are sorted in a descending order on the sale column values by the clause order by sale desc.

The top and bottom boundary condition of the window is specified by the clause rows between unbounded preceding and unbounded following. In this example, we retrieve the top sales value at a level of product, country, region, and year columns, and hence the window includes all rows in a data partition.

Operationally, data are sorted by the product, country, region, year, and sale columns. Sorting order for the sale column is in descending order, though. The first row in every data partition has the highest value for the sale column because of the descending sort order specification of the sale column. So, the first_value(sale) clause fetches the maximum sale column value in the data partition.

In addition to fetching the maximum column value, you might want to fetch other columns from that top row. For example, you might want to fetch the year and week column values in which the maximum sale occurred. In a conventional SQL statement, implementing this results in a join and subquery. But, with analytic functions, it is simpler to fetch other attributes from that top row, too. Hence, the first_value(week) clause, with other parts of the analytic function kept the same as first_value(sale), fetches the week column value associated with that top row.

Example: first_value to Calculate Maximum

In the output of Listing 8-8, the top_sale_year column is an aggregated column that calculates the maximum value of the sale column. The sale column is a nonaggregated column. Both aggregated and nonaggregated column values are fetched in the same row without a self-join.

Listing 8-8.  first_value Function

SQL> select  year, week,sale,
 2      first_value (sale) over(
 3            partition by product, country, region ,year
 4            order by sale desc
 5            rows between unbounded preceding and unbounded following
 6       ) top_sale_value,
 7      first_value (week) over(
 8            partition by product, country, region ,year
 9            order by sale desc
10            rows between unbounded preceding and unbounded following
11       ) top_sale_week
12    from sales_fact
13    where country in ('Australia')  and product ='Xtend Memory'
14*   order by product, country,year, week;

 YEAR WEEK       SALE TOP_SALE_VALUE TOP_SALE_WEEK
----- ---- ---------- -------------- -------------
 2000   49      42.38         246.74            46
 2000   50      21.19         246.74            46
 2000   52      67.45         246.74            46
 2001    1      92.26         278.44            16
 2001    2     118.38         278.44            16
 2001    3      47.24         278.44            16
 2001    4     256.70         278.44            16

Aggregation can be performed at a different level with a different partitioning clause. For example, to compute the maximum value at the product, country, and region levels, the partitioning clause is partition by product, country, region.

Example: last_value to Calculate Minimum

Similarly, you can use the last_value function to calculate minimum or maximum values. The last_value function fetches the column values from the last row in a window of rows. For example, if you want to calculate the minimum sale column value, use the combination of the clause last_value(sale) and the clause order by sale desc sorting order. The clause order by sale desc sorts the rows by sale column values in a descending order, and the clause last_value(sale) fetches the sale column value from the last row. Listing 8-9 provides an example for last_value function usage.

Listing 8-9.  last_value Function

SQL> select  year, week,sale,
 2      last_value (sale) over(
 3            partition by product, country, region ,year
 4            order by sale desc
 5            rows between unbounded preceding and unbounded following
 6           ) low_sale
 7    from sales_fact
 8    where country in ('Australia')  and product ='Xtend Memory'
 9*   order by product, country,year, week ;

YEAR WEEK       SALE    LOW_SALE
----- ---- ---------- ------------
...
2000   49      42.38      19.84
2000   50      21.19      19.84
2000   52      67.45      19.84
2001    1      92.26      22.37
2001    2     118.38      22.37
2001    3      47.24      22.37
...

Granular control of window specification can be used effectively to produce complex reports. For example, the clause rows between 10 preceding and 10 following specifies a window of 21 rows to calculate a maximum or minimum value.

Null values are handled by the clause [RESPECT NULLS|IGNORE NULLS]. The clause RESPECT NULLS is the default, and the first_value function returns the null value if the column value in the first row is null. If the clause IGNORE NULLS is specified, then the first_value function returns the first nonnull column value in a window of rows.

Other Analytic Functions

Oracle Database implements a great many other analytic functions. Some of those used more often are described in the following subsections. The functions that follow are the ones that should be on your short list of good functions to know.

nth_value

Although the first_value and last_value functions provide the ability to fetch the first or last row, respectively, in an ordered result set, it is not quite straightforward to fetch any arbitrary row with these functions. In fact, fetching the second row using either the first_value or last_value function is a complex task.

Oracle Database version 11gR2 introduced another analytic function—nth_value—which is a generalization of first_value and last_value functions. Using the nth_value function, you can fetch any row in the ordered result set, not just first or last values. The first_value function can be written as nth_value (column_name, 1).

In statistics analysis, outliers can occur in the head or tail of the result set. In some cases, it might be important to ignore first_value or last_value in an ordered result set, and to fetch the value from the next row. The second value in a result set can be fetched using the nth_value function passing 2 as the offset to the function.

The nth_value function also supports windowing clauses. As discussed earlier, a windowing clause provides the ability to implement a sliding dynamic window. This, in turn, allows you to write simple queries to answer complex questions such as “Which store had the second highest sales for a product in a span of 12 weeks?”

Syntax for the nth_value function is as follows:

NTH_VALUE (measure,  n )  [ FROM FIRST| FROM LAST] [RESPECT NULLS|IGNORE NULLS]
OVER (partitioning-clause order-by-clause windowing-clause)

The first argument to the nth_value function is the column name; the second argument is the offset in a window. For example, the clause nth_value(sale, 2) accesses the second row in a window. In Listing 8-10, the SQL statement is fetching the week column value with the second highest sale column value at product, country, region, and year levels. The second row in this result set is the row with the second highest value for the sale column because the rows are sorted by sale column in descending order. The clause partition by product, country, region, year specifies the partitioning columns.

Listing 8-10.  nth_value

SQL> select year, week, sale,
  2     nth_value ( sale, 2) over (
  3      partition by product,country, region, year
  4      order by sale desc
  5      rows between unbounded preceding and unbounded following
  6     ) sale_2nd_top
  7  from sales_fact
  8  where country in ('Australia') and product='Xtend Memory'
  9* order by product, country , year, week ;

      YEAR       WEEK       SALE SALE_2ND_TOP
---------- ---------- ---------- ------------
...
      2000         49      42.38       187.48
      2000         50      21.19       187.48
      2000         52      67.45       187.48
      2001          1      92.26       256.7
      2001          2     118.38       256.7
      2001          3      47.24       256.7
...

For the nth_value function, clauses FROM FIRST and RESPECT NULLS are the defaults. If the clause FROM FIRST is specified, then the nth_value function finds the offset row from the beginning of the window. The clause RESPECT NULLS returns null values if the column contains null values in the offset row.

With an ability to specify a windowing clause, the nth_value function is quite powerful in accessing an arbitrary row in the result set or in a partition.

rank

The rank functionreturns the position of a row, as a number, in an ordered set of rows. If the rows are sorted by columns, then the position of a row in a window reflects the rank of the value in that window of rows. In the case of a tie, rows with equal value have the same rank and the ranks are skipped, leaving gaps in the rank values. This means that two rows can have the same rank, and the ranks are not necessarily consecutive.

The rank function is useful to compute the top or bottom n rows. For example, a query to find the top ten weeks by sales quantity is a typical retail industry data warehouse query. Such a query greatly benefits from the use of rank. If you need to write any query that computes top or bottom n elements of a result set, use the rank function or dense_rank function.

The rank function is also useful in finding inner n rows. For example, if the goal is to fetch rows from 21 through 40 sorted by sales, then you can use the rank function in a subquery with a predicate between 21 and 40 to filter 20 inner rows.

Syntax for the rank function is as follows:

rank() over (partition-clause order-by-clause)

In Listing 8-11, you calculate the top ten rows by sale for product, country, region, and year column values. The clause partition by product, country, region, week specifies the partitioning columns, and the rows are sorted by sale column descending order in that data partition using the order by sale desc clause. The rank function calculates the rank of the row in that data partition. This SQL is wrapped inside an inline view, and then a predicate of sales_rank <=10 is applied to fetch the top ten weeks by sale column. Also, notice that the windowing clause is not applicable in the rank functions, and the rank function is applied over all the rows in a data partition.

Listing 8-11.  Use of rank Function: Top Ten Sales Weeks

SQL> select * from (
 2   select  year, week,sale,
 3     rank() over(
 4            partition by product, country, region ,year
 5            order by sale desc
 6             ) sales_rank
 7    from sales_fact
 8    where country in ('Australia')  and product ='Xtend Memory'
 9    order by product, country,year, week
10  ) where sales_rank<=10
11* order by 1,4 ;

YEAR WEEK       SALE SALES_RANK
---- ---- ---------- ----------
...
2001   16     278.44          1
2001    4     256.70          2
2001   21     233.70          3
2001   48     182.96          4
2001   30     162.91          5
2001   14     162.91          5
2001   22     141.78          7
2001   43     139.58          8
...

The rank function assigns the same rank in case of ties. In the output of Listing 8-11, notice that there are two rows with a sales rank of 5, because the sale column value is 162.91 for these two rows. Also, notice that the next rank is 7, not 6. In a nutshell, the rank function skips the ranks if there are ties. Number of rank values skipped equals number of rows with tied values. If there are ties for three rows, then the next rank is 8.

dense_rank

dense_rank is a variant of the rank function. The difference between the rank and dense_rank functions is that the dense_rank function does not skip the ranks in the case of ties. As discussed earlier, the dense_rank function is useful in finding top, bottom, or inner n rows in a result set. In Listing 8-12, the dense_rank function is used instead of the rank function. Note that the rank for week equal to 22 is 6 in Listing 8-12 and 7 in Listing 8-11.

Listing 8-12.  dense_rank Function

SQL> select * from (
 2   select  year, week,sale,
 3     dense_rank() over(
 4            partition by product, country, region ,year
 5            order by sale desc
 6             ) sales_rank
 7    from sales_fact
 8    where country in ('Australia')  and product ='Xtend Memory'
 9    order by product, country,year, week
10  ) where sales_rank<=10
11* order by 1,4 ;

 YEAR WEEK       SALE SALES_RANK
----- ---- ---------- ----------
 2001   16     278.44          1
 2001    4     256.70          2
 2001   21     233.70          3
 2001   48     182.96          4
 2001   14     162.91          5
 2001   30     162.91          5
 2001   22     141.78          6

The dense_rank function is useful in queries in which the ranks need to be consecutive. For example, ranks may not be skipped in a query to compute the top ten students in a class roster. On the other hand, the rank function is useful when ranks need not be consecutive.

Sort order for nulls can be controlled by the NULLS FIRST or NULLS LAST clause in the dense_rank function. NULLS LAST is the default for ascending sort order; NULLS FIRST is the default for the descending sort order. In Listing 8-12, descending sort order is used and the default NULLS FIRST clause is in effect. Rows with null values have a rank of 1 in this case.

Another useful way to use dense_rank is with the FIRST or LAST functions. These functions operate as both aggregate and analytic functions on a set of values from a set of rows that rank as the FIRST or LAST with respect to a given sorting specification. So, when you need a value from the first of last row of an ordered group, but the needed value is not the sort key, the FIRST and LAST functions eliminate the need to join the table back to itself to determine the correct value. The syntax is as follows:

Aggregate syntax:
aggregate function KEEP
(dense_rank [FIRST | LAST] ORDER BY expression [DESC | ASC] NULLS [FIRST | LAST])

Analytic syntax:
aggregate function KEEP
(dense_rank [FIRST | LAST] ORDER BY expression [DESC | ASC] NULLS [FIRST | LAST])
OVER (partition-clause)

Note how the dense_rank function basically acts as a modifier to the specified aggregate function (MIN, MAX, SUM, AVG, COUNT, VARIANCE, or STDDEV). The KEEP keyword is for clarity and qualifies the aggregate function, indicating that only the first or last values of the aggregate function will be returned. When used in this scenario, dense_rank indicates that Oracle aggregates over only those rows with the minimum (FIRST) or the maximum (LAST) dense_rank. Listing 8-13 provides a good example of how to use this function as demonstrated in the pln.sql script used earlier in the book.

Listing 8-13.  dense_rank Function Used within the FIRST|LAST KEEP function

SQL>get pln.sql
  1  select xplan.*
  2  from
  3  (
  4  select max(sql_id) keep
  5         (dense_rank last order by last_active_time) sql_id
  6       , max(child_number) keep
  7         (dense_rank last order by last_active_time) child_number
  8  from v$sql
  9  where upper(sql_text) like '%&1%'
 10  and upper(sql_text) not like '%FROM V$SQL WHERE UPPER(SQL_TEXT) LIKE %'
 11  ) sqlinfo,
 12  table(DBMS_XPLAN.DISPLAY_CURSOR(sqlinfo.sql_id, sqlinfo.child_number,'ALLSTATS LAST')) xplan ;

In this SQL, the desired result is to return SQL_ID and CHILD_NUMBER from V$SQL for the most recently executed SQL statement matching the specified SQL_TEXT input string (&1). The dense_rank function ensures the last statement executed based on LAST_ACTIVE_TIME is returned. Without this function, we have to use a LAST_VALUE function in an inline view and execute a MAX aggregate on the return set of that view by grouping by SQL_ID and CHILD_NUMBER. In other words, we have to add an extra step. This method is actually a shorter and more efficient way of getting the answer we want.

row_number

The row_number function assigns a unique number for each row in the ordered result set. If the partitioning clause is specified, then each row is assigned a number unique within a data partition, based on its position in the sort order in that partition. If the partitioning clause is not specified, then each row in the result set is assigned a unique number.

The row_number function is also useful to fetch top, bottom, or inner n queries, similar to the rank and dense_rank functions. Even though the rank, dense_rank, and row_number functions have similar functionality, there are subtle differences among them. One is that the row_number function does not allow windowing clauses.

Syntax for the row_number function is as follows:

row_number() over (partition-clause order-by-clause)

The row_number function is a nondeterministic function. The value of the row_number function is undetermined if two rows have the same value in a data partition. For example, in Listing 8-14, rows with column values of 19, 8, 12, and 4 have the same value of 46.54 in the sale column. The row_number function returns values of 31, 32, 34, and 33, respectively, for these rows in the example output. But, the result could just as easily be 34, 31, 32, 33 or 32, 34, 31, 33. In fact, you might get different results with execution of the query. On the contrary, rank and dense_rank functions are deterministic and always return consistent values if a query is reexecuted.

Listing 8-14.  row_number Function

SQL> select  year, week,sale,
 2     row_number() over(
 3            partition by product, country, region ,year
 4            order by sale desc
 5             ) sales_rn,
 6    rank() over(
 7            partition by product, country, region ,year
 8            order by sale desc
 9             ) sales_rank
10    from sales_fact
11    where country in ('Australia')  and product ='Xtend Memory'
12*   order by product, country,year,sales_rank ;

 YEAR WEEK       SALE   SALES_RN SALES_RANK
----- ---- ---------- ---------- ----------
...
 2000   19      46.54         31         31
 2000    8      46.54         32         31
 2000   12      46.54         34         31
 2000    4      46.54         33         31
...

ratio_to_report

The analytic function ratio_to_report calculates the ratio of a value to the sum of values in the data partition. If the partitioning clause is not specified, this function calculates the ratio of a value to the sum values in the whole result set. This analytic function is very useful in calculating ratios at various levels without a need for self-joins.

ratio_to_report is useful in computing the percentage of a value compared with the total value in a report. For example, consider a sales report of a product in a retail chain. Each outlet in the retail chain contributes to the total sum of sales computed for that product, and knowing which percentage of sales is generated from an outlet is quite useful for market trend analysis. ratio_to_report allows you to compute the percentage easily. Furthermore, this ratio can be calculated at various levels such as district, region, and country. Essentially, data can be sliced and diced in various ways for market trend analysis.

In Listing 8-15, the SQL statement computes two ratios: sales_ratio_yr is computed at product, country, region, and year levels, and the ratio sales_ratio_prod is computed at product, country, and region levels. The ratio_to_report function returns a ratio and it is multiplied by 100 to compute a percentage.

Listing 8-15.  ratio_to_report Function

SQL> select  year, week,sale,
 2  trunc(100*
 3         ratio_to_report(sale) over(partition by product, country, region ,year)
 4        ,2) sales_yr,
 5  trunc(100*
 6         ratio_to_report(sale) over(partition by product, country, region)
 7        ,2) sales_prod
 8  from sales_fact
 9  where country in ('Australia')  and product ='Xtend Memory'
10  order by product, country,year, week ;

      YEAR       WEEK       SALE   SALES_YR SALES_PROD
---------- ---------- ---------- ---------- ----------
      1998          1      58.15       2.26        .43
      1998          2      29.39       1.14        .21
      1998          3      29.49       1.15        .22
      1998          4      29.49       1.15        .22
      1998          5       29.8       1.16        .22
...
      2001         48     182.96       3.96       1.36
      2001         49      45.26        .98        .33
      2001         50      23.14         .5        .17
      2001         51     114.82       2.48        .85
      2001         52      23.14         .5        .17

159 rows selected.

The ratio_to_report(sale) over( partition by product, country, region, year) clause calculates the ratio of the sale column value to the sum of sale column values in a data partition, partitioned by the columns product, country, region, and year. The next clause ratio_to_report(sale) over( partition by product, country, region) is different because the year column is not included in the partitioning columns. So, the ratio is calculated for all years.

The ratio_to_report function returns a null value if the expression or column specified in the function returns null values. However, other null values in the data partition are handled as either 0 values or empty strings, similar to aggregation functions.

percent_rank

The percent_rank function returns the rank of a value in a data partition, expressed as a fraction between 0 and 1. percent_rank is calculated as (rank – 1)/(n – 1), where n is the number of elements in the data partition if the partitioning clause is specified, or the total number of rows in the result set if the partitioning clause is not specified. The percent_rank function is useful to compute the relative standing of a value in a result set as a percentile.

This rank can be calculated relative to a partition or the whole result set. For example, computing the sales percentile of a retail outlet in a district or region helps find the top-performing outlets or the worst-performing outlets.

In Listing 8-16, I calculate the top 50 sale percentile by year using the percent_rank function. The clause percent_rank() over(partition by product, country, region , year order by sale desc) calculates the percent rank of the sale column in a data partition defined by the partitioning columns product, country, region, and year. Rows are ordered by the sale column in descending order. Function output is multiplied by 100 to compute a percentage.

Listing 8-16.  percent_rank Function

SQL> select * from (
 2  select  year, week,sale,
 3     100 * percent_rank() over(
 4            partition by product, country, region , year
 5            order by sale desc
 6            ) pr
 7    from sales_fact
 8    where country in ('Australia')  and product ='Xtend Memory'
 9  ) where pr <50
10* order by year, sale desc ;

 YEAR WEEK       SALE      PR
----- ---- ---------- -------
2001   16     278.44     .00
2001    4     256.70    2.27
2001   21     233.70    4.55
2001   48     182.96    6.82
...

percentile_cont

The percentile_cont function is useful to compute the interpolated values, such as the median household income per region or city. The percentile_cont function takes a probability value between 0 and 1 and returns an interpolated percentile value that equals the percent_rank value with respect to the sort specification. In fact, the percentile_cont function performs the inverse of the percent_rank function, and it is easier to understand the percentile_cont function in conjunction with the output of the percent_rank function.

The percentile_cont function retrieves the column value matching (or interpolated from) the percent_rank of the argument. For example, the clause percentile_cont(0.25) retrieves the value that has percent_rank 0.25, assuming matching sort order for these two functions. Another example is computing the median household income in a city or region. The median value has percent_rank 0.5 by the definition of median value. The clause percentile_cont(0.5) returns the median value because the percentile_cont function is calculating the value with percent_rank 0.5. In fact, median function is a specific case of the percentile_cont function and has a default value of 0.5.

Nulls are ignored by the function. This function does not support windowing clauses either.

Syntax for the percentile_cont function is as follows:

Percentile_cont(expr) within group (sort-clause) over (partition-clause order-by-clause)

The syntax for the percentile_cont function is slightly different from the analytic functions discussed so far. A new clause within group (order by sale desc) replaces the order by clause, and it is functionally the same as specifying an order by clause. In Listing 8-17, the clause percentile_cont (0.5) within group (order by sale desc) over( partition by product, country, region , year) calls the percentile_cont function and passes a probability value of 0.5. Sort order is defined by the clause within group (order by sale desc). The partition by clause over( partition by product, country, region , year) specifies the partitioning columns.

Listing 8-17 shows the output of percent_rank in a side-by-side comparison with that from percentile_cont, with a similar partition by clause and order by clause. Notice that for the column values year equal to 2001 and week equal to 5, the sale column value is 93.44 and the percent_rank of that value is 0.5. Essentially, a value of 93.44 occurs with a percent_rank of 0.5 in the descending order of the sale column values in that data partition. In a nutshell, the value of 93.44 is a median value and thus percent_rank is 0.5. Hence, the percent_rank function with an argument of 0.5 returns a value of 93.44.

Listing 8-17.  The percentile_cont Function

SQL> select  year, week,sale,
  2     percentile_cont (0.5)  within group
  3       (order by sale desc)
  4       over( partition by product, country, region , year ) pc,
  5     percent_rank () over (
  6            partition by product, country, region , year
  7            order by sale desc ) pr
  8    from sales_fact
  9*   where country in ('Australia')  and product ='Xtend Memory' ;

      YEAR       WEEK       SALE         PC         PR
---------- ---------- ---------- ---------- ----------
...
     2000         28      88.96      79.09 .461538462
      2000         38      79.36      79.09 .487179487
      2000         35      78.82      79.09 .512820513
...
      2001         46      93.58      93.44 .477272727
      2001          5      93.44      93.44         .5
      2001         37      93.16      93.44 .522727273
...
      2001         52      23.14      93.44 .909090909
      2001         50      23.14      93.44 .909090909
      2001          6      22.44      93.44 .954545455
      2001         23      22.38      93.44 .977272727
      2001         18      22.37      93.44          1

159 rows selected.

In addition, note the output row for the column values with year equal to 2000. There is no sale column value with percent_rank matching 0.5 exactly in the data partition. If there is no value matching exactly, then the percentile_cont function computes an interpolated value using the nearest values. Note there is a row in that data partition with a percent_rank of 0.48 for the sale column value of 79.36, and the next row in that sort order has a percent_rank of 0.51 for the sale column value of 78.82. Because the specified percent_rank of 0.5 is between 0.48 and 0.51, the percentile_cont function interpolated these two corresponding sale column values (79.36 and 78.82) and calculated percentile_cont (0.5) as 79.09, an average of the two sale column values. Values are averaged because this function assumes continuous distribution.

Notice that output rows are not sorted in Listing 8-17. The reason for this is that, even though there is an order by clause specified in the analytic function specification (line 3 and line 7), there is no order by clause in the main body of the query. Should you need rows to be sorted, you need to specify sorting order explicitly in the main body of the query as well.

percentile_disc

The percentile_disc function is functionally similar to percentile_cont except that the percentile_cont function uses a continuous distribution model and the percentile_disc function assumes a discrete distribution model. As discussed earlier, when there is no value matching exactly with the specified percent_rank, then percentile_cont (0.5) computes an average of the two nearest values. In contrast, the percentile_disc function retrieves the value with a percent_rank just greater than the passed argument, in the case of ascending order. In the case of descending order, the percentile_cont function retrieves the value that has a percent_rank just smaller than the passed argument.

In Listing 8-18, the percentile_cont function is replaced by two calls to the percentile_disc function. The first call to the function starting in line 2 specifies descending sort order, and the next call in line 4 specifies no sort order, so it defaults to ascending sort order. In both calls to the percentile_disc function, an argument of 0.5 is passed. Because there is no row with a percent_rank of 0.5, the percentile_disc function with the descending sort order specification returns a value of 79.36 (for the rows from the year 2000, weeks 38 and 35), because this value has a percent_rank of 0.48—just below the specified argument of 0.5. For the ascending order, this function returns a value of 78.82, because this value has a percent_rank of 0.51—just above 0.5.

Listing 8-18.  The percentile_disc Function

SQL> select  year, week,sale,
 2     percentile_disc (0.5)  within group (order by sale desc)
 3       over( partition by product, country, region , year ) pd_desc,
 4     percentile_disc (0.5)  within group (order by sale )
 5       over( partition by product, country, region , year ) pd_asc,
 6     percent_rank () over (
 7            partition by product, country, region , year
 8            order by sale desc ) pr
 9    from sales_fact
10*   where country in ('Australia')  and product ='Xtend Memory' ;

      YEAR       WEEK       SALE    PD_DESC     PD_ASC         PR
---------- ---------- ---------- ---------- ---------- ----------
      1998         48     172.56      58.78      58.32          0
      1998         10     117.76      58.78      58.32 .028571429
      1998         18     117.56      58.78      58.32 .057142857
      1998         23     117.56      58.78      58.32 .057142857
      1998         26     117.56      58.78      58.32 .057142857
...
      2000         28      88.96      79.36      78.82 .461538462
      2000         38      79.36      79.36      78.82 .487179487
      2000         35      78.82      79.36      78.82 .512820513
      2000          7       70.8      79.36      78.82 .538461538
...
      2001         52      23.14      93.44      93.44 .909090909
      2001         50      23.14      93.44      93.44 .909090909
      2001          6      22.44      93.44      93.44 .954545455
      2001         23      22.38      93.44      93.44 .977272727
      2001         18      22.37      93.44      93.44          1

159 rows selected.

NTILE

The NTILE function divides an ordered set of rows in a data partition, groups them into buckets, and assigns a unique group number to each group. This function is useful in statistical analysis. For example, if you want to remove the outliers (values that are outside the norm), you can group them in the top or bottom buckets and eliminate those values from the statistical analysis. Oracle Database statistics collection packages also use NTILE functions to calculate histogram boundaries. In statistical terminology, the NTILE function creates equiwidth histograms.

The number of buckets is passed as the argument to this analytic function. For example, NTILE(100) groups the rows into 100 buckets, assigning a unique number for each bucket. This function does not support windowing clauses, however.

In Listing 8-19, the data partition is split into ten buckets using the clause NTILE (10). Rows are sorted by the sale column in descending order. The NTILE function groups rows into buckets, with each bucket containing an equal number of rows. Because the rows are sorted by the sale column values in descending order, rows with lower group numbers have higher sale column values. Outliers in the data can be removed easily with this technique.

Listing 8-19.  NTILE Function

 1  select  year, week,sale,
 2     ntile (10) over(
 3            partition by product, country, region , year
 4            order by sale desc
 5            ) group#
 6    from sales_fact
 7*   where country in ('Australia')  and product ='Xtend Memory' ;

      YEAR       WEEK       SALE     GROUP#
---------- ---------- ---------- ----------
...
      2001         16     278.44          1
      2001          4      256.7          1
      2001         21      233.7          1
      2001         48     182.96          1
      2001         14     162.91          1
      2001         30     162.91          2
      2001         22     141.78          2
      2001         43     139.58          2
      2001         25     139.28          2
      2001         38        139          2
      2001         42     136.98          3
      2001         24     136.92          3
      2001          2     118.38          3
      2001         20     118.03          3
      2001         29     116.85          3
      2001         12     116.81          4
      2001         13     116.81          4
      2001         39     115.57          4
      2001         33     115.52          4
      2001         51     114.82          4
      2001         27      94.48          5
      2001         46      93.58          5
      2001          5      93.44          5
      2001         37      93.16          5
      2001          9      92.67          5
      2001          1      92.26          6
      2001         31      92.21          6
      2001         15      91.98          6
      2001         36      91.12          6
      2001         11      71.57          7
      2001          7      69.96          7
      2001         10      69.05          7
      2001         34       68.9          7
      2001         32       68.9          8
      2001         41      67.19          8
      2001          3      47.24          8
      2001          8      46.06          8
      2001         49      45.26          9
      2001         40      45.18          9
      2001         44      23.29          9
      2001         52      23.14          9
      2001         50      23.14         10
      2001          6      22.44         10
      2001         23      22.38         10
      2001         18      22.37         10

159 rows selected.

There may be a row count difference of at most one between the buckets if the rows cannot be divided equally. In this example, rows for year equal to 2001 are divided into ten buckets, with each of the first five buckets having five rows, but the last five buckets have only 4 rows.

The NTILE function is useful in real-world applications such as dividing total work among n parallel processes. Let’s say you have ten parallel processes. You can divide the total work into ten buckets and assign each bucket to a process.

stddev

The stddev function can be used to calculate standard deviation among a set of rows in a data partition, or in the result set if no partitioning clause is specified. This function calculates the standard deviation, defined as the square root of variance, for a data partition specified using a partitioning clause. If a partitioning clause is not specified, this function calculates stddev for all rows in the result set.

In Listing 8-20, the clause stddev (sale) calculates the standard deviation on the sale column among the rows in a data partition. The partitioning clause partition by product, country, region, year specifies the partitioning columns. The windowing clause rows between unbounded preceding and unbounded following specifies the window as all rows in that data partition. Essentially, this SQL calculates the standard deviation on the sale column among all rows in a data partition.

Listing 8-20.  stddev Function

SQL> select  year, week,sale,
2     stddev (sale) over(
3            partition by product, country, region , year
4            order by Sale desc
5            rows between unbounded preceding and unbounded following
6            ) stddv
7    from sales_fact
8    where country in ('Australia')  and product ='Xtend Memory'
9* order by year, week ;

      YEAR       WEEK       SALE      STDDV
---------- ---------- ---------- ----------
      1998          1      58.15 33.5281435
      1998          2      29.39 33.5281435
      1998          3      29.49 33.5281435
      1998          4      29.49 33.5281435
      1998          5       29.8 33.5281435
 ...
      2001         48     182.96 59.1063592
      2001         49      45.26 59.1063592
      2001         50      23.14 59.1063592
      2001         51     114.82 59.1063592
      2001         52      23.14 59.1063592

159 rows selected.

Standard deviation can be calculated at a coarser or a granular level by specifying an appropriate partition by clause and windowing clause.

There are various other statistics functions that can be used to calculate statistical metrics; for example, stddev_samp calculates the cumulative sample standard deviation, stddev_pop calculates the population standard deviation, and so forth. Detailed discussion about various statistics functions is out of the scope of this book, however.

listagg

Oracle Database version 11gR2 introduced another analytic function, the listagg function, which is very useful in string manipulation. This analytic function provides the ability to convert column values from multiple rows into groups in a list format based on one or more expressions in the partition-by-clause. For example, if you want to concatenate all the employee names in a department, you can use this function to concatenate all names into a list.

Syntax for this function is of the following format:

Listagg (string, separator ) within group (order-by-clause) Over (partition-by-clause )

Syntax for the listagg function uses the clause within group (order-by-clause) to specify sorting order. This clause is similar to the order by clause in other analytic functions. The first argument to this function is the string or column name to concatenate. The second argument is the separator for the values. In Listing 8-21, the partitioning clause is not specified and rows are ordered by the country column in descending order. The output shows that country names are converted to a list separated by commas.

Note that the listagg function does not support windowing clauses.

Listing 8-21.  listagg Function

  1  select listagg (country, ',')
  2    within group (order by country desc)
  3   from (
  4    select distinct country from sales_fact
  5    order by country
  6*  ) ;

LISTAGG(COUNTRY,',')WITHINGROUP(ORDERBYCOUNTRYDESC)
--------------------------------------------------------------
United States of America,United Kingdom,Turkey,Spain,Singapore,
Saudi Arabia,Poland,New Zealand, Japan,Italy,Germany,France, Denmark,China,Canada,Brazil,Australia,Argentina

One restriction of the listagg function is that the results of listagg are constrained to the maximum size of a VARCHAR2 datatype. Beginning in 12c, the maximum size of a VARCHAR2 datatype was increased from 4000 to 32,767 bytes. However, it appears that this increase did not affect the maximum size of the result string for listagg, as shown in Listing 8-22.

Listing 8-22.  Size Restriction on listagg Result String

SQL>select length(acol) from (
  2  SELECT LISTAGG(object_name) WITHIN GROUP (ORDER BY NULL) acol
  3    FROM all_objects where rownum < 359);

LENGTH(ACOL)
------------
        3975

SQL>select length(acol) from (
  2  SELECT LISTAGG(object_name) WITHIN GROUP (ORDER BY NULL) acol
  3    FROM all_objects where rownum < 360);
  FROM all_objects where rownum < 360)
       *
ERROR at line 3:
ORA-01489: result of string concatenation is too long

Note that when enough rows are returned to make the length of the string exceed 4000 bytes (the pre-12c limit for a VARCHAR2), an ORA-014889 error is raised. I was fully expecting the listagg function to use the new, higher limit, but it appears to have the old 4000-byte limit hard coded. It's likely an oversight that should be corrected because this datatype size increase would make it much easier to use listagg for larger string combinations. However, if your resulting string size exceeds the limit, you need to use an alternative means of producing the final string (such as a collection or a user-defined PL/SQL function).

Performance Tuning

Analytic functions are very useful in tuning complex SQL statements. Interrow referencing, aggregation at multiple levels, and nth-row access are a few of the important features analytic functions provide. For example, a typical query fetching both aggregated and nonaggregated rows must perform a self-join. In a data warehouse environments, because of the sheer size of the tables involved, these self-joins can be cost prohibitive.

The efficiency that analytics bring to the table often makes them useful tools in rewriting queries that do not perform well. In turn, however, you can sometimes face the need to tune an analytic function. To this end, there are some useful things to know about analytic functions and execution plans, analytics and predicates, and strategies for indexing.

Execution Plans

Analytic function introduces a few new operations into the SQL execution plan. The presence of the keywords WINDOW SORT indicates that the SQL statement uses an analytic function. In this section, I review the mechanics of analytic function execution.

Listing 8-23 shows a typical explain plan of a SQL statement. Execution of this plan starts at step 4 and works its way outward to step 1:

  1. Table sales_fact is accessed using a full table scan access path.
  2. Filter predicates on the product, country, region, and year columns are applied to filter-required rows.
  3. Analytic functions are applied over the filtered rows from step 3.
  4. The predicate on the week column is applied after the execution of these analytic functions.

Listing 8-23.  Explain Plan

---------------------------------------------------------------------------
| Id  | Operation           | Name           | Rows  | Bytes | Cost (%CPU)|
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                |     5 |   290 |   581   (3)|
|*  1 |  VIEW               | MAX_5_WEEKS_VW |     5 |   290 |   581   (3)|
|   2 |   WINDOW SORT       |                |     5 |   330 |   581   (3)|
|*  3 |    TABLE ACCESS FULL| SALES_FACT     |     5 |   330 |   580   (3)|
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("WEEK"<14)
   3 - filter("PRODUCT"='Xtend Memory' AND "COUNTRY"='Australia' AND
              "REGION"='Australia' AND "YEAR"=2000)

image Note   The cost-based optimizer does not assign or calculate a cost for analytic functions (as of 11gR2, and it is still true in 12c). The cost of the SQL statement is calculated without considering the cost of analytic functions.

Predicates

Predicates must be applied to the tables as early as possible to reduce the result set for better performance. Rows must be filtered earlier so that analytic functions are applied to relatively fewer rows. Predicate safety is an important consideration when executing analytic functions because not all predicates can be applied beforehand.

In Listing 8-24, a view called max_5_weeks_vw is defined and a SQL statement accesses the view with the predicates on the county, product, region, year, and week columns. The execution plan shows that the following filter predicates are applied in step 3:

filter(("PRODUCT"='Xtend Memory' AND "COUNTRY"='Australia' AND "REGION"='Australia' AND "YEAR"=2000))

Listing 8-24.  Predicates

create or replace view max_5_weeks_vw as
   select  country , product, region, year, week,sale,
    max (sale) over(
          partition by product, country, region ,year
          order by year, week
          rows between 2 preceding and 2 following
           ) max_weeks_5
  from sales_fact ;

SQL> select year, week, sale, max_weeks_5 from  max_5_weeks_vw
2    where country in ('Australia')  and product ='Xtend Memory' and
3    region='Australia' and year= 2000 and week <14
4*   order by year, week ;

-------------------------------------------------------
| Id  | Operation           | Name           | E-Rows |
-------------------------------------------------------
|   0 | SELECT STATEMENT    |                |        |
|*  1 |  VIEW               | MAX_5_WEEKS_VW |      5 |
|   2 |   WINDOW SORT       |                |      5 |
|*  3 |    TABLE ACCESS FULL| SALES_FACT     |      5 |
-------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("WEEK"<14)
   3 - filter(("PRODUCT"='Xtend Memory' AND "COUNTRY"='Australia' AND
              "REGION"='Australia' AND "YEAR"=2000))

However, the predicate "WEEK"<14 is not applied in step 3, and is only applied in step 1, indicating that the predicate is applied after executing the analytic functions in step 2’s WINDOW SORT step. All supplied predicates except that on the week column were pushed into the view. Filtering of these predicates then took place before executing the analytic functions.

Predicates on partitioning columns are applied before executing analytic functions because, generally speaking, predicates on the partitioning column can be pushed safely into the view. But, columns in the order-by-clause of the analytic function syntax can’t be pushed safely because the interrow references need access to other rows in the same partitions, even if those rows are not returned in the final result set.

Indexes

A good strategy for index selection is to match the predicates applied on the table access step. As discussed earlier, predicates on partitioning columns are pushed into the view, and these predicate are applied before executing the analytic functions. So, it’s probably a better approach to index the partitioning columns if the SQL statements are using those predicates.

In Listing 8-25, a new index is added on the columns country and product. Step 4 in the execution plan shows that index-based access is used. The Predicate Information section shows that predicates on all four partitioning columns are applied at step 4 and step 3 before executing the analytic function. However, the predicate on the week column was applied much later in the execution plan, at step 1. So, in this case, adding the week column to the index is not useful because the predicates are not applied until after the analytic function execution completes.

Listing 8-25.  Predicates and Indexes

create index sales_fact_i1 on  sales_fact( country, product);

SQL> select year, week, sale, max_weeks_5 from  max_5_weeks_vw
2    where country in ('Australia')  and product ='Xtend Memory' and
3    region='Australia' and year= 2000 and week <14
4*   order by year, week ;

------------------------------------------------------------------
| Id  | Operation                     | Name           | E-Rows |
------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                |        |
|*  1 |  VIEW                         | MAX_5_WEEKS_VW |      5 |
|   2 |   WINDOW SORT                 |                |      5 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| SALES_FACT     |      5 |
|*  4 |     INDEX RANGE SCAN          | SALES_FACT_I1  |    147 |
------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("WEEK"<14)
   3 - filter(("REGION"='Australia' AND "YEAR"=2000))
   4 - access("COUNTRY"='Australia' AND "PRODUCT"='Xtend Memory') fs

Advanced Topics

A few advanced topics about the analytic functions raise points that are worthy of discussion. In this section I briefly discuss dynamic analytic statements, nesting of analytic functions, parallelism, and PGA size.

Dynamic SQL

A common question about the analytic SQL statement is “Can a bind variable can be used in place of partitioning or sorting columns?” The answer is no. If you want the flexibility to modify the partitioning or sorting columns dynamically, you need to use dynamic SQL statements. Static analytic SQL statements cannot change the partitioning or sorting columns.

If your goal is to modify the partitioning columns dynamically, then consider creating a packaged procedure to capture the logic in the procedure. In Listing 8-26, the procedure analytic_dynamic_prc accepts a string to be used as partitioning columns. A SQL statement is constructed using the arguments passed and is executed dynamically using execute immediate syntax. The result of the analytic statement is fetched into an array and is printed using a call to the dbms_output package.

Listing 8-26.  Dynamic SQL Statement

create or replace procedure
  analytic_dynamic_prc ( part_col_string varchar2, v_country varchar2, v_product varchar2)
is
  type numtab is table of number(18,2) index by binary_integer;
  l_year numtab;
  l_week numtab;
  l_sale numtab;
  l_rank numtab;
  l_sql_string  varchar2(512) ;
begin
 l_sql_string :=
 'select * from (
   select  year, week,sale,
    rank() over(
          partition by ' ||part_col_string ||'
          order by sale desc
           ) sales_rank
   from sales_fact
   where country in (' ||chr(39) || v_country || chr(39) || ' )  and
         product =' || chr(39) || v_product || chr(39) ||
       ' order by product, country, year, week
    ) where sales_rank<=10
    order by 1,4';
 execute immediate l_sql_string bulk collect into  l_year, l_week, l_sale, l_rank;
 for  i  in 1 .. l_year.count
  loop
       dbms_output.put_line ( l_year(i) ||' |' || l_week (i) ||
                            '|'|| l_sale(i) || '|' || l_rank(i) );
  end loop;
 end;
/

exec analytic_dynamic_prc ( 'product, country, region','Australia','Xtend Memory'),
...
1998 |48|172.56|9
2000 |46|246.74|3
2000 |21|187.48|5
2000 |43|179.12|7
2000 |34|178.52|8
2001 |16|278.44|1
2001 |4|256.7|2

exec analytic_dynamic_prc ( 'product, country,region, year','Australia','Xtend Memory'),

1998 |48|172.56|1
1998 |10|117.76|2
1998 |18|117.56|3
1998 |23|117.56|3
1998 |26|117.56|3
1998 |38|115.84|6
1998 |42|115.84|6
...

In the first call, analytic_dynamic_prc passes the string product, country, region as the first argument and the columns in this list are used as the partitioning columns. The second call to the procedure uses the string product, country, region, year to use a different list of columns for the partitioning clause.

Note that this procedure is given as an example and as such may not be construed as production-ready code.

Nesting Analytic Functions

Analytic functions cannot be nested, but a nesting effect can be achieved with the use of subqueries. For example, the clause lag(first_value(column,1),1) is syntactically incorrect. Subqueries can be used to create a nesting effect, as explored next.

Suppose your goal is to fetch the maximum sale column value for the year and the prior year in the same row. If so, then the analytic functions lag and first_value can be used in the subqueries to write a SQL statement. In Listing 8-27, an inner subquery fetches the year and week sale column value in which the maximum sale occurred, in addition to fetching the maximum sale column value for that year. The lag function in the outer query retrieves the prior year maximum sale column value.

Listing 8-27.  Nesting Analytic Functions

select  year, week, top_sale_year,
   lag( top_sale_year) over ( order by year desc) prev_top_sale_yer
from (
 select distinct
    first_value ( year) over (
          partition by product, country, region ,year
          order by sale desc
          rows between unbounded preceding and unbounded following
     ) year,
    first_value ( week) over (
          partition by product, country, region ,year
          order by sale desc
          rows between unbounded preceding and unbounded following
     ) week,
    first_value (sale) over(
          partition by product, country, region ,year
          order by sale desc
          rows between unbounded preceding and unbounded following
     ) top_sale_year
  from sales_fact
  where country in ('Australia')  and product ='Xtend Memory'
)
  order by year, week ;

      YEAR       WEEK TOP_SALE_YEAR PREV_TOP_SALE_YER
---------- ---------- ------------- -----------------
      1998         48        172.56            148.12
      1999         17        148.12            246.74
      2000         46        246.74            278.44
      2001         16        278.44

Notice that the partitioning clause is different between the lag and first_value functions. The analytic function first_value computes the top sale row in a partition specified by the partitioning columns product, country, region, year whereas lag fetches the first row from the prior year specifying only the sorting clause: order by year desc. With multilevel nesting of analytic functions, complex goals can be implemented concisely using the analytic functions.

Parallelism

By specifying a parallel hint in the SQL statement or by setting parallelism at the object level, analytic functions can be “parallelized.” If you have huge amount of data that needs to be processed using analytic functions, parallelism is a good choice. A SQL statement using multilevel nesting also can benefit from parallelism.

Listing 8-28 shows the execution plan for the query in Listing 8-27 using parallelism. In the execution plan, there are two WINDOW operations because the SQL statement has nested the lag and first_value analytic functions.

Listing 8-28.  Parallelism

----------------------------------------------------
 Id  | Operation                        | Name
-----------------------------------------------------
   0 | SELECT STATEMENT                 |
   1 |  SORT ORDER BY                   |
   2 |   WINDOW BUFFER                  |
   3 |    PX COORDINATOR                |
   4 |     PX SEND QC (ORDER)           | :TQ10003
   5 |      SORT ORDER BY               |
   6 |       PX RECEIVE                 |
   7 |        PX SEND RANGE             | :TQ10002
   8 |         VIEW                     |
   9 |          HASH UNIQUE             |
  10 |           PX RECEIVE             |
  11 |            PX SEND HASH          | :TQ10001
  12 |             WINDOW SORT          |
  13 |              PX RECEIVE          |
  14 |               PX SEND HASH       | :TQ10000
  15 |                PX BLOCK ITERATOR |
* 16 |                 TABLE ACCESS FULL| SALES_FACT

Optimal distribution of rows between the parallel query (PQ) slaves is critical to maintain functional correctness, and it is handled automatically by Oracle Database.

PGA Size

Most operations associated with the analytic functions are performed in the PGA of the process. Recall that the PGA is a private memory area that contains the data and control information for a server process. Oracle reads and writes information in the PGA on behalf of the server process. So, for optimal performance, it is important to have a big enough memory area so that programs can execute analytic functions without spilling to the disk. This is very analogous to a sort operation. If the sort operation spills to the disk as a result of a lower value of the memory size, then the performance of the sort operation is not optimal. Similarly, the execution performance of analytic functions suffer if the operation spills to the disk.

Database initialization parameter PGA_AGGREGATE_TARGET (PGAT) is a target for the cumulative maximum size of the PGA. By default, a serial process can allocate a PGA up to the maximum size of 5 percent of the PGAT value. For parallel processes, the limit is up to 30 percent of PGAT. The PGA_AGGREGATE_LIMIT (PGAL) parameter enables you to set a hard limit on PGA memory usage. If the PGAL value is exceeded, Oracle aborts or terminates the sessions or processes that are consuming the most PGA memory. The default for this parameter is set to the greater of 2GB, 200 percent of PGAT, or 3MB multiplied by the value of the PROCESSES parameter. Regardless, it never exceeds 120 percent of the physical memory size less the total SGA size.

Excessive PGA usage can lead to high rates of swapping (to TEMP), causing the system to become unresponsive and unstable. It is essential to keep PGAT to a high enough value to improve the performance of analytic functions.

Organizational Behavior

The hardest thing about analytic functions is the organizational resistance to change. Developers and DBAs are comfortable writing SQL statements using conventional syntax. Using analytic syntax does not come easy. However, these developers and DBAs need to embrace the change. Another plus: Use of analytic functions forces one to think in terms of sets.

Oracle releases new features in every major release of Oracle Database. We need to harness the new features to write more efficient and concise SQL statements. Proper training for these new features is also essential and, hopefully, this chapter provided an insight into analytic functions. When you start writing SQL statements using analytic functions, start with a simpler SQL statement, then add more complexity to meet the goal.

Summary

Complex SQL statements can be written concisely using analytic functions. Understanding analytic functions provides with you a whole new way of thinking, analytically speaking. The ability to reference another row, combined with the partitioning and windowing clauses, allows you to simplify complex SQL statements. Many performance issues can be resolved by rewriting SQL statements using analytic functions. You may find resistance from developers and DBAs alike when it comes to using analytic functions, but the resistance can be overcome easily by demonstrating the performance improvements that using them provides.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.69.163