14. Advanced Analytic SQL

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14. Advanced Analytic SQL

For years, SQL has been criticized for its inability to handle routine decision support queries. With a host of new analytic functions introduced in Oracle8i, Oracle9i, Database, and Oracle Database 10g, Oracle has taken giant strides toward eliminating this deficiency. In doing so, Oracle has further blurred the distinction between its multipurpose relational database server and other, special-purpose data warehouse and statistical analysis servers.

Analytic SQL Overview

The types of queries issued by Decision Support Systems (DSS) differ from those issued against OLTP systems. Consider the following business queries:

Find the top 10 salespeople in each sales district last year.
Find all customers whose total orders last year exceeded 20% of the aggregate sales for their geographic region.
Identify the region that suffered the worst quarter-to-quarter sales decline last year.
Find the best and worst selling menu items by state for each quarter last year.

Queries such as these are staples of DSS, and are used by managers, analysts, marketing executives, etc., to spot trends, identify outliers, uncover business opportunities, and predict future business performance. DSS systems typically sit atop data warehouses, in which large quantities of scrubbed, aggregated data provide fertile grounds for researching and formulating business decisions.

Although all of the previous queries can be easily expressed in English, they have historically been difficult to formulate using SQL for the following reasons:

They may require different levels of aggregation of the same data.
They may involve intratable comparisons (comparing one or more rows in a table with other rows in the same table).
They may require an extra filtering step after the result set has been sorted (i.e., finding the top 10 and bottom 10 salespeople last month).

Although it is possible to generate the desired results using such SQL features as self joins, inline views, and user-defined functions, the resulting queries can be difficult to understand and might yield unacceptably long execution times. To illustrate the difficulty in formulating such queries, we will walk through the construction of this query: “Find all customers whose total orders in 2001 exceeded 20% of the aggregate sales for their geographic region.”

For this and other examples in this chapter, we use a simple star schema consisting of a single fact table (called orders) containing aggregated sales information across the following dimensions: region, salesperson, customer, and month. There are two main facets to this query, each requiring a different level of aggregation of the same data:

Sum all sales per region last year.
Sum all sales per customer last year.

After these two intermediate result sets have been constructed, each customer’s total can be compared to the total for their region to see if it exceeds 20%. The final result set will show the customer names along with their total sales, region name, and the percentage of their region’s sales.

The query to aggregate sales by region looks as follows:

            SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales
            FROM orders o
            WHERE o.year = 2001
            GROUP BY o.region_id;

REGION_ID  TOT_SALES
---------- ----------
         5    6585641
         6    6307766
         7    6868495
         8    6854731
         9    6739374
        10    6238901

The query to aggregate sales by customer would be:

            SELECT o.cust_nbr cust_nbr, o.region_id region_id,
              SUM(o.tot_sales) tot_sales
            FROM orders o
            WHERE o.year = 2001
            GROUP BY o.cust_nbr, o.region_id;

  CUST_NBR  REGION_ID  TOT_SALES
---------- ---------- ----------
         1          5    1151162
         2          5    1224992
         3          5    1161286
         4          5    1878275
         5          5    1169926
         6          6    1788836
         7          6     971585
         8          6    1141638
         9          6    1208959
        10          6    1196748
        11          7    1190421
        12          7    1182275
        13          7    1310434
        14          7    1929774
        15          7    1255591
        16          8    1068467
        17          8    1944281
        18          8    1253840
        19          8    1174421
        20          8    1413722
        21          9    1020541
        22          9    1036146
        23          9    1224992
        24          9    1224992
        25          9    2232703
        26         10    1808949
        27         10    1322747
        28         10     986964
        29         10     903383
        30         10    1216858

By placing each of the two queries in an inline view and joining them on region_id, you can identify those customers whose total sales exceeds 20% of their region, as in:

            SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,
              cust_sales.tot_sales cust_sales, region_sales.tot_sales region_sales
            FROM
             (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales
              FROM orders o
              WHERE o.year = 2001
              GROUP BY o.region_id) region_sales INNER JOIN
             (SELECT o.cust_nbr cust_nbr, o.region_id region_id,
                SUM(o.tot_sales) tot_sales
              FROM orders o
              WHERE o.year = 2001
              GROUP BY o.cust_nbr, o.region_id) cust_sales
              ON cust_sales.region_id = region_sales.region_id
            WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2);

  CUST_NBR  REGION_ID CUST_SALES REGION_SALES
---------- ---------- ---------- ------------
         4          5    1878275      6585641
         6          6    1788836      6307766
        14          7    1929774      6868495
        17          8    1944281      6854731
        20          8    1413722      6854731
        25          9    2232703      6739374
        26         10    1808949      6238901
        27         10    1322747      6238901

The final step is to join the region and customer dimensions to include the customer and region names in the result set:

            SELECT c.name cust_name,
              big_custs.cust_sales cust_sales, r.name region_name,
              100 * ROUND(big_custs.cust_sales / 
                big_custs.region_sales, 2)  percent_of_region
            FROM 
             (SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,
                cust_sales.tot_sales cust_sales, 
                region_sales.tot_sales region_sales
              FROM
               (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales
                FROM orders o
                WHERE o.year = 2001
                GROUP BY o.region_id) region_sales INNER JOIN
               (SELECT o.cust_nbr cust_nbr, o.region_id region_id,
                  SUM(o.tot_sales) tot_sales
                FROM orders o
                WHERE o.year = 2001
                GROUP BY o.cust_nbr, o.region_id) cust_sales
                ON cust_sales.region_id = region_sales.region_id
              WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2)) big_custs INNER JOIN 
customer c
              ON big_custs.cust_nbr = c.cust_nbr
              INNER JOIN region r
              ON big_custs.region_id = r.region_id;

CUST_NAME              CUST_SALES REGION_NAME          PERCENT_OF_REGION
---------------------- ---------- -------------------- -----------------
Flowtech Inc.             1878275 New England                         29
Spartan Industries        1788836 Mid-Atlantic                        28
Madden Industries         1929774 Southeast US                        28
Evans Supply Corp.        1944281 Southwest US                        28
Malden Labs               1413722 Southwest US                        21
Worcester Technologies    2232703 Northwest US                        33
Alpha Technologies        1808949 Central US                          29
Phillips Labs             1322747 Central US                          21

Using nothing more exotic than inline views, therefore, it is possible to construct a single query that generates the desired results. Such a solution, however, has the following shortcomings:

The query is fairly complex.
Two passes through the same rows of the orders table are required to generate the different aggregation levels needed by the query.

Let’s see how we can both simplify the query and perform the same work in a single pass through the orders table using one of the new analytic functions. Rather than issuing two separate queries to aggregate sales per region and per customer, we will create a single query that aggregates sales over both region and customer, and then call an analytic function that performs a second level of aggregation to generate total sales per region:

            SELECT o.region_id region_id, o.cust_nbr cust_nbr,
             SUM(o.tot_sales) tot_sales,
             SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales
            FROM orders o
            WHERE o.year = 2001
            GROUP BY o.region_id, o.cust_nbr;

REGION_ID   CUST_NBR  TOT_SALES REGION_SALES
---------- ---------- ---------- ------------
         5          1    1151162      6585641
         5          2    1224992      6585641
         5          3    1161286      6585641
         5          4    1878275      6585641
         5          5    1169926      6585641
         6          6    1788836      6307766
         6          7     971585      6307766
         6          8    1141638      6307766
         6          9    1208959      6307766
         6         10    1196748      6307766
         7         11    1190421      6868495
         7         12    1182275      6868495
         7         13    1310434      6868495
         7         14    1929774      6868495
         7         15    1255591      6868495
         8         16    1068467      6854731
         8         17    1944281      6854731
         8         18    1253840      6854731
         8         19    1174421      6854731
         8         20    1413722      6854731
         9         21    1020541      6739374
         9         22    1036146      6739374
         9         23    1224992      6739374
         9         24    1224992      6739374
         9         25    2232703      6739374
        10         26    1808949      6238901
        10         27    1322747      6238901
        10         28     986964      6238901
        10         29     903383      6238901
        10         30    1216858      6238901

The analytic function can be found in line 3 of the previous query and the result has the alias region_sales. The aggregate function (SUM(o.tot_sales)) in line 2 generates the total sales per customer and region as directed by the GROUP BY clause, and the analytic function in line 3 aggregates these sums for each region, thereby computing the aggregate sales per region. The value for the region_sales column is identical for all customers within the same region and is equal to the sum of all customer sales within that region. We can then wrap the query in an inline view, filter out those customers with less than 20% of their region’s total sales, and join the region and customer tables to generate the desired result set:

            SELECT c.name cust_name,
              cust_sales.tot_sales cust_sales, r.name region_name,
              100 * ROUND(cust_sales.tot_sales / 
                cust_sales.region_sales, 2)  percent_of_region
            FROM 
             (SELECT o.region_id region_id, o.cust_nbr cust_nbr,
                SUM(o.tot_sales) tot_sales,
                SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales
              FROM orders o
              WHERE o.year = 2001
              GROUP BY o.region_id, o.cust_nbr) cust_sales INNER JOIN region r
              ON cust_sales.region_id = r.region_id
              INNER JOIN customer c
              ON cust_sales.cust_nbr = c.cust_nbr
            WHERE cust_sales.tot_sales > (cust_sales.region_sales * .2);

CUST_NAME              CUST_SALES REGION_NAME          PERCENT_OF_REGION
---------------------- ---------- -------------------- -----------------
Flowtech Inc.             1878275 New England                         29
Spartan Industries        1788836 Mid-Atlantic                        28
Madden Industries         1929774 Southeast US                        28
Evans Supply Corp.        1944281 Southwest US                        28
Malden Labs               1413722 Southwest US                        21
Worcester Technologies    2232703 Northwest US                        33
Alpha Technologies        1808949 Central US                          29
Phillips Labs             1322747 Central US                          21

Using an inline view saves us from having to join the region and customer tables to the orders table; otherwise, we would have to include columns from the region and customer tables in the GROUP BY clause.

Later in this chapter, under “Reporting Functions,” we’ll get into the details of how the SUM...OVER function works. For now, you can see that Oracle is performing an aggregation of an aggregation rather than revisiting the detail rows twice. Thus, the query runs faster and should also prove easier to understand and maintain once the syntax is familiar.

Unlike built-in functions such as DECODE, GREATEST, and SUBSTR, Oracle’s suite of analytic functions can only be used in the SELECT and ORDER BY clauses of a query. This is because analytic functions are only executed after the FROM, WHERE, GROUP BY, and HAVING clauses have been evaluated. After the analytic functions have executed, the query’s ORDER BY clause is evaluated to sort the final result set, and the ORDER BY clause is allowed to reference columns in the SELECT clause generated via analytic functions as well as specify analytic functions not found in the SELECT clause.

The remainder of this chapter introduces the Oracle8i Database and Oracle9i Database analytic functions, grouped by functionality.

Ranking Functions

Determining the performance of a particular business entity compared to its peers is central to a wide variety of business decisions. Examples include:

Identifying assets with the highest utilization
Determining the worst selling products by region
Finding the best performing salespeople

Prior to the release of Oracle8i Database, you could use the ORDER BY clause to sort a result set on one or more columns, but any further processing to calculate rankings or percentiles had to be performed using a procedural language. Beginning with Oracle8i Database, however, you can take advantage of several new functions to either generate rankings for each row in a result set or to group rows into buckets for percentile calculations.

RANK, DENSE_RANK, and ROW_NUMBER

The RANK, DENSE_RANK, and ROW_NUMBER functions generate an integer value from 1 to N for each row, where N is less than or equal to the number of rows in the result set. The differences in the values returned by these functions revolves around how each one handles ties:

ROW_NUMBER: Returns a unique number for each row starting with 1. For rows that have duplicate values, numbers are arbitrarily assigned.
DENSE_RANK: Assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned.
RANK: Assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.

An example will best illustrate the differences. First, here is the query to generate the aggregate sales data by region and customer for the year 2001:

               SELECT region_id, cust_nbr, 
                 SUM(tot_sales) cust_sales
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, cust_nbr
               ORDER BY region_id, cust_nbr;

REGION_ID    CUST_NBR CUST_SALES
---------- ---------- ----------
         5          1    1151162
         5          2    1224992
         5          3    1161286
         5          4    1878275
         5          5    1169926
         6          6    1788836
         6          7     971585
         6          8    1141638
         6          9    1208959
         6         10    1196748
         7         11    1190421
         7         12    1182275
         7         13    1310434
         7         14    1929774
         7         15    1255591
         8         16    1068467
         8         17    1944281
         8         18    1253840
         8         19    1174421
         8         20    1413722
         9         21    1020541
         9         22    1036146
         9         23    1224992
         9         24    1224992
         9         25    2232703
        10         26    1808949
        10         27    1322747
        10         28     986964
        10         29     903383
        10         30    1216858

Notice that three of the customers (2, 23, and 24) have the same value for total sales ($1,224,992). In the next query, three function calls are added to generate rankings for each customer across all regions, and the results are then ordered by the ROW_NUMBER function to make the difference in rankings easier to observe:

               SELECT region_id, cust_nbr, 
                 SUM(tot_sales) cust_sales,
                 RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank,
                 DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_dense_rank,
                 ROW_NUMBER( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_number
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, cust_nbr
               ORDER BY sales_number;

REGION_ID   CUST_NBR  CUST_SALES SALES_RANK SALES_DENSE_RANK SALES_NUMBER
---------- ---------- ---------- ---------- ---------------- ------------
         9         25    2232703          1                1            1
         8         17    1944281          2                2            2
         7         14    1929774          3                3            3
         5          4    1878275          4                4            4
        10         26    1808949          5                5            5
         6          6    1788836          6                6            6
         8         20    1413722          7                7            7
        10         27    1322747          8                8            8
         7         13    1310434          9                9            9
         7         15    1255591         10               10           10
         8         18    1253840         11               11           11
         5          2    1224992         12               12           12
         9         23    1224992         12               12           13
         9         24    1224992         12               12           14
        10         30    1216858         15               13           15
         6          9    1208959         16               14           16
         6         10    1196748         17               15           17
         7         11    1190421         18               16           18
         7         12    1182275         19               17           19
         8         19    1174421         20               18           20
         5          5    1169926         21               19           21
         5          3    1161286         22               20           22
         5          1    1151162         23               21           23
         6          8    1141638         24               22           24
         8         16    1068467         25               23           25
         9         22    1036146         26               24           26
         9         21    1020541         27               25           27
        10         28     986964         28               26           28
         6          7     971585         29               27           29
        10         29     903383         30               28           30

Don’t be confused by the ORDER BY clause at the end of the query and the ORDER BY clauses within each function call; the functions use their ORDER BY clauses internally to sort their results for the purpose of applying a ranking. Thus, each of the three functions applies its ranking algorithm to the sum of each customer’s sales in descending order. The final ORDER BY clause specifies the results of the ROW_NUMBER function as the sort key for the final result set, but we could have picked any of the six columns as our sort key.

Both the RANK and DENSE_RANK functions assign the rank of 12 to the three rows with total sales of $1,224,992, while the ROW_NUMBER function assigns the ranks 12, 13, and 14 to the same rows. The difference between the RANK and DENSE_RANK functions manifests itself in the ranking assigned to the next-lowest sales total; the RANK function leaves a gap in the ranking sequence and assigns a rank of 15 to customer number 30, while the DENSE_RANK function continues the sequence with a ranking of 13.

Deciding which of the three functions to use depends on the desired outcome. If you want to identify the top 13 customers from this result set, you would use:

ROW_NUMBER: If you want exactly 13 rows without regard to ties. In this case, one of the customers who might otherwise be included in the list will be excluded from the final set.
RANK: If you want at least 13 rows but don’t want to include rows that would have been excluded had there been no ties. In this case, you would retrieve 14 rows.
DENSE_RANK: If you want all customers with a ranking of 13 or less, including all duplicates. In this case, you would retrieve 15 rows.

While the previous query generates rankings across the entire result set, it is also possible to generate independent sets of rankings across multiple partitions of the result set. The following query generates rankings for customer sales within each region rather than across all regions. Note the addition of the PARTITION BY clause:

               SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
                 RANK( ) OVER (PARTITION BY region_id
                   ORDER BY SUM(tot_sales) DESC) sales_rank,
                 DENSE_RANK( ) OVER (PARTITION BY region_id
                   ORDER BY SUM(tot_sales) DESC) sales_dense_rank,
                 ROW_NUMBER( ) OVER (PARTITION BY region_id
                   ORDER BY SUM(tot_sales) DESC) sales_number
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, cust_nbr
               ORDER BY region_id, sales_number;

REGION_ID    CUST_NBR CUST_SALES SALES_RANK SALES_DENSE_RANK SALES_NUMBER
---------- ---------- ---------- ---------- ---------------- ------------
         5          4    1878275          1                1            1
         5          2    1224992          2                2            2
         5          5    1169926          3                3            3
         5          3    1161286          4                4            4
         5          1    1151162          5                5            5
         6          6    1788836          1                1            1
         6          9    1208959          2                2            2
         6         10    1196748          3                3            3
         6          8    1141638          4                4            4
         6          7     971585          5                5            5
         7         14    1929774          1                1            1
         7         13    1310434          2                2            2
         7         15    1255591          3                3            3
         7         11    1190421          4                4            4
         7         12    1182275          5                5            5
         8         17    1944281          1                1            1
         8         20    1413722          2                2            2
         8         18    1253840          3                3            3
         8         19    1174421          4                4            4
         8         16    1068467          5                5            5
         9         25    2232703          1                1            1
         9         23    1224992          2                2            2
         9         24    1224992          2                2            3
         9         22    1036146          4                3            4
         9         21    1020541          5                4            5
        10         26    1808949          1                1            1
        10         27    1322747          2                2            2
        10         30    1216858          3                3            3
        10         28     986964          4                4            4
        10         29     903383          5                5            5

Each customer receives a ranking between one and five depending on their relation to other customers in the same region. Of the three customers with duplicate total sales, two of them are in region 9; as before, the RANK and DENSE_RANK functions generate identical rankings for both customers.

Tip

The PARTITION BY clause used in ranking functions is used to divide a result set into pieces so that rankings can be applied within each subset. This is completely different from the PARTITION BY RANGE/HASH/LIST clauses introduced in Chapter 10 for breaking a table or index into multiple pieces.

Handling NULLs

All ranking functions allow you to specify where in the ranking order NULL values should appear. This is accomplished by appending either NULLS FIRST or NULLS LAST after the ORDER BY clause of the function, as in:

SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
  RANK( ) OVER (ORDER BY SUM(tot_sales) DESC NULLS LAST) sales_rank
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr;

If omitted, NULL values will either appear last in ascending rankings or first in descending rankings.

Top/bottom N queries

One of the most common uses of a ranked data set is to identify the top N or bottom N performers. Since you can’t call analytic functions from the WHERE or HAVING clauses, you are forced to generate the rankings for all the rows and then use an outer query to filter out the unwanted rankings. For example, the following query uses an inline view to identify the top five salespersons for 2001:

                  SELECT s.name, sp.sp_sales total_sales
                  FROM 
                   (SELECT salesperson_id, SUM(tot_sales) sp_sales,
                      RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank
                    FROM orders
                    WHERE year = 2001
                    GROUP BY salesperson_id) sp INNER JOIN salesperson s
                    ON sp.salesperson_id = s.salesperson_id
                  WHERE sp.sales_rank <= 5
                  ORDER BY sp.sales_rank;

NAME                                              TOTAL_SALES
------------------------------------------------- -----------
Jeff Blake                                            1927580
Sam Houseman                                          1814327
Mark Russell                                          1784596
John Boorman                                          1768813
Carl Isaacs                                           1761814
Tim McGowan                                           1761814

FIRST/LAST

Although there is no function for returning only the top or bottom N from a ranked result set, Oracle provides functionality for identifying the first (top 1) or last (bottom 1) records in a ranked set. This is useful for queries such as the following: “Find the regions with the best and worst total sales last year.” Unlike the top five salespeople example from the previous section, this query needs an additional piece of information—the size of the result set—to answer the question.

Oracle9i provides the ability to answer such queries efficiently using functions that rank the result set based on a specified ordering, identify the row with the top or bottom ranking, and report on any column available in the result set. These functions are composed of three parts:

An ORDER BY clause that specifies how to rank the result set.
The keywords FIRST and LAST to specify whether to use the top or bottom-ranked row.
An aggregate function (i.e., MIN, MAX, AVG, COUNT) used as a tiebreaker in case more than one row of the result set tie for the FIRST or LAST spot in the ranking.

The following query uses the MIN aggregate function to find the regions that rank FIRST and LAST by total sales:

                  SELECT
                    MIN(region_id)
                      KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) best_region,
                    MIN(region_id)
                      KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) worst_region
                  FROM orders
                  WHERE year = 2001
                  GROUP BY region_id;

BEST_REGION WORST_REGION
----------- ------------
          7           10

The use of the MIN function in the previous query is a bit confusing: it is used only if more than one region ties for either first or last place in the ranking. If there were a tie, the row with the minimum value for region_id would be chosen. To find out if a tie actually exists, you could call each function twice using MIN for the first and MAX for the second, and see if they return the same results:

                  SELECT
                    MIN(region_id)
                      KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) min_best_region,
                    MAX(region_id)
                      KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) max_best_region,
                    MIN(region_id)
                      KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) min_worst_region,
                    MAX(region_id)
                      KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) max_worst_region
                  FROM orders
                  WHERE year = 2001
                  GROUP BY region_id;

MIN_BEST_REGION MAX_BEST_REGION MIN_WORST_REGION MAX_WORST_REGION
--------------- --------------- ---------------- ----------------
              7               7               10               10

In this case, there are no ties for either first or last place. Depending on the type of data you are working with, using an aggregate function as a tiebreaker can can be somewhat arbitrary.

NTILE

Another way rankings are commonly used is to generate buckets into which sets of rankings are grouped. For example, you may want to find those customers whose total sales ranked in the top 25%. The following query uses the NTILE function to group the customers into four buckets (or quartiles):

               SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
                 NTILE(4) OVER (ORDER BY SUM(tot_sales) DESC) sales_quartile
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, cust_nbr
               ORDER BY sales_quartile, cust_sales DESC;

REGION_ID    CUST_NBR CUST_SALES SALES_QUARTILE
---------- ---------- ---------- --------------
         9         25    2232703              1
         8         17    1944281              1
         7         14    1929774              1
         5          4    1878275              1
        10         26    1808949              1
         6          6    1788836              1
         8         20    1413722              1
        10         27    1322747              1
         7         13    1310434              2
         7         15    1255591              2
         8         18    1253840              2
         5          2    1224992              2
         9         23    1224992              2
         9         24    1224992              2
        10         30    1216858              2
         6          9    1208959              2
         6         10    1196748              3
         7         11    1190421              3
         7         12    1182275              3
         8         19    1174421              3
         5          5    1169926              3
         5          3    1161286              3
         5          1    1151162              3
         6          8    1141638              4
         8         16    1068467              4
         9         22    1036146              4
         9         21    1020541              4
        10         28     986964              4
         6          7     971585              4
        10         29     903383              4

The sales_quartile column in this query specifies NTILE(4) to create four buckets. The NTILE function finds each row’s place in the ranking, and then assigns each row to a bucket such that every bucket contains the same number of rows. If the number of rows is not evenly divisible by the number of buckets, then the extra rows are distributed so that the number of rows per bucket differs by one at most. In the previous example, there are four buckets allocated for 30 rows, with buckets one and two containing eight rows each, and buckets three and four containing seven rows each. This approach is referred to as equiheight buckets because each bucket contains (optimally) the same number of rows.

Just like in the top N query discussed earlier, you will need to wrap the query in an inline view if you want to filter on the NTILE result:

               SELECT r.name region, c.name customer, cs.cust_sales
               FROM 
                (SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
                   NTILE(4) OVER (ORDER BY SUM(tot_sales) DESC) sales_quartile
                 FROM orders
                 WHERE year = 2001
                 GROUP BY region_id, cust_nbr) cs INNER JOIN customer c
                 ON cs.cust_nbr = c.cust_nbr
                 INNER JOIN region r
                 ON cs.region_id = r.region_id
               WHERE cs.sales_quartile = 1

               ORDER BY cs.cust_sales DESC;

REGION               CUSTOMER                       CUST_SALES
-------------------- ------------------------------ ----------
Northwest US         Worcester Technologies            2232703
Southwest US         Evans Supply Corp.                1944281
Southeast US         Madden Industries                 1929774
New England          Flowtech Inc.                     1878275
Central US           Alpha Technologies                1808949
Mid-Atlantic         Spartan Industries                1788836
Southwest US         Malden Labs                       1413722
Central US           Phillips Labs                     1322747

The outer query filters on sales_quartile = 1, which removes all rows not in the top 25% of sales, and then joins the region and customer dimensions to generate the final results.

WIDTH_BUCKET

Similar to the NTILE function, the WIDTH_BUCKET function groups rows of the result set into buckets. Unlike NTILE, however, the WIDTH_BUCKET function attempts to create equiwidth buckets, meaning that the range of values is evenly distributed across the buckets. If your data were distributed across a bell curve, therefore, you could expect the buckets representing the low and high ranges of the bell curve to contain few records, whereas the buckets representing the middle ranges would contain many records.

WIDTH_BUCKET can operate on numeric or date types, and takes the following four parameters:

The expression that generates the buckets
The value used as the start of the range for bucket #1
The value used as the end of the range for bucket #N
The number of buckets to create (N)

WIDTH_BUCKET uses the values of the second, third, and fourth parameters to generate N buckets containing comparable ranges. If the expression yields values that fall outside the range specified by the second and third parameters, the WIDTH_BUCKET function will generate two additional buckets, numbered 0 and N + 1, into which the outliers are placed. If you want to work with the entire result set, you need to make sure your values for the second and third parameters completely enclose the range of values in the result set. However, if you only wish to work with a subset of the data, you can specify values for the second and third parameters that enclose the desired range, and any rows falling outside the range will be placed into buckets 0 and N + 1.

Here’s an example that uses the NTILE example from earlier to generate three buckets for the total sales per customer:

               SELECT region_id, cust_nbr,
                 SUM(tot_sales) cust_sales,
                 WIDTH_BUCKET(SUM(tot_sales), 1, 3000000, 3) sales_buckets
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, cust_nbr
               ORDER BY cust_sales;

REGION_ID   CUST_NBR CUST_SALES SALES_BUCKETS
---------- ---------- ---------- -------------
        10         29     903383             1
         6          7     971585             1
        10         28     986964             1
         9         21    1020541             2
         9         22    1036146             2
         8         16    1068467             2
         6          8    1141638             2
         5          1    1151162             2
         5          3    1161286             2
         5          5    1169926             2
         8         19    1174421             2
         7         12    1182275             2
         7         11    1190421             2
         6         10    1196748             2
         6          9    1208959             2
        10         30    1216858             2
         5          2    1224992             2
         9         24    1224992             2
         9         23    1224992             2
         8         18    1253840             2
         7         15    1255591             2
         7         13    1310434             2
        10         27    1322747             2
         8         20    1413722             2
         6          6    1788836             2
        10         26    1808949             2
         5          4    1878275             2
         7         14    1929774             2
         8         17    1944281             2
         9         25    2232703             3

Based on these parameters, the WIDTH_BUCKET function generates three buckets; the first bucket starts at 1, and the third bucket has an upper range of 3,000,000. Since there are three buckets, the ranges for each bucket will be 1 to 1,000,000, 1,000,001 to 2,000,000, and 2,000,001 to 3,000,000. When the rows are placed in the appropriate bucket, there are three rows that fall into bucket #1, a single row that falls in bucket #3, and the remaining 26 rows that fall into the second bucket.

The values 1 and 3,000,000 were chosen to guarantee that all rows in the result set would be placed into one of the three buckets. If you want to generate buckets only for rows that have aggregate sales between $1,000,000 and $2,000,000, the WIDTH_BUCKET function will place the remaining rows in the 0^th and 4^th buckets:

               SELECT region_id, cust_nbr,
                 SUM(tot_sales) cust_sales,
                 WIDTH_BUCKET(SUM(tot_sales), 1000000, 2000000, 3) sales_buckets
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, cust_nbr
               ORDER BY cust_sales;

REGION_ID   CUST_NBR CUST_SALES SALES_BUCKETS
---------- ---------- ---------- -------------
        10         29     903383             0
         6          7     971585             0
        10         28     986964             0
         9         21    1020541             1
         9         22    1036146             1
         8         16    1068467             1
         6          8    1141638             1
         5          1    1151162             1
         5          3    1161286             1
         5          5    1169926             1
         8         19    1174421             1
         7         12    1182275             1
         7         11    1190421             1
         6         10    1196748             1
         6          9    1208959             1
        10         30    1216858             1
         5          2    1224992             1
         9         24    1224992             1
         9         23    1224992             1
         8         18    1253840             1
         7         15    1255591             1
         7         13    1310434             1
        10         27    1322747             1
         8         20    1413722             2
         6          6    1788836             3
        10         26    1808949             3
         5          4    1878275             3
         7         14    1929774             3
         8         17    1944281             3
         9         25    2232703             4

Keep in mind that the WIDTH_BUCKET function does not remove rows from the result set that do not lie within the specified range; rather, they are placed into special buckets that your query can either utilize or ignore as needed.

CUME_DIST and PERCENT_RANK

The final two ranking functions, CUME_DIST and PERCENT_RANK, use the rank of a particular row to calculate additional information. The CUME_DIST function (short for Cumulative Distribution) calculates the ratio of the number of rows that have a lesser or equal ranking to the total number of rows in the partition. The PERCENT_RANK function calculates the ratio of a row’s ranking to the number of rows in the partition using the formula:

(RRP -- 1) / (NRP -- 1)

where RRP is the “rank of row in partition,” and NRP is the “number of rows in partition.”

Both functions utilize DENSE_RANK for their rankings and can be specified to be in ascending or descending order. The following query demonstrates the use of these two functions (both specifying descending order) with the customer yearly sales query:

               SELECT region_id, cust_nbr, 
                 SUM(tot_sales) cust_sales,
                 CUME_DIST( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_cume_dist,
                 PERCENT_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_percent_rank
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, cust_nbr
               ORDER BY cust_sales DESC;

REGION_ID   CUST_NBR  CUST_SALES SALES_CUME_DIST SALES_PERCENT_RANK
---------- ---------- ---------- --------------- ------------------
         9         25    2232703      .033333333                  0
         8         17    1944281      .066666667         .034482759
         7         14    1929774              .1         .068965517
         5          4    1878275      .133333333         .103448276
        10         26    1808949      .166666667         .137931034
         6          6    1788836              .2         .172413793
         8         20    1413722      .233333333         .206896552
        10         27    1322747      .266666667         .24137931
         7         13    1310434              .3         .275862069
         7         15    1255591      .333333333         .310344828
         8         18    1253840      .366666667         .344827586
         5          2    1224992      .466666667         .379310345
         9         23    1224992      .466666667         .379310345
         9         24    1224992      .466666667         .379310345
        10         30    1216858              .5         .482758621
         6          9    1208959      .533333333         .517241379
         6         10    1196748      .566666667         .551724138
         7         11    1190421              .6         .586206897
         7         12    1182275      .633333333         .620689655
         8         19    1174421      .666666667         .655172414
         5          5    1169926              .7         .689655172
         5          3    1161286      .733333333         .724137931
         5          1    1151162      .766666667         .75862069
         6          8    1141638              .8         .793103448
         8         16    1068467      .833333333         .827586207
         9         22    1036146      .866666667         .862068966
         9         21    1020541              .9         .896551724
        10         28     986964      .933333333         .931034483
         6          7     971585      .966666667         .965517241
        10         29     903383               1                  1

Let’s walk through a couple of calculations for customer number 1 in the previous result set. With total sales of $1,151,162, customer number 1 ranks 23^rd in the set of 30 customers in descending order of sales. Since there are a total of 30 rows, the CUME_DIST is equal to 23/30, or .766666667. The PERCENT_RANK function yields (23 - 1) / (30 - 1) = .75862069. It should come as no surprise that each function returns identical values for the rows that have identical sales totals, since the calculations are based on rank, which is identical for all three rows.

Hypothetical Functions

For some types of analysis, determining what might have happened is more revealing than knowing what really happened. Oracle provides special versions of RANK, DENSE_RANK, CUME_DIST, and PERCENT_RANK that allow rankings and distributions to be calculated for hypothetical data, allowing the user to see what would have happened if a specific value (or set of values) was included in a data set.

To illustrate this concept, let’s rank all customers by total sales for 2001, and then see where a hypothetical sales figure would fall in the ranking. Here is the query that generates the rankings and distributions:

               SELECT cust_nbr, SUM(tot_sales) cust_sales,
                 RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) rank,
                 DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) dense_rank,
                 CUME_DIST( ) OVER (ORDER BY SUM(tot_sales) DESC) cume_dist,
                 PERCENT_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) percent_rank
               FROM orders
               WHERE year = 2001
               GROUP BY cust_nbr
               ORDER BY rank;

  CUST_NBR CUST_SALES       RANK DENSE_RANK  CUME_DIST PERCENT_RANK
---------- ---------- ---------- ---------- ---------- ------------
        25    2232703          1          1 .033333333            0
        17    1944281          2          2 .066666667   .034482759
        14    1929774          3          3         .1   .068965517
         4    1878275          4          4 .133333333   .103448276
        26    1808949          5          5 .166666667   .137931034
         6    1788836          6          6         .2   .172413793
        20    1413722          7          7 .233333333   .206896552
        27    1322747          8          8 .266666667    .24137931
        13    1310434          9          9         .3   .275862069
        15    1255591         10         10 .333333333   .310344828
        18    1253840         11         11 .366666667   .344827586
         2    1224992         12         12 .466666667   .379310345
        23    1224992         12         12 .466666667   .379310345
        24    1224992         12         12 .466666667   .379310345
        30    1216858         15         13         .5   .482758621
         9    1208959         16         14 .533333333   .517241379
        10    1196748         17         15 .566666667   .551724138
        11    1190421         18         16         .6   .586206897
        12    1182275         19         17 .633333333   .620689655
        19    1174421         20         18 .666666667   .655172414
         5    1169926         21         19         .7   .689655172
         3    1161286         22         20 .733333333   .724137931
         1    1151162         23         21 .766666667    .75862069
         8    1141638         24         22         .8   .793103448
        16    1068467         25         23 .833333333   .827586207
        22    1036146         26         24 .866666667   .862068966
        21    1020541         27         25         .9   .896551724
        28     986964         28         26 .933333333   .931034483
         7     971585         29         27 .966666667   .965517241
        29     903383         30         28          1            1

Now let’s see where a customer with an even million dollars of sales would have ranked:

               SELECT 
                 RANK(1000000) WITHIN GROUP 
                   (ORDER BY SUM(tot_sales) DESC) hyp_rank,
                 DENSE_RANK(1000000) WITHIN GROUP 
                   (ORDER BY SUM(tot_sales) DESC) hyp_dense_rank,
                 CUME_DIST(1000000) WITHIN GROUP 
                   (ORDER BY SUM(tot_sales) DESC) hyp_cume_dist,
                 PERCENT_RANK(1000000) WITHIN GROUP 
                   (ORDER BY SUM(tot_sales) DESC) hyp_percent_rank
               FROM orders
               WHERE year = 2001
               GROUP BY cust_nbr;

  HYP_RANK HYP_DENSE_RANK HYP_CUME_DIST HYP_PERCENT_RANK
---------- -------------- ------------- ----------------
        28             26    .903225806               .9

The WITHIN GROUP clause has the effect of injecting a fictitious row into the result set before determining the rankings. One possible use of this functionality would be to see how actual sales compare to sales targets.

Windowing Functions

The ranking functions described thus far are quite useful when comparing items within a fixed window of time, such as “last year” or “second quarter.” But what if you want to perform computations using a window that slides as you progress through the data set? Oracle’s windowing functions allow aggregates to be calculated for each row in a result set based on a specified window. The aggregation window can be defined in one of three ways:

By specifying a set of rows: “From the current row to the end of the partition”
By specifying a time interval: “For the 30 days preceding the transaction date”
By specifying a range of values: “All rows having a transaction amount within 5% of the current row’s transaction amount”

The first set of examples will generate a window that fills the entire partition, and then show how the window can be detached from one or both ends of the partition so that it floats with the current row. All of the examples will be based on the following query, which calculates total monthly sales in 2001 for the Mid-Atlantic region:

            SELECT month, 
              SUM(tot_sales) monthly_sales
            FROM orders
            WHERE year = 2001 
              AND region_id = 6
            GROUP BY month
            ORDER BY month;

     MONTH MONTHLY_SALES
---------- -------------
         1        610697
         2        428676
         3        637031
         4        541146
         5        592935
         6        501485
         7        606914
         8        460520
         9        392898
        10        510117
        11        532889
        12        492458

The first step is to sum the monthly sales for the entire result set by specifying an “unbounded” window. Note the ROWS BETWEEN clause in the following example:

            SELECT month, 
              SUM(tot_sales) monthly_sales,
              SUM(SUM(tot_sales)) OVER (ORDER BY month 
                ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) total_sales
            FROM orders
            WHERE year = 2001 
              AND region_id = 6
            GROUP BY month
            ORDER BY month;

     MONTH MONTHLY_SALES TOTAL_SALES
---------- ------------- -----------
         1        610697     6307766
         2        428676     6307766
         3        637031     6307766
         4        541146     6307766
         5        592935     6307766
         6        501485     6307766
         7        606914     6307766
         8        460520     6307766
         9        392898     6307766
        10        510117     6307766
        11        532889     6307766
        12        492458     6307766

Each time the function executes, it sums the monthly sales from months 1 through 12; thus, the same calculation is being performed 12 times. This is a rather inefficient way to generate the yearly sales total (see Section 14.4 later in this chapter for a better method), but it should give you an idea of the syntax for building an aggregation window. The next query will create a window that spans from the top of the partition to the current row. The function identifies the month that has the maximum sales, up to and including the current month:

            SELECT month, 
              SUM(tot_sales) monthly_sales,
              MAX(SUM(tot_sales)) OVER (ORDER BY month
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) max_preceeding
            FROM orders
            WHERE year = 2001 
              AND region_id = 6
            GROUP BY month
            ORDER BY month;

     MONTH MONTHLY_SALES MAX_PRECEEDING
---------- ------------- --------------
         1        610697         610697
         2        428676         610697
         3        637031         637031
         4        541146         637031
         5        592935         637031
         6        501485         637031
         7        606914         637031
         8        460520         637031
         9        392898         637031
        10        510117         637031
        11        532889         637031
        12        492458         637031

Unlike the first query, which has a window size fixed at 12 rows, this query’s aggregation window grows from a single row for month 1 to 12 rows for month 12. The keywords CURRENT ROW are used to indicate that the window should end at the current row being inspected by the function. If you replace MAX in the previous query with SUM, you can calculate a running total:

            SELECT month, 
              SUM(tot_sales) monthly_sales,
              SUM(SUM(tot_sales)) OVER (ORDER BY month
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) running_total
            FROM orders
            WHERE year = 2001 
              AND region_id = 6
            GROUP BY month
            ORDER BY month;

     MONTH MONTHLY_SALES RUNNING_TOTAL
---------- ------------- --------------
         1        610697         610697
         2        428676        1039373
         3        637031        1676404
         4        541146        2217550
         5        592935        2810485
         6        501485        3311970
         7        606914        3918884
         8        460520        4379404
         9        392898        4772302
        10        510117        5282419
        11        532889        5815308
        12        492458        6307766

You have now seen examples using windows that are fixed at one or both ends. The next query will define a window that floats freely with each row:

            SELECT month, 
              SUM(tot_sales) monthly_sales,
              AVG(SUM(tot_sales)) OVER (ORDER BY month 
                ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) rolling_avg
            FROM orders
            WHERE year = 2001 
              AND region_id = 6
            GROUP BY month
            ORDER BY month;

     MONTH MONTHLY_SALES ROLLING_AVG
---------- ------------- -----------
         1        610697    519686.5
         2        428676  558801.333
         3        637031  535617.667
         4        541146  590370.667
         5        592935  545188.667
         6        501485  567111.333
         7        606914      522973
         8        460520  486777.333
         9        392898  454511.667
        10        510117  478634.667
        11        532889  511821.333
        12        492458    512673.5

For each of the 12 rows, the function calculates the average sales of the current month, the previous month, and the following month. The value of the ROLLING_AVG column is therefore the average sales within a three month floating window centered on the current month, with the exception that months 1 and 12 are calculated using a two-month window, since there is no previous month for month 1 or following month for month 12.

Working with Ranges

The previous windowing examples use the ROWS BETWEEN option to specify which rows to include in the aggregation. You may alternately specify a range and let Oracle determine which rows lie within the range. For example, the previous query used ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING to generate a three-month rolling average; the same results can be achieved by substituting RANGE for ROWS:

               SELECT month,
                 SUM(tot_sales) monthly_sales,
                 AVG(SUM(tot_sales)) OVER (ORDER BY month
                 RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) rolling_avg
               FROM orders
               WHERE year = 2001
                 AND region_id = 6
               GROUP BY month
               ORDER BY month;

     MONTH MONTHLY_SALES ROLLING_AVG
---------- ------------- -----------
         1        610697    519686.5
         2        428676  558801.333
         3        637031  535617.667
         4        541146  590370.667
         5        592935  545188.667
         6        501485  567111.333
         7        606914      522973
         8        460520  486777.333
         9        392898  454511.667
        10        510117  478634.667
        11        532889  511821.333
        12        492458    512673.5

This substitution works because the month column contains integer values, so adding and subtracting 1 from the current month yields a three-month range. The next variation achieves the same results but specifies a range of +/- 1.999:

               SELECT month,
                 SUM(tot_sales) monthly_sales,
                 AVG(SUM(tot_sales)) OVER (ORDER BY month
                 RANGE BETWEEN 1.999 PRECEDING AND 1.999 FOLLOWING) rolling_avg
               FROM orders
               WHERE year = 2001
                 AND region_id = 6
               GROUP BY month
               ORDER BY month;

     MONTH MONTHLY_SALES ROLLING_AVG
---------- ------------- -----------
         1        610697    519686.5
         2        428676  558801.333
         3        637031  535617.667
         4        541146  590370.667
         5        592935  545188.667
         6        501485  567111.333
         7        606914      522973
         8        460520  486777.333
         9        392898  454511.667
        10        510117  478634.667
        11        532889  511821.333
        12        492458    512673.5

If you are generating a window based on a DATE column, you can specify a range in increments of days, months, or years. Since the orders table has no DATE columns, the next example shows how a date range can be specified against the order_dt column of the cust_order table:

               SELECT TRUNC(order_dt) day,
                 SUM(sale_price) daily_sales,
                 AVG(SUM(sale_price)) OVER (ORDER BY TRUNC(order_dt)
                 RANGE BETWEEN INTERVAL '2' DAY PRECEDING 
                   AND INTERVAL '2' DAY FOLLOWING) five_day_avg
               FROM cust_order
               WHERE sale_price IS NOT NULL 
                 AND order_dt BETWEEN TO_DATE('01-JUL-2001','DD-MON-YYYY')
                 AND TO_DATE('31-JUL-2001','DD-MON-YYYY')
               GROUP BY TRUNC(order_dt);

DAY       DAILY_SALES FIVE_DAY_AVG
--------- ----------- ------------
16-JUL-01         112          146
18-JUL-01         180          114
20-JUL-01          50          169
21-JUL-01          50   165.333333
22-JUL-01         396   165.333333

This query generates a five-day rolling window by specifying a range of +/- two days around the truncated order date.

FIRST_VALUE and LAST_VALUE

Oracle provides two additional aggregate functions, called FIRST_VALUE and LAST_VALUE, that can be used with windowing functions to identify the values of the first and last values in the window. In the case of the three-month rolling average query shown previously, you could display the values of all three months along with the average of the three, as in:

               SELECT month,
                 FIRST_VALUE(SUM(tot_sales)) OVER (ORDER BY month 
                   ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) prev_month,
                 SUM(tot_sales) monthly_sales,
                 LAST_VALUE(SUM(tot_sales)) OVER (ORDER BY month 
                   ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) next_month,
                 AVG(SUM(tot_sales)) OVER (ORDER BY month 
                   ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) rolling_avg
               FROM orders
               WHERE year = 2001 
                 AND region_id = 6
               GROUP BY month
               ORDER BY month;

     MONTH PREV_MONTH MONTHLY_SALES NEXT_MONTH ROLLING_AVG
---------- ---------- ------------- ---------- -----------
         1     610697        610697     428676    519686.5
         2     610697        428676     637031  558801.333
         3     428676        637031     541146  535617.667
         4     637031        541146     592935  590370.667
         5     541146        592935     501485  545188.667
         6     592935        501485     606914  567111.333
         7     501485        606914     460520      522973
         8     606914        460520     392898  486777.333
         9     460520        392898     510117  454511.667
        10     392898        510117     532889  478634.667
        11     510117        532889     492458  511821.333
        12     532889        492458     492458    512673.5

These functions are useful for queries that compare each value to the first or last value in the period, such as: “How did each month’s sales compare to the first month?”

LAG/LEAD Functions

Although not technically windowing functions, the LAG and LEAD functions are included here because they allow rows to be referenced by their position relative to the current row, much like the PRECEDING and FOLLOWING clauses within windowing functions. LAG and LEAD are useful for comparing one row of a result set with another row of the same result set. For example, the query “Compute the total sales per month for the Mid-Atlantic region, including the percent change from the previous month” requires data from both the current and preceding rows to calculate the answer. This is, in effect, a two-row window, but the offset from the current row can be specified as one or more rows, making LAG and LEAD act like specialized windowing functions where only the outer edges of the window are utilized.

Here is the SQL that uses the LAG function to generate the data needed to answer the question posed in the previous paragraph:

               SELECT month, 
                 SUM(tot_sales) monthly_sales,
                 LAG(SUM(tot_sales), 1) OVER (ORDER BY month) prev_month_sales
               FROM orders
               WHERE year = 2001
                 AND region_id = 6
               GROUP BY month
               ORDER BY month;

     MONTH MONTHLY_SALES PREV_MONTH_SALES
---------- ------------- ----------------
         1        610697
         2        428676           610697
         3        637031           428676
         4        541146           637031
         5        592935           541146
         6        501485           592935
         7        606914           501485
         8        460520           606914
         9        392898           460520
        10        510117           392898
        11        532889           510117
        12        492458           532889

As you might expect, the LAG value for month 1 is NULL, since there is no preceding month. This would also be the case for the LEAD value for month 12. If you would like the LAG and LEAD functions to return a non-NULL value for these cases, you can specify a substitute value via the optional third parameter (see next example).

The next query utilizes the output from the previous query to generate the percentage difference from month to month. Note how a third parameter has been specified for the LAG function so that month 1 will use the current month’s sales instead of a NULL value for the percentage change:

               SELECT months.month month, months.monthly_sales monthly_sales,
                 ROUND((months.monthly_sales - months.prev_month_sales) /
                   months.prev_month_sales, 3) * 100 percent_change
               FROM
                (SELECT month, 
                   SUM(tot_sales) monthly_sales,
                   LAG(SUM(tot_sales), 1, SUM(tot_sales)) 
                     OVER (ORDER BY month) prev_month_sales
                 FROM orders
                 WHERE year = 2001
                   AND region_id = 6
                 GROUP BY month) months
               ORDER BY month;


     MONTH MONTHLY_SALES PERCENT_CHANGE
---------- ------------- --------------
         1        610697              0
         2        428676          -29.8
         3        637031           48.6
         4        541146          -15.1
         5        592935            9.6
         6        501485          -15.4
         7        606914             21
         8        460520          -24.1
         9        392898          -14.7
        10        510117           29.8
        11        532889            4.5
        12        492458           -7.6

Reporting Functions

Similar to the windowing functions described earlier, reporting functions allow the execution of various aggregate functions (MIN, MAX, SUM, COUNT, AVG, etc.) against a result set. Unlike windowing functions, however, the reporting functions cannot specify localized windows and thus generate the same result for each entire partition (or the entire result set, if no partitions are specified). Therefore, anything that can be accomplished using a reporting function can also be accomplished using a windowing function with an unbounded window, although it will generally be more efficient to use the reporting function.

Earlier in the chapter, we used a windowing function with an unbounded reporting window to generate the total sales for the 12 months of 2001:

            SELECT month, 
              SUM(tot_sales) monthly_sales,
              SUM(SUM(tot_sales)) OVER (ORDER BY month 
                ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) total_sales
            FROM orders
            WHERE year = 2001 
              AND region_id = 6
            GROUP BY month
            ORDER BY month;

     MONTH MONTHLY_SALES TOTAL_SALES
---------- ------------- -----------
         1        610697     6307766
         2        428676     6307766
         3        637031     6307766
         4        541146     6307766
         5        592935     6307766
         6        501485     6307766
         7        606914     6307766
         8        460520     6307766
         9        392898     6307766
        10        510117     6307766
        11        532889     6307766
        12        492458     6307766

The next query adds a reporting function to generate the same results:

            SELECT month, 
              SUM(tot_sales) monthly_sales,
              SUM(SUM(tot_sales)) OVER (ORDER BY month 
                ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) window_sales,
              SUM(SUM(tot_sales)) OVER ( ) reporting_sales
            FROM orders
            WHERE year = 2001 
              AND region_id = 6
            GROUP BY month
            ORDER BY month;

     MONTH MONTHLY_SALES WINDOW_SALES REPORTING_SALES
---------- ------------- ------------ ---------------
         1        610697      6307766         6307766
         2        428676      6307766         6307766
         3        637031      6307766         6307766
         4        541146      6307766         6307766
         5        592935      6307766         6307766
         6        501485      6307766         6307766
         7        606914      6307766         6307766
         8        460520      6307766         6307766
         9        392898      6307766         6307766
        10        510117      6307766         6307766
        11        532889      6307766         6307766
        12        492458      6307766         6307766

The empty parentheses after the OVER clause for the reporting_sales column indicate that the entire result set should be included in the sum, which has the same effect as using an unbounded window function. Hopefully, you will agree that the reporting function is easier to understand than the unbounded window function.

Reporting functions are useful when you need both detail and aggregate data (or different aggregation levels) to answer a business query. For example, the query “Show the monthly sales totals for 2001 along with each month’s percentage of yearly sales” requires the detail rows to be aggregated first to the month level, and then to the year level to answer the question. Rather than computing both aggregations from the detail rows, you can use the SUM function with a GROUP BY clause to aggregate to the month level, and then use a reporting function to aggregate the monthly totals, as in:

            SELECT month, 
              SUM(tot_sales) monthly_sales,
              SUM(SUM(tot_sales)) OVER ( ) yearly_sales
            FROM orders
            WHERE year = 2001
            GROUP BY month
            ORDER BY month;

     MONTH MONTHLY_SALES YEARLY_SALES
---------- ------------- ------------
         1       3028325     39594908
         2       3289336     39594908
         3       3411024     39594908
         4       3436482     39594908
         5       3749264     39594908
         6       3204730     39594908
         7       3233532     39594908
         8       3081290     39594908
         9       3388292     39594908
        10       3279637     39594908
        11       3167858     39594908
        12       3325138     39594908

You would then simply divide MONTHLY_SALES by YEARLY_SALES to compute the requested percentage (see Section 14.4.2 later in the chapter).

Report Partitions

Like ranking functions, reporting functions can include PARTITION BY clauses to split the result set into multiple pieces, allowing multiple aggregations to be computed across different subsets of the result set. The following query generates total sales per salesperson per region along with the total regional sales for comparison:

               SELECT region_id, salesperson_id, 
                 SUM(tot_sales) sp_sales,
                 SUM(SUM(tot_sales)) OVER (PARTITION BY region_id) region_sales
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, salesperson_id
               ORDER BY region_id, salesperson_id;

REGION_ID  SALESPERSON_ID   SP_SALES REGION_SALES
---------- -------------- ---------- ------------
         5              1    1927580      6585641
         5              2    1461898      6585641
         5              3    1501039      6585641
         5              4    1695124      6585641
         6              5    1688252      6307766
         6              6    1392648      6307766
         6              7    1458053      6307766
         6              8    1768813      6307766
         7              9    1735575      6868495
         7             10    1723305      6868495
         7             11    1737093      6868495
         7             12    1672522      6868495
         8             13    1516776      6854731
         8             14    1814327      6854731
         8             15    1761814      6854731
         8             16    1761814      6854731
         9             17    1710831      6739374
         9             18    1625456      6739374
         9             19    1645204      6739374
         9             20    1757883      6739374
        10             21    1542152      6238901
        10             22    1468316      6238901
        10             23    1443837      6238901
        10             24    1784596      6238901

The value for the REGION_SALES column is the same for all salespeople in the same region. In the next section, you will see two different approaches for using this information to generate percentage calculations.

RATIO_TO_REPORT

One of the more common uses of reporting functions is to generate the value of the denominator for performance calculations. With the query from the previous section, for example, the next logical step would be to divide each salesperson’s total sales (SP_SALES) by the total region sales (REGION_SALES) to determine what ratio of the total region sales can be attributed to each salesperson. One option is to use the reporting function as the denominator in the percentage calculation, as in:

               SELECT region_id, salesperson_id, 
                 SUM(tot_sales) sp_sales,
                 ROUND(SUM(tot_sales) /
                   SUM(SUM(tot_sales)) OVER (PARTITION BY region_id), 
                   2) percent_of_region
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, salesperson_id
               ORDER BY region_id, salesperson_id;

REGION_ID  SALESPERSON_ID   SP_SALES PERCENT_OF_REGION
---------- -------------- ---------- -----------------
         5              1    1927580               .29
         5              2    1461898               .22
         5              3    1501039               .23
         5              4    1695124               .26
         6              5    1688252               .27
         6              6    1392648               .22
         6              7    1458053               .23
         6              8    1768813               .28
         7              9    1735575               .25
         7             10    1723305               .25
         7             11    1737093               .25
         7             12    1672522               .24
         8             13    1516776               .22
         8             14    1814327               .26
         8             15    1761814               .26
         8             16    1761814               .26
         9             17    1710831               .25
         9             18    1625456               .24
         9             19    1645204               .24
         9             20    1757883               .26
        10             21    1542152               .25
        10             22    1468316               .24
        10             23    1443837               .23
        10             24    1784596               .29

Because this is such a common operation, however, Oracle has spared us the trouble by including the RATIO_TO_REPORT function. The RATIO_TO_REPORT function allows you to calculate each row’s contribution to either the entire result set, or some subset of the result set if the PARTITION BY clause is included. The next query uses RATIO_TO_REPORT to generate the percentage contribution of each salesperson to her region’s total sales:

               SELECT region_id, salesperson_id, 
                 SUM(tot_sales) sp_sales,
                 ROUND(RATIO_TO_REPORT(SUM(tot_sales)) 
                   OVER (PARTITION BY region_id), 2) sp_ratio
               FROM orders
               WHERE year = 2001
               GROUP BY region_id, salesperson_id
               ORDER BY region_id, salesperson_id;

REGION_ID  SALESPERSON_ID   SP_SALES   SP_RATIO
---------- -------------- ---------- ----------
         5              1    1927580        .29
         5              2    1461898        .22
         5              3    1501039        .23
         5              4    1695124        .26
         6              5    1688252        .27
         6              6    1392648        .22
         6              7    1458053        .23
         6              8    1768813        .28
         7              9    1735575        .25
         7             10    1723305        .25
         7             11    1737093        .25
         7             12    1672522        .24
         8             13    1516776        .22
         8             14    1814327        .26
         8             15    1761814        .26
         8             16    1761814        .26
         9             17    1710831        .25
         9             18    1625456        .24
         9             19    1645204        .24
         9             20    1757883        .26
        10             21    1542152        .25
        10             22    1468316        .24
        10             23    1443837        .23
        10             24    1784596        .29

Summary

We have covered a lot of ground in this chapter, so don’t feel bad if it takes a couple of passes to get a feel for all of the different analytic functions and how they can be applied. You’ll find the material easier to digest if you concentrate on one category at a time (Ranking, Windowing, Reporting). If you’ve been working with Oracle for many years, you are probably chomping at the bit to give these functions a try. Along with being compact and efficient, Oracle’s analytic functions keep analytical calculations where they belong—in the database server—instead of relying on procedural languages or spreadsheet macros to finish the job.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.