For years, SQL has been criticized for its inability to handle routine decision support queries. With a host of new analytic functions introduced in Oracle8i, Oracle9i, Database, and Oracle Database 10g, Oracle has taken giant strides toward eliminating this deficiency. In doing so, Oracle has further blurred the distinction between its multipurpose relational database server and other, special-purpose data warehouse and statistical analysis servers.
The types of queries issued by Decision Support Systems (DSS) differ from those issued against OLTP systems. Consider the following business queries:
Find the top 10 salespeople in each sales district last year.
Find all customers whose total orders last year exceeded 20% of the aggregate sales for their geographic region.
Identify the region that suffered the worst quarter-to-quarter sales decline last year.
Find the best and worst selling menu items by state for each quarter last year.
Queries such as these are staples of DSS, and are used by managers, analysts, marketing executives, etc., to spot trends, identify outliers, uncover business opportunities, and predict future business performance. DSS systems typically sit atop data warehouses, in which large quantities of scrubbed, aggregated data provide fertile grounds for researching and formulating business decisions.
Although all of the previous queries can be easily expressed in English, they have historically been difficult to formulate using SQL for the following reasons:
They may require different levels of aggregation of the same data.
They may involve intratable comparisons (comparing one or more rows in a table with other rows in the same table).
They may require an extra filtering step after the result set has been sorted (i.e., finding the top 10 and bottom 10 salespeople last month).
Although it is possible to generate the desired results using such SQL features as self joins, inline views, and user-defined functions, the resulting queries can be difficult to understand and might yield unacceptably long execution times. To illustrate the difficulty in formulating such queries, we will walk through the construction of this query: “Find all customers whose total orders in 2001 exceeded 20% of the aggregate sales for their geographic region.”
For this and other examples in this chapter, we use a simple star
schema consisting of a single fact table (called
orders
) containing aggregated sales information
across the following dimensions: region
,
salesperson
, customer
, and
month
. There are two main facets to this query,
each requiring a different level of aggregation of the same data:
Sum all sales per region last year.
Sum all sales per customer last year.
After these two intermediate result sets have been constructed, each customer’s total can be compared to the total for their region to see if it exceeds 20%. The final result set will show the customer names along with their total sales, region name, and the percentage of their region’s sales.
The query to aggregate sales by region looks as follows:
SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.region_id;
REGION_ID TOT_SALES ---------- ---------- 5 6585641 6 6307766 7 6868495 8 6854731 9 6739374 10 6238901
The query to aggregate sales by customer would be:
SELECT o.cust_nbr cust_nbr, o.region_id region_id,
SUM(o.tot_sales) tot_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.cust_nbr, o.region_id;
CUST_NBR REGION_ID TOT_SALES ---------- ---------- ---------- 1 5 1151162 2 5 1224992 3 5 1161286 4 5 1878275 5 5 1169926 6 6 1788836 7 6 971585 8 6 1141638 9 6 1208959 10 6 1196748 11 7 1190421 12 7 1182275 13 7 1310434 14 7 1929774 15 7 1255591 16 8 1068467 17 8 1944281 18 8 1253840 19 8 1174421 20 8 1413722 21 9 1020541 22 9 1036146 23 9 1224992 24 9 1224992 25 9 2232703 26 10 1808949 27 10 1322747 28 10 986964 29 10 903383 30 10 1216858
By placing each of the two queries in an
inline view and joining them on
region_id
, you can identify those customers whose
total sales exceeds 20% of their region, as in:
SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,
cust_sales.tot_sales cust_sales, region_sales.tot_sales region_sales
FROM
(SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.region_id) region_sales INNER JOIN
(SELECT o.cust_nbr cust_nbr, o.region_id region_id,
SUM(o.tot_sales) tot_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.cust_nbr, o.region_id) cust_sales
ON cust_sales.region_id = region_sales.region_id
WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2);
CUST_NBR REGION_ID CUST_SALES REGION_SALES ---------- ---------- ---------- ------------ 4 5 1878275 6585641 6 6 1788836 6307766 14 7 1929774 6868495 17 8 1944281 6854731 20 8 1413722 6854731 25 9 2232703 6739374 26 10 1808949 6238901 27 10 1322747 6238901
The final step is to join the
region
and
customer
dimensions to include the customer and
region names in the result set:
SELECT c.name cust_name,
big_custs.cust_sales cust_sales, r.name region_name,
100 * ROUND(big_custs.cust_sales /
big_custs.region_sales, 2) percent_of_region
FROM
(SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,
cust_sales.tot_sales cust_sales,
region_sales.tot_sales region_sales
FROM
(SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.region_id) region_sales INNER JOIN
(SELECT o.cust_nbr cust_nbr, o.region_id region_id,
SUM(o.tot_sales) tot_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.cust_nbr, o.region_id) cust_sales
ON cust_sales.region_id = region_sales.region_id
WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2)) big_custs INNER JOIN customer c
ON big_custs.cust_nbr = c.cust_nbr
INNER JOIN region r
ON big_custs.region_id = r.region_id;
CUST_NAME CUST_SALES REGION_NAME PERCENT_OF_REGION ---------------------- ---------- -------------------- ----------------- Flowtech Inc. 1878275 New England 29 Spartan Industries 1788836 Mid-Atlantic 28 Madden Industries 1929774 Southeast US 28 Evans Supply Corp. 1944281 Southwest US 28 Malden Labs 1413722 Southwest US 21 Worcester Technologies 2232703 Northwest US 33 Alpha Technologies 1808949 Central US 29 Phillips Labs 1322747 Central US 21
Using nothing more exotic than inline views, therefore, it is possible to construct a single query that generates the desired results. Such a solution, however, has the following shortcomings:
The query is fairly complex.
Two passes through the same rows of the orders
table are required to generate the different aggregation levels
needed by the query.
Let’s see how we can both simplify the query and
perform the same work in a single pass through the
orders
table using one of the new analytic
functions. Rather than issuing two separate queries to aggregate
sales per region and per customer, we will create a single query that
aggregates sales over both region and customer, and then call an
analytic function that performs a second level of aggregation to
generate total sales per region:
SELECT o.region_id region_id, o.cust_nbr cust_nbr,
SUM(o.tot_sales) tot_sales,
SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.region_id, o.cust_nbr;
REGION_ID CUST_NBR TOT_SALES REGION_SALES ---------- ---------- ---------- ------------ 5 1 1151162 6585641 5 2 1224992 6585641 5 3 1161286 6585641 5 4 1878275 6585641 5 5 1169926 6585641 6 6 1788836 6307766 6 7 971585 6307766 6 8 1141638 6307766 6 9 1208959 6307766 6 10 1196748 6307766 7 11 1190421 6868495 7 12 1182275 6868495 7 13 1310434 6868495 7 14 1929774 6868495 7 15 1255591 6868495 8 16 1068467 6854731 8 17 1944281 6854731 8 18 1253840 6854731 8 19 1174421 6854731 8 20 1413722 6854731 9 21 1020541 6739374 9 22 1036146 6739374 9 23 1224992 6739374 9 24 1224992 6739374 9 25 2232703 6739374 10 26 1808949 6238901 10 27 1322747 6238901 10 28 986964 6238901 10 29 903383 6238901 10 30 1216858 6238901
The analytic function can be found in line 3 of the previous query
and the result has the alias region_sales
. The
aggregate function (SUM(o.tot_sales)
) in line 2
generates the total sales per customer and region as directed by the
GROUP BY clause, and the analytic function in line 3 aggregates these
sums for each region, thereby computing the aggregate sales per
region. The value for the region_sales
column is
identical for all customers within the same region and is equal to
the sum of all customer sales within that region. We can then wrap
the query in an inline view, filter out those customers with less
than 20% of their region’s total sales, and join the
region
and customer
tables to
generate the desired result set:
SELECT c.name cust_name,
cust_sales.tot_sales cust_sales, r.name region_name,
100 * ROUND(cust_sales.tot_sales /
cust_sales.region_sales, 2) percent_of_region
FROM
(SELECT o.region_id region_id, o.cust_nbr cust_nbr,
SUM(o.tot_sales) tot_sales,
SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.region_id, o.cust_nbr) cust_sales INNER JOIN region r
ON cust_sales.region_id = r.region_id
INNER JOIN customer c
ON cust_sales.cust_nbr = c.cust_nbr
WHERE cust_sales.tot_sales > (cust_sales.region_sales * .2);
CUST_NAME CUST_SALES REGION_NAME PERCENT_OF_REGION ---------------------- ---------- -------------------- ----------------- Flowtech Inc. 1878275 New England 29 Spartan Industries 1788836 Mid-Atlantic 28 Madden Industries 1929774 Southeast US 28 Evans Supply Corp. 1944281 Southwest US 28 Malden Labs 1413722 Southwest US 21 Worcester Technologies 2232703 Northwest US 33 Alpha Technologies 1808949 Central US 29 Phillips Labs 1322747 Central US 21
Using an inline view saves us from having to join the
region
and customer
tables to
the orders table; otherwise, we would have to include columns from
the region
and customer
tables
in the GROUP BY clause.
Later in this chapter, under “Reporting Functions,” we’ll get into the details of how the SUM...OVER function works. For now, you can see that Oracle is performing an aggregation of an aggregation rather than revisiting the detail rows twice. Thus, the query runs faster and should also prove easier to understand and maintain once the syntax is familiar.
Unlike built-in functions such as DECODE, GREATEST, and SUBSTR, Oracle’s suite of analytic functions can only be used in the SELECT and ORDER BY clauses of a query. This is because analytic functions are only executed after the FROM, WHERE, GROUP BY, and HAVING clauses have been evaluated. After the analytic functions have executed, the query’s ORDER BY clause is evaluated to sort the final result set, and the ORDER BY clause is allowed to reference columns in the SELECT clause generated via analytic functions as well as specify analytic functions not found in the SELECT clause.
The remainder of this chapter introduces the Oracle8i Database and Oracle9i Database analytic functions, grouped by functionality.
Determining the performance of a particular business entity compared to its peers is central to a wide variety of business decisions. Examples include:
Identifying assets with the highest utilization
Determining the worst selling products by region
Finding the best performing salespeople
Prior to the release of Oracle8i Database, you could use the ORDER BY clause to sort a result set on one or more columns, but any further processing to calculate rankings or percentiles had to be performed using a procedural language. Beginning with Oracle8i Database, however, you can take advantage of several new functions to either generate rankings for each row in a result set or to group rows into buckets for percentile calculations.
The RANK, DENSE_RANK, and ROW_NUMBER functions generate an integer value from 1 to N for each row, where N is less than or equal to the number of rows in the result set. The differences in the values returned by these functions revolves around how each one handles ties:
Returns a unique number for each row starting with 1. For rows that have duplicate values, numbers are arbitrarily assigned.
Assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned.
Assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.
An example will best illustrate the differences. First, here is the query to generate the aggregate sales data by region and customer for the year 2001:
SELECT region_id, cust_nbr,
SUM(tot_sales) cust_sales
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr
ORDER BY region_id, cust_nbr;
REGION_ID CUST_NBR CUST_SALES ---------- ---------- ---------- 5 1 1151162 5 2 1224992 5 3 1161286 5 4 1878275 5 5 1169926 6 6 1788836 6 7 971585 6 8 1141638 6 9 1208959 6 10 1196748 7 11 1190421 7 12 1182275 7 13 1310434 7 14 1929774 7 15 1255591 8 16 1068467 8 17 1944281 8 18 1253840 8 19 1174421 8 20 1413722 9 21 1020541 9 22 1036146 9 23 1224992 9 24 1224992 9 25 2232703 10 26 1808949 10 27 1322747 10 28 986964 10 29 903383 10 30 1216858
Notice that three of the customers (2, 23, and 24) have the same value for total sales ($1,224,992). In the next query, three function calls are added to generate rankings for each customer across all regions, and the results are then ordered by the ROW_NUMBER function to make the difference in rankings easier to observe:
SELECT region_id, cust_nbr,
SUM(tot_sales) cust_sales,
RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank,
DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_dense_rank,
ROW_NUMBER( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_number
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr
ORDER BY sales_number;
REGION_ID CUST_NBR CUST_SALES SALES_RANK SALES_DENSE_RANK SALES_NUMBER ---------- ---------- ---------- ---------- ---------------- ------------ 9 25 2232703 1 1 1 8 17 1944281 2 2 2 7 14 1929774 3 3 3 5 4 1878275 4 4 4 10 26 1808949 5 5 5 6 6 1788836 6 6 6 8 20 1413722 7 7 7 10 27 1322747 8 8 8 7 13 1310434 9 9 9 7 15 1255591 10 10 10 8 18 1253840 11 11 11 5 2 1224992 12 12 12 9 23 1224992 12 12 13 9 24 1224992 12 12 14 10 30 1216858 15 13 15 6 9 1208959 16 14 16 6 10 1196748 17 15 17 7 11 1190421 18 16 18 7 12 1182275 19 17 19 8 19 1174421 20 18 20 5 5 1169926 21 19 21 5 3 1161286 22 20 22 5 1 1151162 23 21 23 6 8 1141638 24 22 24 8 16 1068467 25 23 25 9 22 1036146 26 24 26 9 21 1020541 27 25 27 10 28 986964 28 26 28 6 7 971585 29 27 29 10 29 903383 30 28 30
Don’t be confused by the ORDER BY clause at the end of the query and the ORDER BY clauses within each function call; the functions use their ORDER BY clauses internally to sort their results for the purpose of applying a ranking. Thus, each of the three functions applies its ranking algorithm to the sum of each customer’s sales in descending order. The final ORDER BY clause specifies the results of the ROW_NUMBER function as the sort key for the final result set, but we could have picked any of the six columns as our sort key.
Both the RANK and DENSE_RANK functions assign the rank of 12 to the three rows with total sales of $1,224,992, while the ROW_NUMBER function assigns the ranks 12, 13, and 14 to the same rows. The difference between the RANK and DENSE_RANK functions manifests itself in the ranking assigned to the next-lowest sales total; the RANK function leaves a gap in the ranking sequence and assigns a rank of 15 to customer number 30, while the DENSE_RANK function continues the sequence with a ranking of 13.
Deciding which of the three functions to use depends on the desired outcome. If you want to identify the top 13 customers from this result set, you would use:
If you want exactly 13 rows without regard to ties. In this case, one of the customers who might otherwise be included in the list will be excluded from the final set.
If you want at least 13 rows but don’t want to include rows that would have been excluded had there been no ties. In this case, you would retrieve 14 rows.
If you want all customers with a ranking of 13 or less, including all duplicates. In this case, you would retrieve 15 rows.
While the previous query generates rankings across the entire result set, it is also possible to generate independent sets of rankings across multiple partitions of the result set. The following query generates rankings for customer sales within each region rather than across all regions. Note the addition of the PARTITION BY clause:
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
RANK( ) OVER (PARTITION BY region_id
ORDER BY SUM(tot_sales) DESC) sales_rank,
DENSE_RANK( ) OVER (PARTITION BY region_id
ORDER BY SUM(tot_sales) DESC) sales_dense_rank,
ROW_NUMBER( ) OVER (PARTITION BY region_id
ORDER BY SUM(tot_sales) DESC) sales_number
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr
ORDER BY region_id, sales_number;
REGION_ID CUST_NBR CUST_SALES SALES_RANK SALES_DENSE_RANK SALES_NUMBER ---------- ---------- ---------- ---------- ---------------- ------------ 5 4 1878275 1 1 1 5 2 1224992 2 2 2 5 5 1169926 3 3 3 5 3 1161286 4 4 4 5 1 1151162 5 5 5 6 6 1788836 1 1 1 6 9 1208959 2 2 2 6 10 1196748 3 3 3 6 8 1141638 4 4 4 6 7 971585 5 5 5 7 14 1929774 1 1 1 7 13 1310434 2 2 2 7 15 1255591 3 3 3 7 11 1190421 4 4 4 7 12 1182275 5 5 5 8 17 1944281 1 1 1 8 20 1413722 2 2 2 8 18 1253840 3 3 3 8 19 1174421 4 4 4 8 16 1068467 5 5 5 9 25 2232703 1 1 1 9 23 1224992 2 2 2 9 24 1224992 2 2 3 9 22 1036146 4 3 4 9 21 1020541 5 4 5 10 26 1808949 1 1 1 10 27 1322747 2 2 2 10 30 1216858 3 3 3 10 28 986964 4 4 4 10 29 903383 5 5 5
Each customer receives a ranking between one and five depending on their relation to other customers in the same region. Of the three customers with duplicate total sales, two of them are in region 9; as before, the RANK and DENSE_RANK functions generate identical rankings for both customers.
The PARTITION BY clause used in ranking functions is used to divide a result set into pieces so that rankings can be applied within each subset. This is completely different from the PARTITION BY RANGE/HASH/LIST clauses introduced in Chapter 10 for breaking a table or index into multiple pieces.
All ranking functions allow you to specify where in the ranking order NULL values should appear. This is accomplished by appending either NULLS FIRST or NULLS LAST after the ORDER BY clause of the function, as in:
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
RANK( ) OVER (ORDER BY SUM(tot_sales) DESC NULLS LAST) sales_rank
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr;
If omitted, NULL values will either appear last in ascending rankings or first in descending rankings.
One of the most common uses of a ranked data set is to identify the top N or bottom N performers. Since you can’t call analytic functions from the WHERE or HAVING clauses, you are forced to generate the rankings for all the rows and then use an outer query to filter out the unwanted rankings. For example, the following query uses an inline view to identify the top five salespersons for 2001:
SELECT s.name, sp.sp_sales total_sales
FROM
(SELECT salesperson_id, SUM(tot_sales) sp_sales,
RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank
FROM orders
WHERE year = 2001
GROUP BY salesperson_id) sp INNER JOIN salesperson s
ON sp.salesperson_id = s.salesperson_id
WHERE sp.sales_rank <= 5
ORDER BY sp.sales_rank;
NAME TOTAL_SALES ------------------------------------------------- ----------- Jeff Blake 1927580 Sam Houseman 1814327 Mark Russell 1784596 John Boorman 1768813 Carl Isaacs 1761814 Tim McGowan 1761814
Although there is no function for returning only the top or bottom N from a ranked result set, Oracle provides functionality for identifying the first (top 1) or last (bottom 1) records in a ranked set. This is useful for queries such as the following: “Find the regions with the best and worst total sales last year.” Unlike the top five salespeople example from the previous section, this query needs an additional piece of information—the size of the result set—to answer the question.
Oracle9i provides the ability to answer such queries efficiently using functions that rank the result set based on a specified ordering, identify the row with the top or bottom ranking, and report on any column available in the result set. These functions are composed of three parts:
An ORDER BY clause that specifies how to rank the result set.
The keywords FIRST and LAST to specify whether to use the top or bottom-ranked row.
An aggregate function (i.e., MIN, MAX, AVG, COUNT) used as a tiebreaker in case more than one row of the result set tie for the FIRST or LAST spot in the ranking.
The following query uses the MIN aggregate function to find the regions that rank FIRST and LAST by total sales:
SELECT
MIN(region_id)
KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) best_region,
MIN(region_id)
KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) worst_region
FROM orders
WHERE year = 2001
GROUP BY region_id;
BEST_REGION WORST_REGION ----------- ------------ 7 10
The use of the MIN function in the previous query is a bit confusing:
it is used only if more than one region ties for either first or last
place in the ranking. If there were a tie, the row with the minimum
value for region_id
would be chosen. To find out
if a tie actually exists, you could call each function twice using
MIN for the first and MAX for the second, and see if they return the
same results:
SELECT
MIN(region_id)
KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) min_best_region,
MAX(region_id)
KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) max_best_region,
MIN(region_id)
KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) min_worst_region,
MAX(region_id)
KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) max_worst_region
FROM orders
WHERE year = 2001
GROUP BY region_id;
MIN_BEST_REGION MAX_BEST_REGION MIN_WORST_REGION MAX_WORST_REGION --------------- --------------- ---------------- ---------------- 7 7 10 10
In this case, there are no ties for either first or last place. Depending on the type of data you are working with, using an aggregate function as a tiebreaker can can be somewhat arbitrary.
Another way rankings are commonly used is to generate buckets into which sets of rankings are grouped. For example, you may want to find those customers whose total sales ranked in the top 25%. The following query uses the NTILE function to group the customers into four buckets (or quartiles):
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
NTILE(4) OVER (ORDER BY SUM(tot_sales) DESC) sales_quartile
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr
ORDER BY sales_quartile, cust_sales DESC;
REGION_ID CUST_NBR CUST_SALES SALES_QUARTILE ---------- ---------- ---------- -------------- 9 25 2232703 1 8 17 1944281 1 7 14 1929774 1 5 4 1878275 1 10 26 1808949 1 6 6 1788836 1 8 20 1413722 1 10 27 1322747 1 7 13 1310434 2 7 15 1255591 2 8 18 1253840 2 5 2 1224992 2 9 23 1224992 2 9 24 1224992 2 10 30 1216858 2 6 9 1208959 2 6 10 1196748 3 7 11 1190421 3 7 12 1182275 3 8 19 1174421 3 5 5 1169926 3 5 3 1161286 3 5 1 1151162 3 6 8 1141638 4 8 16 1068467 4 9 22 1036146 4 9 21 1020541 4 10 28 986964 4 6 7 971585 4 10 29 903383 4
The sales_quartile
column in this query specifies
NTILE(4)
to create four buckets. The NTILE
function finds each row’s place in the ranking, and
then assigns each row to a bucket such that every bucket contains the
same number of rows. If the number of rows is not evenly divisible by
the number of buckets, then the extra rows are distributed so that
the number of rows per bucket differs by one at most. In the previous
example, there are four buckets allocated for 30 rows, with buckets
one and two containing eight rows each, and buckets three and four
containing seven rows each. This approach is referred to as
equiheight buckets
because each bucket contains (optimally) the same number of rows.
Just like in the top N query discussed earlier, you will need to wrap the query in an inline view if you want to filter on the NTILE result:
SELECT r.name region, c.name customer, cs.cust_sales
FROM
(SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
NTILE(4) OVER (ORDER BY SUM(tot_sales) DESC) sales_quartile
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr) cs INNER JOIN customer c
ON cs.cust_nbr = c.cust_nbr
INNER JOIN region r
ON cs.region_id = r.region_id
WHERE cs.sales_quartile = 1
ORDER BY cs.cust_sales DESC;
REGION CUSTOMER CUST_SALES -------------------- ------------------------------ ---------- Northwest US Worcester Technologies 2232703 Southwest US Evans Supply Corp. 1944281 Southeast US Madden Industries 1929774 New England Flowtech Inc. 1878275 Central US Alpha Technologies 1808949 Mid-Atlantic Spartan Industries 1788836 Southwest US Malden Labs 1413722 Central US Phillips Labs 1322747
The outer query filters on sales_quartile = 1
,
which removes all rows not in the top 25% of sales, and then joins
the region
and customer
dimensions to generate the final results.
Similar to the NTILE function, the WIDTH_BUCKET function groups rows of the result set into buckets. Unlike NTILE, however, the WIDTH_BUCKET function attempts to create equiwidth buckets, meaning that the range of values is evenly distributed across the buckets. If your data were distributed across a bell curve, therefore, you could expect the buckets representing the low and high ranges of the bell curve to contain few records, whereas the buckets representing the middle ranges would contain many records.
WIDTH_BUCKET can operate on numeric or date types, and takes the following four parameters:
The expression that generates the buckets
The value used as the start of the range for bucket #1
The value used as the end of the range for bucket #N
The number of buckets to create (N)
WIDTH_BUCKET uses the values of the second, third, and fourth parameters to generate N buckets containing comparable ranges. If the expression yields values that fall outside the range specified by the second and third parameters, the WIDTH_BUCKET function will generate two additional buckets, numbered 0 and N + 1, into which the outliers are placed. If you want to work with the entire result set, you need to make sure your values for the second and third parameters completely enclose the range of values in the result set. However, if you only wish to work with a subset of the data, you can specify values for the second and third parameters that enclose the desired range, and any rows falling outside the range will be placed into buckets 0 and N + 1.
Here’s an example that uses the NTILE example from earlier to generate three buckets for the total sales per customer:
SELECT region_id, cust_nbr,
SUM(tot_sales) cust_sales,
WIDTH_BUCKET(SUM(tot_sales), 1, 3000000, 3) sales_buckets
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr
ORDER BY cust_sales;
REGION_ID CUST_NBR CUST_SALES SALES_BUCKETS ---------- ---------- ---------- ------------- 10 29 903383 1 6 7 971585 1 10 28 986964 1 9 21 1020541 2 9 22 1036146 2 8 16 1068467 2 6 8 1141638 2 5 1 1151162 2 5 3 1161286 2 5 5 1169926 2 8 19 1174421 2 7 12 1182275 2 7 11 1190421 2 6 10 1196748 2 6 9 1208959 2 10 30 1216858 2 5 2 1224992 2 9 24 1224992 2 9 23 1224992 2 8 18 1253840 2 7 15 1255591 2 7 13 1310434 2 10 27 1322747 2 8 20 1413722 2 6 6 1788836 2 10 26 1808949 2 5 4 1878275 2 7 14 1929774 2 8 17 1944281 2 9 25 2232703 3
Based on these parameters, the WIDTH_BUCKET function generates three buckets; the first bucket starts at 1, and the third bucket has an upper range of 3,000,000. Since there are three buckets, the ranges for each bucket will be 1 to 1,000,000, 1,000,001 to 2,000,000, and 2,000,001 to 3,000,000. When the rows are placed in the appropriate bucket, there are three rows that fall into bucket #1, a single row that falls in bucket #3, and the remaining 26 rows that fall into the second bucket.
The values 1 and 3,000,000 were chosen to guarantee that all rows in the result set would be placed into one of the three buckets. If you want to generate buckets only for rows that have aggregate sales between $1,000,000 and $2,000,000, the WIDTH_BUCKET function will place the remaining rows in the 0th and 4th buckets:
SELECT region_id, cust_nbr,
SUM(tot_sales) cust_sales,
WIDTH_BUCKET(SUM(tot_sales), 1000000, 2000000, 3) sales_buckets
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr
ORDER BY cust_sales;
REGION_ID CUST_NBR CUST_SALES SALES_BUCKETS ---------- ---------- ---------- ------------- 10 29 903383 0 6 7 971585 0 10 28 986964 0 9 21 1020541 1 9 22 1036146 1 8 16 1068467 1 6 8 1141638 1 5 1 1151162 1 5 3 1161286 1 5 5 1169926 1 8 19 1174421 1 7 12 1182275 1 7 11 1190421 1 6 10 1196748 1 6 9 1208959 1 10 30 1216858 1 5 2 1224992 1 9 24 1224992 1 9 23 1224992 1 8 18 1253840 1 7 15 1255591 1 7 13 1310434 1 10 27 1322747 1 8 20 1413722 2 6 6 1788836 3 10 26 1808949 3 5 4 1878275 3 7 14 1929774 3 8 17 1944281 3 9 25 2232703 4
Keep in mind that the WIDTH_BUCKET function does not remove rows from the result set that do not lie within the specified range; rather, they are placed into special buckets that your query can either utilize or ignore as needed.
The final two ranking functions, CUME_DIST and PERCENT_RANK, use the rank of a particular row to calculate additional information. The CUME_DIST function (short for Cumulative Distribution) calculates the ratio of the number of rows that have a lesser or equal ranking to the total number of rows in the partition. The PERCENT_RANK function calculates the ratio of a row’s ranking to the number of rows in the partition using the formula:
(RRP
-- 1) / (NRP
-- 1)
where RRP
is the “rank of
row in partition,” and
NRP
is the “number of
rows in partition.”
Both functions utilize DENSE_RANK for their rankings and can be specified to be in ascending or descending order. The following query demonstrates the use of these two functions (both specifying descending order) with the customer yearly sales query:
SELECT region_id, cust_nbr,
SUM(tot_sales) cust_sales,
CUME_DIST( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_cume_dist,
PERCENT_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_percent_rank
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr
ORDER BY cust_sales DESC;
REGION_ID CUST_NBR CUST_SALES SALES_CUME_DIST SALES_PERCENT_RANK ---------- ---------- ---------- --------------- ------------------ 9 25 2232703 .033333333 0 8 17 1944281 .066666667 .034482759 7 14 1929774 .1 .068965517 5 4 1878275 .133333333 .103448276 10 26 1808949 .166666667 .137931034 6 6 1788836 .2 .172413793 8 20 1413722 .233333333 .206896552 10 27 1322747 .266666667 .24137931 7 13 1310434 .3 .275862069 7 15 1255591 .333333333 .310344828 8 18 1253840 .366666667 .344827586 5 2 1224992 .466666667 .379310345 9 23 1224992 .466666667 .379310345 9 24 1224992 .466666667 .379310345 10 30 1216858 .5 .482758621 6 9 1208959 .533333333 .517241379 6 10 1196748 .566666667 .551724138 7 11 1190421 .6 .586206897 7 12 1182275 .633333333 .620689655 8 19 1174421 .666666667 .655172414 5 5 1169926 .7 .689655172 5 3 1161286 .733333333 .724137931 5 1 1151162 .766666667 .75862069 6 8 1141638 .8 .793103448 8 16 1068467 .833333333 .827586207 9 22 1036146 .866666667 .862068966 9 21 1020541 .9 .896551724 10 28 986964 .933333333 .931034483 6 7 971585 .966666667 .965517241 10 29 903383 1 1
Let’s walk through a couple of calculations for customer number 1 in the previous result set. With total sales of $1,151,162, customer number 1 ranks 23rd in the set of 30 customers in descending order of sales. Since there are a total of 30 rows, the CUME_DIST is equal to 23/30, or .766666667. The PERCENT_RANK function yields (23 - 1) / (30 - 1) = .75862069. It should come as no surprise that each function returns identical values for the rows that have identical sales totals, since the calculations are based on rank, which is identical for all three rows.
For some types of analysis, determining what might have happened is more revealing than knowing what really happened. Oracle provides special versions of RANK, DENSE_RANK, CUME_DIST, and PERCENT_RANK that allow rankings and distributions to be calculated for hypothetical data, allowing the user to see what would have happened if a specific value (or set of values) was included in a data set.
To illustrate this concept, let’s rank all customers by total sales for 2001, and then see where a hypothetical sales figure would fall in the ranking. Here is the query that generates the rankings and distributions:
SELECT cust_nbr, SUM(tot_sales) cust_sales,
RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) rank,
DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) dense_rank,
CUME_DIST( ) OVER (ORDER BY SUM(tot_sales) DESC) cume_dist,
PERCENT_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) percent_rank
FROM orders
WHERE year = 2001
GROUP BY cust_nbr
ORDER BY rank;
CUST_NBR CUST_SALES RANK DENSE_RANK CUME_DIST PERCENT_RANK ---------- ---------- ---------- ---------- ---------- ------------ 25 2232703 1 1 .033333333 0 17 1944281 2 2 .066666667 .034482759 14 1929774 3 3 .1 .068965517 4 1878275 4 4 .133333333 .103448276 26 1808949 5 5 .166666667 .137931034 6 1788836 6 6 .2 .172413793 20 1413722 7 7 .233333333 .206896552 27 1322747 8 8 .266666667 .24137931 13 1310434 9 9 .3 .275862069 15 1255591 10 10 .333333333 .310344828 18 1253840 11 11 .366666667 .344827586 2 1224992 12 12 .466666667 .379310345 23 1224992 12 12 .466666667 .379310345 24 1224992 12 12 .466666667 .379310345 30 1216858 15 13 .5 .482758621 9 1208959 16 14 .533333333 .517241379 10 1196748 17 15 .566666667 .551724138 11 1190421 18 16 .6 .586206897 12 1182275 19 17 .633333333 .620689655 19 1174421 20 18 .666666667 .655172414 5 1169926 21 19 .7 .689655172 3 1161286 22 20 .733333333 .724137931 1 1151162 23 21 .766666667 .75862069 8 1141638 24 22 .8 .793103448 16 1068467 25 23 .833333333 .827586207 22 1036146 26 24 .866666667 .862068966 21 1020541 27 25 .9 .896551724 28 986964 28 26 .933333333 .931034483 7 971585 29 27 .966666667 .965517241 29 903383 30 28 1 1
Now let’s see where a customer with an even million dollars of sales would have ranked:
SELECT
RANK(1000000) WITHIN GROUP
(ORDER BY SUM(tot_sales) DESC) hyp_rank,
DENSE_RANK(1000000) WITHIN GROUP
(ORDER BY SUM(tot_sales) DESC) hyp_dense_rank,
CUME_DIST(1000000) WITHIN GROUP
(ORDER BY SUM(tot_sales) DESC) hyp_cume_dist,
PERCENT_RANK(1000000) WITHIN GROUP
(ORDER BY SUM(tot_sales) DESC) hyp_percent_rank
FROM orders
WHERE year = 2001
GROUP BY cust_nbr;
HYP_RANK HYP_DENSE_RANK HYP_CUME_DIST HYP_PERCENT_RANK ---------- -------------- ------------- ---------------- 28 26 .903225806 .9
The WITHIN GROUP clause has the effect of injecting a fictitious row into the result set before determining the rankings. One possible use of this functionality would be to see how actual sales compare to sales targets.
The ranking functions described thus far are quite useful when comparing items within a fixed window of time, such as “last year” or “second quarter.” But what if you want to perform computations using a window that slides as you progress through the data set? Oracle’s windowing functions allow aggregates to be calculated for each row in a result set based on a specified window. The aggregation window can be defined in one of three ways:
By specifying a set of rows: “From the current row to the end of the partition”
By specifying a time interval: “For the 30 days preceding the transaction date”
By specifying a range of values: “All rows having a transaction amount within 5% of the current row’s transaction amount”
The first set of examples will generate a window that fills the entire partition, and then show how the window can be detached from one or both ends of the partition so that it floats with the current row. All of the examples will be based on the following query, which calculates total monthly sales in 2001 for the Mid-Atlantic region:
SELECT month,
SUM(tot_sales) monthly_sales
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES ---------- ------------- 1 610697 2 428676 3 637031 4 541146 5 592935 6 501485 7 606914 8 460520 9 392898 10 510117 11 532889 12 492458
The first step is to sum the monthly sales for the entire result set by specifying an “unbounded” window. Note the ROWS BETWEEN clause in the following example:
SELECT month,
SUM(tot_sales) monthly_sales,
SUM(SUM(tot_sales)) OVER (ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) total_sales
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES TOTAL_SALES ---------- ------------- ----------- 1 610697 6307766 2 428676 6307766 3 637031 6307766 4 541146 6307766 5 592935 6307766 6 501485 6307766 7 606914 6307766 8 460520 6307766 9 392898 6307766 10 510117 6307766 11 532889 6307766 12 492458 6307766
Each time the function executes, it sums the monthly sales from months 1 through 12; thus, the same calculation is being performed 12 times. This is a rather inefficient way to generate the yearly sales total (see Section 14.4 later in this chapter for a better method), but it should give you an idea of the syntax for building an aggregation window. The next query will create a window that spans from the top of the partition to the current row. The function identifies the month that has the maximum sales, up to and including the current month:
SELECT month,
SUM(tot_sales) monthly_sales,
MAX(SUM(tot_sales)) OVER (ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) max_preceeding
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES MAX_PRECEEDING ---------- ------------- -------------- 1 610697 610697 2 428676 610697 3 637031 637031 4 541146 637031 5 592935 637031 6 501485 637031 7 606914 637031 8 460520 637031 9 392898 637031 10 510117 637031 11 532889 637031 12 492458 637031
Unlike the first query, which has a window size fixed at 12 rows, this query’s aggregation window grows from a single row for month 1 to 12 rows for month 12. The keywords CURRENT ROW are used to indicate that the window should end at the current row being inspected by the function. If you replace MAX in the previous query with SUM, you can calculate a running total:
SELECT month,
SUM(tot_sales) monthly_sales,
SUM(SUM(tot_sales)) OVER (ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) running_total
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES RUNNING_TOTAL ---------- ------------- -------------- 1 610697 610697 2 428676 1039373 3 637031 1676404 4 541146 2217550 5 592935 2810485 6 501485 3311970 7 606914 3918884 8 460520 4379404 9 392898 4772302 10 510117 5282419 11 532889 5815308 12 492458 6307766
You have now seen examples using windows that are fixed at one or both ends. The next query will define a window that floats freely with each row:
SELECT month,
SUM(tot_sales) monthly_sales,
AVG(SUM(tot_sales)) OVER (ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) rolling_avg
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES ROLLING_AVG ---------- ------------- ----------- 1 610697 519686.5 2 428676 558801.333 3 637031 535617.667 4 541146 590370.667 5 592935 545188.667 6 501485 567111.333 7 606914 522973 8 460520 486777.333 9 392898 454511.667 10 510117 478634.667 11 532889 511821.333 12 492458 512673.5
For each of the 12 rows, the function calculates the average sales of the current month, the previous month, and the following month. The value of the ROLLING_AVG column is therefore the average sales within a three month floating window centered on the current month, with the exception that months 1 and 12 are calculated using a two-month window, since there is no previous month for month 1 or following month for month 12.
The previous windowing
examples use the ROWS BETWEEN option
to specify which rows to include in the aggregation. You may
alternately specify a range and let Oracle determine which rows lie
within the range. For example, the previous query used ROWS
BETWEEN 1 PRECEDING AND 1 FOLLOWING
to generate a
three-month rolling average; the same results can be achieved by
substituting RANGE for ROWS:
SELECT month,
SUM(tot_sales) monthly_sales,
AVG(SUM(tot_sales)) OVER (ORDER BY month
RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) rolling_avg
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES ROLLING_AVG ---------- ------------- ----------- 1 610697 519686.5 2 428676 558801.333 3 637031 535617.667 4 541146 590370.667 5 592935 545188.667 6 501485 567111.333 7 606914 522973 8 460520 486777.333 9 392898 454511.667 10 510117 478634.667 11 532889 511821.333 12 492458 512673.5
This substitution works because the month column contains integer values, so adding and subtracting 1 from the current month yields a three-month range. The next variation achieves the same results but specifies a range of +/- 1.999:
SELECT month,
SUM(tot_sales) monthly_sales,
AVG(SUM(tot_sales)) OVER (ORDER BY month
RANGE BETWEEN 1.999 PRECEDING AND 1.999 FOLLOWING) rolling_avg
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES ROLLING_AVG ---------- ------------- ----------- 1 610697 519686.5 2 428676 558801.333 3 637031 535617.667 4 541146 590370.667 5 592935 545188.667 6 501485 567111.333 7 606914 522973 8 460520 486777.333 9 392898 454511.667 10 510117 478634.667 11 532889 511821.333 12 492458 512673.5
If you are generating a window based on a DATE column, you can
specify a range in increments of days, months, or years. Since the
orders
table has no DATE columns, the next example
shows how a date range can be specified against the
order_dt
column of the
cust_order
table:
SELECT TRUNC(order_dt) day,
SUM(sale_price) daily_sales,
AVG(SUM(sale_price)) OVER (ORDER BY TRUNC(order_dt)
RANGE BETWEEN INTERVAL '2' DAY PRECEDING
AND INTERVAL '2' DAY FOLLOWING) five_day_avg
FROM cust_order
WHERE sale_price IS NOT NULL
AND order_dt BETWEEN TO_DATE('01-JUL-2001','DD-MON-YYYY')
AND TO_DATE('31-JUL-2001','DD-MON-YYYY')
GROUP BY TRUNC(order_dt);
DAY DAILY_SALES FIVE_DAY_AVG --------- ----------- ------------ 16-JUL-01 112 146 18-JUL-01 180 114 20-JUL-01 50 169 21-JUL-01 50 165.333333 22-JUL-01 396 165.333333
This query generates a five-day rolling window by specifying a range of +/- two days around the truncated order date.
Oracle provides two additional aggregate functions, called FIRST_VALUE and LAST_VALUE, that can be used with windowing functions to identify the values of the first and last values in the window. In the case of the three-month rolling average query shown previously, you could display the values of all three months along with the average of the three, as in:
SELECT month,
FIRST_VALUE(SUM(tot_sales)) OVER (ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) prev_month,
SUM(tot_sales) monthly_sales,
LAST_VALUE(SUM(tot_sales)) OVER (ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) next_month,
AVG(SUM(tot_sales)) OVER (ORDER BY month
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) rolling_avg
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH PREV_MONTH MONTHLY_SALES NEXT_MONTH ROLLING_AVG ---------- ---------- ------------- ---------- ----------- 1 610697 610697 428676 519686.5 2 610697 428676 637031 558801.333 3 428676 637031 541146 535617.667 4 637031 541146 592935 590370.667 5 541146 592935 501485 545188.667 6 592935 501485 606914 567111.333 7 501485 606914 460520 522973 8 606914 460520 392898 486777.333 9 460520 392898 510117 454511.667 10 392898 510117 532889 478634.667 11 510117 532889 492458 511821.333 12 532889 492458 492458 512673.5
These functions are useful for queries that compare each value to the first or last value in the period, such as: “How did each month’s sales compare to the first month?”
Although not technically windowing functions, the LAG and LEAD functions are included here because they allow rows to be referenced by their position relative to the current row, much like the PRECEDING and FOLLOWING clauses within windowing functions. LAG and LEAD are useful for comparing one row of a result set with another row of the same result set. For example, the query “Compute the total sales per month for the Mid-Atlantic region, including the percent change from the previous month” requires data from both the current and preceding rows to calculate the answer. This is, in effect, a two-row window, but the offset from the current row can be specified as one or more rows, making LAG and LEAD act like specialized windowing functions where only the outer edges of the window are utilized.
Here is the SQL that uses the LAG function to generate the data needed to answer the question posed in the previous paragraph:
SELECT month,
SUM(tot_sales) monthly_sales,
LAG(SUM(tot_sales), 1) OVER (ORDER BY month) prev_month_sales
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES PREV_MONTH_SALES ---------- ------------- ---------------- 1 610697 2 428676 610697 3 637031 428676 4 541146 637031 5 592935 541146 6 501485 592935 7 606914 501485 8 460520 606914 9 392898 460520 10 510117 392898 11 532889 510117 12 492458 532889
As you might expect, the LAG value for month 1 is NULL, since there is no preceding month. This would also be the case for the LEAD value for month 12. If you would like the LAG and LEAD functions to return a non-NULL value for these cases, you can specify a substitute value via the optional third parameter (see next example).
The next query utilizes the output from the previous query to generate the percentage difference from month to month. Note how a third parameter has been specified for the LAG function so that month 1 will use the current month’s sales instead of a NULL value for the percentage change:
SELECT months.month month, months.monthly_sales monthly_sales,
ROUND((months.monthly_sales - months.prev_month_sales) /
months.prev_month_sales, 3) * 100 percent_change
FROM
(SELECT month,
SUM(tot_sales) monthly_sales,
LAG(SUM(tot_sales), 1, SUM(tot_sales))
OVER (ORDER BY month) prev_month_sales
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month) months
ORDER BY month;
MONTH MONTHLY_SALES PERCENT_CHANGE ---------- ------------- -------------- 1 610697 0 2 428676 -29.8 3 637031 48.6 4 541146 -15.1 5 592935 9.6 6 501485 -15.4 7 606914 21 8 460520 -24.1 9 392898 -14.7 10 510117 29.8 11 532889 4.5 12 492458 -7.6
Similar to the windowing functions described earlier, reporting functions allow the execution of various aggregate functions (MIN, MAX, SUM, COUNT, AVG, etc.) against a result set. Unlike windowing functions, however, the reporting functions cannot specify localized windows and thus generate the same result for each entire partition (or the entire result set, if no partitions are specified). Therefore, anything that can be accomplished using a reporting function can also be accomplished using a windowing function with an unbounded window, although it will generally be more efficient to use the reporting function.
Earlier in the chapter, we used a windowing function with an unbounded reporting window to generate the total sales for the 12 months of 2001:
SELECT month,
SUM(tot_sales) monthly_sales,
SUM(SUM(tot_sales)) OVER (ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) total_sales
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES TOTAL_SALES ---------- ------------- ----------- 1 610697 6307766 2 428676 6307766 3 637031 6307766 4 541146 6307766 5 592935 6307766 6 501485 6307766 7 606914 6307766 8 460520 6307766 9 392898 6307766 10 510117 6307766 11 532889 6307766 12 492458 6307766
The next query adds a reporting function to generate the same results:
SELECT month,
SUM(tot_sales) monthly_sales,
SUM(SUM(tot_sales)) OVER (ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) window_sales,
SUM(SUM(tot_sales)) OVER ( ) reporting_sales
FROM orders
WHERE year = 2001
AND region_id = 6
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES WINDOW_SALES REPORTING_SALES ---------- ------------- ------------ --------------- 1 610697 6307766 6307766 2 428676 6307766 6307766 3 637031 6307766 6307766 4 541146 6307766 6307766 5 592935 6307766 6307766 6 501485 6307766 6307766 7 606914 6307766 6307766 8 460520 6307766 6307766 9 392898 6307766 6307766 10 510117 6307766 6307766 11 532889 6307766 6307766 12 492458 6307766 6307766
The empty parentheses after the OVER clause for the
reporting_sales
column indicate that the entire
result set should be included in the sum, which has the same effect
as using an unbounded window function. Hopefully, you will agree that
the reporting function is easier to understand than the unbounded
window function.
Reporting functions are useful when you need both detail and aggregate data (or different aggregation levels) to answer a business query. For example, the query “Show the monthly sales totals for 2001 along with each month’s percentage of yearly sales” requires the detail rows to be aggregated first to the month level, and then to the year level to answer the question. Rather than computing both aggregations from the detail rows, you can use the SUM function with a GROUP BY clause to aggregate to the month level, and then use a reporting function to aggregate the monthly totals, as in:
SELECT month,
SUM(tot_sales) monthly_sales,
SUM(SUM(tot_sales)) OVER ( ) yearly_sales
FROM orders
WHERE year = 2001
GROUP BY month
ORDER BY month;
MONTH MONTHLY_SALES YEARLY_SALES ---------- ------------- ------------ 1 3028325 39594908 2 3289336 39594908 3 3411024 39594908 4 3436482 39594908 5 3749264 39594908 6 3204730 39594908 7 3233532 39594908 8 3081290 39594908 9 3388292 39594908 10 3279637 39594908 11 3167858 39594908 12 3325138 39594908
You would then simply divide MONTHLY_SALES by YEARLY_SALES to compute the requested percentage (see Section 14.4.2 later in the chapter).
Like ranking functions, reporting functions can include PARTITION BY clauses to split the result set into multiple pieces, allowing multiple aggregations to be computed across different subsets of the result set. The following query generates total sales per salesperson per region along with the total regional sales for comparison:
SELECT region_id, salesperson_id,
SUM(tot_sales) sp_sales,
SUM(SUM(tot_sales)) OVER (PARTITION BY region_id) region_sales
FROM orders
WHERE year = 2001
GROUP BY region_id, salesperson_id
ORDER BY region_id, salesperson_id;
REGION_ID SALESPERSON_ID SP_SALES REGION_SALES ---------- -------------- ---------- ------------ 5 1 1927580 6585641 5 2 1461898 6585641 5 3 1501039 6585641 5 4 1695124 6585641 6 5 1688252 6307766 6 6 1392648 6307766 6 7 1458053 6307766 6 8 1768813 6307766 7 9 1735575 6868495 7 10 1723305 6868495 7 11 1737093 6868495 7 12 1672522 6868495 8 13 1516776 6854731 8 14 1814327 6854731 8 15 1761814 6854731 8 16 1761814 6854731 9 17 1710831 6739374 9 18 1625456 6739374 9 19 1645204 6739374 9 20 1757883 6739374 10 21 1542152 6238901 10 22 1468316 6238901 10 23 1443837 6238901 10 24 1784596 6238901
The value for the REGION_SALES column is the same for all salespeople in the same region. In the next section, you will see two different approaches for using this information to generate percentage calculations.
One of the
more common uses of
reporting functions is to generate the value of the denominator for
performance calculations. With the query from the previous section,
for example, the next logical step would be to divide each
salesperson’s total sales
(SP_SALES
) by the total region sales
(REGION_SALES
) to determine what ratio of the
total region sales can be attributed to each salesperson. One option
is to use the reporting function as the denominator in the percentage
calculation, as in:
SELECT region_id, salesperson_id,
SUM(tot_sales) sp_sales,
ROUND(SUM(tot_sales) /
SUM(SUM(tot_sales)) OVER (PARTITION BY region_id),
2) percent_of_region
FROM orders
WHERE year = 2001
GROUP BY region_id, salesperson_id
ORDER BY region_id, salesperson_id;
REGION_ID SALESPERSON_ID SP_SALES PERCENT_OF_REGION ---------- -------------- ---------- ----------------- 5 1 1927580 .29 5 2 1461898 .22 5 3 1501039 .23 5 4 1695124 .26 6 5 1688252 .27 6 6 1392648 .22 6 7 1458053 .23 6 8 1768813 .28 7 9 1735575 .25 7 10 1723305 .25 7 11 1737093 .25 7 12 1672522 .24 8 13 1516776 .22 8 14 1814327 .26 8 15 1761814 .26 8 16 1761814 .26 9 17 1710831 .25 9 18 1625456 .24 9 19 1645204 .24 9 20 1757883 .26 10 21 1542152 .25 10 22 1468316 .24 10 23 1443837 .23 10 24 1784596 .29
Because this is such a common operation, however, Oracle has spared us the trouble by including the RATIO_TO_REPORT function. The RATIO_TO_REPORT function allows you to calculate each row’s contribution to either the entire result set, or some subset of the result set if the PARTITION BY clause is included. The next query uses RATIO_TO_REPORT to generate the percentage contribution of each salesperson to her region’s total sales:
SELECT region_id, salesperson_id,
SUM(tot_sales) sp_sales,
ROUND(RATIO_TO_REPORT(SUM(tot_sales))
OVER (PARTITION BY region_id), 2) sp_ratio
FROM orders
WHERE year = 2001
GROUP BY region_id, salesperson_id
ORDER BY region_id, salesperson_id;
REGION_ID SALESPERSON_ID SP_SALES SP_RATIO ---------- -------------- ---------- ---------- 5 1 1927580 .29 5 2 1461898 .22 5 3 1501039 .23 5 4 1695124 .26 6 5 1688252 .27 6 6 1392648 .22 6 7 1458053 .23 6 8 1768813 .28 7 9 1735575 .25 7 10 1723305 .25 7 11 1737093 .25 7 12 1672522 .24 8 13 1516776 .22 8 14 1814327 .26 8 15 1761814 .26 8 16 1761814 .26 9 17 1710831 .25 9 18 1625456 .24 9 19 1645204 .24 9 20 1757883 .26 10 21 1542152 .25 10 22 1468316 .24 10 23 1443837 .23 10 24 1784596 .29
We have covered a lot of ground in this chapter, so don’t feel bad if it takes a couple of passes to get a feel for all of the different analytic functions and how they can be applied. You’ll find the material easier to digest if you concentrate on one category at a time (Ranking, Windowing, Reporting). If you’ve been working with Oracle for many years, you are probably chomping at the bit to give these functions a try. Along with being compact and efficient, Oracle’s analytic functions keep analytical calculations where they belong—in the database server—instead of relying on procedural languages or spreadsheet macros to finish the job.
3.139.97.40