Appendix A. Window Function Refresher

The recipes in this book take full advantage of the window functions added to the ISO SQL standard in 2003, as well as vendor-specific window functions. This appendix is meant to serve as a brief overview of how window functions work. Window functions make many typically difficult tasks (difficult to solve using standard SQL, that is) quite easy. For a complete list of window functions available, full syntax, and in-depth coverage of how they work, please consult your vendor’s documentation.

Grouping

Before moving on to window functions, it is crucial that you understand how grouping works in SQL -the concept of grouping results in SQL can be difficult to master . The problems stem from not fully understanding how the GROUP BY clause works and why certain queries return certain results when using GROUP BY.

Simply stated, grouping is a way to organize like rows together. When you use GROUP BY in a query, each row in the result set is a group and represents one or more rows with the same values in one or more columns that you specify. That’s the gist of it.

If a group is simply a unique instance of a row that represents one or more rows with the same value for a particular column (or columns), then practical examples of groups from table EMP include all employees in department 10 (the common value for these employees that enable them to be in the same group is DEPTNO=10) or all clerks (the common value for these employees that enable them to be in the same group is JOB=CLERK). Consider the following queries. The first shows all employees in department 10; the second query groups the employees in department 10 and returns the following information about the group: the number of rows (members) in the group, the highest salary, and the lowest salary:

select deptno,ename
		from emp
		where deptno=10

	  DEPTNO ENAME
	  ------ ----------
		  10 CLARK
		  10 KING
		  10 MILLER

	 select deptno,
				   count(*) as cnt,
				   max(sal) as hi_sal,
				   min(sal) as lo_sal
			 from emp
		   where deptno=10
		   group by deptno


		DEPTNO		  CNT	  HI_SAL	 LO_SAL
		------ ---------- ---------- ----------
			10			3		5000		1300

If you were not able to group the employees in department 10 together, to get the information in the second query above you would have to manually inspect the rows for that department (trivial if there are only three rows, but what if there were three million rows?). So, why would anyone want to group? Reasons for doing so vary; perhaps you want to see how many different groups exist or how many members (rows) are in each group. As you can see from the simple example above, grouping allows you to get information about many rows in a table without having to inspect them one by one.

Definition of an SQL Group

In mathematics, a group is defined, for the most part, as (G, •, e), where G is a set, • is a binary operation in G, and e is a member of G. We will use this definition as the foundation for what a SQL group is. A SQL group will be defined as (G, e), where G is a result set of a single or self-contained query that uses GROUP BY, e is a member of G, and the following axioms are satisfied:

  • For each e in G, e is distinct and represents one or more instances of e.

  • For each e in G, the aggregate function COUNT returns a value > 0.

Tip

The result set is included in the definition of a SQL group to reinforce the fact that we are defining what groups are when working with queries only. Thus, it would be accurate to replace “e” in each axiom with the word “row” because the rows in the result set are technically the groups.

Because these properties are fundamental to what we consider a group, it is important that we prove they are true (and we will proceed to do so through the use of some example SQL queries).

Groups are non-empty

By its very definition, a group must have at least one member (or row). If we accept this as a truth, then it can be said that a group cannot be created from an empty table. To prove that proposition true, simply try to prove it is false. The following example creates an empty table, and then attempts to create groups via three different queries against that empty table:

create table fruits (name varchar(10))

	select name
		from fruits
	  group by name

	  (no rows selected)

	 select count(*) as cnt
		  from fruits
		group by name

		(no rows selected)

	select name, count(*) as cnt
		from fruits
	   group by name

	   (no rows selected)

As you can see from these queries, it is impossible to create what SQL considers a group from an empty table.

Groups are distinct

Now let’s prove that the groups created via queries with a GROUP BY clause are distinct. The following example inserts five rows into table FRUITS, and then creates groups from those rows:

insert into fruits values ('Oranges')
	insert into fruits values ('Oranges')
	insert into fruits values ('Oranges')
	insert into fruits values ('Apple')
	insert into fruits values ('Peach')

	select *
		from fruits

	NAME
	--------
	Oranges
	Oranges
	Oranges
	Apple
	Peach

	select name
		from fruits
	  group by name

	 NAME
	 -------
	 Apple
	 Oranges
	 Peach

	 select name, count(*) as cnt
		from fruits
	   group by name

	   NAME			CNT
	   ------- --------
	   Apple		1
	   Oranges		3
	   Peach		1

The first query shows that “Oranges” occurs three times in table FRUITS. However, the second and third queries (using GROUP BY) return only one instance of “Oranges.” Taken together, these queries prove that the rows in the result set (e in G, from our definition) are distinct, and each value of NAME represents one or more instances of itself in table FRUITS.

Knowing that groups are distinct is important because it means, typically, you would not use the DISTINCT keyword in your SELECT list when using a GROUP BY in your queries.

Tip

We don’t pretend GROUP BY and DISTINCT are the same. They represent two completely different concepts. We do state that the items listed in the GROUP BY clause will be distinct in the result set and that using DISTINCT as well as GROUP BY is redundant.

COUNT is never zero

The queries and results in the preceding section also prove the final axiom that the aggregate function COUNT will never return zero when used in a query with GROUP BY on a nonempty table. It should not be surprising that you cannot return a count of zero for a group. We have already proved that a group cannot be created from an empty table, thus a group must have at least one row. If at least one row exists, then the count will always be at least 1.

Tip

Remember, we are talking about using COUNT with GROUP BY, not COUNT by itself. A query using COUNT without a GROUP BY on an empty table will of course return zero.

Paradoxes

“Hardly anything more unfortunate can befall a scientific writer than to have one of the foundations of his edifice shaken after the work is finished…. This was the position I was placed in by a letter of Mr. Bertrand Russell, just when the printing of this volume was nearing its completion.”

The preceding quote is from Gottlob Frege in response to Bertrand Russell’s discovery of a contradiction to Frege’s axiom of abstraction in set theory.

Paradoxes many times provide scenarios that would seem to contradict established theories or ideas. In many cases these contradictions are localized and can be “worked around,” or they are applicable to such small test cases that they can be safely ignored.

You may have guessed by now that the point to all this discussion of paradoxes is that there exists a paradox concerning our definition of an SQL group, and that paradox must be addressed. Although our focus right now is on groups, ultimately we are discussing SQL queries. In its GROUP BY clause, a query may have a wide range of values such as constants, expressions, or, most commonly, columns from a table. We pay a price for this flexibility, because NULL is a valid “value” in SQL. NULLs present problems because they are effectively ignored by aggregate functions. With that said, if a table consists of a single row and its value is NULL, what would the aggregate function COUNT return when used in a GROUP BY query? By our very definition, when using GROUP BY and the aggregate function COUNT, a value >= 1 must be returned. What happens, then, in the case of values ignored by functions such as COUNT, and what does this mean to our definition of a GROUP? Consider the following example, which reveals the NULL group paradox (using the function COALESCE when necessary for readability):

select *
		from fruits

	 NAME
	 -------
	 Oranges
	 Oranges
	 Oranges
	 Apple
	 Peach

	 insert into fruits values (null)
	 insert into fruits values (null)
	 insert into fruits values (null)
	 insert into fruits values (null)
	 insert into fruits values (null)

	 select coalesce(name,'NULL') as name
		from fruits

	 NAME
	 --------
	 Oranges
	 Oranges
	 Oranges
	 Apple
	 Peach
	 NULL
	 NULL
	 NULL
	 NULL
	 NULL

	 select coalesce(name,'NULL') as name,
			count(name) as cnt
		from fruits
	  group by name

	  NAME			  CNT
	  -------- ----------
	  Apple				1
	  NULL				0
	  Oranges			3
	  Peach				1

It would seem that the presence of NULL values in our table introduces a contradiction, or paradox, to our definition of a SQL group. Fortunately, this contradiction is not a real cause for concern, because the paradox has more to do with the implementation of aggregate functions than our definition. Consider the final query in the preceding set; a general problem statement for that query would be:

Count the number of times each name occurs in table FRUITS or count the number of members in each group.

Examining the INSERT statements above, it’s clear that there are five rows with NULL values, which means there exists a NULL group with five members.

Tip

While NULL certainly has properties that differentiate it from other values, it is nevertheless a value, and can in fact be a group.

How, then, can we write the query to return a count of 5 instead of 0, thus returning the information we are looking for while conforming to our definition of a group? The example below shows a workaround to deal with the NULL group paradox:

select coalesce(name,'NULL') as name,
		   count(*) as cnt
	   from fruits
	 group by name

	 NAME			CNT
	 --------- --------
	 Apple			 1
	 Oranges		 3
	 Peach			 1
	 NULL			 5

The workaround is to use COUNT() rather than COUNT(NAME) to avoid the NULL group paradox. Aggregate functions will ignore NULL values if any exist in the column passed to them. Thus, to avoid a zero when using COUNT do not pass the column name; instead, pass in an asterisk (). The * causes the COUNT function to count rows rather than the actual column values, so whether or not the actual values are NULL or not NULL is irrelevant.

One more paradox has to do with the axiom that each group in a result set (for each e in G) is distinct. Because of the nature of SQL result sets and tables, which are more accurately defined as multisets or “bags,” not sets (because duplicate rows are allowed), it is possible to return a result set with duplicate groups. Consider the following queries:

select coalesce(name,'NULL') as name,
		   count(*) as cnt
	   from fruits
	 group by name
	 union all
	select coalesce(name,'NULL') as name,
			count(*) as cnt
	   from fruits
	 group by name

	 NAME			  CNT
	 ---------- ---------
	 Apple				1
	 Oranges			3
	 Peach				1
	 NULL				5
	 Apple				1
	 Oranges			3
	 Peach				1
	 NULL				5


	 select x.*
		from (
	 select coalesce(name,'NULL') as name,
			count(*) as cnt
		from fruits
	  group by name
			) x,
			(select deptno from dept) y

	 NAME			   CNT
	 ---------- ----------
	 Apple				 1
	 Apple				 1
	 Apple				 1
	 Apple				 1
	 Oranges			 3
	 Oranges			 3
	 Oranges			 3
	 Oranges			 3
	 Peach				 1
	 Peach				 1
	 Peach				 1
	 Peach				 1
	 NULL				 5
	 NULL				 5
	 NULL				 5
	 NULL				 5

As you can see in these queries, the groups are in fact repeated in the final results. Fortunately, this is not much to worry about because it represents only a partial paradox. The first property of a group states that for (G, e), G is a result set from a single or self-contained query that uses GROUP BY. Simply put, the result set from any GROUP BY query itself conforms to our definition of a group. It is only when you combine the result sets from two GROUP BY queries to create a multiset that groups may repeat. The first query in the preceding example uses UNION ALL, which is not a set operation but a multiset operation, and invokes GROUP BY twice, effectively executing two queries.

Tip

If you use UNION, which is a set operation, you will not see repeating groups.

The second query in the preceding set uses a Cartesian product, which only works if you materialize the group first and then perform the Cartesian. Thus the GROUP BY query when self-contained conforms to our definition. Neither of the two examples takes anything away from the definition of a SQL group. They are shown for completeness, and so that you can be aware that almost anything is possible in SQL.

Relationship Between SELECT and GROUP BY

With the concept of a group defined and proved, it is now time to move on to more practical matters concerning queries using GROUP BY. It is important to understand the relationship between the SELECT clause and the GROUP BY clause when grouping in SQL. It is important to keep in mind when using aggregate functions such as COUNT that any item in your SELECT list that is not used as an argument to an aggregate function must be part of your group. For example, if you write a SELECT clause such as:

	select deptno, count(*) as cnt
		from emp

then you must list DEPTNO in your GROUP BY clause:

select deptno, count(*) as cnt
		from emp
	   group by deptno

	   DEPTNO	CNT
	   ------- ----
			10	  3
			20	  5
			30    6

Constants, scalar values returned by user-defined functions, window functions, and non-correlated scalar subqueries are exceptions to this rule. Since the SELECT clause is evaluated after the GROUP BY clause, these constructs are allowed in the SELECT list and do not have to (and in some cases cannot) be specified in the GROUP BY clause. For example:

select 'hello' as msg,
			1 as num,
			deptno,
			(select count(*) from emp) as total,
			count(*) as cnt
	   from emp
	 group by deptno

	 MSG	NUM DEPTNO TOTAL CNT
	 -----  --- ------ ----- ---
	 hello	1		10	14		3
	 hello	1		20	14		5
	 hello	1		30	14		6

Don’t let this query confuse you. The items in the SELECT list not listed in the GROUP BY clause do not change the value of CNT for each DEPTNO, nor do the values for DEPTNO change. Based on the results of the preceding query, we can define the rule about matching items in the SELECT list and the GROUP BY clause when using aggregates a bit more precisely:

Items in a SELECT list that can potentially change the group or change the value returned by an aggregate function must be included in the GROUP BY clause.

The additional items in the preceding SELECT list did not change the value of CNT for any group (each DEPTNO), nor did they change the groups themselves.

Now it’s fair to ask: exactly what items in a SELECT list can change a grouping or the value returned by an aggregate function? The answer is simple: other columns from the table(s) you are selecting from. Consider the prospect of adding the JOB column to the query we’ve been looking at:

select deptno, job, count(*) as cnt
		from emp
	  group by deptno, job

	  DEPTNO JOB		CNT
	  ------ ---------- ----
	  10	CLERK			1
	  10	MANAGER			1
	  10	PRESIDENT		1
	  20	CLERK			2
	  20	ANALYST			2
	  20	MANAGER			1
	  30	CLERK			1
	  30	MANAGER			1
	  30	SALESMAN		4

By listing another column, JOB, from table EMP, we are changing the group and changing the result set; thus we must now include JOB in the GROUP BY clause along with DEPTNO, otherwise the query will fail. The inclusion of JOB in the SELECT/GROUP BY clauses changes the query from “How many employees are in each department?” to “How many different types of employees are in each department?” Notice again that the groups are distinct; the values for DEPTNO and JOB individually are not distinct, but the combination of the two (which is what is in the GROUP BY and SELECT list, and thus is the group) are distinct (e.g., 10 and CLERK appear only once).

If you choose not to put items other than aggregate functions in the SELECT list, then you may list any valid column you wish, in the GROUP BY clause. Consider the following two queries, which highlight this fact:

select count(*)
		from emp
	   group by deptno

		COUNT(*)
	   ---------
				3
				5
				6

	select count(*)
		from emp
	   group by deptno,job

	   COUNT(*)
	 ----------
			1
			1
			1
			2
			2
			1
			1
			1
			4

Including items other than aggregate functions in the SELECT list is not mandatory, but often improves readability and usability of the results.

Tip

As a rule, when using GROUP BY and aggregate functions, any items in the SELECT list [from the table(s) in the FROM clause] not used as an argument to an aggregate function must be included in the GROUP BY clause. However, MySQL has a “feature” that allows you to deviate from this rule, allowing you to place items in your SELECT list [that are columns in the table(s) you are selecting from] that are not used as arguments to an aggregate function and that are not present in your GROUP BY clause. We use the term “feature” very loosely here as its use is a bug waiting to happen - to avoid it. As a matter of fact, if you use MySQL and care at all about the accuracy of your queries we suggest you urge them to remove this, ahem, “feature.”

Windowing

Once you understand the concept of grouping and using aggregates in SQL, understanding window functions is easy. Window functions, like aggregate functions, perform an aggregation on a defined set (a group) of rows, but rather than returning one value per group, window functions can return multiple values for each group. The group of rows to perform the aggregation on is the window (hence the name “window functions”). DB2 actually calls such functions online analytic processing (OLAP) functions, and Oracle calls them analytic functions, but the ISO SQL standard calls them window functions, so that’s the term used in this book.

A Simple Example

Let’s say that you wish to count the total number of employees across all departments. The traditional method for doing that is to issue a COUNT(*) query against the entire EMP table:

select count(*) as cnt
		from emp

		  CNT
		-----
		   14

This is easy enough, but often you will find yourself wanting to access such aggregate data from rows that do not represent an aggregation, or that represent a different aggregation. Window functions make light work of such problems. For example, the following query shows how you can use a window function to access aggregate data (the total count of employees) from detail rows (one per employee):

select ename,
			   deptno,
			   count(*) over() as cnt
		 from emp
	   order by 2

	   ENAME	    DEPTNO	CNT
	   ----------   ------ ------
	   CLARK			10	   14
	   KING				10	   14
	   MILLER			10	   14
	   SMITH			20	   14
	   ADAMS			20	   14
	   FORD				20	   14
	   SCOTT			20	   14
	   JONES			20	   14
	   ALLEN			30	   14
	   BLAKE			30	   14
	   MARTIN			30	   14
	   JAMES			30	   14
	   TURNER			30	   14
	   WARD				30	   14

The window function invocation in this example is COUNT(*) OVER(). The presence of the OVER keyword indicates that the invocation of COUNT will be treated as a window function, not as an aggregate function. In general, the SQL standard allows for all aggregate functions to also be window functions, and the keyword OVER is how the language distinguishes between the two uses.

So, what did the window function COUNT(*) OVER () do exactly? For every row being returned in the query, it returned the count of all the rows in the table. As the empty parentheses suggest, the OVER keyword accepts additional clauses to affect the range of rows that a given window function considers. Absent any such clauses, the window function looks at all rows in the result set, which is why you see the value 14 repeated in each row of output.

Hopefully you begin to see the great utility of window functions, which is that they allow you to work with multiple levels of aggregation in one row. As you continue through this appendix, you’ll begin to see even more just how incredibly useful that ability can be.

Order of Evaluation

Before digging deeper into the OVER clause, it is important to note that window functions are performed as the last step in SQL processing prior to the ORDER BY clause. As an example of how window functions are processed last, let’s take the query from the preceding section and use a WHERE clause to filter out employees from DEPTNO 20 and 30:

select ename,
		   deptno,
		   count(*) over() as cnt
	   from emp
	 where deptno = 10
	 order by 2

	 ENAME		DEPTNO	  CNT
	 ---------- ------  ------
	 CLARK			10		3
	 KING			10		3
	 MILLER			10		3

The value for CNT for each row is no longer 14, it is now 3. In this example, it is the WHERE clause that restricts the result set to three rows, hence the window function will count only three rows (there are only three rows available to the window function by the time processing reaches the SELECT portion of the query). From this example you can see that window functions perform their computations after clauses such as WHERE and GROUP BY are evaluated.

Partitions

Use the PARTITION BY clause to define a partition or group of rows to perform an aggregation over. As we’ve seen already, if you use empty parentheses then the entire result set is the partition that a window function aggregation will be computed over. You can think of the PARTITION BY clause as a “moving GROUP BY” because unlike a traditional GROUP BY, a group created by PARTITION BY is not distinct in a result set. You can use PARTITION BY to compute an aggregation over a defined group of rows (resetting when a new group is encountered) and rather than having one group represent all instances of that value in the table, each value (each member in each group) is returned. Consider the following query:

select ename,
				  deptno,
				  count(*) over(partition by deptno) as cnt
				from emp
			order by 2

			ENAME		DEPTNO		CNT
			----------	------   ------
			CLARK			10		3
			KING			10		3
			MILLER			10		3
			SMITH			20		5
			ADAMS			20		5
			FORD			20		5
			SCOTT			20		5
			JONES			20		5
			ALLEN			30		6
			BLAKE			30		6
			MARTIN			30		6
			JAMES			30		6
			TURNER			30		6
			WARD			30      6

This query still returns 14 rows, but now the COUNT is performed for each department as a result of the PARTITION BY DEPTNO clause. Each employee in the same department (in the same partition) will have the same value for CNT, because the aggregation will not reset (recompute) until a new department is encountered. Also note that you are returning information about each group, along with the members of each group. You can think of the preceding query as a more efficient version of the following:

select e.ename,
		   e.deptno,
		   (select count(*) from emp d
			  where e.deptno=d.deptno) as cnt
		from emp e
	  order by 2


	  ENAME		 DEPTNO	   CNT
	  ---------- ------ ------
	  CLARK			10		3
	  KING			10		3
	  MILLER		10		3
	  SMITH			20		5
	  ADAMS			20		5
	  FORD			20		5
	  SCOTT			20		5
	  JONES			20		5
	  ALLEN			30		6
	  BLAKE			30		6
	  MARTIN		30		6
	  JAMES			30		6
	  TURNER		30		6
	  WARD			30		6

Additionally, what’s nice about the PARTITION BY clause is that it performs its computations independently of other window functions, partitioning by different columns in the same SELECT statement. Consider the following query, which returns each employee, her department, the number of employees in her respective department, her job, and the number of employees with the same job:

select ename,
			deptno,
			count(*) over(partition by deptno) as dept_cnt,
			job,
			count(*) over(partition by job) as job_cnt
		from emp
	   order by 2

	   ENAME	    DEPTNO   DEPT_CNT JOB		JOB_CNT
	   ----------   ------   -------- --------- -------
	   MILLER			10			3 CLERK			4
	   CLARK			10			3 MANAGER		3
	   KING				10			3 PRESIDENT		1
	   SCOTT			20			5 ANALYST		2
	   FORD				20			5 ANALYST		2
	   SMITH			20			5 CLERK			4
	   JONES			20			5 MANAGER		3
	   ADAMS			20			5 CLERK			4
	   JAMES			30			6 CLERK			4
	   MARTIN			30			6 SALESMAN		4
	   TURNER			30			6 SALESMAN		4
	   WARD				30			6 SALESMAN		4
	   ALLEN			30			6 SALESMAN		4
	   BLAKE			30			6 MANAGER		3

In this result set, you can see that employees in the same department have the same value for DEPT_CNT, and that employees who have the same job position have the same value for JOB_CNT.

By now it should be clear that the PARTITION BY clause works like a GROUP BY clause, but it does so without being affected by the other items in the SELECT clause and without requiring you to write a GROUP BY clause.

Effect of NULLs

Like the GROUP BY clause, the PARTITION BY clause lumps all the NULLs into one group or partition. Thus, the effect from NULLs when using PARTITION BY is similar to that from using GROUP BY. The following query uses a window function to count the number of employees with each distinct commission (returning–1 in place of NULL for readability):

select coalesce(comm,-1) as comm,
		count(*)over(partition by comm) as cnt
	  from emp

	  COMM		  CNT
	------ ----------
		0			1
	  300			1
	  500			1
	 1400			1
	   -1		   10
	   -1		   10
	   -1		   10
	   -1		   10
	   -1		   10
	   -1		   10
	   -1          10
	   -1          10
	   -1		   10
	   -1		   10

Because COUNT(*) is used, the function counts rows. You can see that there are 10 employees having NULL commissions. Use COMM instead of *, however, and you get quite different results:

select coalesce(comm,-1) as comm,
           count(comm)over(partition by comm) as cnt
      from emp

    COMM		CNT
	---- ----------
	   0          1
	 300		  1
     500          1
    1400          1
      -1          0
      -1          0
      -1          0
      -1          0
      -1          0
      -1          0
      -1          0
      -1          0
      -1          0
      -1          0

This query uses COUNT(COMM), which means that only the non-NULL values in the COMM column are counted. There is one employee with a commission of 0, one employee with a commission of 300, and so forth. But notice the counts for those with NULL commissions! Those counts are 0. Why? Because aggregate functions ignore NULL values, or more accurately, aggregate functions count only non-NULL values.

Tip

When using COUNT, consider whether you wish to include NULLs. Use COUNT(column) to avoid counting NULLs. Use COUNT(*) if you do wish to include NULLs (since you are no longer counting actual column values, you are counting rows).

When Order Matters

Sometimes the order in which rows are treated by a window function is material to the results that you wish to obtain from a query. For this reason, window function syntax includes an ORDER BY subclause that you can place within an OVER clause. Use the ORDER BY clause to specify how the rows are ordered with a partition (remember, “partition” in the absence of a PARTITION BY clause means the entire result set).

Warning

Some window functions require you to impose order on the partitions of rows being affected. Thus, for some window functions an ORDER BY clause is mandatory. At the time of this writing, SQL Server does not allow ORDER BY in the OVER clause when used with aggregate window functions. SQL Server does permit ORDER BY in the OVER clause when used with window ranking functions.

When you use an ORDER BY clause in the OVER clause of a window function you are specifying two things:

  1. How the rows in the partition are ordered

  2. What rows are included in the computation

Consider the following query, which sums and computes a running total of salaries for employees in DEPTNO 10:

select deptno,
			ename,
			hiredate,
			sal,
			sum(sal)over(partition by deptno) as total1,
			sum(sal)over() as total2,
			sum(sal)over(order by hiredate) as running_total
		from emp
	   where deptno=10

	   DEPTNO  ENAME   HIREDATE	   SAL TOTAL1 TOTAL2 RUNNING_TOTAL
	   ------  ------  ----------- ----- ------ ------ -------------
		   10  CLARK   09-JUN-1981  2450   8750	  8750		2450
		   10  KING	   17-NOV-1981  5000   8750	  8750		7450
		   10  MILLER  23-JAN-1982  1300   8750	  8750		8750
Warning

Just to keep you on your toes, I’ve included a sum with empty parentheses. Notice how TOTAL1 and TOTAL2 have the same values. Why? Once again, the order in which window functions are evaluated answers the question. The WHERE clause filters the result set such that only salaries from DEPTNO 10 are considered for summation. In this case there is only one partition—the entire result set, which consists of only salaries from DEPTNO 10. Thus TOTAL1 and TOTAL2 are the same.

Looking at the values returned by column SAL, you can easily see where the values for RUNNING_TOTAL come from. You can eyeball the values and add them yourself to compute the running total. But more importantly, why did including an ORDER BY in the OVER clause create a running total in the first place? The reason is, when you use ORDER BY in the OVER clause you are specify a default “moving” or “sliding” window within the partition even though you don’t see it. The ORDER BY HIREDATE clause terminates summation at the HIREDATE in the current row.

The following query is the same as the previous one, but uses the RANGE BETWEEN clause (which you’ll learn more about later) to explicitly specify the default behavior that results from ORDER BY HIREDATE:

select deptno,
		   ename,
		   hiredate,
		   sal,
		   sum(sal)over(partition by deptno) as total1,
		   sum(sal)over() as total2,
		   sum(sal)over(order by hiredate
						range between unbounded preceding
							and current row) as running_total
	   from emp
	 where deptno=10

	 DEPTNO ENAME  HIREDATE		 SAL TOTAL1 TOTAL2 RUNNING_TOTAL
	 ------ ------ ----------- ----- ------ ------ -------------
		 10 CLARK  09-JUN-1981	2450 8750	  8750			2450
		 10 KING   17-NOV-1981	5000 8750	  8750			7450
		 10 MILLER 23-JAN-1982  1300 8750     8750          8750

The RANGE BETWEEN clause that you see in this query is termed the framing clause by ANSI and we’ll use that term here. Now, it should be easy to see why specifying an ORDER BY in the OVER clause created a running total; we’ve (by default) told the query to sum all rows starting from the current row and include all prior rows (“prior” as defined in the ORDER BY, in this case ordering the rows by HIREDATE).

The Framing Clause

Let’s apply the framing clause from the preceding query to the result set, starting with the first employee hired, who is named CLARK.

  1. Starting with CLARK’s salary, 2450, and including all employees hired before CLARK, compute a sum. Since CLARK was the first employee hired in DEPTNO 10, the sum is simply CLARK’s salary, 2450, which is the first value returned by RUNNING_TOTAL.

  2. Let’s move to the next employee based on HIREDATE, named KING, and apply the framing clause once again. Compute a sum on SAL starting with the current row, 5000 (KING’s salary), and include all prior rows (all employees hired before KING). CLARK is the only one hired before KING so the sum is 5000 + 2450, which is 7450, the second value returned by RUNNING_TOTAL.

  3. Moving on to MILLER, the last employee in the partition based on HIREDATE, let’s one more time apply the framing clause. Compute a sum on SAL starting with the current row, 1300 (MILLER’s salary), and include all prior rows (all employees hired before MILLER). CLARK and KING were both hired before MILLER, and thus their salaries are included in MILLER’s RUNNING_TOTAL: 2450 + 5000 + 1300 is 8750, which is the value for RUNNING_TOTAL for MILLER.

As you can see, it is really the framing clause that produces the running total. The ORDER BY defines the order of evaluation and happens to also imply a default framing.

In general, the framing clause allows you to define different “sub-windows” of data to include in your computations. There are many ways to specify such sub-windows. Consider the following query:

select deptno,
			ename,
			sal,
			sum(sal)over(order by hiredate
						 range between unbounded preceding
						   and current row) as run_total1,
			sum(sal)over(order by hiredate
						  rows between 1 preceding
						   and current row) as run_total2,
			sum(sal)over(order by hiredate
						 range between current row
						   and unbounded following) as run_total3,
			sum(sal)over(order by hiredate
						  rows between current row
						   and 1 following) as run_total4
		from emp
	  where deptno=10

	DEPTNO ENAME	SAL RUN_TOTAL1 RUN_TOTAL2 RUN_TOTAL3 RUN_TOTAL4
	------ ------ ----- ---------- ---------- ---------- ----------
		10 CLARK	2450	  2450		 2450	    8750		7450
		10 KING		5000	  7450       7450       6300        6300
		10 MILLER   1300      8750       6300       1300        1300

Don’t be intimidated here; this query is not as bad as it looks. You’ve already seen RUN_TOTAL1 and the effects of the framing clause “UNBOUNDED PRECEDING AND CURRENT ROW”. Here’s a quick description of what’s happening in the other examples:

RUN_TOTAL2

Rather than the keyword RANGE, this framing clause specifies ROWS, which means the frame, or window, is going to be constructed by counting some number of rows. The 1 PRECEDING means that the frame will begin with the row immediately preceding the current row. The range continues through the CUR-RENT ROW. So what you get in RUN_TOTAL2 is the sum of the current employee’s salary and that of the preceding employee, based on HIREDATE.

Tip

It so happens that RUN_TOTAL1 and RUN_TOTAL2 are the same for both CLARK and KING. Why? Think about which values are being summed for each of those employees, for each of the two window functions. Think carefully, and you’ll get the answer.

RUN_TOTAL3

The window function for RUN_TOTAL3 works just the opposite of that for RUN_TOTAL1; rather than starting with the current row and including all prior rows in the summation, summation begins with the current row and includes all subsequent rows in the summation.

RUN_TOTAL4

Is inverse of RUN_TOTAL2; rather than starting from the current row and including one prior row in the summation, start with the current row and include one subsequent row in the summation.

Tip

If you can understand what’s been explained thus far, you will have no problem with any of the recipes in this book. If you’re not catching on, though, try practicing with your own examples and your own data. It’s usually easier to learn by coding new features rather than just reading about them.

A Framing Finale

As a final example of the effect of the framing clause on query output, consider the following query:

select ename,
			sal,
			min(sal)over(order by sal) min1,
			max(sal)over(order by sal) max1,
			min(sal)over(order by sal
						 range between unbounded preceding
						   and unbounded following) min2,
			max(sal)over(order by sal
						 range between unbounded preceding
						   and unbounded following) max2,
			min(sal)over(order by sal
						 range between current row
						   and current row) min3,
			max(sal)over(order by sal
						 range between current row
						   and current row) max3,
			max(sal)over(order by sal
						  rows between 3 preceding
						   and 3 following) max4
	  from emp

	  ENAME		SAL		MIN1	MAX1	MIN2	MAX2	MIN3	MAX3	MAX4
	  ------  -----   ------  ------  ------  ------  ------  ------  ------
	  SMITH		800		800		800		800		5000	800		800		1250
	  JAMES		950		800		950		800		5000	950		950		1250
	  ADAMS		1100	800		1100	800		5000	1100	1100	1300
	  WARD		1250	800		1250	800		5000	1250	1250	1500
	  MARTIN	1250	800		1250	800		5000	1250	1250	1600
	  MILLER	1300	800		1300	800		5000	1300	1300	2450
	  TURNER	1500	800		1500	800		5000	1500	1500	2850
	  ALLEN		1600	800		1600	800		5000	1600	1600	2975
	  CLARK		2450	800		2450	800		5000	2450	2450	3000
	  BLAKE		2850	800		2850	800		5000	2850	2850	3000
	  JONES		2975	800		2975	800		5000	2975	2975	5000
	  SCOTT		3000	800		3000	800		5000	3000	3000	5000
	  FORD		3000	800		3000	800		5000	3000	3000	5000
	  KING		5000	800		5000	800		5000	5000	5000	5000

OK, let’s break this query down:

MIN1

The window function generating this column does not specify a framing clause, so the default framing clause of UNBOUNDED PRECEDING AND CURRENT ROW kicks in. Why is MIN1 800 for all rows? It’s because the lowest salary comes first (ORDER BY SAL), and it remains the lowest, or minimum, salary forever after.

MAX1

The values for MAX1 are much different from those for MIN1. Why? The answer (again) is the default framing clause UNBOUNDED PRECEDING AND CURRENT ROW. In conjunction with ORDER BY SAL, this framing clause ensures that the maximum salary will also correspond to that of the current row.
Consider the first row, for SMITH. When evaluating SMITH’s salary and all prior salaries, MAX1 for SMITH is SMITH’s salary, because there are no prior salaries. Moving on to the next row, JAMES, when comparing JAMES’ salary to all prior salaries, in this case comparing to the salary of SMITH, JAMES’ salary is the higher of the two, and thus it is the maximum. If you apply this logic to all rows, you will see that the value of MAX1 for each row is the current employee’s salary.

MIN2 and MAX2

The framing clause given for these is UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, which is the same as specifying empty parentheses. Thus, all rows in the result set are considered when computing MIN and MAX. As you might expect, the MIN and MAX values for the entire result set are constant, and thus the value of these columns is constant as well.

MIN3 and MAX3

The framing clause for these is CURRENT ROW AND CURRENT ROW, which simply means use only the current employee’s salary when looking for the MIN and MAX salary. Thus both MIN3 and MAX3 are the same as SAL for each row. That was easy, wasn’t it?

MAX4

The framing clause defined for MAX4 is 3 PRECEDING AND 3 FOLLOWING, which means, for every row, consider the three rows prior and the three rows after the current row, as well as the current row itself. This particular invocation of MAX(SAL) will return from those rows the highest salary value.
If you look at the value of MAX4 for employee MARTIN you can see how the framing clause is applied. MARTIN’s salary is 1250 and the three employee salaries prior to MARTIN’s are WARD’s (1250), ADAMS’ (1100) and JAMES’ (950). The three employee salaries after MARTIN’s are MILLER’s (1300), TURNER’s (1500), and ALLEN’s (1600). Out of all those salaries, including MARTIN’s, the highest is ALLEN’s, and thus the value of MAX4 for MARTIN is 1600.

Readability + Performance = Power

As you can see, window functions are extremely powerful as they allow you to write queries that contain both detailed and aggregate information. Using window functions allows you to write smaller, more efficient queries as compared to using multiple self join and/or scalar subqueries. Consider the following query, which easily answers all of the following questions: “What is the number of employees in each department? How many different types of employees are in each department (e.g., how many clerks are in department 10)? How many total employees are in table EMP?”

select deptno,
			job,
			count(*) over (partition by deptno) as emp_cnt,
			count(job) over (partition by deptno,job) as job_cnt,
			count(*) over () as total
		from emp

		DEPTNO JOB			EMP_CNT		JOB_CNT		TOTAL
		------ --------- ---------- ---------- ----------
			10 CLERK			3			1			14
			10 MANAGER			3			1			14
			10 PRESIDENT		3			1			14
			20 ANALYST			5			2			14
			20 ANALYST			5			2			14
			20 CLERK			5			2			14
			20 CLERK			5			2			14
			20 MANAGER			5			1			14
			30 CLERK			6			1			14
			30 MANAGER			6			1			14
			30 SALESMAN			6			4			14
			30 SALESMAN			6			4			14
			30 SALESMAN			6			4			14
			30 SALESMAN			6			4			14

To return the same result set without using window functions would require a bit more work:

select a.deptno, a.job,
		(select count(*) from emp b
			where b.deptno = a.deptno) as emp_cnt,
		(select count(*) from emp b
			where b.deptno = a.deptno and b.job = a.job) as job_cnt,
		(select count(*) from emp) as total
	 from emp a
	order by 1,2

	DEPTNO JOB			EMP_CNT		JOB_CNT		TOTAL
	------ --------- ----------  ---------- ----------
		10 CLERK			3			1			14
		10 MANAGER			3			1			14
		10 PRESIDENT		3			1			14
		20 ANALYST			5			2			14
		20 ANALYST			5			2			14
		20 CLERK			5			2			14
		20 CLERK			5			2			14
		20 MANAGER			5			1			14
		30 CLERK			6			1			14
		30 MANAGER			6			1			14
		30 SALESMAN			6			4			14
		30 SALESMAN			6			4			14
		30 SALESMAN			6			4			14
		30 SALESMAN			6			4			14

The non-window solution is obviously not difficult to write, yet it certainly is not as clean or efficient (you won’t see performance differences with a 14-row table, but try these queries with, say, a 1,000- or 10,000-row table and then you’ll see the benefit of using window functions over multiple self joins and scalar subqueries).

Providing a Base

Besides readability and performance, window functions are useful for providing a “base” for more complex “report style” queries. For example, consider the following “report style” query that uses window functions in an inline view and then aggregates the results in an outer query. Using window functions allows you to return detailed as well as aggregate data, which is useful for reports. The query below uses window functions to find counts using different partitions. Because the aggregation is applied to multiple rows, the inline view returns all rows from EMP, which the outer CASE expressions can use to transpose and create a formatted report:

select deptno,
		   emp_cnt as dept_total,
		   total,
		   max(case when job = 'CLERK'
					then job_cnt else 0 end) as clerks,
		   max(case when job = 'MANAGER'
					then job_cnt else 0 end) as mgrs,
		   max(case when job = 'PRESIDENT'
					then job_cnt else 0 end) as prez,
		   max(case when job = 'ANALYST'
					then job_cnt else 0 end) as anals,
		   max(case when job = 'SALESMAN'
					then job_cnt else 0 end) as smen
		from (
	  select deptno,
			job,
			count(*) over (partition by deptno) as emp_cnt,
			count(job) over (partition by deptno,job) as job_cnt,
			count(*) over () as total	
		from emp
			) x
	  group by deptno, emp_cnt, total

	  DEPTNO DEPT_TOTAL TOTAL CLERKS MGRS PREZ ANALS SMEN
	  ------ ---------- ----- ------ ---- ---- ----- ----
		10			3		14		1	1	1	  0		0
		20			5		14		2	1	0	  2		0
		30			6		14		1	1	0	  0		4

The query above returns each department, the total number of employees in each department, the total number of employees in table EMP, and a breakdown of the number of different job types in each department. All this is done in one query, without additional joins or temp tables!

As a final example of how easily multiple questions can be answered using window functions, consider the following query:

select ename as name,
			sal,
			max(sal)over(partition by deptno) as hiDpt,
			min(sal)over(partition by deptno) as loDpt,
			max(sal)over(partition by job) as hiJob,
			min(sal)over(partition by job) as loJob,
			max(sal)over() as hi,
			min(sal)over() as lo,
			sum(sal)over(partition by deptno
							 order by sal,empno) as dptRT,
			sum(sal)over(partition by deptno) as dptSum,
			sum(sal)over() as ttl
		 from emp
	   order by deptno,dptRT


	   NAME		SAL HIDPT LODPT HIJOB LOJOB		HI	LO DPTRT DPTSUM		TTL
	   ------ ----- ----- ----- ----- ----- ----- ---- ------ ------ ------
	   MILLER 1300  5000  1300  1300   800  5000  800   1300   8750  29025
	   CLARK  2450  5000  1300  2975  2450  5000  800   3750   8750  29025
	   KING   5000  5000  1300  5000  5000  5000  800   8750   8750  29025
	   SMITH   800  3000   800  1300   800  5000  800    800  10875  29025
	   ADAMS  1100  3000   800  1300   800  5000  800   1900  10875  29025
	   JONES  2975  3000   800  2975  2450  5000  800   4875  10875  29025
	   SCOTT  3000  3000   800  3000  3000  5000  800   7875  10875  29025
	   FORD   3000  3000   800  3000  3000  5000  800  10875  10875  29025
	   JAMES   950  2850   950  1300   800  5000  800    950   9400  29025
	   WARD   1250  2850   950  1600  1250  5000  800   2200   9400  29025
	   MARTIN 1250  2850   950  1600  1250  5000  800   3450   9400  29025
	   TURNER 1500  2850   950	1600  1250  5000  800   4950   9400  29025
	   ALLEN  1600  2850   950  1600  1250  5000  800   6550   9400  29025
	   BLAKE  2850  2850   950  2975  2450  5000  800   9400   9400  29025

This query answers the following questions easily, efficiently, and readably (and without additional joins to EMP!). Simply match the employee and her salary with the different rows in the result set to determine:

  1. who makes the highest salary of all employees (HI)

  2. who makes the lowest salary of all employees (LO)

  3. who makes the highest salary in her department (HIDPT)

  4. who makes the lowest salary in her department (LODPT)

  5. who makes the highest salary in her job (HIJOB)

  6. who makes the lowest salary in her job (LOJOB)

  7. what is the sum of all salaries (TTL)

  8. what is the sum of salaries per department (DPTSUM)

  9. what is the running total of all salaries per department (DPTRT)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.21.229