Apart from grouping and aggregation, PostgreSQL provides another way to perform computations based on the values of several records. It can be done by using the window functions. Grouping and aggregation implies the output of a single record for every group of input records. Window functions can do similar things, but they are executed for every record, and the number of records in the output and the input is the same:
In the preceding diagram, the rectangles represent the records of a table. Let's assume that the color of the rectangles indicates the value of a field, used to group the records. When GROUP BY
is used in a query, each distinct value of that field will create a group, and each group will become a single record in the results of the query. That was explained in Chapter 5, SQL Language. Window functions can access the values of other records of the same group (which is called a partition in this case), although the number of records stays the same. When window functions are used, no grouping is necessary, although possible.
Window functions are evaluated after grouping and aggregation. For that reason, the only places in the SELECT
query where the window functions are allowed are Select-List and the ORDER BY
clause.
The syntax of the window functions is as follows:
<function_name>(<function_arguments>) OVER( [PARTITION BY <expression_list>] [ORDER BY <order_by_list>] [{ROWS | RANGE} <frame_start> | {ROWS | RANGE} BETWEEN <frame_start> AND <frame_end>])
The construct in the parenthesis, after the OVER
keyword, is called the window definition. The last part of the window definition, which starts with ROWS
, is called a frame
clause. The syntax of frame_start
and frame_end
is described later.
Window functions, in general, work like aggregating functions. They process sets of records. These sets are built separately for each processed record. That's why, unlike the normal aggregating functions, window functions are evaluated for each row.
For each record, a set of rows to be processed by a window function is built in the following way:
At the beginning, the PARTITION BY
clause is processed. All the records that have the same values as all the expressions in the expression_list
on the current row, are taken. The set of those rows is called a partition. The current row is also included in the partition. In fact, the PARTITION BY
clause has the same logic and syntax as the GROUP BY
clause of the SELECT
statement, except that it is not possible to refer to the output column names or numbers in PARTITION BY
. In other words, while processing each record, a window function will take a look into all the other records to check if any of them falls into the same partition as the current one. If no PARTITION BY
is specified, it means that all the rows will be included in a single partition at this step.
Next, the partition is sorted according to the ORDER BY
clause which has the same syntax and logic as the ORDER BY
clause in the SELECT
statement. Again, no references to the output column names or numbers are allowed here. If the ORDER BY
clause is omitted, then all the records of the set are considered to have the same precedence.
In the end, the frame clause is processed. It means taking a subset from the whole partition to pass it to the window function. The subset is called a window frame. The frame has its starting and ending points. The start of the frame, which is referenced by frame_start
in the preceding syntax diagram, can be any of the following:
UNBOUNDED PRECEDING
: The very first record of the partition.<value> PRECEDING
: A record, that is <value>
number of records before the current one. <value>
is an integer expression which cannot return a negative value and which cannot use aggregating functions or other window functions. "0 PRECEDING"
points to the current row.CURRENT ROW
: The current row itself.<value> FOLLOWING
: A record, that is <value>
number of records after the current record.The ending point—frame_end
—can be any one of the following:
<value> PRECEDING
CURRENT ROW
<value> FOLLOWING
UNBOUNDED PRECEDING
—the very last record of the partition.The starting point should precede the ending point. That's why, for example, ROWS BETWEEN CURRENT ROW AND 1 PRECEDING
is not correct.
A window frame can be defined using the ROWS
mode or the RANGE
mode. It affects the meaning of the CURRENT ROW
. In the ROWS
mode, the CURRENT ROW
points to the current record itself. In the RANGE
mode, the CURRENT ROW
points to the first or to the last record that has the same position when sorted according to the ORDER BY
clause. First or last will be chosen with a view to make the frame wider.
If frame_end
is omitted, then CURRENT ROW
is used instead.
If the whole frame clause is omitted, then the frame will be built using the RANGE UNBOUNDED PRECEDING
definition.
Look at the following example of a window definition:
OVER (PARTITION BY a ORDER BY b ROWS BETWEEN UNBOUNDED PRECEDING AND 5 FOLLOWING)
The preceding definition means that for every row, all the records with the same value of the field a
will form the partition. Then, the partition will be ordered in an ascending manner by the values of the field b
, and the frame will contain all the records from the first to the fifth one following the current row.
The window definitions can be quite long, and in many cases, it is not convenient to use them in the Select-list. PostgreSQL provides a way to define windows and give them names that can be used in the OVER
clause in window functions. This is done by using the WINDOW
clause of the SELECT
statement, which is specified after the HAVING
clause, as follows:
SELECT count() OVER w, sum(b) OVER w, avg(b) OVER (w ORDER BY c ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM table1 WINDOW w AS (PARTITION BY a)
The predefined window can be used as is. In the preceding example, the window functions count
and sum
do that. The window definition can also be further detailed like it is for the function avg
in the example. The syntactical difference is the following: to reuse the same window definition, the window name should be specified after the OVER
keyword, without parenthesis. To extend the window definition with the ORDER BY
or frame
clause, one should use the name of the window inside the parenthesis.
When the same window definition is used several times, PostgreSQL will optimize the execution of the query by building partitions only once and then reusing the results.
All aggregating functions can be used as window functions, with the exception of ordered-set and hypothetical-set aggregates. User defined aggregating functions can also be used as window functions. The presence of an OVER
clause indicates that the function is a window function.
When the aggregating function is used as a window function, it will aggregate the rows that belong to the window frame of a current row.
The typical use cases for the window functions are computing statistical values of different kinds. Take the car portal database for example. There is a table called advertisement that contains information about the advertisements that users create. Suppose it is required to analyze the quantity of advertisements that the users create over a period of time. The query which generates the report would be as follows:
WITH monthly_data AS ( SELECT date_trunc('month', advertisement_date) AS month, count(*) as cnt FROM car_portal_app.advertisement GROUP BY date_trunc('month', advertisement_date) ) SELECT to_char(month,'YYYY-MM') as month, cnt, sum(cnt) OVER (w ORDER BY month) AS cnt_year, round(avg(cnt) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING),1) AS mov_avg, round(cnt/sum(cnt) OVER w *100,1) AS ratio_year FROM monthly_data WINDOW w AS (PARTITION BY date_trunc('year',month)); month | cnt | cnt_year | mov_avg | ratio_year ---------+-----+----------+---------+------------ 2014-01 | 42 | 42 | 41.3 | 5.7 2014-02 | 51 | 93 | 45.3 | 7.0 2014-03 | 31 | 124 | 57.4 | 4.2 2014-04 | 57 | 181 | 69.6 | 7.8 2014-05 | 106 | 287 | 73.2 | 14.5 2014-06 | 103 | 390 | 81.2 | 14.1 2014-07 | 69 | 459 | 86.2 | 9.4 2014-08 | 71 | 530 | 74.2 | 9.7 2014-09 | 82 | 612 | 60.8 | 11.2 2014-10 | 46 | 658 | 54.4 | 6.3 2014-11 | 36 | 694 | 49.8 | 4.9 2014-12 | 37 | 731 | 35.2 | 5.1 2015-01 | 48 | 48 | 32.5 | 84.2 2015-02 | 9 | 57 | 31.3 | 15.8 (14 rows)
In the WITH
clause, the data is aggregated on a monthly basis. In the main query, the window w
is defined as implying partitioning by year. This means that every window function that uses the window w
will work with the records of the same year as the current record.
The first window function, sum
, uses the window w
. Since ORDER BY
is specified, each record has its place in the partition. The Frame
clause is omitted, which means that the frame, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
is applied. That means, the function calculates the sum of the values for the records from the beginning of each year till the current month. It is the cumulative total on an yearly basis.
The second function, avg
, calculates the moving average. For each record, it calculates the average value of five records—ranging from two preceding the current record to the second one following the current one. It does not use a predefined window, because the moving average does not take the year into account. Only the order of the values matters.
The third window function, sum
, uses the same window definition again. It calculates the sum of the values for the whole year.
There are several window functions that are different from aggregating functions. They are used to get the values of other records within the partition, calculate the rank of the current row among all rows, and generate row numbers.
For example, let's extend the report of the previous example. Suppose, it is necessary to calculate the difference in the quantity of advertisements for each month against the previous months and against the same month of the previous year. Suppose it is also required to get the rank of the current month. The query would be as follows:
WITH monthly_data AS ( SELECT date_trunc('month', advertisement_date) AS month, count(*) as cnt FROM car_portal_app.advertisement GROUP BY date_trunc('month', advertisement_date) ) SELECT to_char(month,'YYYY-MM') as month, cnt, cnt - lag(cnt) OVER (ORDER BY month) as prev_m, cnt - lag(cnt, 12) OVER (ORDER BY month) as prev_y, rank() OVER (w ORDER BY cnt DESC) as rank FROM monthly_data WINDOW w AS (PARTITION BY date_trunc('year',month)) ORDER BY month DESC; month | cnt | prev_m | prev_y | rank ---------+-----+--------+--------+------ 2015-02 | 9 | -39 | -42 | 2 2015-01 | 48 | 11 | 6 | 1 2014-12 | 37 | 1 | | 10 2014-11 | 36 | -10 | | 11 2014-10 | 46 | -36 | | 8 2014-09 | 82 | 11 | | 3 2014-08 | 71 | 2 | | 4 2014-07 | 69 | -34 | | 5 2014-06 | 103 | -3 | | 2 2014-05 | 106 | 49 | | 1 2014-04 | 57 | 26 | | 6 2014-03 | 31 | -20 | | 12 2014-02 | 51 | 9 | | 7 2014-01 | 42 | | | 9 (14 rows)
The lag
function returns the value of a given expression for the record, which is the given number of records before the current one (default is 1
). In the first occurrence of the function in the example, it returns the value of the field cnt
in the previous record, which corresponds to the previous month.
The second lag
returns the value of cnt
for twelve previous records.
The rank
function returns the rank of the current row within the partition. It returns the rank with gaps. That means that if two records have the same position according to the ORDER BY
clause, both of them will get the same rank. The next record will get the rank after the next rank.
Other window functions are as follows:
lead
: This returns the value of a given expression evaluated for the record that is the given number of records after the current row.first_value
, last_value
, nth_value
: This returns the value of a given expression evaluated for the first record, last record, or nth record of the frame respectively.row_number
: This returns the number of the current row within the partition.dense_rank
: This returns the rank of the current row without gaps.percent_rank
and cume_dist
: These return the relative rank of the current row. The difference is that the first function uses rank and the second uses row_number
as a numerator for the calculations.ntile
: This divides the partition into the given number of equal parts and returns the integer number of the part where the current record belongs.A more detailed description of these functions is available in the documentation at http://www.postgresql.org/docs/current/static/functions-window.html.
Since window functions are evaluated after grouping, it is possible to use aggregating functions within the window functions, but not the other way around.
A code like the following is right:
sum( count(*) ) OVER()
This will also work: sum(a) OVER( ORDER BY count(*) )
However, the code, sum( count(*) OVER() )
,is wrong.
For example, to calculate the rank of the seller accounts by the number of advertisements they give, the following query can be used:
SELECT seller_account_id, dense_rank() OVER(ORDER BY count(*) DESC) FROM car_portal_app.advertisement GROUP BY seller_account_id; seller_account_id | dense_rank -------------------+------------ 26 | 1 128 | 2 28 | 2 126 | 2 ...
18.117.105.74