Window functions

Apart from grouping and aggregation, PostgreSQL provides another way to perform computations based on the values of several records. It can be done by using the window functions. Grouping and aggregation implies the output of a single record for every group of input records. Window functions can do similar things, but they are executed for every record, and the number of records in the output and the input is the same:

Window functions

GROUP BY and Window functions

In the preceding diagram, the rectangles represent the records of a table. Let's assume that the color of the rectangles indicates the value of a field, used to group the records. When GROUP BY is used in a query, each distinct value of that field will create a group, and each group will become a single record in the results of the query. That was explained in Chapter 5, SQL Language. Window functions can access the values of other records of the same group (which is called a partition in this case), although the number of records stays the same. When window functions are used, no grouping is necessary, although possible.

Window functions are evaluated after grouping and aggregation. For that reason, the only places in the SELECT query where the window functions are allowed are Select-List and the ORDER BY clause.

Window definition

The syntax of the window functions is as follows:

<function_name>(<function_arguments>)
OVER(
[PARTITION BY <expression_list>]
[ORDER BY <order_by_list>]
[{ROWS | RANGE} <frame_start> |
 {ROWS | RANGE} BETWEEN <frame_start> AND <frame_end>])

The construct in the parenthesis, after the OVER keyword, is called the window definition. The last part of the window definition, which starts with ROWS, is called a frame clause. The syntax of frame_start and frame_end is described later.

Window functions, in general, work like aggregating functions. They process sets of records. These sets are built separately for each processed record. That's why, unlike the normal aggregating functions, window functions are evaluated for each row.

For each record, a set of rows to be processed by a window function is built in the following way:

At the beginning, the PARTITION BY clause is processed. All the records that have the same values as all the expressions in the expression_list on the current row, are taken. The set of those rows is called a partition. The current row is also included in the partition. In fact, the PARTITION BY clause has the same logic and syntax as the GROUP BY clause of the SELECT statement, except that it is not possible to refer to the output column names or numbers in PARTITION BY. In other words, while processing each record, a window function will take a look into all the other records to check if any of them falls into the same partition as the current one. If no PARTITION BY is specified, it means that all the rows will be included in a single partition at this step.

Next, the partition is sorted according to the ORDER BY clause which has the same syntax and logic as the ORDER BY clause in the SELECT statement. Again, no references to the output column names or numbers are allowed here. If the ORDER BY clause is omitted, then all the records of the set are considered to have the same precedence.

In the end, the frame clause is processed. It means taking a subset from the whole partition to pass it to the window function. The subset is called a window frame. The frame has its starting and ending points. The start of the frame, which is referenced by frame_start in the preceding syntax diagram, can be any of the following:

  • UNBOUNDED PRECEDING: The very first record of the partition.
  • <value> PRECEDING: A record, that is <value> number of records before the current one. <value> is an integer expression which cannot return a negative value and which cannot use aggregating functions or other window functions. "0 PRECEDING" points to the current row.
  • CURRENT ROW: The current row itself.
  • <value> FOLLOWING: A record, that is <value> number of records after the current record.

The ending point—frame_end—can be any one of the following:

  • <value> PRECEDING
  • CURRENT ROW
  • <value> FOLLOWING
  • UNBOUNDED PRECEDING—the very last record of the partition.

The starting point should precede the ending point. That's why, for example, ROWS BETWEEN CURRENT ROW AND 1 PRECEDING is not correct.

A window frame can be defined using the ROWS mode or the RANGE mode. It affects the meaning of the CURRENT ROW. In the ROWS mode, the CURRENT ROW points to the current record itself. In the RANGE mode, the CURRENT ROW points to the first or to the last record that has the same position when sorted according to the ORDER BY clause. First or last will be chosen with a view to make the frame wider.

If frame_end is omitted, then CURRENT ROW is used instead.

If the whole frame clause is omitted, then the frame will be built using the RANGE UNBOUNDED PRECEDING definition.

Look at the following example of a window definition:

OVER (PARTITION BY a ORDER BY b ROWS BETWEEN UNBOUNDED PRECEDING AND 5 FOLLOWING)

The preceding definition means that for every row, all the records with the same value of the field a will form the partition. Then, the partition will be ordered in an ascending manner by the values of the field b, and the frame will contain all the records from the first to the fifth one following the current row.

The WINDOW clause

The window definitions can be quite long, and in many cases, it is not convenient to use them in the Select-list. PostgreSQL provides a way to define windows and give them names that can be used in the OVER clause in window functions. This is done by using the WINDOW clause of the SELECT statement, which is specified after the HAVING clause, as follows:

SELECT
    count() OVER w,
    sum(b) OVER w,
    avg(b) OVER (w ORDER BY c ROWS BETWEEN 1 PRECEDING
      AND 1 FOLLOWING)
  FROM table1
  WINDOW w AS (PARTITION BY a)

The predefined window can be used as is. In the preceding example, the window functions count and sum do that. The window definition can also be further detailed like it is for the function avg in the example. The syntactical difference is the following: to reuse the same window definition, the window name should be specified after the OVER keyword, without parenthesis. To extend the window definition with the ORDER BY or frame clause, one should use the name of the window inside the parenthesis.

When the same window definition is used several times, PostgreSQL will optimize the execution of the query by building partitions only once and then reusing the results.

Using window functions

All aggregating functions can be used as window functions, with the exception of ordered-set and hypothetical-set aggregates. User defined aggregating functions can also be used as window functions. The presence of an OVER clause indicates that the function is a window function.

When the aggregating function is used as a window function, it will aggregate the rows that belong to the window frame of a current row.

The typical use cases for the window functions are computing statistical values of different kinds. Take the car portal database for example. There is a table called advertisement that contains information about the advertisements that users create. Suppose it is required to analyze the quantity of advertisements that the users create over a period of time. The query which generates the report would be as follows:

WITH monthly_data AS (
  SELECT date_trunc('month', advertisement_date) AS
      month, count(*) as cnt
    FROM car_portal_app.advertisement
    GROUP BY date_trunc('month', advertisement_date)
)
SELECT to_char(month,'YYYY-MM') as month, cnt,
    sum(cnt) OVER (w ORDER BY month) AS cnt_year,
    round(avg(cnt) OVER (ORDER BY month
      ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING),1)
      AS mov_avg,
    round(cnt/sum(cnt) OVER w *100,1) AS ratio_year
  FROM monthly_data
  WINDOW w AS (PARTITION BY date_trunc('year',month));

  month  | cnt | cnt_year | mov_avg | ratio_year
---------+-----+----------+---------+------------
 2014-01 |  42 |       42 |    41.3 |        5.7
 2014-02 |  51 |       93 |    45.3 |        7.0
 2014-03 |  31 |      124 |    57.4 |        4.2
 2014-04 |  57 |      181 |    69.6 |        7.8
 2014-05 | 106 |      287 |    73.2 |       14.5
 2014-06 | 103 |      390 |    81.2 |       14.1
 2014-07 |  69 |      459 |    86.2 |        9.4
 2014-08 |  71 |      530 |    74.2 |        9.7
 2014-09 |  82 |      612 |    60.8 |       11.2
 2014-10 |  46 |      658 |    54.4 |        6.3
 2014-11 |  36 |      694 |    49.8 |        4.9
 2014-12 |  37 |      731 |    35.2 |        5.1
 2015-01 |  48 |       48 |    32.5 |       84.2
 2015-02 |   9 |       57 |    31.3 |       15.8
(14 rows)

In the WITH clause, the data is aggregated on a monthly basis. In the main query, the window w is defined as implying partitioning by year. This means that every window function that uses the window w will work with the records of the same year as the current record.

The first window function, sum, uses the window w. Since ORDER BY is specified, each record has its place in the partition. The Frame clause is omitted, which means that the frame, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is applied. That means, the function calculates the sum of the values for the records from the beginning of each year till the current month. It is the cumulative total on an yearly basis.

The second function, avg, calculates the moving average. For each record, it calculates the average value of five records—ranging from two preceding the current record to the second one following the current one. It does not use a predefined window, because the moving average does not take the year into account. Only the order of the values matters.

The third window function, sum, uses the same window definition again. It calculates the sum of the values for the whole year.

There are several window functions that are different from aggregating functions. They are used to get the values of other records within the partition, calculate the rank of the current row among all rows, and generate row numbers.

For example, let's extend the report of the previous example. Suppose, it is necessary to calculate the difference in the quantity of advertisements for each month against the previous months and against the same month of the previous year. Suppose it is also required to get the rank of the current month. The query would be as follows:

WITH monthly_data AS (
  SELECT date_trunc('month', advertisement_date) AS
      month, count(*) as cnt
    FROM car_portal_app.advertisement
    GROUP BY date_trunc('month', advertisement_date)
)
SELECT to_char(month,'YYYY-MM') as month,
    cnt,
    cnt - lag(cnt) OVER (ORDER BY month) as prev_m,
    cnt - lag(cnt, 12) OVER (ORDER BY month) as prev_y,
    rank() OVER (w ORDER BY cnt DESC) as rank
  FROM monthly_data
  WINDOW w AS (PARTITION BY date_trunc('year',month))
  ORDER BY month DESC;

  month  | cnt | prev_m | prev_y | rank
---------+-----+--------+--------+------
 2015-02 |   9 |    -39 |    -42 |    2
 2015-01 |  48 |     11 |      6 |    1
 2014-12 |  37 |      1 |        |   10
 2014-11 |  36 |    -10 |        |   11
 2014-10 |  46 |    -36 |        |    8
 2014-09 |  82 |     11 |        |    3
 2014-08 |  71 |      2 |        |    4
 2014-07 |  69 |    -34 |        |    5
 2014-06 | 103 |     -3 |        |    2
 2014-05 | 106 |     49 |        |    1
 2014-04 |  57 |     26 |        |    6
 2014-03 |  31 |    -20 |        |   12
 2014-02 |  51 |      9 |        |    7
 2014-01 |  42 |        |        |    9
(14 rows)

The lag function returns the value of a given expression for the record, which is the given number of records before the current one (default is 1). In the first occurrence of the function in the example, it returns the value of the field cnt in the previous record, which corresponds to the previous month.

The second lag returns the value of cnt for twelve previous records.

The rank function returns the rank of the current row within the partition. It returns the rank with gaps. That means that if two records have the same position according to the ORDER BY clause, both of them will get the same rank. The next record will get the rank after the next rank.

Other window functions are as follows:

  • lead: This returns the value of a given expression evaluated for the record that is the given number of records after the current row.
  • first_value, last_value, nth_value: This returns the value of a given expression evaluated for the first record, last record, or nth record of the frame respectively.
  • row_number: This returns the number of the current row within the partition.
  • dense_rank: This returns the rank of the current row without gaps.
  • percent_rank and cume_dist: These return the relative rank of the current row. The difference is that the first function uses rank and the second uses row_number as a numerator for the calculations.
  • ntile: This divides the partition into the given number of equal parts and returns the integer number of the part where the current record belongs.

A more detailed description of these functions is available in the documentation at http://www.postgresql.org/docs/current/static/functions-window.html.

Window functions with grouping and aggregation

Since window functions are evaluated after grouping, it is possible to use aggregating functions within the window functions, but not the other way around.

A code like the following is right:

sum( count(*) ) OVER()

This will also work: sum(a) OVER( ORDER BY count(*) )

However, the code, sum( count(*) OVER() ),is wrong.

For example, to calculate the rank of the seller accounts by the number of advertisements they give, the following query can be used:

SELECT seller_account_id,
    dense_rank() OVER(ORDER BY count(*) DESC)
  FROM car_portal_app.advertisement
  GROUP BY seller_account_id;

seller_account_id | dense_rank
-------------------+------------
                26 |          1
               128 |          2
                28 |          2
               126 |          2
...
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.105.74