In the following sections, some other advanced SQL techniques will be introduced:
Quite often it is necessary to find the first records based on some criteria. For example, let's take the car_portal
database; suppose it is required to find the first advertisement for each car_id
in the advertisement
table.
Grouping can help in this case. It will require a subquery to implement the logic:
SELECT advertisement_id, advertisement_date, adv.car_id, seller_account_id FROM car_portal_app.advertisement adv INNER JOIN ( SELECT car_id, min(advertisement_date) min_date FROM car_portal_app.advertisement GROUP BY car_id ) first ON adv.car_id=first.car_id AND adv.advertisement_date = first.min_date;
However, if the ordering logic is complex and cannot be implemented by using the function min
, that approach will not work.
Although window functions can solve the problem, they are not always convenient to use:
SELECT DISTINCT first_value(advertisement_id) OVER w AS advertisement_id, min(advertisement_date) OVER w AS advertisement_date, car_id, first_value(seller_account_id) OVER w AS seller_account_id FROM car_portal_app.advertisement WINDOW w AS (PARTITION BY car_id ORDER BY advertisement_date);
In the preceding code, DISTINCT
is used to remove the duplicates which where grouped together in the previous example.
PostgreSQL provides an explicit way of selecting the first record within each group. The DISTINCT ON
keywords are used for that. The syntax is as follows:
SELECT DISTINCT ON (<expression_list>) <Select-List> ... ORDER BY <order_by_list>
In the preceding code snippet, for each distinct combination of values of expression_list
, only the first record will be returned by the SELECT
statement. The ORDER BY
clause is used to define a rule to determine which record is the first. Expression_list
from DISTINCT ON
must be included in the ORDER BY
list.
For the task being discussed, it can be applied in the following way:
SELECT DISTINCT ON (car_id) advertisement_id, advertisement_date, car_id, seller_account_id FROM car_portal_app.advertisement ORDER BY car_id, advertisement_date;
Functions in PostgreSQL can return not only single values but also relations. They are called set returning functions.
There is a quite typical task for every SQL developer: to generate a sequence of integers, each in a separate record. This sequence or relation can have many use cases. For example, suppose it is necessary to select data from the car_portal
database to count the number of cars for each year of manufacture, from 2010 till 2015; and it is required to show zero for the years where no cars exist at all. The simple SELECT
statement from the only table, car
, cannot be used to implement that logic. It is not possible to see the records that are absent from the table data itself. That's why it would be useful if a table with the numbers from 2010 to 2015 existed. Then it could be outer-joined to the results of the query.
One could create that table in advance, but it is not a very good idea, because the number of records necessary in that table is not known in advance; and if one created a very big table, it would be just a waste of disk space in most cases.
There is a generate_series
function called which serves exactly that purpose. It returns a set of integers with the given start and stop values. The query for the example would be as follows:
SELECT years.manufacture_year, count(car_id) FROM generate_series(2010, 2015) as years (manufacture_year) LEFT JOIN car_portal_app.car ON car.manufacture_year = years.manufacture_year GROUP BY years.manufacture_year ORDER BY 1; manufacture_year | count ------------------+------- 2010 | 11 2011 | 12 2012 | 12 2013 | 21 2014 | 16 2015 | 0 (6 rows)
In the preceding query, the generate_series
function returns six integers from 2010 to 2015. That set has an alias t
. The table car
is left-joined to the set of integers and then, all the years can be seen in the output result set.
It is possible to specify a step when calling generate_series
:
SELECT * FROM generate_series(5, 11, 3); generate_series ----------------- 5 8 11 (3 rows)
The generate_series
function can also return a set of timestamp values:
SELECT * FROM generate_series('2015-01-01'::date, '2015-01-31'::date, interval '7 days'); generate_series ------------------------ 2015-01-01 00:00:00-05 2015-01-08 00:00:00-05 2015-01-15 00:00:00-05 2015-01-22 00:00:00-05 2015-01-29 00:00:00-05 (5 rows)
There are a couple of other set-returning functions designed to work with arrays:
generate_subscripts
: This generates numbers that can be used to reference the elements in a given array for the given dimension(s). This function is useful for enumerating the elements of an array in an SQL statement.unnest
: This transforms a given array into a set or rows where each record corresponds to an element of the array.Set-returning functions are also called table functions.
Table functions can return sets of rows of a predefined type, like generate_series
returns a set of int
or bigint
(depending on the type of input argument). Moreover, they can return a set of abstract type records. That allows a function to return different numbers of columns depending on their implementation. But SQL requires that the row structure of all input relations is defined so that the optimizer can parse the query and build the execution plan. That's why all table functions that return sets of records, when used in a query, must be followed by a definition of their row structure:
function_name [AS] alias (column_name column_type [, ...])
The output of several set-returning functions can be combined together as if they were joined on the position of each row. The ROWS FROM
syntax is used for this, as follows:
ROWS FROM (function_call [,…]) [[AS] alias (column_name [,...])]
The preceding construct will return a relation. The number of rows is equal to the largest output of the functions. Each column corresponds to the respective function in the ROWS FROM
clause. If a function returns fewer rows than other functions, the missing values will be set to NULL
. For example:
SELECT foo.a, foo.b FROM ROWS FROM ( generate_series(1,3), generate_series(1,7,2) ) AS foo(a, b); a | b ---+--- 1 | 1 2 | 3 3 | 5 | 7 (4 rows)
Subqueries were discussed in the previous chapter. However, it is worth mentioning one specific pattern of using them in more detail.
It is very convenient to use subqueries in the Select-list. They are used in the SELECT
statements, for example, to create calculated attributes when querying a table. Let's take the car portal database again. Suppose it is necessary to query the table car
to retrieve information about the cars. For each car, it is required to assess its age by comparing it with the age of the other cars of the same model. Furthermore, it is required to query the number of cars of the same model.
These two additional fields can be generated by scalar subqueries, as follows:
SELECT car_id, manufacture_year, CASE WHEN manufacture_year <= (SELECT avg(manufacture_year) FROM car_portal_app.car WHERE car_model_id = c.car_model_id ) THEN 'old' ELSE 'new' END as age, (SELECT count(*) FROM car_portal_app.car WHERE car_model_id = c.car_model_id ) AS same_model_count FROM car_portal_app.car c; car_id | manufacture_year | age | same_model_count --------+------------------+-----+------------------ 1 | 2008 | old | 3 2 | 2014 | new | 6 3 | 2014 | new | 2 ... (229 rows)
The power of those subqueries is that they can refer to the main
table in their WHERE
clause. That makes them easy. It is also very simple to add more columns in the query by adding other subqueries. On the other hand, there is a problem here: performance. The car
table is scanned by the database server once for the main query, and then it is scanned two times again for each retrieved row, that is, for the age
column and for the same_model_count
column.
It is possible, of course, to calculate those aggregates once for each car model independently and then join the results with the car
table:
SELECT car_id, manufacture_year, CASE WHEN manufacture_year <= avg_year THEN 'old' ELSE 'new' END as age, same_model_count FROM car_portal_app.car INNER JOIN ( SELECT car_model_id, avg(manufacture_year) avg_year, count(*) same_model_count FROM car_portal_app.car GROUP BY car_model_id ) subq USING (car_model_id); car_id | manufacture_year | age | same_model_count --------+------------------+-----+------------------ 1 | 2008 | old | 3 2 | 2014 | new | 6 3 | 2014 | new | 2 ... (229 rows)
The result is the same and the query is 20 times faster. However, this query is only good for retrieving many rows from the database. If it is required to get the information about only one car, the first query will be faster.
One can see that the first query could perform better if it was possible to select two columns in the subquery in the Select-list. But that is not possible. Scalar queries can return only one column.
There is yet another way of using subqueries. It combines the advantages of the subqueries in the Select-list, which can refer to the main table, with the subqueries in the FROM
clause, which can return multiple columns. This can be done via lateral subqueries that were added in PostgreSQL version 9.3. Putting the LATERAL
keyword before the subquery code in the FROM
clause makes it possible to reference any preceding items of the FROM
clause from the subquery.
The query would be as follows:
SELECT car_id, manufacture_year, CASE WHEN manufacture_year <= avg_year THEN 'old' ELSE 'new' END as age, same_model_count FROM car_portal_app.car c, LATERAL (SELECT avg(manufacture_year) avg_year, count(*) same_model_count FROM car_portal_app.car WHERE car_model_id = c.car_model_id) subq; car_id | manufacture_year | age | same_model_count --------+------------------+-----+------------------ 1 | 2008 | old | 3 2 | 2014 | new | 6 3 | 2014 | new | 2 ... (229 rows)
This query is approximately two times faster than the first one, and is the best one for retrieving only one row from the car
table.
When it comes to set-returning functions, it is not necessary to use the LATERAL
keyword. All functions that are mentioned in the FROM
clause can use the output of any preceding functions or subqueries:
SELECT a, b FROM generate_series(1,3) AS a, generate_series(a, a+2) AS b; a | b ---+--- 1 | 1 1 | 2 1 | 3 2 | 2 2 | 3 2 | 4 3 | 3 3 | 4 3 | 5 (9 rows)
In the preceding query, the first function that has the alias a
returns three rows. For each of those three rows, the second function is called, returning three more rows.
There are several aggregating functions that are executed in a special way.
The first group of such aggregating functions are called ordered-set aggregates. They take into account not just the values of the argument expressions but also their order. They are related to statistics and calculate percentile values.
Percentile is the value of a group in which the given percentage of other values is less than that of the group. For example, if a value is at the 95th percentile, it means it is higher than 95 percent of the other values. In PostgreSQL, one can calculate a continuous or discrete percentile. A discrete percentile is one of the actual values of a group, while a continuous percentile is an interpolated value between two actual values. It is possible to calculate the percentile for a given fraction, or several percentile values for a given array of fractions.
For example, the query regarding the distribution of the number of advertisements per car:
SELECT percentile_disc(ARRAY[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY cnt) FROM ( SELECT count(*) cnt FROM car_portal_app.advertisement GROUP BY car_id) subq; percentile_disc ----------------- {2,3,5} (1 row)
The result means that there are, at the most, two advertisements for 25 percent of the cars, three advertisements for 50 percent of the cars, and five advertisements for 75 percent of the cars in the database.
The syntax of the ordered-set aggregating functions differs from the normal aggregates and uses a special construct, WITHIN GROUP
(ORDER BY
expression). The expression here is actually an argument of a function. Here, not just the order of the rows but the values of the expressions as well affect the result of the function. In contrast to the ORDER BY
clause of the SELECT
query, only one expression is possible here, and no references to the output column numbers are allowed.
Another ordered-set aggregating function is mode
. It returns the most frequently appearing value of the group. If two values appear the same number of times, then the first of them will be returned.
For example, the following query gets the ID of the most frequent car model in the database:
SELECT mode() WITHIN GROUP (ORDER BY car_model_id) FROM car_portal_app.car; mode ------ 64 (1 row)
To get the same result without this function will require self-join or ordering and limiting the result. Both are more expensive operations.
Another group of aggregates that use the same syntax are the hypothetical-set aggregating functions. They are rank
, dense_rank
, percent_rank
, and cume_dist
. There are window functions with the same names. Window functions take no argument and they return the result for the current row. Aggregate functions have no current row because they are evaluated for a group of rows. But they take an argument: the value for the hypothetical current row.
For example, the aggregate function rank
returns the rank of a given value in the ordered set as if that value existed in the set:
SELECT rank(2) WITHIN GROUP (ORDER BY a) FROM generate_series(1,10,3) a; rank ------ 2 (1 row)
In the preceding query, the value 2
does not exist in the output of generate_series
(it returns 1..4..7..10
) . But if it existed, it would take the second position in the output.
Another topic worth mentioning about aggregating functions is the FILTER
clause.
The FILTER
clause filters the rows that are passed to the particular aggregating function based on a given condition. For example, suppose it is required to count the number of cars in the database for each car model separately, for each number of doors. If one groups the records by these two fields, the result will be correct but not very convenient to use in reporting:
SELECT car_model_id, number_of_doors, count(*) FROM car_portal_app.car GROUP BY car_model_id, number_of_doors; car_model_id | number_of_doors | count --------------+-----------------+------- 47 | 4 | 1 42 | 3 | 2 76 | 5 | 1 52 | 5 | 2 ...
The FILTER
clause makes the output much clearer:
SELECT car_model_id, count(*) FILTER (WHERE number_of_doors = 2) doors2, count(*) FILTER (WHERE number_of_doors = 3) doors3, count(*) FILTER (WHERE number_of_doors = 4) doors4, count(*) FILTER (WHERE number_of_doors = 5) doors5 FROM car_portal_app.car GROUP BY car_model_id; car_model_id | doors2 | doors3 | doors4 | doors5 --------------+--------+--------+--------+-------- 43 | 0 | 0 | 0 | 2 8 | 0 | 0 | 1 | 0 11 | 0 | 2 | 1 | 0 80 | 0 | 1 | 0 | 0 ...
Note that the cars with a number of doors other that is than from 2 to 5 will not be counted by the query.
The same result can be achieved by calling functions, as follows:
count(CASE WHEN number_of_doors = 2 THEN 1 END) doors2
18.188.143.21