SELECT
statements or SELECT
queries or just queries are used to retrieve data from the database. SELECT
queries can have different sources: tables, views, functions or the VALUES
command. All of them are relations or can be treated as relations or return relations, which functions can do. The output of SELECT
is also a relation which in general can have multiple columns and contain many rows. Since the result and the source of the query have the same nature in SQL, it is possible to use one SELECT
query as a source for another statement. But in this case, both queries are considered as parts of one bigger query. The source of the data, output format, filters, grouping, ordering and required transformations of the data are specified in the code of the query.
In general, SELECT
queries do not change the data in the database and could be considered as read-only, but there is an exception. If a volatile function is used in the query, then the data can be changed by the function.
Let's start with a simple example. We will use the sample database of the car web portal, which was described in previous chapters. The sample database is used by the web portal application.
To connect to the database, the following command is used:
> psql -h localhost car_portal
There is a table called car
that contains information about the cars registered in the system. Suppose it is necessary to query the database to get the information about cars that have 3
doors. They should be sorted by their ID. The output should be limited to 5
records due to pagination in the user interface. The query will look like this:
SELECT car_id, registration_number, manufacture_year FROM car_portal_app.car WHERE number_of_doors=3 ORDER BY car_id LIMIT 5;
And the result:
car_id | registration_number | manufacture_year --------+---------------------+------------------ 2 | VSVW4565 | 2014 5 | BXGK6290 | 2009 6 | ORIU9886 | 2007 7 | TGVF4726 | 2009 8 | JISW6779 | 2013 (5 rows)
The syntax and the logic of the query are the following. The query starts from the keyword SELECT
which determines the type of statement. Therefore, this keyword is always required. The keyword is followed by the comma-separated list of the fields to be retrieved from the database. Instead of the list, it is possible to use an asterisk (*
) which would mean that all the fields from the table are selected.
The name of the table is specified after the FROM
keyword. It is possible to get the data from several tables at the same time. The filter criteria - predicate - is after the WHERE
keyword. The sorting rule is at the end after ORDER BY
. And the keyword LIMIT
makes the database return not more than 5
rows even if the number of records in the table is larger.
These parts of the query—the keywords and the following expressions are called clauses, such as the FROM
clause, the WHERE
clause, and so on.
All of these clauses have their own purpose and logic. They must follow each other in the specific order. And none of them is mandatory. The simplified syntax diagram for the SELECT
statement is as follows:
SELECT [DISTINCT | ALL] <expression>[[AS] <output_name>][, …] [FROM <table>[, <table>… | <JOIN clause>…] [WHERE <condition>] [GROUP BY <expression>|<output_name>|<output_number> [,…]] [HAVING <condition>] [ORDER BY <expression>|<output_name>|<output_number> [ASC | DESC] [NULLS FIRST | LAST] [,…]] [OFFSET <expression>] [LIMIT <expression>];
Some elements were not included here such as the WINDOW
clause, the WITH
clause or FOR UPDATE
. A complete syntax diagram can be found in the documentation at http://www.postgresql.org/docs/current/static/sql-select.html.
Some of the omitted elements will be described in the next chapters.
There is no part of the SELECT
statement that is always mandatory. For example, the query might be simpler if no ordering or filtering is needed:
SELECT * FROM car_portal_app.car;
Even the FROM
clause is not mandatory. When one needs to evaluate an expression that does not take any data from the database, the query takes this form:
SELECT 1; ?column? ---------- 1
That can be considered as Hello world
in SQL. Note that the FROM
clause is optional in PostgreSQL but in other RDBMS such as Oracle the FROM
keyword may be required.
Logically, the sequence of the operations performed by the SELECT
query is as follows:
FROM
clause, they are evaluated first.JOIN
conditions or set some fields to NULL
in case of outer joins.WHERE
clause.GROUP BY
list.HAVING
conditions.DISTINCT
is specified.UNION
, EXCEPT
, or INTERSECT
.ORDER BY
clause.OFFSET
and LIMIT
.In fact PostgreSQL optimizes that algorithm by performing the steps in a different order or even simultaneously. For example, if LIMIT 1
is specified, then it does not make sense to retrieve all the rows from the source tables, but only one that matches the WHERE
condition.
After the SELECT
keyword, one should specify the list of fields or expressions to get from the database. This list is called Select-list. It defines the structure of the query result: the number, names, and type of the selected values.
Every expression in Select-list has a name in the output of the query. The names, when not provided by the user, are assigned automatically by the database and in most cases the name reflects the source of the data: a name of a column when a field from a table is selected, or a name of a function when one is used. In other cases, the name will look like ?column?
. It is possible and in many cases, it totally makes sense to provide a different name to a selected expression. This is done using the AS
keyword, like this:
SELECT car_id AS identifier_of_the_car ...
In the result, the column car_id
from the table car
will be named identifier_of_the_car
. The keyword AS
is optional. The output column name follows the rules for any SQL identifier. It is possible to use the same name for several columns in a single SELECT
query. Double quoted names could be used, for example, when some report is generated by the SELECT
query without any subsequent processing. In that case, it may make sense to use more human readable column names:
SELECT car_id "Identifier of the car" ...
In many cases, it is convenient to use an asterisk (*
) instead of Select-list. Asterisk represents all the fields from all the tables specified in the FROM
clause. And it is possible to use an asterisk for each table separately, like this:
SELECT car.*, car_model.marke ...
In this example, all fields are selected from table car
and only one field—marke
—from car_model
.
It is considered a bad practice to use *
in situations where the query is used in other code: in applications, stored procedures, view definitions, and so on. It is not recommended because in case of using *
, the output format depends not on the code of the query but on the structure of the data. And if the data structure changes, the output format also changes which will break the application using it. But if you explicitly specify all the output fields in the query and afterwards add another column in the input table, this will not change the output of the query and will not break the application.
So, in our example instead of SELECT * ...
, it is safer to use:
SELECT car_id, number_of_owners, regestration_number, number_of_doors, car_model_id, mileage ...
Expressions in the Select-list are called value expressions or scalar expressions. That is because each expression in the Select-list always returns only one value (but the value can be an array).
Scalar expressions can also be called SQL expressions or simply expressions.
Each expression in SQL has its data type. It is determined by the data type(s) of the input. In many cases, it is possible to explicitly change the type of the data. Each item of the Select-list becomes a column in the output dataset of a type that the corresponding expression has.
SQL expressions can contain:
This list is not complete. There are several other cases of using SQL expressions that are not covered by this chapter.
Column names can be qualified and unqualified. Qualified means that the name of the column is preceded by the table name and optionally the schema name, all separated by the period (.
) symbol. Unqualified are just names of the fields without table references. Qualified column names must be used when several tables in the FROM
clause have columns with the same name. Unqualified naming in this case will cause an error "column reference is ambiguous". That means the database cannot understand which column is meant there. It is possible to use table alias instead of table name, and in case of using subqueries or functions, the alias must be used.
An example of using qualified names in Select-list is as follows:
SELECT car.car_id, car.number_of_owners FROM car_portal_app.car;
SQL supports all common operators as most of the other languages: logical, arithmetic, string, binary, date/time, and so on. We will discuss logical operators later in reference to SQL conditions. An examples of using arithmetic operators in expressions would be:
SELECT 1+1 AS two, 13%4 AS one, -5 AS minus_five, 5! AS factorial, |/25 AS square_root; two | one | minus_five | factorial | square_root -----+-----+------------+-----------+------------- 2 | 1 | -5 | 120 | 5
Also, in PostgreSQL, it is possible to create user-defined operators.
Function calls can also be a part of an SQL expression. To call an SQL function one should use its name and the arguments in parenthesis. Like this:
SELECT substring('this is a string constant',11,6); substring ----------- string
A function named substring
was executed here. Three arguments were passed to the function: a string and two integers. This function extracts a part from, given string starting from the character specified by the second argument and having specified length.
By default, PostgreSQL assigns to the output column the same name as the function.
If a function has no arguments it is still necessary to use parenthesis to indicate that it is a function name and not a field name or another identifier or a keyword.
Another thing that makes SQL very flexible and powerful is scalar subqueries, which can be used as a part of the value expression. It allows the developer to combine the results of different queries together. Scalar subqueries or scalar queries are queries that return exactly one column and one or zero records. They have no special syntax and their difference from non-scalar queries is nominal.
Consider the example:
SELECT (SELECT 1) + (SELECT 2) AS three; three ---------- 3
Here, the result of one scalar query which returns the value of 1
is added to the result of another scalar query returning 2
. The result of the whole expression is 3
.
Type casting means changing the data type of a value.
Type casts have several syntax patterns with the same meaning:
CAST ( <value> AS <type>)
<value>::<type>
<type> '<value>'
<type> (<value>)
The first is a common SQL syntax that is supported in most databases. The second is PostgreSQL specific. The third is only applicable for string constants is and usually used to define constants of other types but string or numeric. The last is function-like and can be applied only for types whose names are also existing function names, which is not very convenient. That's why this syntax is not widely used.
In many cases, PostgreSQL can do implicit type conversion. For example, the concatenation operator ||
takes two operands of type string. But if one tries to concatenate a string with a number, PostgreSQL will convert the number to a string automatically:
SELECT 'One plus one equals ' || (1+1) AS str; str ----------------------- One plus one equals 2
A conditional expression is an expression returning different results depending on some condition. It is similar to an IF
—THEN
—ELSE
statement in other programming languages. The syntax is as follows:
CASE WHEN <condition1> THEN <expression1> [WHEN <condition2> THEN <expression2> ...] [ELSE <expression n>] END
The behavior is understandable from the syntax: if the first condition is met, then the result of the first expression is returned, if the second condition is met, then the second expression is used, and so on. If no condition is met, then the expression specified in the ELSE
part is evaluated. Each <condition>
is an expression returning Boolean (true or false) result. All expressions used after the keyword THEN
should return result of the same type or at least of compatible types.
The number of condition-expression pairs should be one or more. ELSE
is optional and when ELSE
is not specified and the no condition's result is true, then the CASE
expression returns NULL
.
CASE
can be used in any place where the SQL expression is used. CASE
expressions can be nested, that is they can be put one inside another. The order of evaluating conditions is the same as specified in the expression. That means for any condition, it is known that all preceding conditions have returned false. And if any condition returns true, subsequent conditions are not evaluated at all.
There is a simplified syntax for the CASE
expression. When all the conditions implement checking of equality of the same expression to several values, it is possible to use it like this:
CASE <checked_expression> WHEN <value1> THEN <result1> [WHEN <value2> THEN <result2> ...] [ELSE <result_n>] END
This means when the value of checked_expression
is equal to value1
, the result1
is returned, and so on.
This is an example of using CASE
:
SELECT CASE WHEN now() > date_trunc('day', now()) + interval '12 hours' THEN 'PM' ELSE 'AM' END; case ------ PM
Here, the current time is compared to midday (the current time is truncated to day
, which gives midnight and then time interval of 12 hours
is added). When the current time is more than midday, the expression returns the sting PM
, otherwise it returns AM
.
A single SQL expression can have many operators, functions, type casts, and so on. The length of an SQL expression has no limits in language specification. The Select-list is not the only place where SQL expressions can be used. In fact, they are used almost everywhere in SQL statements. For example, one can order the results of the query based on some SQL expression, as a sorting key. In an INSERT
statement they are used to calculate values of the fields for newly inserted records. SQL expressions that return Boolean values are used as conditions.
PostgreSQL supports the short-circuit evaluation of the expressions and it sometimes it skips the evaluation of the part of the expression when it does not affect the result. For example, when evaluating the expression false AND z()
, PostgreSQL will not call the function because the result of the
AND
operator is false
, regardless of the result of the z()
function call.
Another thing related to the Select-list is pair of keywords DISTINCT
and ALL
, which can be used right after the SELECT
keyword. When DISTINCT
is specified, only unique rows from the input dataset will be returned. ALL
returns all the rows, this is the default.
Consider the examples:
SELECT ALL marke FROM car_portal_app.car_model; marke --------------- Audi Audi Audi Audi BMW BMW ... (99 rows) and SELECT DISTINCT marke FROM car_portal_app.car_model; marke --------------- Ferrari GMC Citroen UAZ Audi Volvo ... (25 rows)
The input in both cases is the same: the table with the car models. But the first query returned 99
records while the second only returned 25
. This is because the first returned all the rows that were in the input table. The second query selected only the unique rows. DISTINCT
removes duplicate records based on the Select-list, not on the data in the input table. For example, if only the first letter of the name of the manufacturer is selected the result will be even shorter because some names start from the same letter:
SELECT DISTINCT substring(marke, 1, 1) FROM car_portal_app.car_model; substring ----------- H S C J L I ... (21 rows)
DISTINCT
also works when several columns are selected. In that case, DISTINCT
will remove duplicated combinations of all the column values.
The source of the rows for the query is specified after the FROM
keyword. It is called the FROM
clause. The query can select rows from zero, one, or more sources. When no source is specified, the FROM
keyword should be omitted. The source of the rows for the query can be any of:
When multiple sources are specified, they should be separated by a comma, or the JOIN
clause should be used.
It is possible to set aliases for tables in the FROM
clause. The optional keyword AS
is used for that:
SELECT a.car_id, a.number_of_doors FROM car_portal_app.car AS a; car_id | number_of_doors --------+----------------- 1 | 5 2 | 3 3 | 5 ...
In the preceding example, the table name car_portal_app.car
was substituted by its alias a
in the Select-list. If an alias is used for a table or view in the FROM
clause, in the Select-list (or anywhere else) it is no longer possible to refer to the table by its name. Subqueries when used in the FROM
clause must have an alias. Also aliases are often used when self-join is performed, which means using the same table several times in the FROM
clause.
It is possible to select records from several sources at a time. Consider the following examples. There are two tables each having three rows:
SELECT * FROM car_portal_app.a; a_int | a_text -------+-------- 1 | one 2 | two 3 | three (3 rows) SELECT * FROM car_portal_app.b; b_int | b_text -------+-------- 2 | two 3 | three 4 | four (3 rows)
When records are selected from both of them, we get the combinations of all their rows :
SELECT * FROM car_portal_app.a, car_portal_app.b; a_int | a_text | b_int | b_text -------+--------+-------+-------- 1 | one | 2 | two 1 | one | 3 | three 1 | one | 4 | four 2 | two | 2 | two 2 | two | 3 | three 2 | two | 4 | four 3 | three | 2 | two 3 | three | 3 | three 3 | three | 4 | four (9 rows)
All possible combinations of records are called Cartesian product and usually it does not make much sense. In most cases, the user is interested in certain combinations of rows, when rows from one table match rows from another table based on some criteria. For example, it may be necessary to select only the combinations when the integer fields of both the tables have equal values. To get it the query should be changed:
SELECT * FROM car_portal_app.a, car_portal_app.b WHERE a_int=b_int; a_int | a_text | b_int | b_text -------+--------+-------+-------- 2 | two | 2 | two 3 | three | 3 | three (2 rows)
This condition (a_int=b_int
) join the tables. The joining conditions could be specified in the WHERE
clause but in most cases it is better to put them into a FROM
clause to make it more visible that they are here for joining and not for filtering the result of the join, though there is no formal difference.
The JOIN
keyword is used to add join conditions to the FROM
clause. The following query has the same logic and the same results as the previous one:
SELECT * FROM car_portal_app.a JOIN car_portal_app.b ON a_int=b_int;
The JOIN
condition may be specified using any of these three ways:
<first table> JOIN <second table> ON <condition>
The condition can be any. It's even not necessary to include fields from the joined tables.
<first table> JOIN <second table> USING (<field list>)
The join is based on the equality of all the fields specified in the comma separated <field list>
. The fields should exist in both tables with the same name. So this syntax is not flexible enough.
<first table> NATURAL JOIN <second table>
Here, the join is based on the equality of all the fields that have the same name in both tables.
Usage of USING
or NATURAL JOIN
syntax has a drawback which is similar to the usage of *
in the Select-list. It is possible to change the structure of the tables, for example, by adding another column or renaming them, in a way which does not make the query invalid but changes the logic of the query. This will cause errors that are very difficult to find.
What if not all rows from the first table can be matched to a row in the second table? In our example, only rows with integer values 2
and 3
exist in both tables. And when we join on the condition a_int=b_int
, only those two rows are selected from the tables. The rest of the rows are not selected. This kind of the join is called inner join. It can be shown as a filled area on the diagram, like this:
When all the records from one table are selected, regardless of the existence of matching records in the other table, it is called an outer join. There are three types of outer joins. Look at the following diagrams:
If all records are selected from the first table, along with only those records that match the joining condition from the second—it is a left outer join.
When all records from the second table are selected , along with only the matching records from the first table—it is a right outer join.
And when all the records from both tables are selected—it is a full outer join.
In SQL syntax, the words inner
and outer
are optional. Consider the following code example:
SELECT * FROM car_portal_app.a JOIN car_portal_app.b ON a_int=b_int; a_int | a_text | b_int | b_text -------+--------+-------+-------- 2 | two | 2 | two 3 | three | 3 | three (2 rows) SELECT * FROM car_portal_app.a LEFT JOIN car_portal_app.b ON a_int=b_int; a_int | a_text | b_int | b_text -------+--------+-------+-------- 1 | one | | 2 | two | 2 | two 3 | three | 3 | three (3 rows) SELECT * FROM car_portal_app.a RIGHT JOIN car_portal_app.b ON a_int=b_int; a_int | a_text | b_int | b_text -------+--------+-------+-------- 2 | two | 2 | two 3 | three | 3 | three | | 4 | four (3 rows) SELECT * FROM car_portal_app.a FULL JOIN car_portal_app.b ON a_int=b_int; a_int | a_text | b_int | b_text -------+--------+-------+-------- 1 | one | | 2 | two | 2 | two 3 | three | 3 | three | | 4 | four (4 rows)
Note that the Cartesian product is not the same as the result of the full outer join. Cartesian product means all possible combinations of all records from the tables without any specific matching rules. Full outer join returns pairs of records when they that match the join conditions. And the records that do not have a pair in other table are returned separately. Outer joins return empty values (NULLs) in columns that correspond to the table where no matching record is found.
Since it is possible to query not only tables but also views, functions and subqueries, it is also possible to join them using the same syntax as for joining tables:
SELECT * FROM car_portal_app.a INNER JOIN (SELECT * FROM car_portal_app.b WHERE b_text = 'two') subq ON a.a_int=subq.b_int; a_int | a_text | b_int | b_text -------+--------+-------+-------- 2 | two | 2 | two
In the example, the subquery got the alias sq
and it was used in the join condition.
Also, it is possible to join more than two tables. In fact, every join clause joins all the tables before the JOIN
keyword with one table right after the keyword.
For example, this is correct:
SELECT * FROM table_a JOIN table_b ON table_a.field1=table_b.field1 JOIN table_c ON table_a.field2=table_c.field2 AND table_b.field3=table_c.field3;
At the moment of JOIN table_c
, the table table_a
has been mentioned already in the FROM
clause, therefore it is possible to refer to that table.
However, this is not correct:
SELECT * FROM table_a JOIN table_b ON table_b.field3=table_c.field3 JOIN table_c ON table_a.field2=table_c.field2
The code will cause an error because at JOIN table_b
, the table table_c
has not been there yet.
The Cartesian product can also be implemented using the JOIN
syntax. The keywords CROSS JOIN
are used for that. The following code:
SELECT * FROM car_portal_app.a CROSS JOIN car_portal_app.b;
Is equivalent to:
SELECT * FROM car_portal_app.a, car_portal_app.b;
The join condition in INNER JOIN
in the logic of the query has the same meaning as a condition for filtering the rows in the WHERE
clause. So the two following queries are the same:
SELECT * FROM car_portal_app.a INNER JOIN car_portal_app.b ON a.a_int=b.b_int; SELECT * FROM car_portal_app.a, car_portal_app.b WHERE a.a_int=b.b_int;
But this is not the case for outer joins. There is no way to implement an outer join with the WHERE
clause in PostgreSQL, though it may be possible in other databases.
It is possible to join a table with itself. It is called self-join. Self-join has no special syntax. In fact, all the data sources in the query are considered as independent even though they could be the same. Suppose one wants to know about each record of a table and how many records exist with a bigger value than the field a_int
. The following query can be used for that:
SELECT t1.a_int AS current, t2.a_int AS bigger FROM car_portal_app.a t1 INNER JOIN car_portal_app.a t2 ON t2.a_int > t1.a_int; current | bigger --------+-------- 1 | 2 1 | 3 2 | 3 (3 rows)
The table a
is joined to itself. From the logic of the query, it does not matter if two different tables are joined or if the same table is used twice. To be able to reference the fields of the table and distinguish the instances of the table, the table aliases are used. The first instance is called t1
and the second t2
. From the results, it is visible that for the value 1
there are two bigger values: 2
and 3
; and for the value 2
only one bigger value of 3
exists. The examined value is in the column named current
and the bigger values are in the column of the same name.
The value 3
is not selected because there are no values bigger than 3
. But if one wants to explicitly show that, LEFT JOIN
is used:
SELECT t1.a_int AS current, t2.a_int AS bigger FROM car_portal_app.a t1 LEFT JOIN car_portal_app.a t2 ON t2.a_int > t1.a_int; current | bigger --------+-------- 1 | 2 1 | 3 2 | 3 3 | (4 rows)
In many cases, after the rows are taken from the input tables, they should be filtered. It is done via the WHERE
clause. The filtering condition is specified after the WHERE
keyword. The condition is an SQL expression returning a Boolean value. That's why the syntax of the WHERE
condition is the same as in the expressions in the Select-list. This is specific for PostgreSQL. In other databases, it may not be possible to use Boolean values in the Select-list, which makes SQL conditions different from SQL expressions. But in PostgreSQL, the difference is only in the data type of a returned value.
Simple examples of the WHERE
conditions can be:
SELECT * FROM car_portal_app.car_model WHERE marke='Peugeot'; SELECT * FROM car_portal_app.car WHERE mileage < 25000; SELECT * FROM car_portal_app.car WHERE number_of_doors > 3 AND number_of_owners <= 2;
Although there is no formal difference between SQL expressions and SQL conditions in PostgreSQL, usually all expressions returning a Boolean type are called conditional expressions or just conditions. And they are mostly used in the WHERE
clause, in conditional expressions like CASE
or in the JOIN
clause.
Logical operators are usually used in conditional expressions. They are AND
, OR
and NOT
. They take Boolean arguments and return Boolean values. Logical operators are evaluated in the following order: NOT
, AND
, OR
. But they have lower priority than any other operators. PostgreSQL tries to the optimize evaluation of logical expressions. For example, when the OR
operator is evaluated, and it is known that the first operand has the true
value, PostgreSQL may not evaluate the second operand at all, because the result of OR
is already known. For that reason, PostgreSQL may change the actual order of evaluating expressions on purpose to get the results faster.
Sometimes that might cause problems. For example, it is not possible to divide by zero, and in case one wanted to filter rows based on the result of division, this would not be correct:
SELECT * FROM t WHERE a<>0 AND b/a>0.5
Because it is not guaranteed that PostgreSQL will evaluate the first condition (a<>0
) before the second, and in case a = 0
, this could cause an error. To be secure one should use a CASE
statement because the order of evaluation of CASE
conditions is determined by the statement:
SELECT * FROM t WHERE CASE WHEN a=0 THEN false ELSE b/a>0.5 END
There are some other operators or expressions returning Boolean values that are used in conditional expressions:
OVERLAPS
operatorAs if in the Select-list, functions can be used in the WHERE
clause as well as anywhere in the expression. Suppose one wants to search for car models whose name is four letters long. This can be done by using a length
function:
SELECT * FROM car_portal_app.car_model WHERE length(model)=4; car_model_id | marke | model --------------+------------+------- 47 | KIA | Seed 57 | Nissan | GT-R 70 | Renault | Clio ...
Comparison operators are <
(less), >
(more), <=
(equal or less), >=
(equal or more), =
(equal) and <>
or !=
(not equal—those two are synonyms). These operators can compare not only numbers but any values that can be compared, for example dates, or strings.
There is a BETWEEN
construct that also relates to comparing:
x BETWEEN a AND b
The preceding code is equivalent to:
x>=a AND a<=b
The OVERLAPS
operator checks to see if two ranges of dates overlap or not. An example would be:
SELECT 1 WHERE (date '2014-10-15', date '2014-10-31') OVERLAPS (date '2014-10-25', date '2014-11-15'); ?column? ---------- 1
Comparison operators have different precedence: >=
and <=
have the highest priority. Then comes BETWEEN
, then OVERLAPS
, then <
and >
. And =
has the lowest priority.
Pattern matching is always about strings. There are two similar operators: LIKE
and ILIKE
. They check if a string matches a given pattern. Only two wildcards can be used in the pattern: underscore (_
) for exactly one character (or number) and percent sign (%
) for any number of any characters, including an empty string.
LIKE
and ILIKE
are the same except that the first is case-sensitive and second is not.
For example, to get car models whose names start with s
and have exactly four characters, one can use the following query:
SELECT * FROM car_portal_app.car_model WHERE model ILIKE 's_ _ _'; car_model_id | marke | model --------------+------------+------- 47 | KIA | Seed
There are another two pattern matching operators: SIMILAR
and ~
. They check for pattern matching using regular expressions. The difference between them is that SIMILAR
uses regular expression syntax defined in SQL standard, while ~
uses Portable Operating System Interface (POSIX) regular expressions.
In this example, one selects all car models whose names consist of exactly two words:
SELECT * FROM car_portal_app.car_model WHERE model ~ '^w+W+w+$'; car_model_id | marke | model --------------+---------------+-------------- 21 | Citroen | C4 Picasso 33 | Ford | C-Max 34 | Ford | S-Max ...
Row and array comparison constructs are used to make multiple comparisons between values, groups of values and arrays.
The expression IN
is used for checking if a value equals to any of the values from a list. The expression is as follows:
a IN (1, 2, 3)
The preceding code will return true
if a
equals to 1
, 2
or 3
. It is a shorter way of implementing the following:
(a = 1 OR a = 2 OR a = 3)
SQL allows the use of array types that mean several elements as a whole in one single value. This can be used for enriching comparison conditions. For example, this checks if a
is bigger than any of x
, y
or z
:
a > ANY (ARRAY[x, y, z])
The preceding code is equivalent to:
(a > x OR a > y OR a > z)
This checks if a
is bigger than either x
, y
, or z
:
a > ALL (ARRAY[x, y, z])
The preceding code is equivalent to:
(a > x AND a > y AND a > z )
The keywords IN
, ALL
, and ANY
(which has a synonym SOME
) can also be used in subquery expressions with the same logic. A result of subquery can be used in any place where it is possible to use a set of values or an array. This makes it possible, for instance, to select records from one table, when some values exist in an other table. For example, here car models are selected when there is a car of that model:
SELECT * FROM car_portal_app.car_model WHERE car_model_id IN (SELECT car_model_id FROM car_portal_app.car); car_model_id | marke | model --------------+---------------+-------------- 2 | Audi | A2 3 | Audi | A3 4 | Audi | A4 ... (86 rows)
Sometimes an IN
expression can be replaced by inner join, but not always. Consider the example:
SELECT car_model.* FROM car_portal_app.car_model INNER JOIN car_portal_app.car USING (car_model_id); car_model_id | marke | model --------------+---------------+-------------- 2 | Audi | A2 2 | Audi | A2 2 | Audi | A2 3 | Audi | A3 3 | Audi | A3 4 | Audi | A4 4 | Audi | A4 ... (229 rows)
Although the same table is queried and the same columns are returned, the number of records is bigger. That is because there are many cars of the same model, and for them, the model is selected several times.
The NOT IN
construct with a subquery is sometimes very slow because the check for the nonexistence of a value is more expensive than the opposite.
In all previous examples,the number of records returned by the SELECT
query is the same as the number of rows from the input table (or tables) after filtering. In other words, every row from any of the source table (or joined tables) becomes exactly one row in the query result. Rows are processed one by one.
But SQL provides a way to get some aggregated result of processing several records at a time and get the result in one single row. The easiest example would be counting the total number of records in the table. The input is all the records of a table. The output is one single record. Grouping and aggregation is used for this.
The GROUP BY
clause is used for grouping. Grouping means splitting the whole input set of records into several groups with a view to have only one result row for each group. Grouping is performed on the basis of a list of expressions. All records that have the same combination of values of grouping expressions are grouped together. That means the groups are identified by the values of expressions defined in the GROUP BY
clause. Usually, it makes sense to include those expressions in the Select-list, in order to make it visible which group is referred to by the result row.
For example:
SELECT a.marke, a.model FROM car_portal_app.car_model a INNER JOIN car_portal_app.car b ON a.car_model_id=b.car_model_id GROUP BY a.marke, a.model; marke | model ---------------+-------------- Opel | Corsa Ferrari | 458 Italia Peugeot | 308 Opel | Insignia ... (86 rows)
Here, the list of all the car models that are used in the table car
is selected. Each record in the result set represents a group of records from the source tables relating to the same car model. In fact, this query gives the same result as SELECT DISTINCT marke, model...
without GROUP BY
but the logic is different. DISTINCT
removes duplicated values, but GROUP BY
groups duplicated values together.
It is almost useless just to group rows. Usually, it is necessary to do some computing on the groups. In the last case, it would be interesting to know how many cars of which model are in the system. This is done by aggregation. Aggregation means performing calculation, on a group of records returning a single value for the whole group. This is done by the special aggregating functions that are used in the Select-list. To get the number of cars, one needs to use the count
function:
SELECT a.marke, a.model, count(*) FROM car_portal_app.car_model a INNER JOIN car_portal_app.car b ON a.car_model_id=b.car_model_id GROUP BY a.marke, a.model; marke | model | count ---------------+--------------+------- Opel | Corsa | 6 Ferrari | 458 Italia | 4 Peugeot | 308 | 3 Opel | Insignia | 4 ... (86 rows)
There are several aggregating functions available in PostgreSQL. The most frequently used are count
, sum
, max
, min
, and avg
for computing, respectively the number of records in the group, the sum of any numeric expression for all the records in the group, the biggest, the lowest and average value of any expression. There are some other aggregating functions such as corr
, which computes the correlation coefficient of the two given arguments, stddev
—for standard deviation, string_agg
, which concatenates the string values of an expression, and others.
When grouping and aggregation is used, the records are grouped. This means several records become one. Therefore, no other expressions except the aggregation functions and expressions from the
GROUP BY
list can be included in the Select-list. If it is done, the database will raise an error:
SELECT a_int, a_text FROM car_portal_app.a GROUP BY a_int; ERROR: column "a.a_text" must appear in the GROUP BY clause or be used in an aggregate function
But it is possible to create new expressions based on the expressions from the GROUP BY
list. For example, if we have GROUP BY a, b
, it is possible to SELECT a+b
.
What if it is needed to group all of the records of the table together, not on the basis of the values of some field, but the whole table? To do that one should include aggregating functions in the Select-list (and only them!) and not use the GROUP BY
clause:
SELECT count(*) FROM car_portal_app.car; count ------- 229 (1 row)
If all the records of a table are grouped without any GROUP BY
expressions, then exactly one group exists. Note that the SQL queries that have aggregating functions in the Select-list and do not have the GROUP BY
clause always return exactly one row, even if there are no rows in the input tables, or if all of them are filtered out. For example:
SELECT count(*) FROM car_portal_app.car WHERE number_of_doors = 15; count ------- 0 (1 row)
There are no cars with 15
doors. If the table is selected with this WHERE
condition, no rows will be returned. But if one uses count(*)
, the aggregating function will return a row with a value of zero.
It is possible to count the number of unique values of the expression with count(DISTINCT <expression>)
. For example:
SELECT count(*), count(DISTINCT car_model_id) FROM car_portal_app.car; count | count -------+------- 229 | 86
Here, the first column has the total number of cars. The second column is the number of the car models to which the cars belong. Since some cars are of the same model, the number of models is lower than the total number of the cars.
Aggregating functions are not allowed in the WHERE
clause but it is possible to filter groups that follow a certain condition. This is different from filtering in the WHERE
clause because WHERE
filters input rows, and groups are calculated afterwards. The filtering of the groups is done by the HAVING
clause. That is very similar to the WHERE
clause but only aggregating functions are allowed there. The HAVING
clause is specified after the GROUP BY
clause. Suppose one needs to know which car models have a number of cars greater than 5
. This can be done by using a subquery:
SELECT marke, model FROM ( SELECT a.marke, a.model, count(*) c FROM car_portal_app.car_model a INNER JOIN car_portal_app.car b ON a.car_model_id=b.car_model_id GROUP BY a.marke, a.model ) subq WHERE c >5; marke | model ---------+------- Opel | Corsa Peugeot | 208 (2 rows)
But the simpler and clearer way is to do it via a HAVING
clause:
SELECT a.marke, a.model FROM car_portal_app.car_model a INNER JOIN car_portal_app.car b ON a.car_model_id=b.car_model_id GROUP BY a.marke, a.model HAVING count(*)>5; marke | model ---------+------- Opel | Corsa Peugeot | 208
The results of the query are not ordered by default. The order of the rows is not defined and may depend on their physical place on the disc, the joining algorithm, or on other factors. But in many cases it is required to have the result set sorted. This is done with the ORDER BY
clause. The list of expressions whose values should be sorted is specified after the ORDER BY
keywords. At the beginning, the records are sorted on the basis of the first expression of the ORDER BY
list. If some rows have the same value for the first expression, they are sorted by the values of the second expression, and so on.
After each item of the ORDER BY
list it is possible to specify if the order should be ascending or descending. That is done by specifying the keywords ASC
or DESC
after the expression. Ascending is the default. NULL values are considered larger than any other values by default but it is possible to explicitly define that NULLs should precede other rows by specifying NULLS FIRST or NULLS LAST if NULLs should be at the end.
It is not required for the ORDER BY
clause to contain the same expressions as the Select-list but it usually does. So, to make it more convenient, it is allowed to use in the ORDER BY
list the output column names that are assigned to the expression in the Select-list instead of fully qualified expressions. Also, it is possible to use the numbers of the columns.
So, these examples are equivalent:
SELECT number_of_owners, manufacture_year, trunc(mileage/1000) as kmiles FROM car_portal_app.car ORDER BY number_of_owners, manufacture_year, trunc(mileage/1000) DESC; SELECT number_of_owners, manufacture_year, trunc(mileage/1000) as kmiles FROM car_portal_app.car ORDER BY number_of_owners, manufacture_year, kmiles DESC; SELECT number_of_owners, manufacture_year, trunc(mileage/1000) as kmiles FROM car_portal_app.car ORDER BY 1, 2, 3 DESC;
Sometimes it is necessary to limit the output of the query to a certain number of rows and discard the rest. That is done by specifying that number after the LIMIT
keyword. For example:
SELECT * FROM car_portal_app.car_model LIMIT 5; car_model_id | marke | model --------------+-------+------- 1 | Audi | A1 2 | Audi | A2 3 | Audi | A3 4 | Audi | A4 5 | Audi | A5 (5 rows)
The preceding code returns only 5
rows regardless of the fact that the actual number of records in the table is bigger. This is sometimes used in scalar subqueries that should not return more than 1 record.
Another similar task is to skip several rows at the beginning of the output. This is done by the using keyword OFFSET
. OFFSET
and LIMIT
can be used together:
SELECT * FROM car_portal_app.car_model OFFSET 5 LIMIT 5; car_model_id | marke | model --------------+-------+------- 6 | Audi | A6 7 | Audi | A8 8 | BMW | 1er 9 | BMW | 3er 10 | BMW | 5er (5 rows)
The typical use case for OFFSET
and LIMIT
is the implementation of paginated output in web applications. For example, if ten rows are displayed on a page, then on third page the rows 21-30 should be shown. Then OFFSET 20 LIMIT 10
is used. In most cases of using OFFSET
and LIMIT
, the rows should be ordered, otherwise it is not determined which records are not shown. The keywords are then specified after the ORDER BY
clause.
Subqueries are a very powerful feature of SQL. They can be used almost everywhere in the query. The most obvious way to use subqueries is in the FROM
clause as a source for the main query:
SELECT * FROM ( SELECT car_model_id, count(*) c FROM car_portal_app.car GROUP BY car_model_id ) subq WHERE c = 1; car_model_id | c --------------+--- 8 | 1 80 | 1 ... (14 rows)
When subqueries are used in the FROM
clause they must have an alias. In the preceding example, the subquery is given the name subq
.
Also, subqueries can be used in SQL conditions in the IN
expressions:
SELECT car_id, registration_number FROM car_portal_app.car WHERE car_model_id IN ( SELECT car_model_id FROM car_portal_app.car_model WHERE marke='Peugeot' ); car_id | registration_number --------+--------------------- 1 | MUWH4675 14 | MTZC8798 18 | VFZF9207 ... (18 rows)
Scalar subqueries can be used everywhere in expressions: in the Select-list, WHERE
clause, GROUP BY
clause, and so on. Even in LIMIT
:
SELECT (SELECT count(*) FROM car_portal_app.car_model) FROM car_portal_app.car LIMIT (SELECT MIN(car_id)+2 FROM car_portal_app.car); count ------- 99 99 99 (3 rows)
This is a PostgreSQL specific feature. Not every RDBMS supports subqueries in every place where expression is allowed.
It is not possible to refer to the internal elements of one subquery from inside of another. But subqueries can refer to the elements of main query. For example, if it is necessary to count cars for each car model, and select the top five most popular models, it can be done by using subquery in this way:
SELECT marke, model, ( SELECT count(*) FROM car_portal_app.car WHERE car_model_id = main.car_model_id ) FROM car_portal_app.car_model main ORDER BY 3 DESC LIMIT 5; marke | model | count ---------+----------+------- Peugeot | 208 | 7 Opel | Corsa | 6 Jeep | Wrangler | 5 Renault | Laguna | 5 Peugeot | 407 | 5 (5 rows)
In the example the subquery in the Select-list refers to the table of the main query by its alias main
. The subquery is executed for each record received from the main table, using the value of car_model_id
in the WHERE
condition.
Subqueries can be nested. It means it is possible to use subqueries inside another subqueries.
Set operations are used to combine the results of several queries. It is different from joining, although the same results often can be achieved by joining.
Joining means placing the records of two tables, one table with another table horizontally. The result of joining is that the number of columns equal to the sum of the numbers of columns of the source tables, and the number of records will depend on the join conditions.
Combining means putting the result of one query on top of the result of another query. The number of columns stays the same, but the number of rows is the sum of the rows from the sources.
There are three set operations:
UNION
: This appends the result of one query to the result of another queryINTERSECT
: This returns the records that exist in the results of both queriesEXCEPT
: This returns the records from the first query that do not exist in the result of the second query—the differenceThe syntax of set operations is as follows:
<query1> UNION <query2>; <query1> INTERSECT <query2>; <query1> EXCEPT <query2>;
It is possible to use several set operations in one statement:
SELECT a, b FROM t1 UNION SELECT c, d FROM t2 INTERSECT SELECT e, f FROM t3;
The priority of all set operations is the same. That means logically they are executed in the same order as used in the code. But the records can be returned in a different order that is not predicted, unless the ORDER BY
clause is used. In this case, the ORDER BY
clause is applied after all of the set operations. And for that reason, it does not make sense to put ORDER BY
into the subqueries.
All set operations by default remove duplicated records as if SELECT DISTINCT
is used. To avoid this and return all the records, the ALL
keyword should be used, which is specified after the name of the set operation:
<query1> UNION ALL <query2>.
The set operations can be used to find the difference between the two tables. For example:
SELECT 'a', * FROM ( SELECT * FROM car_portal_app.a EXCEPT ALL SELECT * FROM car_portal_app.b ) v1 UNION ALL SELECT 'b', * FROM ( SELECT * FROM car_portal_app.b EXCEPT ALL SELECT * FROM car_portal_app.a ) v2; ?column? | a_int | a_text ----------+-------+-------- a | 1 | one b | 4 | four (2 rows)
From the results of that query, you can find out that the row one
exists in the table a
but does not exist in the table b
. And the row four
exists in the table b
, but not in a
.
It is possible to append one set of records to another only when they have the same number of columns and they have respectively the same data types, or compatible data types. The output names for the columns are always taken from the first subquery, even if they are different in subsequent queries.
In other RDBMS, set operations can have different names, for example, in Oracle EXCEPT
is called MINUS
.
NULL is a special value that any field or expression can have, except for when it is explicitly forbidden. NULL means the absence of any value. It can also be treated as an unknown value in some cases. In relation to logical values, NULL is neither true nor false. Working with NULLs can be confusing because almost all operators when take NULL as an argument, return NULL. And if one tries to compare some values and one of them is NULL the result will also be NULL, which is not true. For example, the query
SELECT * FROM t WHERE a > b
This will return no records if a
or b
have NULL
value. That can be expected, but for the following condition this is not so clear:
SELECT * FROM t WHERE a = b
Here, if both a
and b
have a value of NULL
, the record still will not be selected. The equal operator (=
) returns NULL
if any of the arguments is NULL
, and since WHERE
filters rows only when the condition is true, the record is not selected. Similarly, the following will never return any record, even if a
has a NULL
value:
SELECT * FROM t WHERE a = NULL
To check the expression for having NULL
value, a special operator is used: IS NULL
. The previous query, if it is necessary to also select records when both a
and b
are NULL
, should be changed this way:
SELECT * FROM t WHERE a = b OR (a IS NULL AND b IS NULL)
There is a special construct, that can be used to check the equivalence of expressions taking NULL
into account: IS NOT DISTINCT FROM
. The preceding example can be implemented in the following way, which has the same logic:
SELECT * FROM t WHERE a IS NOT DISTINCT FROM b
Logical operators are different. They can sometimes return not null value even if they take NULL
as an argument. For logical operators NULL
means an unknown value. Operator AND
always returns false when one of the operands is false, even if the second is NULL
. Or
always returns true if one of the arguments if true. In all other cases, the result is unknown, therefore NULL
:
SELECT true AND NULL, false AND NULL, true OR NULL, false OR NULL, NOT NULL; ?column? | ?column? | ?column? | ?column? | ?column? ----------+----------+----------+----------+---------- | f | t | |
Functions can treat NULL
values differently. Their behavior is determined by their code. Most built-in functions return NULL
if any of the arguments are NULL
.
Aggregating functions work with NULL
values in a different way. They work with many rows and therefore many values. In general, they ignore NULL
. Sum
calculates the total of all not-null values and ignores NULL
. Sum
returns NULL
only when all the received values are NULL
. For avg
, max
and min
it is the same. But for count
it is different. count
returns the number of not-null values. So if all the values are NULL
, count
returns zero.
In contrast to some other databases in PostgreSQL, empty string is not NULL
. Consider the example:
SELECT a IS NULL, b IS NULL, a = b FROM (SELECT ''::text a, NULL::text b) v; ?column? | ?column? | ?column? ----------+----------+---------- f | t |
There are a couple of functions designed to deal with NULL
values: COALESCE
and NULLIF
.
The COALESCE
function takes any number of arguments of the same data type or compatible types. It returns the value of the first of its arguments that is not null:
COALESCE(a, b, c)
The preceding code is equivalent to:
CASE WHEN a IS NOT NULL THEN a WHEN b IS NOT NULL THEN b ELSE c END
NULLIF
takes two arguments and returns NULL
if they are equal. Otherwise it returns the value of the first argument. This is somehow the opposite of COALESCE
:
NULLIF (a, b)
The preceding code is equivalent to:
CASE WHEN a = b THEN NULL ELSE a END
Another aspect of NULL
values is that they are ignored by unique constraints. This means if a field of a table is defined as unique, it is still possible to create several records having a NULL
value of that field. Also b-tree indexes which are most commonly used, do not index NULL
values. Consider the query as follows:
SELECT * FROM t WHERE a IS NULL
The preceding code will not use an index if it is created on the column a
.
18.117.188.67