Querying the data with the SELECT statement

SELECT statements or SELECT queries or just queries are used to retrieve data from the database. SELECT queries can have different sources: tables, views, functions or the VALUES command. All of them are relations or can be treated as relations or return relations, which functions can do. The output of SELECT is also a relation which in general can have multiple columns and contain many rows. Since the result and the source of the query have the same nature in SQL, it is possible to use one SELECT query as a source for another statement. But in this case, both queries are considered as parts of one bigger query. The source of the data, output format, filters, grouping, ordering and required transformations of the data are specified in the code of the query.

In general, SELECT queries do not change the data in the database and could be considered as read-only, but there is an exception. If a volatile function is used in the query, then the data can be changed by the function.

The structure of the SELECT query

Let's start with a simple example. We will use the sample database of the car web portal, which was described in previous chapters. The sample database is used by the web portal application.

To connect to the database, the following command is used:

> psql -h localhost car_portal

There is a table called car that contains information about the cars registered in the system. Suppose it is necessary to query the database to get the information about cars that have 3 doors. They should be sorted by their ID. The output should be limited to 5 records due to pagination in the user interface. The query will look like this:

SELECT car_id, registration_number, manufacture_year
  FROM car_portal_app.car
  WHERE number_of_doors=3
  ORDER BY car_id
  LIMIT 5;

And the result:

car_id | registration_number | manufacture_year
--------+---------------------+------------------
      2 | VSVW4565            |             2014
      5 | BXGK6290            |             2009
      6 | ORIU9886            |             2007
      7 | TGVF4726            |             2009
      8 | JISW6779            |             2013
(5 rows)

The syntax and the logic of the query are the following. The query starts from the keyword SELECT which determines the type of statement. Therefore, this keyword is always required. The keyword is followed by the comma-separated list of the fields to be retrieved from the database. Instead of the list, it is possible to use an asterisk (*) which would mean that all the fields from the table are selected.

The name of the table is specified after the FROM keyword. It is possible to get the data from several tables at the same time. The filter criteria - predicate - is after the WHERE keyword. The sorting rule is at the end after ORDER BY. And the keyword LIMIT makes the database return not more than 5 rows even if the number of records in the table is larger.

These parts of the query—the keywords and the following expressions are called clauses, such as the FROM clause, the WHERE clause, and so on.

All of these clauses have their own purpose and logic. They must follow each other in the specific order. And none of them is mandatory. The simplified syntax diagram for the SELECT statement is as follows:

SELECT [DISTINCT | ALL]
  <expression>[[AS] <output_name>][, …]
[FROM <table>[, <table>… | <JOIN clause>…]
[WHERE <condition>]
[GROUP BY <expression>|<output_name>|<output_number>
  [,…]]
[HAVING <condition>]
[ORDER BY  <expression>|<output_name>|<output_number>
  [ASC | DESC] [NULLS FIRST | LAST] [,…]]
[OFFSET <expression>]
[LIMIT <expression>];

Some elements were not included here such as the WINDOW clause, the WITH clause or FOR UPDATE. A complete syntax diagram can be found in the documentation at http://www.postgresql.org/docs/current/static/sql-select.html.

Some of the omitted elements will be described in the next chapters.

There is no part of the SELECT statement that is always mandatory. For example, the query might be simpler if no ordering or filtering is needed:

SELECT * FROM car_portal_app.car;

Even the FROM clause is not mandatory. When one needs to evaluate an expression that does not take any data from the database, the query takes this form:

SELECT 1;
 ?column?
----------
        1

That can be considered as Hello world in SQL. Note that the FROM clause is optional in PostgreSQL but in other RDBMS such as Oracle the FROM keyword may be required.

Logically, the sequence of the operations performed by the SELECT query is as follows:

  • Take all the records from all the source tables. If there are subqueries in the FROM clause, they are evaluated first.
  • Build all possible combinations of those records and discard the combinations that do not follow the JOIN conditions or set some fields to NULL in case of outer joins.
  • Filter out the combinations that do not match the condition of the WHERE clause.
  • Build groups based on the values of the expressions of the GROUP BY list.
  • Filter the groups that match the HAVING conditions.
  • Evaluate expressions of the Select-list.
  • Eliminate duplicated rows if DISTINCT is specified.
  • Apply the set operations UNION, EXCEPT, or INTERSECT.
  • Sort rows according to the ORDER BY clause.
  • Discard records according to OFFSET and LIMIT.

In fact PostgreSQL optimizes that algorithm by performing the steps in a different order or even simultaneously. For example, if LIMIT 1 is specified, then it does not make sense to retrieve all the rows from the source tables, but only one that matches the WHERE condition.

Select-list

After the SELECT keyword, one should specify the list of fields or expressions to get from the database. This list is called Select-list. It defines the structure of the query result: the number, names, and type of the selected values.

Every expression in Select-list has a name in the output of the query. The names, when not provided by the user, are assigned automatically by the database and in most cases the name reflects the source of the data: a name of a column when a field from a table is selected, or a name of a function when one is used. In other cases, the name will look like ?column?. It is possible and in many cases, it totally makes sense to provide a different name to a selected expression. This is done using the AS keyword, like this:

SELECT car_id AS identifier_of_the_car ...

In the result, the column car_id from the table car will be named identifier_of_the_car. The keyword AS is optional. The output column name follows the rules for any SQL identifier. It is possible to use the same name for several columns in a single SELECT query. Double quoted names could be used, for example, when some report is generated by the SELECT query without any subsequent processing. In that case, it may make sense to use more human readable column names:

SELECT car_id "Identifier of the car" ...

In many cases, it is convenient to use an asterisk (*) instead of Select-list. Asterisk represents all the fields from all the tables specified in the FROM clause. And it is possible to use an asterisk for each table separately, like this:

SELECT car.*, car_model.marke ...

In this example, all fields are selected from table car and only one field—marke—from car_model.

It is considered a bad practice to use * in situations where the query is used in other code: in applications, stored procedures, view definitions, and so on. It is not recommended because in case of using *, the output format depends not on the code of the query but on the structure of the data. And if the data structure changes, the output format also changes which will break the application using it. But if you explicitly specify all the output fields in the query and afterwards add another column in the input table, this will not change the output of the query and will not break the application.

So, in our example instead of SELECT * ..., it is safer to use:

  SELECT car_id, number_of_owners, regestration_number, number_of_doors, car_model_id, mileage ...

SQL expressions

Expressions in the Select-list are called value expressions or scalar expressions. That is because each expression in the Select-list always returns only one value (but the value can be an array).

Scalar expressions can also be called SQL expressions or simply expressions.

Each expression in SQL has its data type. It is determined by the data type(s) of the input. In many cases, it is possible to explicitly change the type of the data. Each item of the Select-list becomes a column in the output dataset of a type that the corresponding expression has.

SQL expressions can contain:

  • Column names (most of the cases)
  • Constants
  • Operator invocations
  • Parenthesis to control operations precedence
  • Function calls
  • Aggregate expressions (we will discuss them later)
  • Scalar subqueries
  • Type casts
  • Conditional expressions

This list is not complete. There are several other cases of using SQL expressions that are not covered by this chapter.

Column names can be qualified and unqualified. Qualified means that the name of the column is preceded by the table name and optionally the schema name, all separated by the period (.) symbol. Unqualified are just names of the fields without table references. Qualified column names must be used when several tables in the FROM clause have columns with the same name. Unqualified naming in this case will cause an error "column reference is ambiguous". That means the database cannot understand which column is meant there. It is possible to use table alias instead of table name, and in case of using subqueries or functions, the alias must be used.

An example of using qualified names in Select-list is as follows:

SELECT car.car_id, car.number_of_owners
  FROM car_portal_app.car;

SQL supports all common operators as most of the other languages: logical, arithmetic, string, binary, date/time, and so on. We will discuss logical operators later in reference to SQL conditions. An examples of using arithmetic operators in expressions would be:

SELECT 1+1 AS two, 13%4 AS one, -5 AS minus_five,
  5! AS factorial, |/25 AS square_root;
  two | one | minus_five | factorial | square_root
-----+-----+------------+-----------+-------------
   2 |   1 |         -5 |       120 |            5

Also, in PostgreSQL, it is possible to create user-defined operators.

Function calls can also be a part of an SQL expression. To call an SQL function one should use its name and the arguments in parenthesis. Like this:

SELECT substring('this is a string constant',11,6);
 substring
-----------
 string

A function named substring was executed here. Three arguments were passed to the function: a string and two integers. This function extracts a part from, given string starting from the character specified by the second argument and having specified length.

By default, PostgreSQL assigns to the output column the same name as the function.

If a function has no arguments it is still necessary to use parenthesis to indicate that it is a function name and not a field name or another identifier or a keyword.

Another thing that makes SQL very flexible and powerful is scalar subqueries, which can be used as a part of the value expression. It allows the developer to combine the results of different queries together. Scalar subqueries or scalar queries are queries that return exactly one column and one or zero records. They have no special syntax and their difference from non-scalar queries is nominal.

Consider the example:

 SELECT (SELECT 1) + (SELECT 2) AS three;
 three
----------
        3

Here, the result of one scalar query which returns the value of 1 is added to the result of another scalar query returning 2. The result of the whole expression is 3.

Type casting means changing the data type of a value.

Type casts have several syntax patterns with the same meaning:

  • CAST ( <value> AS <type>)
  • <value>::<type>
  • <type> '<value>'
  • <type> (<value>)

The first is a common SQL syntax that is supported in most databases. The second is PostgreSQL specific. The third is only applicable for string constants is and usually used to define constants of other types but string or numeric. The last is function-like and can be applied only for types whose names are also existing function names, which is not very convenient. That's why this syntax is not widely used.

In many cases, PostgreSQL can do implicit type conversion. For example, the concatenation operator || takes two operands of type string. But if one tries to concatenate a string with a number, PostgreSQL will convert the number to a string automatically:

 SELECT 'One plus one equals ' || (1+1) AS str;
          str
-----------------------
 One plus one equals 2

A conditional expression is an expression returning different results depending on some condition. It is similar to an IFTHENELSE statement in other programming languages. The syntax is as follows:

CASE WHEN <condition1> THEN <expression1> [WHEN <condition2> THEN <expression2> ...] [ELSE <expression n>] END

The behavior is understandable from the syntax: if the first condition is met, then the result of the first expression is returned, if the second condition is met, then the second expression is used, and so on. If no condition is met, then the expression specified in the ELSE part is evaluated. Each <condition> is an expression returning Boolean (true or false) result. All expressions used after the keyword THEN should return result of the same type or at least of compatible types.

The number of condition-expression pairs should be one or more. ELSE is optional and when ELSE is not specified and the no condition's result is true, then the CASE expression returns NULL.

CASE can be used in any place where the SQL expression is used. CASE expressions can be nested, that is they can be put one inside another. The order of evaluating conditions is the same as specified in the expression. That means for any condition, it is known that all preceding conditions have returned false. And if any condition returns true, subsequent conditions are not evaluated at all.

There is a simplified syntax for the CASE expression. When all the conditions implement checking of equality of the same expression to several values, it is possible to use it like this:

CASE <checked_expression> WHEN <value1> THEN <result1> [WHEN <value2> THEN <result2> ...] [ELSE <result_n>] END

This means when the value of checked_expression is equal to value1, the result1 is returned, and so on.

This is an example of using CASE:

SELECT
  CASE
    WHEN now() > date_trunc('day', now()) +
      interval '12 hours'
    THEN 'PM'
    ELSE 'AM' END;
 case
------
 PM

Here, the current time is compared to midday (the current time is truncated to day, which gives midnight and then time interval of 12 hours is added). When the current time is more than midday, the expression returns the sting PM, otherwise it returns AM.

A single SQL expression can have many operators, functions, type casts, and so on. The length of an SQL expression has no limits in language specification. The Select-list is not the only place where SQL expressions can be used. In fact, they are used almost everywhere in SQL statements. For example, one can order the results of the query based on some SQL expression, as a sorting key. In an INSERT statement they are used to calculate values of the fields for newly inserted records. SQL expressions that return Boolean values are used as conditions.

PostgreSQL supports the short-circuit evaluation of the expressions and it sometimes it skips the evaluation of the part of the expression when it does not affect the result. For example, when evaluating the expression false AND z(), PostgreSQL will not call the function because the result of the AND operator is false, regardless of the result of the z() function call.

DISTINCT

Another thing related to the Select-list is pair of keywords DISTINCT and ALL, which can be used right after the SELECT keyword. When DISTINCT is specified, only unique rows from the input dataset will be returned. ALL returns all the rows, this is the default.

Consider the examples:

SELECT ALL marke FROM car_portal_app.car_model;
     marke
---------------
 Audi
 Audi
 Audi
 Audi
 BMW
 BMW
...
(99 rows)
and
SELECT DISTINCT marke FROM car_portal_app.car_model;
     marke
---------------
 Ferrari
 GMC
 Citroen
 UAZ
 Audi
 Volvo
...
(25 rows)

The input in both cases is the same: the table with the car models. But the first query returned 99 records while the second only returned 25. This is because the first returned all the rows that were in the input table. The second query selected only the unique rows. DISTINCT removes duplicate records based on the Select-list, not on the data in the input table. For example, if only the first letter of the name of the manufacturer is selected the result will be even shorter because some names start from the same letter:

SELECT DISTINCT substring(marke, 1, 1) FROM car_portal_app.car_model;
 substring
-----------
 H
 S
 C
 J
 L
 I
...
(21 rows)

DISTINCT also works when several columns are selected. In that case, DISTINCT will remove duplicated combinations of all the column values.

FROM clause

The source of the rows for the query is specified after the FROM keyword. It is called the FROM clause. The query can select rows from zero, one, or more sources. When no source is specified, the FROM keyword should be omitted. The source of the rows for the query can be any of:

  • Table
  • View
  • Function
  • Subquery
  • Values

When multiple sources are specified, they should be separated by a comma, or the JOIN clause should be used.

It is possible to set aliases for tables in the FROM clause. The optional keyword AS is used for that:

 SELECT a.car_id, a.number_of_doors
  FROM car_portal_app.car AS a;
 car_id | number_of_doors
--------+-----------------
      1 |               5
      2 |               3
      3 |               5
...

In the preceding example, the table name car_portal_app.car was substituted by its alias a in the Select-list. If an alias is used for a table or view in the FROM clause, in the Select-list (or anywhere else) it is no longer possible to refer to the table by its name. Subqueries when used in the FROM clause must have an alias. Also aliases are often used when self-join is performed, which means using the same table several times in the FROM clause.

Selecting from multiple tables

It is possible to select records from several sources at a time. Consider the following examples. There are two tables each having three rows:

 SELECT * FROM car_portal_app.a;
 a_int | a_text
-------+--------
     1 | one
     2 | two
     3 | three
(3 rows)

SELECT * FROM car_portal_app.b;
 b_int | b_text
-------+--------
     2 | two
     3 | three
     4 | four
(3 rows)

When records are selected from both of them, we get the combinations of all their rows :

 SELECT * FROM car_portal_app.a, car_portal_app.b;
 a_int | a_text | b_int | b_text
-------+--------+-------+--------
     1 | one    |     2 | two
     1 | one    |     3 | three
     1 | one    |     4 | four
     2 | two    |     2 | two
     2 | two    |     3 | three
     2 | two    |     4 | four
     3 | three  |     2 | two
     3 | three  |     3 | three
     3 | three  |     4 | four
(9 rows)

All possible combinations of records are called Cartesian product and usually it does not make much sense. In most cases, the user is interested in certain combinations of rows, when rows from one table match rows from another table based on some criteria. For example, it may be necessary to select only the combinations when the integer fields of both the tables have equal values. To get it the query should be changed:

SELECT *
  FROM car_portal_app.a, car_portal_app.b
  WHERE a_int=b_int;
 a_int | a_text | b_int | b_text
-------+--------+-------+--------
     2 | two    |     2 | two
     3 | three  |     3 | three
(2 rows)

This condition (a_int=b_int) join the tables. The joining conditions could be specified in the WHERE clause but in most cases it is better to put them into a FROM clause to make it more visible that they are here for joining and not for filtering the result of the join, though there is no formal difference.

The JOIN keyword is used to add join conditions to the FROM clause. The following query has the same logic and the same results as the previous one:

SELECT *
FROM car_portal_app.a JOIN car_portal_app.b ON a_int=b_int;

The JOIN condition may be specified using any of these three ways:

<first table> JOIN <second table> ON <condition>

The condition can be any. It's even not necessary to include fields from the joined tables.

<first table> JOIN <second table> USING (<field list>)

The join is based on the equality of all the fields specified in the comma separated <field list>. The fields should exist in both tables with the same name. So this syntax is not flexible enough.

<first table> NATURAL JOIN <second table>

Here, the join is based on the equality of all the fields that have the same name in both tables.

Usage of USING or NATURAL JOIN syntax has a drawback which is similar to the usage of * in the Select-list. It is possible to change the structure of the tables, for example, by adding another column or renaming them, in a way which does not make the query invalid but changes the logic of the query. This will cause errors that are very difficult to find.

What if not all rows from the first table can be matched to a row in the second table? In our example, only rows with integer values 2 and 3 exist in both tables. And when we join on the condition a_int=b_int, only those two rows are selected from the tables. The rest of the rows are not selected. This kind of the join is called inner join. It can be shown as a filled area on the diagram, like this:

Selecting from multiple tables

Inner join

When all the records from one table are selected, regardless of the existence of matching records in the other table, it is called an outer join. There are three types of outer joins. Look at the following diagrams:

Selecting from multiple tables

Left outer join

If all records are selected from the first table, along with only those records that match the joining condition from the second—it is a left outer join.

Selecting from multiple tables

Right outer join

When all records from the second table are selected , along with only the matching records from the first table—it is a right outer join.

Selecting from multiple tables

Full outer join

And when all the records from both tables are selected—it is a full outer join.

In SQL syntax, the words inner and outer are optional. Consider the following code example:

SELECT *
FROM car_portal_app.a JOIN car_portal_app.b ON a_int=b_int;
 a_int | a_text | b_int | b_text
-------+--------+-------+--------
     2 | two    |     2 | two
     3 | three  |     3 | three
(2 rows)

SELECT *
FROM car_portal_app.a LEFT JOIN car_portal_app.b ON a_int=b_int;
 a_int | a_text | b_int | b_text
-------+--------+-------+--------
     1 | one    |       |
     2 | two    |     2 | two
     3 | three  |     3 | three
(3 rows)

SELECT *
FROM car_portal_app.a RIGHT JOIN car_portal_app.b ON a_int=b_int;
 a_int | a_text | b_int | b_text
-------+--------+-------+--------
     2 | two    |     2 | two
     3 | three  |     3 | three
       |        |     4 | four
(3 rows)

SELECT *
FROM car_portal_app.a FULL JOIN car_portal_app.b ON a_int=b_int;
 a_int | a_text | b_int | b_text
-------+--------+-------+--------
     1 | one    |       |
     2 | two    |     2 | two
     3 | three  |     3 | three
       |        |     4 | four
(4 rows)

Note that the Cartesian product is not the same as the result of the full outer join. Cartesian product means all possible combinations of all records from the tables without any specific matching rules. Full outer join returns pairs of records when they that match the join conditions. And the records that do not have a pair in other table are returned separately. Outer joins return empty values (NULLs) in columns that correspond to the table where no matching record is found.

Since it is possible to query not only tables but also views, functions and subqueries, it is also possible to join them using the same syntax as for joining tables:

SELECT *
  FROM car_portal_app.a
  INNER JOIN
    (SELECT * FROM car_portal_app.b WHERE b_text = 'two') subq
    ON a.a_int=subq.b_int;
 a_int | a_text | b_int | b_text
-------+--------+-------+--------
     2 | two    |     2 | two

In the example, the subquery got the alias sq and it was used in the join condition.

Also, it is possible to join more than two tables. In fact, every join clause joins all the tables before the JOIN keyword with one table right after the keyword.

For example, this is correct:

SELECT *
  FROM table_a
    JOIN table_b ON table_a.field1=table_b.field1
    JOIN table_c ON table_a.field2=table_c.field2
      AND table_b.field3=table_c.field3;

At the moment of JOIN table_c, the table table_a has been mentioned already in the FROM clause, therefore it is possible to refer to that table.

However, this is not correct:

SELECT *
  FROM table_a
    JOIN table_b ON table_b.field3=table_c.field3
    JOIN table_c ON table_a.field2=table_c.field2

The code will cause an error because at JOIN table_b, the table table_c has not been there yet.

The Cartesian product can also be implemented using the JOIN syntax. The keywords CROSS JOIN are used for that. The following code:

SELECT * FROM car_portal_app.a CROSS JOIN car_portal_app.b;

Is equivalent to:

SELECT * FROM car_portal_app.a, car_portal_app.b;

The join condition in INNER JOIN in the logic of the query has the same meaning as a condition for filtering the rows in the WHERE clause. So the two following queries are the same:

SELECT * FROM car_portal_app.a INNER JOIN car_portal_app.b ON a.a_int=b.b_int;
SELECT * FROM car_portal_app.a, car_portal_app.b WHERE a.a_int=b.b_int;

But this is not the case for outer joins. There is no way to implement an outer join with the WHERE clause in PostgreSQL, though it may be possible in other databases.

Self-joins

It is possible to join a table with itself. It is called self-join. Self-join has no special syntax. In fact, all the data sources in the query are considered as independent even though they could be the same. Suppose one wants to know about each record of a table and how many records exist with a bigger value than the field a_int. The following query can be used for that:

SELECT t1.a_int AS current, t2.a_int AS bigger
  FROM car_portal_app.a t1
    INNER JOIN car_portal_app.a t2
      ON t2.a_int > t1.a_int;
 current | bigger
--------+--------
      1 |      2
      1 |      3
      2 |      3
(3 rows)

The table a is joined to itself. From the logic of the query, it does not matter if two different tables are joined or if the same table is used twice. To be able to reference the fields of the table and distinguish the instances of the table, the table aliases are used. The first instance is called t1 and the second t2. From the results, it is visible that for the value 1 there are two bigger values: 2 and 3; and for the value 2 only one bigger value of 3 exists. The examined value is in the column named current and the bigger values are in the column of the same name.

The value 3 is not selected because there are no values bigger than 3. But if one wants to explicitly show that, LEFT JOIN is used:

SELECT t1.a_int AS current, t2.a_int AS bigger
  FROM car_portal_app.a t1
    LEFT JOIN car_portal_app.a t2
      ON t2.a_int > t1.a_int;
 current | bigger
--------+--------
      1 |      2
      1 |      3
      2 |      3
      3 |
(4 rows)

WHERE clause

In many cases, after the rows are taken from the input tables, they should be filtered. It is done via the WHERE clause. The filtering condition is specified after the WHERE keyword. The condition is an SQL expression returning a Boolean value. That's why the syntax of the WHERE condition is the same as in the expressions in the Select-list. This is specific for PostgreSQL. In other databases, it may not be possible to use Boolean values in the Select-list, which makes SQL conditions different from SQL expressions. But in PostgreSQL, the difference is only in the data type of a returned value.

Simple examples of the WHERE conditions can be:

SELECT * FROM car_portal_app.car_model
  WHERE marke='Peugeot';
SELECT * FROM car_portal_app.car WHERE mileage < 25000;
SELECT * FROM car_portal_app.car
  WHERE number_of_doors > 3
    AND number_of_owners <= 2;

Although there is no formal difference between SQL expressions and SQL conditions in PostgreSQL, usually all expressions returning a Boolean type are called conditional expressions or just conditions. And they are mostly used in the WHERE clause, in conditional expressions like CASE or in the JOIN clause.

Logical operators are usually used in conditional expressions. They are AND, OR and NOT. They take Boolean arguments and return Boolean values. Logical operators are evaluated in the following order: NOT, AND, OR. But they have lower priority than any other operators. PostgreSQL tries to the optimize evaluation of logical expressions. For example, when the OR operator is evaluated, and it is known that the first operand has the true value, PostgreSQL may not evaluate the second operand at all, because the result of OR is already known. For that reason, PostgreSQL may change the actual order of evaluating expressions on purpose to get the results faster.

Sometimes that might cause problems. For example, it is not possible to divide by zero, and in case one wanted to filter rows based on the result of division, this would not be correct:

SELECT * FROM t WHERE a<>0 AND b/a>0.5

Because it is not guaranteed that PostgreSQL will evaluate the first condition (a<>0) before the second, and in case a = 0, this could cause an error. To be secure one should use a CASE statement because the order of evaluation of CASE conditions is determined by the statement:

SELECT * FROM t
  WHERE CASE WHEN a=0 THEN false ELSE b/a>0.5 END

There are some other operators or expressions returning Boolean values that are used in conditional expressions:

  • Comparison operators
  • Pattern matching operators
  • The OVERLAPS operator
  • Row and array comparison constructs
  • Subquery expressions
  • Any function returning Boolean or convertible to Boolean values

As if in the Select-list, functions can be used in the WHERE clause as well as anywhere in the expression. Suppose one wants to search for car models whose name is four letters long. This can be done by using a length function:

SELECT * FROM car_portal_app.car_model
  WHERE length(model)=4;
 car_model_id |   marke    | model
--------------+------------+-------
           47 | KIA        | Seed
           57 | Nissan     | GT-R
           70 | Renault    | Clio
...

Comparison operators

Comparison operators are < (less), > (more), <= (equal or less), >= (equal or more), = (equal) and <> or != (not equal—those two are synonyms). These operators can compare not only numbers but any values that can be compared, for example dates, or strings.

There is a BETWEEN construct that also relates to comparing:

x BETWEEN a AND b

The preceding code is equivalent to:

x>=a AND a<=b

The OVERLAPS operator checks to see if two ranges of dates overlap or not. An example would be:

SELECT 1 WHERE (date '2014-10-15', date '2014-10-31')
  OVERLAPS (date '2014-10-25', date '2014-11-15');
 ?column?
----------
        1

Comparison operators have different precedence: >= and <= have the highest priority. Then comes BETWEEN, then OVERLAPS, then < and >. And = has the lowest priority.

Pattern matching

Pattern matching is always about strings. There are two similar operators: LIKE and ILIKE. They check if a string matches a given pattern. Only two wildcards can be used in the pattern: underscore (_) for exactly one character (or number) and percent sign (%) for any number of any characters, including an empty string.

LIKE and ILIKE are the same except that the first is case-sensitive and second is not.

For example, to get car models whose names start with s and have exactly four characters, one can use the following query:

SELECT * FROM car_portal_app.car_model
  WHERE model ILIKE 's_ _ _';
 car_model_id |   marke    | model
--------------+------------+-------
           47 | KIA        | Seed

There are another two pattern matching operators: SIMILAR and ~. They check for pattern matching using regular expressions. The difference between them is that SIMILAR uses regular expression syntax defined in SQL standard, while ~ uses Portable Operating System Interface (POSIX) regular expressions.

In this example, one selects all car models whose names consist of exactly two words:

 SELECT * FROM car_portal_app.car_model
  WHERE model ~ '^w+W+w+$';
 car_model_id |     marke     |    model
--------------+---------------+--------------
           21 | Citroen       | C4 Picasso
           33 | Ford          | C-Max
           34 | Ford          | S-Max
...

Row and array comparison constructs

Row and array comparison constructs are used to make multiple comparisons between values, groups of values and arrays.

The expression IN is used for checking if a value equals to any of the values from a list. The expression is as follows:

a IN (1, 2, 3)

The preceding code will return true if a equals to 1, 2 or 3. It is a shorter way of implementing the following:

(a = 1 OR a = 2 OR a = 3)

SQL allows the use of array types that mean several elements as a whole in one single value. This can be used for enriching comparison conditions. For example, this checks if a is bigger than any of x, y or z:

a > ANY (ARRAY[x, y, z])

The preceding code is equivalent to:

(a >  x OR a > y OR a > z)

This checks if a is bigger than either x, y, or z:

a >  ALL (ARRAY[x, y, z])

The preceding code is equivalent to:

(a > x AND a > y AND a > z )

The keywords IN, ALL, and ANY (which has a synonym SOME) can also be used in subquery expressions with the same logic. A result of subquery can be used in any place where it is possible to use a set of values or an array. This makes it possible, for instance, to select records from one table, when some values exist in an other table. For example, here car models are selected when there is a car of that model:

SELECT * FROM car_portal_app.car_model
  WHERE car_model_id IN
    (SELECT car_model_id FROM car_portal_app.car);
 car_model_id |     marke     |    model
--------------+---------------+--------------
            2 | Audi          | A2
            3 | Audi          | A3
            4 | Audi          | A4
...
(86 rows)

Sometimes an IN expression can be replaced by inner join, but not always. Consider the example:

SELECT car_model.*
  FROM car_portal_app.car_model
    INNER JOIN car_portal_app.car USING (car_model_id);

 car_model_id |     marke     |    model
--------------+---------------+--------------
            2 | Audi          | A2
            2 | Audi          | A2
            2 | Audi          | A2
            3 | Audi          | A3
            3 | Audi          | A3
            4 | Audi          | A4
            4 | Audi          | A4
...
(229 rows)

Although the same table is queried and the same columns are returned, the number of records is bigger. That is because there are many cars of the same model, and for them, the model is selected several times.

The NOT IN construct with a subquery is sometimes very slow because the check for the nonexistence of a value is more expensive than the opposite.

Grouping and aggregation

In all previous examples,the number of records returned by the SELECT query is the same as the number of rows from the input table (or tables) after filtering. In other words, every row from any of the source table (or joined tables) becomes exactly one row in the query result. Rows are processed one by one.

But SQL provides a way to get some aggregated result of processing several records at a time and get the result in one single row. The easiest example would be counting the total number of records in the table. The input is all the records of a table. The output is one single record. Grouping and aggregation is used for this.

GROUP BY clause

The GROUP BY clause is used for grouping. Grouping means splitting the whole input set of records into several groups with a view to have only one result row for each group. Grouping is performed on the basis of a list of expressions. All records that have the same combination of values of grouping expressions are grouped together. That means the groups are identified by the values of expressions defined in the GROUP BY clause. Usually, it makes sense to include those expressions in the Select-list, in order to make it visible which group is referred to by the result row.

For example:

SELECT a.marke, a.model
  FROM car_portal_app.car_model a
    INNER JOIN car_portal_app.car b
      ON a.car_model_id=b.car_model_id
  GROUP BY a.marke, a.model;
     marke     |    model
---------------+--------------
 Opel          | Corsa
 Ferrari       | 458 Italia
 Peugeot       | 308
 Opel          | Insignia
...
(86 rows)

Here, the list of all the car models that are used in the table car is selected. Each record in the result set represents a group of records from the source tables relating to the same car model. In fact, this query gives the same result as SELECT DISTINCT marke, model... without GROUP BY but the logic is different. DISTINCT removes duplicated values, but GROUP BY groups duplicated values together.

It is almost useless just to group rows. Usually, it is necessary to do some computing on the groups. In the last case, it would be interesting to know how many cars of which model are in the system. This is done by aggregation. Aggregation means performing calculation, on a group of records returning a single value for the whole group. This is done by the special aggregating functions that are used in the Select-list. To get the number of cars, one needs to use the count function:

SELECT a.marke, a.model, count(*)
  FROM car_portal_app.car_model a
    INNER JOIN car_portal_app.car b
      ON a.car_model_id=b.car_model_id
  GROUP BY a.marke, a.model;
     marke     |    model     | count
---------------+--------------+-------
 Opel          | Corsa        |     6
 Ferrari       | 458 Italia   |     4
 Peugeot       | 308          |     3
 Opel          | Insignia     |     4
...
(86 rows)

There are several aggregating functions available in PostgreSQL. The most frequently used are count, sum, max, min, and avg for computing, respectively the number of records in the group, the sum of any numeric expression for all the records in the group, the biggest, the lowest and average value of any expression. There are some other aggregating functions such as corr, which computes the correlation coefficient of the two given arguments, stddev—for standard deviation, string_agg, which concatenates the string values of an expression, and others.

When grouping and aggregation is used, the records are grouped. This means several records become one. Therefore, no other expressions except the aggregation functions and expressions from the GROUP BY list can be included in the Select-list. If it is done, the database will raise an error:

SELECT a_int, a_text
  FROM car_portal_app.a
  GROUP BY a_int;
ERROR:  column "a.a_text" must appear in the GROUP BY clause or be used in an aggregate function

But it is possible to create new expressions based on the expressions from the GROUP BY list. For example, if we have GROUP BY a, b, it is possible to SELECT a+b.

What if it is needed to group all of the records of the table together, not on the basis of the values of some field, but the whole table? To do that one should include aggregating functions in the Select-list (and only them!) and not use the GROUP BY clause:

SELECT count(*) FROM car_portal_app.car;
 count
-------
   229
(1 row)

If all the records of a table are grouped without any GROUP BY expressions, then exactly one group exists. Note that the SQL queries that have aggregating functions in the Select-list and do not have the GROUP BY clause always return exactly one row, even if there are no rows in the input tables, or if all of them are filtered out. For example:

SELECT count(*)
FROM car_portal_app.car WHERE number_of_doors = 15;
 count
-------
     0
(1 row)

There are no cars with 15 doors. If the table is selected with this WHERE condition, no rows will be returned. But if one uses count(*), the aggregating function will return a row with a value of zero.

It is possible to count the number of unique values of the expression with count(DISTINCT <expression>). For example:

SELECT count(*), count(DISTINCT car_model_id)
  FROM car_portal_app.car;
 count | count
-------+-------
   229 |    86

Here, the first column has the total number of cars. The second column is the number of the car models to which the cars belong. Since some cars are of the same model, the number of models is lower than the total number of the cars.

HAVING clause

Aggregating functions are not allowed in the WHERE clause but it is possible to filter groups that follow a certain condition. This is different from filtering in the WHERE clause because WHERE filters input rows, and groups are calculated afterwards. The filtering of the groups is done by the HAVING clause. That is very similar to the WHERE clause but only aggregating functions are allowed there. The HAVING clause is specified after the GROUP BY clause. Suppose one needs to know which car models have a number of cars greater than 5. This can be done by using a subquery:

SELECT marke, model FROM
(
  SELECT a.marke, a.model, count(*) c
    FROM car_portal_app.car_model a
      INNER JOIN car_portal_app.car b
        ON a.car_model_id=b.car_model_id
    GROUP BY a.marke, a.model
) subq
WHERE c >5;
  marke  | model
---------+-------
 Opel    | Corsa
 Peugeot | 208
(2 rows)

But the simpler and clearer way is to do it via a HAVING clause:

SELECT a.marke, a.model
  FROM car_portal_app.car_model a
    INNER JOIN car_portal_app.car b
      ON a.car_model_id=b.car_model_id
  GROUP BY a.marke, a.model
  HAVING count(*)>5;
  marke  | model
---------+-------
 Opel    | Corsa
 Peugeot | 208

Ordering and limiting the results

The results of the query are not ordered by default. The order of the rows is not defined and may depend on their physical place on the disc, the joining algorithm, or on other factors. But in many cases it is required to have the result set sorted. This is done with the ORDER BY clause. The list of expressions whose values should be sorted is specified after the ORDER BY keywords. At the beginning, the records are sorted on the basis of the first expression of the ORDER BY list. If some rows have the same value for the first expression, they are sorted by the values of the second expression, and so on.

After each item of the ORDER BY list it is possible to specify if the order should be ascending or descending. That is done by specifying the keywords ASC or DESC after the expression. Ascending is the default. NULL values are considered larger than any other values by default but it is possible to explicitly define that NULLs should precede other rows by specifying NULLS FIRST or NULLS LAST if NULLs should be at the end.

It is not required for the ORDER BY clause to contain the same expressions as the Select-list but it usually does. So, to make it more convenient, it is allowed to use in the ORDER BY list the output column names that are assigned to the expression in the Select-list instead of fully qualified expressions. Also, it is possible to use the numbers of the columns.

So, these examples are equivalent:

SELECT number_of_owners, manufacture_year,
    trunc(mileage/1000) as kmiles
  FROM car_portal_app.car
  ORDER BY number_of_owners, manufacture_year,
    trunc(mileage/1000) DESC;

SELECT number_of_owners, manufacture_year,
    trunc(mileage/1000) as kmiles
  FROM car_portal_app.car
  ORDER BY number_of_owners, manufacture_year,
    kmiles DESC;

SELECT number_of_owners, manufacture_year,
    trunc(mileage/1000) as kmiles
  FROM car_portal_app.car
  ORDER BY 1, 2, 3 DESC;

Sometimes it is necessary to limit the output of the query to a certain number of rows and discard the rest. That is done by specifying that number after the LIMIT keyword. For example:

SELECT * FROM car_portal_app.car_model LIMIT 5;
 car_model_id | marke | model
--------------+-------+-------
            1 | Audi  | A1
            2 | Audi  | A2
            3 | Audi  | A3
            4 | Audi  | A4
            5 | Audi  | A5
(5 rows)

The preceding code returns only 5 rows regardless of the fact that the actual number of records in the table is bigger. This is sometimes used in scalar subqueries that should not return more than 1 record.

Another similar task is to skip several rows at the beginning of the output. This is done by the using keyword OFFSET. OFFSET and LIMIT can be used together:

SELECT * FROM car_portal_app.car_model OFFSET 5 LIMIT 5;
 car_model_id | marke | model
--------------+-------+-------
            6 | Audi  | A6
            7 | Audi  | A8
            8 | BMW   | 1er
            9 | BMW   | 3er
           10 | BMW   | 5er
(5 rows)

The typical use case for OFFSET and LIMIT is the implementation of paginated output in web applications. For example, if ten rows are displayed on a page, then on third page the rows 21-30 should be shown. Then OFFSET 20 LIMIT 10 is used. In most cases of using OFFSET and LIMIT, the rows should be ordered, otherwise it is not determined which records are not shown. The keywords are then specified after the ORDER BY clause.

Subqueries

Subqueries are a very powerful feature of SQL. They can be used almost everywhere in the query. The most obvious way to use subqueries is in the FROM clause as a source for the main query:

SELECT * FROM
(
  SELECT car_model_id, count(*) c
    FROM car_portal_app.car
    GROUP BY car_model_id
) subq
WHERE c = 1;
 car_model_id | c
--------------+---
            8 | 1
           80 | 1
...
(14 rows)

When subqueries are used in the FROM clause they must have an alias. In the preceding example, the subquery is given the name subq.

Also, subqueries can be used in SQL conditions in the IN expressions:

SELECT car_id, registration_number
FROM car_portal_app.car
WHERE car_model_id IN
(
  SELECT car_model_id
    FROM car_portal_app.car_model
    WHERE marke='Peugeot'
);
 car_id | registration_number
--------+---------------------
      1 | MUWH4675
     14 | MTZC8798
     18 | VFZF9207
...
(18 rows)

Scalar subqueries can be used everywhere in expressions: in the Select-list, WHERE clause, GROUP BY clause, and so on. Even in LIMIT:

SELECT (SELECT count(*) FROM car_portal_app.car_model)
FROM car_portal_app.car
LIMIT (SELECT MIN(car_id)+2 FROM car_portal_app.car);
 count
-------
    99
    99
    99
(3 rows)

This is a PostgreSQL specific feature. Not every RDBMS supports subqueries in every place where expression is allowed.

It is not possible to refer to the internal elements of one subquery from inside of another. But subqueries can refer to the elements of main query. For example, if it is necessary to count cars for each car model, and select the top five most popular models, it can be done by using subquery in this way:

SELECT marke, model,
  (
    SELECT count(*)
      FROM car_portal_app.car
      WHERE car_model_id = main.car_model_id
  )
  FROM car_portal_app.car_model main
  ORDER BY 3 DESC
  LIMIT 5;
  marke  |  model   | count
---------+----------+-------
 Peugeot | 208      |     7
 Opel    | Corsa    |     6
 Jeep    | Wrangler |     5
 Renault | Laguna   |     5
 Peugeot | 407      |     5
(5 rows)

In the example the subquery in the Select-list refers to the table of the main query by its alias main. The subquery is executed for each record received from the main table, using the value of car_model_id in the WHERE condition.

Subqueries can be nested. It means it is possible to use subqueries inside another subqueries.

Set operations – UNION, EXCEPT, and INTERSECT

Set operations are used to combine the results of several queries. It is different from joining, although the same results often can be achieved by joining.

Joining means placing the records of two tables, one table with another table horizontally. The result of joining is that the number of columns equal to the sum of the numbers of columns of the source tables, and the number of records will depend on the join conditions.

Combining means putting the result of one query on top of the result of another query. The number of columns stays the same, but the number of rows is the sum of the rows from the sources.

There are three set operations:

  • UNION: This appends the result of one query to the result of another query
  • INTERSECT: This returns the records that exist in the results of both queries
  • EXCEPT: This returns the records from the first query that do not exist in the result of the second query—the difference

The syntax of set operations is as follows:

<query1> UNION <query2>;
<query1> INTERSECT <query2>;
<query1> EXCEPT <query2>;

It is possible to use several set operations in one statement:

SELECT a, b FROM t1
UNION
SELECT c, d FROM t2
INTERSECT
SELECT e, f FROM t3;

The priority of all set operations is the same. That means logically they are executed in the same order as used in the code. But the records can be returned in a different order that is not predicted, unless the ORDER BY clause is used. In this case, the ORDER BY clause is applied after all of the set operations. And for that reason, it does not make sense to put ORDER BY into the subqueries.

All set operations by default remove duplicated records as if SELECT DISTINCT is used. To avoid this and return all the records, the ALL keyword should be used, which is specified after the name of the set operation:

<query1> UNION ALL <query2>.

The set operations can be used to find the difference between the two tables. For example:

SELECT 'a', * FROM
(
  SELECT * FROM car_portal_app.a
  EXCEPT ALL
  SELECT * FROM car_portal_app.b
) v1
UNION ALL
SELECT 'b', * FROM
(
  SELECT * FROM car_portal_app.b
  EXCEPT ALL
  SELECT * FROM car_portal_app.a
) v2;
 ?column? | a_int | a_text
----------+-------+--------
 a        |     1 | one
 b        |     4 | four
(2 rows)

From the results of that query, you can find out that the row one exists in the table a but does not exist in the table b. And the row four exists in the table b, but not in a.

It is possible to append one set of records to another only when they have the same number of columns and they have respectively the same data types, or compatible data types. The output names for the columns are always taken from the first subquery, even if they are different in subsequent queries.

In other RDBMS, set operations can have different names, for example, in Oracle EXCEPT is called MINUS.

Dealing with NULLs

NULL is a special value that any field or expression can have, except for when it is explicitly forbidden. NULL means the absence of any value. It can also be treated as an unknown value in some cases. In relation to logical values, NULL is neither true nor false. Working with NULLs can be confusing because almost all operators when take NULL as an argument, return NULL. And if one tries to compare some values and one of them is NULL the result will also be NULL, which is not true. For example, the query

SELECT * FROM t WHERE a > b

This will return no records if a or b have NULL value. That can be expected, but for the following condition this is not so clear:

SELECT * FROM t WHERE a = b

Here, if both a and b have a value of NULL, the record still will not be selected. The equal operator (=) returns NULL if any of the arguments is NULL, and since WHERE filters rows only when the condition is true, the record is not selected. Similarly, the following will never return any record, even if a has a NULL value:

SELECT * FROM t WHERE a = NULL

To check the expression for having NULL value, a special operator is used: IS NULL. The previous query, if it is necessary to also select records when both a and b are NULL, should be changed this way:

SELECT * FROM t WHERE a = b OR (a IS NULL AND b IS NULL)

There is a special construct, that can be used to check the equivalence of expressions taking NULL into account: IS NOT DISTINCT FROM. The preceding example can be implemented in the following way, which has the same logic:

SELECT * FROM t WHERE a IS NOT DISTINCT FROM b

Logical operators are different. They can sometimes return not null value even if they take NULL as an argument. For logical operators NULL means an unknown value. Operator AND always returns false when one of the operands is false, even if the second is NULL. Or always returns true if one of the arguments if true. In all other cases, the result is unknown, therefore NULL:

SELECT true AND NULL, false AND NULL, true OR NULL,
  false OR NULL, NOT NULL;
 ?column? | ?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------+----------
          | f        | t        |          |

Functions can treat NULL values differently. Their behavior is determined by their code. Most built-in functions return NULL if any of the arguments are NULL.

Aggregating functions work with NULL values in a different way. They work with many rows and therefore many values. In general, they ignore NULL. Sum calculates the total of all not-null values and ignores NULL. Sum returns NULL only when all the received values are NULL. For avg, max and min it is the same. But for count it is different. count returns the number of not-null values. So if all the values are NULL, count returns zero.

In contrast to some other databases in PostgreSQL, empty string is not NULL. Consider the example:

SELECT a IS NULL, b IS NULL, a = b
  FROM (SELECT ''::text a, NULL::text b) v;
?column? | ?column? | ?column?
----------+----------+----------
 f        | t        |

There are a couple of functions designed to deal with NULL values: COALESCE and NULLIF.

The COALESCE function takes any number of arguments of the same data type or compatible types. It returns the value of the first of its arguments that is not null:

COALESCE(a, b, c)

The preceding code is equivalent to:

CASE
  WHEN a IS NOT NULL THEN a
  WHEN b IS NOT NULL THEN b
  ELSE c END

NULLIF takes two arguments and returns NULL if they are equal. Otherwise it returns the value of the first argument. This is somehow the opposite of COALESCE:

NULLIF (a, b)

The preceding code is equivalent to:

CASE WHEN a = b THEN NULL ELSE a END

Another aspect of NULL values is that they are ignored by unique constraints. This means if a field of a table is defined as unique, it is still possible to create several records having a NULL value of that field. Also b-tree indexes which are most commonly used, do not index NULL values. Consider the query as follows:

SELECT * FROM t WHERE a IS NULL

The preceding code will not use an index if it is created on the column a.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.188.67