An index is a physical database object that is defined on a table column or a list of columns. In PostgreSQL, there are many types of indexes and several ways to use them. Indexes can be used, in general, to do the following:
UNIQUE
check constraint creates a unique index on the column behind the scenes.Let us take a look at the account_history
table in the car web portal example. The unique constraint, UNIQUE (account_id, search_key, search_date)
, has two purposes: the first purpose is to define the validity constraint for inserting the search key into the table only once each day, even if the user searches for the key several times. The second purpose is to retrieve the data quickly. Let us assume that we would like to show the last 10 searches for the user. The query for performing this will be as follows:
SELECT search_key FROM account_history WHERE account_id = <account> GROUP BY search_key ORDER BY max(search_date) limit 10;
The preceding query returns only 10 records containing the different search_key
ordered by search_date
. If we assume that the search account_history
contains millions of rows, then reading all the data will take a lot of time. In this case, this is not true, because the unique index will help us in reading the columns only for this particular customer. To understand how PostgreSQL uses indexes, let us populate the table with fictional data.
WITH test_account AS( INSERT INTO account VALUES (1000, 'test_first_name', 'test_last_name','[email protected]', 'password'), (2000, 'test_first_name2', 'test_last_name2','[email protected]', 'password2') RETURNING account_id ),car AS ( SELECT i as car_model FROM (VALUES('brand=BMW'), ('brand=WV')) AS foo(i) ),manufacturing_date AS ( SELECT 'year='|| i as date FROM generate_series (2015, 2014, -1) as foo(i)) INSERT INTO account_history (account_id, search_key, search_date) SELECT account_id, car.car_model||'&'||manufacturing_date.date, current_date FROM test_account, car, manufacturing_date;
To understand how the data is generated, we simply created a cross join, as discussed in Chapter 1, Relational Databases. Since we have two brands and two manufacturing dates, we generated four combinations. To see the generated date, let us execute the preceding query for the test account with account_id
, which has the value 1000
:
car_portal=# SELECT search_key FROM account_history WHERE account_id = 1000 GROUP BY (search_key) ORDER BY max(search_date) limit 10; search_key --------------------- brand=BMW&year=2014 brand=WV&year=2014 brand=BMW&year=2015 brand=WV&year=2015 (4 rows)
For small data, PostgreSQL prefers to perform a sequential scan simply due to the execution cost. On a small data set, a sequential scan is faster than an index scan. To get the execution plan, one can execute the following command:
car_portal=# EXPLAIN SELECT search_key FROM account_history WHERE account_id = 1000 GROUP BY search_key ORDER BY max(search_date) limit 10; QUERY PLAN ------------------------------------------------------------------------------------- Limit (cost=38.64..38.67 rows=10 width=23) -> Sort (cost=38.64..39.43 rows=313 width=23) Sort Key: (max(search_date)) -> HashAggregate (cost=28.75..31.88 rows=313 width=23) -> Seq Scan on account_history (cost=0.00..25.63 rows=625 width=23) Filter: (account_id = 1000) (6 rows)
... (5000, 'test_first_name5', 'test_last_name5','[email protected]', 'password5') ... SELECT i as car_model FROM (VALUES('brand=BMW'), ('brand=WV'), ('brand=Opel'), ('brand=Fiat'), ('brand=Siat')) AS foo(i) ... SELECT 'year='|| i as date FROM generate_series (2015, 1980, -1) as foo(i)
With a big data set, PostgreSQL performs an index scan instead of a sequential scan to retrieve the data. The following shows the execution plan with an index scan for the preceding query:
car_portal=# EXPLAIN SELECT search_key FROM account_history WHERE account_id = 1000 GROUP BY search_key ORDER BY max(search_date) limit 10; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=26.07..26.09 rows=10 width=24) -> Sort (cost=26.07..26.19 rows=50 width=24) Sort Key: (max(search_date)) -> HashAggregate (cost=24.49..24.99 rows=50 width=24) -> Bitmap Heap Scan on account_history (cost=10.18..23.26 rows=246 width=24) Recheck Cond: (account_id = 1000) -> Bitmap Index Scan on account_history_account_id_search_key_search_date_key (cost=0.00..10.12 rows=246 width=0) Index Cond: (account_id = 1000) (8 rows)
The PostgreSQL planner decides whether to use an index based on the execution plan cost. For the same query with different parameters, the planner might pick a different plan based on the data histogram.
PostgreSQL supports different indexes; each index can be used for a certain scenario or a data type.
CREATE INDEX
command. The "B" stands for balanced, which means that the data on both sides of the tree is roughly equal. B-tree can be used for equality, ranges, and null predicates. The B-tree index supports all PostgreSQL data types.A partial index indexes only a subset of the table data that meets a certain predicate; the WHERE
clause is used with the index. The idea behind a partial index is to decrease the index size, thus making it more maintainable and faster to access. Let us assume that we would like to introduce a data life cycle for some tables. For the advertisement table in the car web portal, suppose that we would like to keep the advertisement for a certain period of time after deletion to enable the seller to revive their advertisement easily. The deletion operation will be emulated by adding a column called advertisement_deletion_date
and a job scheduler task to clean up the advertisements that have been deleted. For the SQL part, this could be done as follows:
car_portal=# ALTER TABLE advertisement ADD column advertisement_deletion_date TIMESTAMP WITH TIME ZONE; ALTER TABLE
In order to retrieve the dead advertisement quickly, one could use the partial index on the advertisement_deletion_date
, as follows:
CREATE INDEX ON advertisement(advertisement_id) WHERE advertisement_deletion_date IS NOT NULL; CREATE INDEX
The preceding index will help in retrieving the dead advertisement quickly without creating an index on the whole advertisement table.
As stated earlier, an index could be created on a table column or multiple columns. It can also be created on expressions and function results. In the account table, let us assume that we would like to search for an account with the first name foo
. Also, let us assume that the names are stored in the database in different cases such as Foo
, foo
, and fOo
, and that there is an index on the first_name
. To get the accounts, one can use the lower()
and upper()
functions or the case insensitive regular expressions operator ~*
. The queries can be written as follows:
car_portal=# INSERT INTO account values (DEFAULT, 'fOo', 'bar', '[email protected]', md5('foo')); INSERT 0 1 car_portal=# SELECT * FROM account WHERE first_name ~* '^foo$'; account_id | first_name | last_name | email | password ------------+------------+-----------+-------------+---------------------------------- 1 | fOo | bar | [email protected] | acbd18db4cc2f85cedef654fccc4a4d8 (1 row) car_portal=# SELECT * FROM account WHERE lower(first_name) = 'foo'; account_id | first_name | last_name | email | password ------------+------------+-----------+-------------+---------------------------------- 1 | fOo | bar | [email protected] | acbd18db4cc2f85cedef654fccc4a4d8 (1 row)
The two preceding approaches will not use an index scan, which is the preferable method, because only one row is retrieved. When using the lower(first_name)
function, the value of the function is not determined unless the function is executed. Thus, the index cannot be used. For the regular expressions, the expression might match one or more values. To solve this, an index on the lower()
function can be used as follows, and the lower()
function should then be used instead of regex:
CREATE index ON account(lower(first_name)); SELECT * FROM account WHERE lower(first_name) = 'foo';
Another usage of the expression indexes is to find rows by casting a data type to another data type. For example, the birth of a child can be stored as a timestamp; however, we often search for a birth date and not the birth time.
The unique index guarantees the uniqueness of a certain value in a row across the whole table. In the account table, the email column has a unique check constraint. This is implemented by the unique index, as shown by the d
meta command:
car_portal=# d account Table "car_portal_app.account" Column | Type | Modifiers ------------+---------+-------------------------------------------------------------- account_id | integer | not null default nextval('account_account_id_seq'::regclass) first_name | text | not null last_name | text | not null email | text | not null password | text | not null Indexes: "account_pkey" PRIMARY KEY, btree (account_id) "account_email_key" UNIQUE CONSTRAINT, btree (email) "account_lower_idx" btree (lower(first_name)) Check constraints: ... Referenced by: ...
Note that in the index section of the d
command there are three indexes; the account_email_key
is a B-tree index for the email column.
One can also use the unique and partial indexes together. For example, let us assume that we have a table called employee, where each employee must have a supervisor except for the company head. We can model this as a self-referencing table, as follows:
CREATE TABLE employee (employee_id INT PRIMARY KEY, supervisor_id INT); ALTER TABLE employee ADD CONSTRAINT supervisor_id_fkey FOREIGN KEY (supervisor_id) REFERENCES employee(employee_id);
To guarantee that only one row is assigned to a supervisor, we can add the following unique index:
CREATE UNIQUE INDEX ON employee ((1)) WHERE supervisor_id IS NULL;
The unique index on the constant expression (1)
will allow only one row with a null value. With the first insert of a null value, a unique index will be built using the value 1. A second attempt to add a row with a null value will cause an error, because the value 1 is already indexed:
car_portal=# INSERT INTO employee VALUES (1, NULL); INSERT 0 1 car_portal=# INSERT INTO employee VALUES (2, 1); INSERT 0 1 car_portal=# INSERT INTO employee VALUES (3, NULL); ERROR: duplicate key value violates unique constraint "employee_expr_idx" DETAIL: Key ((1))=(1) already exists.
A multicolumn index can be used for a certain pattern of query conditions. Suppose a query has a pattern similar to the following:
SELECT * FROM table WHERE column1 = constant1 and column2= constant2 AND … columnn = constantn;
In this case, an index can be created on column1
, column2
,…, and columnn
, where n
is less than or equal to 32
. Currently, only B-tree, GIN, and GiST support multicolumn indexes. When creating a multicolumn index, the column order is important. A multicolumn index can be used to retrieve data if the leftmost columns are used with equality and inequality expressions on the first column that does not have any equality expressions.
Since the multicolumn index size is often big, the planner might prefer to perform a sequential scan rather than reading the index.
It is often useful to index columns that are used with predicates and foreign keys. This enables PostgreSQL to use an index scan instead of a sequential scan. The benefits of indexes are not limited to the SELECT
statements, but the DELETE
and UPDATE
statements can also benefit from it. There are some cases where an index is not used, and this is often due to the small table size. In big tables, one should plan the space capacity carefully, since the index size might be very big. Also note that indexes have a negative impact on the INSERT
statements, since amending the index comes with a cost.
There are several catalogue tables and functions that help in maintaining the indexes. The first view is pg_stat_all_indexes
, which gives the statistics about the index usage. In the following example, we can see that the index account_history_account_id_search_key_search_date_key
is never used, which is fine in this case, since the database has just been created:
car_portal=# SELECT * FROM pg_stat_all_indexes WHERE schemaname = 'car_portal_app' limit 1; relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch ---------+------------+----------------+-----------------+-------------------------------------------------------+----------+--------------+--------------- 8053962 | 8053971 | car_portal_app | account_history | account_history_account_id_search_key_search_date_key | 0 | 0 | 0 (1 row)
The function pg_index_size
can be used with the pg_size_pretty
function to get the index size in a human-readable form, as follows:
car_portal=# SELECT pg_size_pretty(pg_indexes_size ('car_portal_app.account_pkey')); pg_size_pretty ---------------- 0 bytes (1 row)
When creating an index, make sure that the index does not exist, otherwise one could end up with duplicate indexes. PostgreSQL does not raise a warning when creating duplicate indexes, as shown in the following example:
car_portal=# CREATE index on car_portal_app.account(first_name); CREATE INDEX car_portal=# CREATE index on car_portal_app.account(first_name); CREATE INDEX
In some cases, an index might be bloated; PostgreSQL provides the REINDEX
command, which is used to rebuild the index. Note that the REINDEX
command is a blocking command. To solve this, one can create another index identical to the original index concurrently to avoid locks. Creating an index concurrently is often preferred with a live system, but it requires more work than creating a normal index. Moreover, creating an index concurrently has some caveats; in some cases index creation might fail, leading to an invalid index. An invalid index can be dropped or rebuilt using the DROP
and REINDEX
statements respectively.
Let us assume that we have a table called a
, and an index on this table called a_index
. Let us also assume that the index is bloated, and we need to build it. The following two pieces of code could be considered equivalent with one difference, that is, the locking behavior:
--- create a test table car_portal=# CREATE TABLE a (id int); CREATE TABLE car_portal=# CREATE INDEX a_index ON a(id); CREATE INDEX --- Piece of code1: cause locks car_portal=# REINDEX INDEX a_index; REINDEX --- Piece of code 2: No locks car_portal=# CREATE INDEX concurrently a_index_1 ON a(id); CREATE INDEX car_portal=# DROP index a_index; DROP INDEX car_portal=# ALTER INDEX a_index_1 RENAME TO a_index; ALTER INDEX
PostgreSQL indexes is a huge topic. Indexes can also be used to answer certain queries without doing a table scan at all. They can, additionally, be used to speed up sorting operations. Also, they can be used for anchored and non-anchored text searches using operator classes and operator families. Finally, PostgreSQL supports full-text searches using the GIN and GiST indexes and the text search types: tsvector
and tsquery
.
3.143.241.133