In these busy times, many people believe if it ain't broken, don't fix it. I believe that too, but it isn't an excuse for not taking action to maintain your database servers and be sure that nothing will break.
Database maintenance is about making your database run smoothly.
PostgreSQL prefers regular maintenance, so please read the Planning maintenance recipe for more information.
We recognize that you're here for a reason and are looking for a quick solution to your needs. You're probably thinking – Fix the problem first, and I'll plan later. So, off we go!
PostgreSQL provides a utility command named VACUUM, which is a reference to a garbage collector that sweeps up all of the bad things and fixes them – or at least most of them. That's the single most important thing you need to remember to do – I say single because closely connected to that is the ANALYZE command, which collects statistics for the SQL optimizer. It's possible to run VACUUM and ANALYZE as a single joint command, VACUUM ANALYZE. These actions are automatically executed for you when appropriate by autovacuum, a special background process that runs as part of the PostgreSQL server.
VACUUM performs a range of cleanup activities, some of them too complex to describe without a whole sideline into their internals. VACUUM has been heavily optimized over 30 years to take the minimum required lock levels on tables and execute them in the most efficient manner possible, skipping all of the unnecessary work and using L2 cache CPU optimizations when work is required.
Many experienced PostgreSQL DBAs will prefer to execute their VACUUM commands, though autovacuum now provides a fine degree of control, which, if enabled and controlled, can save much of your time. Using both manual and automatic vacuuming gives you control and a safety net.
In this chapter, we will cover the following recipes:
autovacuum is enabled by default in PostgreSQL and mostly does a great job of maintaining your PostgreSQL database. We say mostly because it doesn't know everything you do about the database, such as the best time to perform maintenance actions. Let's explore the settings that can be tuned so that you can use vacuums efficiently.
Exercising control requires some thinking about what you want:
autovacuum = on
track_counts = on
vacuum_cleanup_index_scale_factor
vacuum_cost_delay
vacuum_cost_limit
vacuum_cost_page_dirty
vacuum_cost_page_hit
vacuum_cost_page_miss
vacuum_defer_cleanup_age
vacuum_failsafe_age
vacuum_freeze_min_age
vacuum_freeze_table_age
vacuum_multixact_freeze_min_age
vacuum_multixact_freeze_table_age
autovacuum
autovacuum_analyze_scale_factor
autovacuum_analyze_threshold
autovacuum_freeze_max_age
autovacuum_max_workers
autovacuum_multixact_freeze_max_age
autovacuum_naptime
autovacuum_vacuum_cost_delay
autovacuum_vacuum_cost_limit
autovacuum_vacuum_insert_threshold
autovacuum_vacuum_insert_scale_factor
autovacuum_vacuum_scale_factor
autovacuum_vacuum_threshold
autovacuum_work_mem
log_autovacuum_min_duration
ALTER TABLE mytable SET (storage_parameter = value);
autovacuum_enabled
autovacuum_analyze_scale_factor
autovacuum_analyze_threshold
autovacuum_freeze_min_age
autovacuum_freeze_max_age
autovacuum_freeze_table_age
autovacuum_multixact_freeze_max_age
autovacuum_multixact_freeze_min_age
autovacuum_multixact_freeze_table_age
autovacuum_vacuum_cost_delay
autovacuum_vacuum_cost_limit
autovacuum_vacuum_insert_threshold
autovacuum_vacuum_insert_scale_factor
autovacuum_vacuum_scale_factor
autovacuum_vacuum_threshold
vacuum_truncate (no equivalent postgresql.conf parameter)
log_autovacuum_min_duration
toast.autovacuum_enabled
toast.autovacuum_analyze_scale_factor
toast.autovacuum_analyze_threshold
toast.autovacuum_freeze_min_age
toast.autovacuum_freeze_max_age
toast.autovacuum_freeze_table_age
toast.autovacuum_multixact_freeze_max_age
toast.autovacuum_multixact_freeze_min_age
toast.autovacuum_multixact_freeze_table_age
toast.autovacuum_vacuum_cost_delay
toast.autovacuum_vacuum_cost_limit
toast.autovacuum_vacuum_insert_threshold
toast.autovacuum_vacuum_insert_scale_factor
toast.autovacuum_vacuum_scale_factor
toast.autovacuum_vacuum_threshold
toast.vacuum_truncate
toast.log_autovacuum_min_duration
If autovacuum is set, then it will wake up every autovacuum_naptime seconds, and decide whether to run VACUUM, ANALYZE, or both (don't modify that).
There will never be more than autovacuum_max_workers maintenance processes running at any time. As these autovacuum workers perform I/O, they accumulate cost points until they hit the autovacuum_vacuum_cost_limit value, after which they sleep for an autovacuum_vacuum_cost_delay period. This is designed to throttle the resource utilization of autovacuum to prevent it from using all of the available disk I/O bandwidth, which it should never do. So, increasing autovacuum_vacuum_cost_delay will slow down each VACUUM to reduce the impact on user activity, but the general advice is don't do that. autovacuum will run ANALYZE when there have been at least autovacuum_analyze_threshold changes and a fraction of the table defined by autovacuum_analyze_scale_factor has been inserted, updated, or deleted.
autovacuum will run VACUUM when there have been at least autovacuum_vacuum_threshold changes, and a fraction of the table defined by autovacuum_vacuum_scale_factor has been updated or deleted.
The autovacuum_* parameters only change vacuums and analyze operations that are executed by autovacuum. User-initiated VACUUM and ANALYZE commands are affected by vacuum_cost_delay and other vacuum_* parameters.
If you set log_autovacuum_min_duration, then any autovacuum process that runs for longer than this value will be logged to the server log, like so:
2019-04-19 01:33:55 BST (13130) LOG: automatic vacuum of table "postgres.public.pgbench_accounts": index scans: 1
pages: 0 removed, 3279 remain
tuples: 100000 removed, 100000 remain
system usage: CPU 0.19s/0.36u sec elapsed 19.01 sec
2019-04-19 01:33:59 BST (13130) LOG: automatic analyze of table "postgres.public.pgbench_accounts"
system usage: CPU 0.06s/0.18u sec elapsed 3.66 sec
Most of the preceding global parameters can also be set at the table level. For example, the normal autovacuum_cost_delay is 2 ms, but if you want big_table to be vacuumed more quickly, then you can set the following:
ALTER TABLE big_table SET (autovacuum_vacuum_cost_delay = 0);
It's also possible to set parameters for toast tables. A toast table is where the oversized column values get placed, which the documents refer to as supplementary storage tables. If there are no oversized values, then the toast table will occupy little space. Tables with very wide values often have large toast tables. The Oversized Attribute Storage Technique (TOAST) is optimized for UPDATE. For example, if you have a heavily updated table, the toast table is often untouched, so it may make sense to turn off autovacuuming for the toast table, as follows:
ALTER TABLE pgbench_accounts
SET ( toast.autovacuum_enabled = off);
Note
Autovacuuming the toast table is performed completely separately from the main table, even though you can't ask for an explicit include or exclude of the toast table yourself when running VACUUM.
Use the following query to display reloptions for tables and their toast tables:
postgres=#
SELECT n.nspname
, c.relname
, array_to_string(
c.reloptions ||
ARRAY(
SELECT 'toast.' || x
FROM unnest(tc.reloptions) AS x
), ', ')
AS relopts
FROM pg_class c
LEFT JOIN pg_class tc ON c.reltoastrelid = tc.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE c.relkind = 'r'
AND nspname NOT IN ('pg_catalog', 'information_schema');
An example of the output of this query is shown here:
nspname | relname | relopts
---------+------------------+------------------------------
public | pgbench_accounts | fillfactor=100,
autovacuum_enabled=on,
autovacuum_vacuum_cost_delay=20
public | pgbench_tellers | fillfactor=100
public | pgbench_branches | fillfactor=100
public | pgbench_history |
public | text_archive | toast.autovacuum_enabled=off
Managing parameters for many different tables becomes difficult with tens, hundreds, or thousands of tables. We recommend that these parameter settings are used with caution and only when you have good evidence that they are worthwhile. Undocumented parameter settings will cause problems later.
Note that when multiple workers are running, the autovacuum cost delay parameters are "balanced" among all the running workers, so that the total I/O impact on the system is the same regardless of the number of workers running. However, if you set the per-table storage parameters for autovacuum_vacuum_cost_delay or autovacuum_vacuum_cost_limit, then those tables are not considered in the balancing algorithm.
VACUUM allows insertions, updates, and deletions while it runs, but it prevents DDL commands such as ALTER TABLE and CREATE INDEX. autovacuum can detect whether a user has requested a conflicting lock on the table while it runs, and it will cancel itself if it is getting in the user's way. VACUUM doesn't cancel itself since we expect that the DBA would not want it to be canceled.
From PostgreSQL 13+, autovacuum can be trigged by insertions, so you may see more vacuum activity than before in some workloads, but this is likely to be a good thing and nothing to worry about.
Note that VACUUM does not shrink a table when it runs unless there is a large run of space at the end of a table, and nobody is accessing the table when we try to shrink it. If you want to avoid trying to shrink a table when we vacuum it, you can turn this off with the following setting:
ALTER TABLE pgbench_accounts
SET (vacuum_truncate = off);
To shrink a table properly, you'll need VACUUM FULL, but this locks up the whole table for a long time and should be avoided if possible. The VACUUM FULL command will rewrite every row of the table and completely rebuild all indexes. This process is faster than it used to be, though it still takes a long time for larger tables, as well as needing up to twice the current space for the sort and new copy of the table.
The postgresql.conf file also allows include directives, which look as follows:
include 'autovacuum.conf'
These specify another file that will be read at that point, just as if those parameters had been included in the main file.
This can be used to maintain multiple sets of files for the autovacuum configuration. Let's say we have a website that is busy mainly during the daytime, with some occasional nighttime use. We decide to have two profiles – one for daytime, when we want less aggressive autovacuuming, and another for nighttime, where we can allow more aggressive vacuuming:
autovacuum = on
autovacuum_max_workers = 3
include 'autovacuum.conf'
autovacuum_analyze_scale_factor = 0.1
autovacuum_vacuum_cost_delay = 5
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.05
autovacuum_vacuum_cost_delay = 0
autovacuum_vacuum_scale_factor = 0.1
$ ln -sf autovacuum.conf.night autovacuum.conf
$ pg_ctl reload
The latter command reloads the server configuration, and it must be customized depending on your platform.
This allows us to switch profiles twice per day without needing to edit the configuration files. You can also easily tell which is the active profile simply by looking at the full details of the linked file (using ls -l). The exact details of the schedule are up to you. Night and day was just an example, which is unlikely to suit everybody.
The autovacuum_freeze_max_age parameter is explained in the next recipe, Avoiding auto-freezing and page corruptions, as are the more complex table-level parameters.
In the life cycle of a row, there are two routes that a row can take in PostgreSQL – a row version dies and needs to be removed by VACUUM, or a row version gets old enough and needs to be frozen, a task that is also performed by the VACUUM process. The removal of dead rows is easy to understand, while the second seems strange and surprising.
PostgreSQL uses internal transaction identifiers that are 4 bytes long, so we only have 232 transaction IDs (about four billion). PostgreSQL starts again from the beginning when that wraps around, circularly allocating new identifiers. The reason we do this is that moving to an 8-byte identifier has various other negative effects and costs that we would rather not pay for, so we keep the 4-byte transaction identifier. The impact is that we need to do regular sweeps of the entire database to mark tuples as frozen, meaning they are visible to all users – that's why this procedure is known as freezing. Once frozen, they don't need to be touched again, though they can still be updated or deleted later if desired.
Why do we care? Suppose that we load a table with 100 million rows, and everything is fine. When those rows have been there long enough to begin being frozen, the next VACUUM operation on that table will rewrite all of them to freeze their transaction identifiers. Put another way, autovacuum will wake up and start using lots of I/O to perform the freezing.
The most obvious way to forestall this problem is to explicitly vacuum a table after a major load. Of course, that doesn't remove the problem entirely, because vacuuming doesn't freeze all the rows immediately and so some will remain for later vacuums.
The knee-jerk reaction for many people is to turn off autovacuum because it keeps waking up at the most inconvenient times. My way of doing this is described in the Controlling automatic database maintenance recipe.
Freezing takes place when a transaction identifier on a row becomes more than vacuum_freeze_min_age transactions older than the current next value, measured in xid values, not time. Normal VACUUM operations will perform a small amount of freezing as you go, and in most cases, you won't notice that at all. As explained in the previous example, large transactions leave many rows with the same transaction identifiers, so those might cause problems when it comes to freezing.
The VACUUM command is normally optimized to only look at the chunks of a table that require cleaning, both for normal vacuum and freezing operations.
If you fiddle with the vacuum parameters to try to forestall heavy VACUUM operations, then you'll notice that the autovacuum_freeze_max_age parameter controls when the table will be scanned by a forced VACUUM command. To put this another way, you can't turn off the need to freeze rows, but you can defer it to a more convenient time. The mistake comes from deferring it completely and then finding that PostgreSQL executes an aggressive, uncancellable vacuum to remedy the lack of freezing. My advice is to control autovacuum, as we described in the previous recipe, or perform explicit VACUUM operations at a time of your choosing, rather than wait for the inevitable emergency freeze operation.
The VACUUM command is also an efficient way to confirm the absence of page corruptions, so it is worth scanning the whole database, block by block, from time to time. To do this, you can run the following command on each of your databases:
VACUUM (DISABLE_PAGE_SKIPPING);
You can do this table by table as well. There's nothing important about running whole database VACUUM operations anymore; in earlier versions of PostgreSQL, this was important, so you may read that this is a good idea on the web.
You can focus on only the tables that most need freezing by using the vacuumdb utility with the new --min-xid-age and --min-mxid-age options. By setting those options, vacuumdb will skip them if the main table or toast table has a relfrozenxid older than the specified age threshold. If you choose the values carefully, this will skip tables that don't need freezing yet (there is no corresponding option for these on the VACUUM command, as there is in most other cases).
If you've never had a corrupt block, then you may only need to scan every 2 to 3 months. If you start to get corrupt blocks, then you may want to increase the scan rate to confirm that everything is OK. Corrupt blocks are usually hardware induced, though they show up as database errors. It's possible but rare that the corruption was from a PostgreSQL bug instead.
There's no easy way to fix page corruption at present. There are, however, ways to investigate and extract data from corrupt blocks, for example, by using the pageinspect contrib utility that Simon wrote. You can also detect them automatically by creating the whole cluster using the following code:
initdb --data-checksums
This command initializes the data directory and enables data block checksums. This means that every time something changes in a block, PostgreSQL will compute the new checksum, and then store the resulting block checksums in that same block so that a simple program can detect it.
Bloat can be caused by long-running queries or long-running write transactions that execute alongside write-heavy workloads. Resolving that is mostly down to understanding the workloads that are running on the server.
Look at the age of the oldest snapshots that are running, like this:
postgres=# SELECT now() -
CASE
WHEN backend_xid IS NOT NULL
THEN xact_start
ELSE query_start END
AS age
, pid
, backend_xid AS xid
, backend_xmin AS xmin
, state
FROM pg_stat_activity
WHERE backend_type = 'client backend'
ORDER BY 1 DESC;
age | pid | xid | xmin | state
----------------+-------+----------+----------+------------------
00:00:25.791098 | 27624 | | 10671262 | active
00:00:08.018103 | 27591 | | | idle in transaction
00:00:00.002444 | 27630 | 10703641 | 10703639 | active
00:00:00.001506 | 27631 | 10703642 | 10703640 | active
00:00:00.000324 | 27632 | 10703643 | 10703641 | active
00:00:00 | 27379 | | 10703641 | active
The preceding example shows an updated workload of three sessions alongside one session that is waiting in an idle in transaction state, plus two other sessions that are only reading data.
If you have sessions stuck in the idle_in_transaction state, then you may want to consider setting the idle_in_transaction_session_timeout parameter so that transactions in that mode will be canceled. The default for that is zero, meaning there will be no cancellation.
If not, try running shorter transactions or shorter queries.
If that is not an option, then consider setting old_snapshot_threshold. This parameter sets a time delay, after which dead rows are at risk of being removed. If a query attempts to read data that has been removed, then we cancel the query. All queries executing in less time than the old_snapshot_threshold parameter will be safe. This is a very similar concept to the way Hot Standby works (see Chapter 12, Replication and Upgrades).
VACUUM cannot remove dead rows until they are invisible to all users. The earliest piece of data that's visible to a session is defined by its oldest snapshot's xmin value, or if that is not set, then by the backend's xid value.
A session that is not running any query is in the idle state if it's outside of a transaction, or in the idle in transaction state if it's inside a transaction; that is, between a BEGIN and the corresponding COMMIT. Recall the Writing a script that either succeeds entirely or fails entirely recipe in Chapter 7, Database Administration, which was about how BEGIN and COMMIT can be used to wrap several commands into one transaction.
The reason to distinguish between these two states is that locks are released at the end of a transaction. Hence, an idle in transaction session is not currently doing anything, but it might be preventing other queries, including VACUUM, from accessing some tables.
You may have been routed here from other recipes, so you might not even know what prepared transactions are, let alone what an old prepared transaction looks like.
The good news is that prepared transactions don't just happen at random; they happen in certain situations. If you don't know what I'm talking about, that's OK! You don't need to, and better still, you probably don't have any prepared transactions either.
Prepared transactions are part of the two-phase commit feature, also known as 2PC. A transaction commits in two stages rather than one, allowing multiple databases to have synchronized commits. Its typical use is to combine multiple so-called resource managers using the XA protocol, which is usually provided by a Transaction Manager (TM), as used by the Java Transaction API (JTA) and others. If none of this means anything to you, then you probably don't have any prepared transactions.
First, check the setting of max_prepared_transactions:
SHOW max_prepared_transactions;
If your setting is more than zero, check whether you have any prepared transactions. As an example, you may find something like the following:
postgres=# SELECT * FROM pg_prepared_xacts;
-[ RECORD 1 ]------------------------------
transaction | 459812
gid | prep1
prepared | 2017-04-11 13:21:51.912374+01
owner | postgres
database | postgres
Here, gid (the global identifier) will usually be automatically generated.
Removing a prepared transaction is also referred to as resolving in-doubt transactions. The transaction is stuck between committing and aborting. The database or TM may have crashed, leaving the transaction midway through the two-phase commit process.
If you have a connection pool of 100 active connections and something crashes, you'll probably find 1 to 20 transactions stuck in the prepared state, depending on how long your average transaction is.
To resolve the transaction, we need to decide whether we want that change or not. The best way to do this is to check what happened externally to PostgreSQL. That should help you decide.
If you need further help, look at the There's more... section of this recipe.
If you wish to commit these changes, then use the following command:
COMMIT PREPARED 'prep1';
If you want to roll back these changes, then use the following command:
ROLLBACK PREPARED 'prep1';
Prepared transactions are persistent across crashes, so you can't just do a fast restart to get rid of them. They have both an internal transaction identifier and an external global identifier. Either of these can be used to locate locked resources and help you decide how to resolve the transactions.
If you're not sure what the prepared transaction did, you can go and look, but this is time-consuming. The pg_locks view shows locks that are held by prepared transactions. You can get a full report of what is being locked by using the following query:
postgres=# SELECT l.locktype, x.database, l.relation, l.page, l.tuple,l.classid, l.objid, l.objsubid, l.mode, x.transaction, x.gid, x.prepared, x.owner
FROM pg_locks l JOIN pg_prepared_xacts x
ON l.virtualtransaction = '-1/' || x.transaction::text;
The documents mention that you can join pg_locks to pg_prepared_xacts, but they don't mention that, if you join directly on the transaction ID, all it tells you is that there is a transaction lock unless there are some row-level locks. The table locks are listed as being held by a virtual transaction. A simpler query is the following:
postgres=# SELECT DISTINCT x.database, l.relation
FROM pg_locks l JOIN pg_prepared_xacts x
ON l.virtualtransaction = '-1/' || x.transaction::text
WHERE l.locktype != 'transactionid';
database | relation
---------+----------
postgres | 16390
postgres | 16401
(2 rows)
This tells you which relationships in which databases have been touched by the remaining prepared transactions. We don't know their names because we'd need to connect to those databases to check.
It is much harder to check the rows that have been changed by a transaction until it is committed and even then, deleted rows will be invisible.
If you are a heavy user of temporary tables in your applications, then there are some additional actions that you may need to perform.
There are four main things to check, which are as follows:
When we create a temporary table, we insert entries into the pg_class, pg_type, and pg_attribute catalog tables. These catalog tables and their indexes begin to grow and bloat – an issue that will be covered in further recipes. To control that growth, you can either vacuum those tables manually or let autovacuum do its work. You cannot run ALTER TABLE against system tables, so it is not possible to set specific autovacuum settings for any of these tables.
If you vacuum the system catalog tables manually, make sure that you get all of the system tables. You can get the full list of tables to vacuum and a list of their indexes by using the following query:
postgres=# SELECT relname, pg_relation_size(oid) FROM pg_class
WHERE relkind in ('i','r') AND relnamespace = 'pg_catalog'::regnamespace
ORDER BY 2 DESC;
This results in the following output:
relname | pg_relation_size
---------------------------------+------------------
pg_proc | 450560
pg_depend | 344064
pg_attribute | 286720
pg_depend_depender_index | 204800
pg_depend_reference_index | 204800
pg_proc_proname_args_nsp_index | 180224
pg_description | 172032
pg_attribute_relid_attnam_index | 114688
pg_operator | 106496
pg_statistic | 106496
pg_description_o_c_o_index | 98304
pg_attribute_relid_attnum_index | 81920
pg_proc_oid_index | 73728
pg_rewrite | 73728
pg_class | 57344
pg_type | 57344
pg_class_relname_nsp_index | 40960
...(partial listing)
The preceding values are for a newly created database. These tables can become very large if they're not properly maintained, with values of 11 GB for one index being witnessed in one unlucky installation.
PostgreSQL implements Multiversion Concurrency Control (MVCC), which allows users to read data at the same time as writers make changes. This is an important feature for concurrency in database applications as it can allow the following:
Bloated tables and indexes are a natural consequence of MVCC design in PostgreSQL. Bloat is caused mainly by updates, as we must retain both the old and new updates for a certain period. Since these extra row versions are required to provide MVCC, some amount of bloat is normal and acceptable. Tuning to remove bloat completely isn't useful and probably a waste of time.
Bloating results in increased disk consumption, as well as performance loss – if a table is twice as big as it should be, scanning it takes twice as long. VACUUM is one of the best ways of removing bloat.
Many users execute VACUUM far too frequently, while at the same time complaining about the cost of doing so. This recipe is all about understanding when you need to run VACUUM by estimating the amount of bloat in tables and indexes.
MVCC is a core part of PostgreSQL and cannot be turned off, nor would you want it to be. The internals of MVCC have some implications for the DBA that need to be understood: each row represents a row version, so it has two system columns – xmin and xmax – indicating the identifiers of the two transactions when the version was created and deleted, respectively. The value of xmax is NULL if that version has not been deleted yet.
The general idea is that, instead of removing row versions, we alter their visibility by changing their xmin and/or xmax values. To be more precise, when a row is inserted, its xmin value is set to the "XID" or transaction ID of the creating transaction, while xmax is emptied; when a row is deleted, xmax is set to the number of the deleting transaction, without actually removing the row. An UPDATE operation is treated similarly to a DELETE operation, followed by INSERT; the deleted row represents the older version, and the row that's been inserted is the newer version. Finally, when rolling back a transaction, all of its changes are made invisible by marking that transaction ID as aborted.
In this way, we get faster DELETE, UPDATE, and ROLLBACK statements, but the price of these benefits is that the SQL UPDATE command can cause tables and indexes to grow in size because they leave behind dead row versions. The DELETE and aborted INSERT statements take up space, which must be reclaimed by garbage collection. VACUUM is the command we use to reclaim space in a batch operation, though there is another internal feature named Heap-Only Tuples (HOT), which allows us to clean data blocks one at a time as we scan each data block if that is possible. HOT also reduces index bloat since not all updates require index maintenance.
The best way to understand this is to look at things the same way that autovacuum does, by using a view that's been created with the following query:
CREATE OR REPLACE VIEW av_needed AS
SELECT N.nspname, C.relname
, pg_stat_get_tuples_inserted(C.oid) AS n_tup_ins
, pg_stat_get_tuples_updated(C.oid) AS n_tup_upd
, pg_stat_get_tuples_deleted(C.oid) AS n_tup_del
, CASE WHEN pg_stat_get_tuples_updated(C.oid) > 0
THEN pg_stat_get_tuples_hot_updated(C.oid)::real
/ pg_stat_get_tuples_updated(C.oid)
END
AS HOT_update_ratio
, pg_stat_get_live_tuples(C.oid) AS n_live_tup
, pg_stat_get_dead_tuples(C.oid) AS n_dead_tup
, C.reltuples AS reltuples
, round(COALESCE(threshold.custom, current_setting('autovacuum_vacuum_threshold'))::integer
+ COALESCE(scale_factor.custom, current_setting('autovacuum_vacuum_scale_factor'))::numeric
* C.reltuples)
AS av_threshold
, date_trunc('minute',
greatest(pg_stat_get_last_vacuum_time(C.oid),
pg_stat_get_last_autovacuum_time(C.oid)))
AS last_vacuum
, date_trunc('minute',
greatest(pg_stat_get_last_analyze_time(C.oid),
pg_stat_get_last_analyze_time(C.oid)))
AS last_analyze
, pg_stat_get_dead_tuples(C.oid) >
round( current_setting('autovacuum_vacuum_threshold')::integer
+ current_setting('autovacuum_vacuum_scale_factor')::numeric
* C.reltuples)
AS av_needed
, CASE WHEN reltuples > 0
THEN round(100.0 * pg_stat_get_dead_tuples(C.oid) / reltuples)
ELSE 0 END
AS pct_dead
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
NATURAL LEFT JOIN LATERAL (
SELECT (regexp_match(unnest,'^[^=]+=(.+)$'))[1]
FROM unnest(reloptions)
WHERE unnest ~ '^autovacuum_vacuum_threshold='
) AS threshold(custom)
NATURAL LEFT JOIN LATERAL (
SELECT (regexp_match(unnest,'^[^=]+=(.+)$'))[1]
FROM unnest(reloptions)
WHERE unnest ~ '^autovacuum_vacuum_scale_factor='
) AS scale_factor(custom)
WHERE C.relkind IN ('r', 't', 'm')
AND N.nspname NOT IN ('pg_catalog', 'information_schema')
AND N.nspname NOT LIKE 'pg_toast%'
ORDER BY av_needed DESC, n_dead_tup DESC;
We can then use this to look at individual tables, as follows:
postgres=# x
postgres=# SELECT * FROM av_needed WHERE nspname = 'public' AND relname = 'pgbench_accounts';
We will get the following output:
-[ RECORD 1 ]----+------------------------
nspname | public
relname | pgbench_accounts
n_tup_ins | 100001
n_tup_upd | 117201
n_tup_del | 1
hot_update_ratio | 0.123454578032611
n_live_tup | 100000
n_dead_tup | 0
reltuples | 100000
av_threshold | 20050
last_vacuum | 2010-04-29 01:33:00+01
last_analyze | 2010-04-28 15:21:00+01
av_needed | f
pct_dead | 0
We can compare the number of dead row versions, shown as n_dead_tup, against the required threshold, av_threshold.
The preceding query doesn't take into account table-specific autovacuum thresholds. It could do so if you need it, but the main purpose of the query is to give us information to understand what is happening, and then set the parameters accordingly – not the other way around.
Notice that the table query shows insertions, updates, and deletions so that you can understand your workload better. There is also something named hot_update_ratio. This shows the fraction of updates that take advantage of the HOT feature, which allows a table to self-vacuum as the table changes. If that ratio is high, then you may avoid VACUUM activities altogether or at least for long periods. If the ratio is low, then you will need to execute VACUUM commands or autovacuum more frequently. Note that the ratio never reaches 1.0, so if you have it above 0.95, then that is very good and you need not think about it further.
HOT updates take place when the UPDATE statement does not change any of the column values that are indexed by any index, and there is enough free space in the disk page where the updated row is located. If you change even one column that is indexed by just one index, then it will be a non-HOT update, and there will be a performance hit. So, carefully selecting indexes can improve update performance and reduce the need for maintenance. Also, if HOT updates do occur, though not often enough for your liking, you might want to try to decrease the fillfactor storage parameter for the table to make more space for them. Remember that this will only be important on your most active tables. Seldom touched tables don't need much tuning.
To recap, non-HOT updates cause indexes to bloat. The following query is useful in investigating the index size and how it changes over time. It runs fairly quickly and can be used to monitor whether your indexes are changing in size over time:
SELECT
nspname,relname,
round(100 * pg_relation_size(indexrelid) /
pg_relation_size(indrelid)) / 100
AS index_ratio,
pg_size_pretty(pg_relation_size(indexrelid))
AS index_size,
pg_size_pretty(pg_relation_size(indrelid))
AS table_size
FROM pg_index I
LEFT JOIN pg_class C ON (C.oid = I.indexrelid)
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE
nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast') AND
C.relkind='i' AND
pg_relation_size(indrelid) > 0;
Another route is to use the pgstattuple contrib extension, which provides very detailed statistics on tables and indexes:
CREATE EXTENSION pgstattuple;
You can scan tables using pgstattuple(), as follows:
test=> SELECT * FROM pgstattuple('pg_catalog.pg_proc');
The output will look as follows:
-[ RECORD 1 ]------+-------
table_len | 458752
tuple_count | 1470
tuple_len | 438896
tuple_percent | 95.67
dead_tuple_count | 11
dead_tuple_len | 3157
dead_tuple_percent | 0.69
free_space | 8932
free_percent | 1.95
The downside of pgstattuple is that it derives exact statistics by scanning the whole table and counting everything. If you have time to scan the table, you may as well vacuum the whole table anyway. So, a better idea is to use pgstattuple_approx(), which is much, much faster, and yet is still fairly accurate. It works by accessing the table's visibility map first and then only scanning the pages that need VACUUM, so I recommend that you use it in all cases for checking tables (there is no equivalent for indexes since they don't have a visibility map):
postgres=# select * from pgstattuple_approx('pgbench_accounts');
-[ RECORD 1 ]--------+-----------------
table_len | 268591104
scanned_percent | 0
approx_tuple_count | 1001738
approx_tuple_len | 137442656
approx_tuple_percent | 51.1717082037088
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
approx_free_space | 131148448
approx_free_percent | 48.8282917962912
You can also scan indexes using pgstatindex(), as follows:
postgres=> SELECT * FROM pgstatindex('pg_cast_oid_index');
-[ RECORD 1 ]------+------
version | 2
tree_level | 0
index_size | 8192
root_block_no | 1
internal_pages | 0
leaf_pages | 1
empty_pages | 0
deleted_pages | 0
avg_leaf_density | 50.27
leaf_fragmentation | 0
You may want to set up monitoring for the bloated tables and indexes. Look at the Nagios plugin called check_postgres_bloat, which is a part of the check_postgres plugins.
It provides some flexible options to assess bloat. Unfortunately, it's not that well documented, but if you've read this, it should make sense. You'll need to play with it to get the thresholding correct anyway, so that shouldn't be a problem.
Also, note that the only way to know for certain the exact bloat of a table or index is to scan the whole relationship. Anything else is just an estimate and may lead to you running maintenance either too early or too late.
This recipe covers both the VACUUM command and autovacuum, which I refer to collectively as vacuums (non-capitalized).
If you're currently waiting for a long-running vacuum (or autovacuum) to finish, go straight to the How to do it... section.
If you've just had a long-running vacuum complete, then you may want to think about setting a few parameters for next time, so read the How it works… section.
Let's watch what happens when we run a large VACUUM. Don't run VACUUM FULL, because it runs for a long time while holding an AccessExclusiveLock on the table. Ouch.
First, locate which process is running this VACUUM by using the pg_stat_activity view to identify the specific pid (34399 is just an example).
Repeatedly execute the following query to see the progress of the VACUUM command, specifying the pid of the process you wish to monitor:
postgres=# SELECT * FROM pg_stat_progress_vacuum WHERE pid = 343
The next section explains what this all means.
VACUUM works in various phases:
pid | 34399
datid | 12515
datname | postgres
relid | 16422
phase | scanning heap
heap_blks_total | 32787
heap_blks_scanned | 25207
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples | 9541017
num_dead_tuples | 537600
While this phase is happening, the progress data doesn't change until it has vacuumed all of the indexes. This phase can take a long time; more indexes increase the time that is required unless you specify parallelism (more on this later). After this phase, we increment index_vacuum_count. Note that this does not refer to the number of indexes on the table, only how many times we have scanned all the indexes:
pid | 3439
datid | 12515
datname | postgres
relid | 16422
phase | vacuuming indexes
heap_blks_total | 32787
heap_blks_scanned | 32787
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples | 9541017
num_dead_tuples | 999966
pid | 34399
datid | 12515
datname | postgres
relid | 16422
phase | vacuuming heap
heap_blks_total | 32787
heap_blks_scanned | 32787
heap_blks_vacuumed | 25051
index_vacuum_count | 1
max_dead_tuples | 9541017
num_dead_tuples | 999966
To make VACUUM run in minimal time, maintenance_work_mem should be set to anything up to 1 GB, according to how much memory you can allocate to this task at this time. This will minimize the number of times indexes are scanned. If you avoid running vacuums, then more dead rows will be collected when it runs, which may cause an overflow of max_dead_tuples, thus causing the vacuum to take longer to run.
Using the INDEX_CLEANUP OFF option allows you to request that steps after "scanning heap" will be skipped, which will then make VACUUM go much faster. This is not an option with autovacuum.
If your filesystem supports it, you may also be able to set maintenance_io_concurrency to an optimal value for running ANALYZE and VACUUM.
VACUUM can be blocked while waiting for table-level locks by other DDL statements such as a long-running ALTER TABLE or CREATE INDEX. If that happens, the lock waits are not shown in the progress view, so you may also want to look in the pg_stat_activity or pg_locks views. You can request that locked tables be skipped with the SKIP_LOCKED option.
You can request multiple options for a VACUUM command, as shown in these examples, both of which do the same thing:
VACUUM (DISABLE_PAGE_SKIPPING, SKIP_LOCKED, VERBOSE) my_table;
VACUUM (DISABLE_PAGE_SKIPPING ON, SKIP_LOCKED ON, VERBOSE ON, ANALYZE OFF) my_table;
VACUUM doesn't run in parallel on a single table. However, if you have more than one index on a table, the index scanning phases can be conducted in parallel, if specifically requested by the user – autovacuum never does this. To use this feature, add the PARALLEL option and specify the number of workers, which will be limited to the number of indexes, the value of max_parallel_maintenance_workers, and whether we exceed min_parallel_index_scan_size.
If you want to run multiple VACUUMs at once, you can do this by, for example, running four vacuums, each job with up to two parallel workers to scan indexes, scanning all databases:
$ vacuumdb --jobs=4 –parallel=2 --all
If you run multiple VACUUM at once, you'll use more memory and I/O, so be careful.
Vacuums can be slowed down by raising vacuum_cost_delay or lowering vacuum_cost_limit. Setting vacuum_cost_delay too high is counterproductive. VACUUM is your friend, not your enemy, so delaying it until it doesn't happen at all just makes things worse. Be careful.
Each vacuum sleeps when the work it has performed takes it over its limit, so the processes running VACUUM do not all sleep at the same time.
VACUUM commands use the value of vacuum_cost_limit as their limit.
For autovacuum workers, their limit is a share of the total autovacuum_vacuum_cost_limit, so the total amount of work that's done is the same no matter what the setting of autovacuum_max_workers.
autovacuum_max_workers should always be set to more than 2 to ensure that all the tables can begin vacuuming when they need it. Setting it too high may not be very useful, so you need to be careful.
If you need to change the settings to slow down or speed up a running process, then vacuums will pick up any new default settings when you reload the postgresql.conf file.
If you do choose to run VACUUM FULL, the progress for that is available in PostgreSQL 12+ via the pg_stat_progress_cluster catalog view, which also covers the CLUSTER command. Note that you can have multiple jobs running VACUUM FULL, but you should not specify parallel workers when using FULL to avoid deadlocks.
PostgreSQL 13+ allows ANALYZE progress reporting via pg_stat_progress_analyze. ANALYZE ignores any parallel workers that have been set.
Just as tables can become bloated, so can indexes. However, reusing space in indexes is much less effective. In the Identifying and fixing bloated tables and indexes recipe, you saw that non-HOT updates can cause bloated indexes. Non-primary key indexes are also prone to some bloat from normal INSERT commands, as is common in most relational databases. Indexes can become a problem in many database applications that involve a high proportion of INSERT and DELETE commands.
autovacuum does not detect bloated indexes, nor does it do anything to rebuild indexes. Therefore, we need to look at other ways to maintain indexes.
PostgreSQL supports commands that will rebuild indexes for you. The client utility, reindexdb, allows you to execute the REINDEX command conveniently from the operating system:
$ reindexdb
This executes the SQL REINDEX command on every table in the default database. If you want to reindex all your databases, then use the following command:
$ reindexdb -a
That's what the manual says, anyway. My experience is that many indexes don't need rebuilding, so you should probably be more selective of what you rebuild.
Also, REINDEX puts a full table lock (AccessExclusiveLock) on the table while it runs, preventing even SELECTs against the table. You don't want to run that on your whole database!
So, I recommend that you rebuild individual indexes or all the indexes on one table at a time.
DROP TABLE IF EXISTS test; CREATE TABLE test
(id INTEGER PRIMARY KEY
,category TEXT
, value TEXT);
CREATE INDEX ON test (category);
SELECT oid, relname, relfilenode
FROM pg_class
WHERE oid in (SELECT indexrelid
FROM pg_index
WHERE indrelid = 'test'::regclass);
oid | relname | relfilenode
-------+-------------------+-------------
16639 | test_pkey | 16639
16641 | test_category_idx | 16641
(2 rows)
PostgreSQL supports a command known as REINDEX CONCURRENTLY, which builds an index without taking a painful AccessExclusiveLock:
REINDEX INDEX CONCURRENTLY test_category_idx;
When we check our internal identifiers again, we get the following:
SELECT oid, relname, relfilenode
FROM pg_class
WHERE oid in (SELECT indexrelid
FROM pg_index
WHERE indrelid = 'test'::regclass);
oid | relname | relfilenode
-------+-------------------+-------------
16639 | test_pkey | 16639
16642 | test_category_idx | 16642
(2 rows)
Here, we can see that test_category_idx is now a completely new index.
This seems pretty good, and it works on primary keys too.
If you do choose to use the reindexdb tool, make sure that you use these options to reindex one table at a time, concurrently, with some useful output:
$ reindexdb --concurrently -t test --verbose
INFO: index "public.test_category_idx" was reindexed
INFO: index "public.test_pkey" was reindexed
INFO: index "pg_toast.pg_toast_16414_index" was reindexed
INFO: table "public.test" was reindexed
DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s.
The REINDEX INDEX CONCURRENTLY statement allows the INSERT, UPDATE, and DELETE commands to be used while the index is being created. It cannot be executed inside another transaction, and only one index per table can be created concurrently at any time.
If you perform REINDEX TABLE CONCURRENTLY, then each index will be recreated one after the other. However, each index can be built in parallel, as discussed shortly.
REINDEX will also work on partitioned tables, from PostgreSQL 14+.
You can also now use REINDEX to change the tablespaces of indexes, as it works.
Also new in PostgreSQL 14+ is the ability to use VACUUM to ignore long-running transactions that execute REINDEX on other tables, making it even more practical to use on production database servers.
CREATE INDEX/ REINDEX for B-tree indexes can be run in parallel for PostgreSQL 11+. The amount of parallelism will be directly controlled by the setting of a table's parallel_workers parameter. Be careful since setting this at the table level affects all queries, not just the index build/rebuild. If the table-level parameter is not set, then the maintenance_work_mem and max_parallel_maintenance_workers parameters will determine how many workers will be used; the default is 64 MB for maintenance_work_mem and 2 MB for max_parallel_maintenance_workers. Increase both to get further gains in performance and/or concurrency. Note that these workers are shared across all users, so be careful not to over-allocate jobs; otherwise, there won't be enough workers to let everybody run in parallel.
If you are fairly new to database systems, you may think that rebuilding indexes for performance is something that only PostgreSQL needs to do. Other DBMSes require this as well – they just don't say so.
Indexes are designed for performance and, in all databases, deleting index entries causes contention and loss of performance. PostgreSQL does not remove index entries for a row when that row is deleted, so an index can be filled with dead entries. PostgreSQL attempts to remove dead entries when a block becomes full, but that doesn't stop a small number of dead entries from accumulating in many data blocks.
Selecting the correct set of indexes for a workload is known to be a hard problem. It usually involves trial and error by developers and DBAs to get a good mix of indexes.
Tools for identifying slow queries exist and many SELECT statements can be improved by adding an index.
What many people forget is to check whether the mix of indexes remains valuable over time, which is something for the DBA to investigate and optimize.
PostgreSQL keeps track of each access against an index. We can view that information and use it to see whether an index is unused, as follows:
postgres=# SELECT schemaname, relname, indexrelname, idx_scan
FROM pg_stat_user_indexes ORDER BY idx_scan;
schemaname | indexrelname | idx_scan
------------+--------------------------+----------
public | pgbench_accounts_bid_idx | 0
public | pgbench_branches_pkey | 14575
public | pgbench_tellers_pkey | 15350
public | pgbench_accounts_pkey | 114400
(4 rows)
As shown in the preceding code, there is one unused index, alongside others that have some usage. You now need to decide whether unused means that you should remove the index. That is a more complex question, so we need to explain how it works.
The PostgreSQL statistics accumulate various pieces of useful information. These statistics can be reset to zero using an administrator function. Also, as the data accumulates over time, we usually find that objects that have been there for longer periods have higher apparent usage. So, if we see a low number for idx_scan, then it may be that the index was newly created (as was the case in my preceding demonstration), or that the index is only used by a part of the application that runs only at certain times of the day, week, month, and so on.
Another important consideration is that the index may be a unique constraint index that exists specifically to safeguard against duplicate INSERT commands. An INSERT operation does not show up as idx_scan, even if the index was used while checking the uniqueness of the newly inserted values, whereas UPDATE or DELETE may show up because they have to locate the row first. So, a table that only has INSERT commands against it will appear to have unused indexes.
Here is an updated version of the preceding query, which excludes unique constraint indexes:
SELECT schemaname
, relname
, indexrelname
, idx_scan
FROM pg_stat_user_indexes i
LEFT JOIN pg_constraint c
ON i.indexrelid = c.conindid
WHERE c.contype IS NULL
ORDER BY idx_scan DESC;
Also, some indexes that show usage might be showing historical usage, and there is no further usage. Alternatively, it might be the case that some queries use an index where they could just as easily and almost as cheaply use an alternative index. Those things are for you to explore and understand before you take action. A very common approach is to regularly monitor such numbers to gain knowledge by examining their evolution over time, both on the master database and any replicated Hot Standby nodes.
In the end, you may decide that you want to remove an index. If only there was a way to try removing an index and then put it back again quickly, in case you cause problems! Rebuilding an index may take hours on a big table, so these decisions can be a little scary. No worries! Just follow the next recipe, Carefully removing unwanted indexes.
Carefully removing? Do you mean pressing Enter gently after typing DROP INDEX? Err, no!
The reasoning is that it takes a long time to build an index and a short time to drop it.
What we want is a way of removing an index so that if we discover that removing it was a mistake, we can put the index back again quickly.
The following query will list all invalid indexes, if any:
SELECT ir.relname AS indexname
, it.relname AS tablename
, n.nspname AS schemaname
FROM pg_index i
JOIN pg_class ir ON ir.oid = i.indexrelid
JOIN pg_class it ON it.oid = i.indrelid
JOIN pg_namespace n ON n.oid = it.relnamespace
WHERE NOT i.indisvalid;
Take note of these indexes so that you can tell whether a given index is invalid later because we marked it as invalid during this recipe, in which case it can safely be marked as valid, or because it was already invalid for other reasons.
Here, we will describe a procedure that allows us to deactivate an index without actually dropping it so that we can appreciate what its contribution was and possibly reactivate it:
CREATE OR REPLACE FUNCTION trial_drop_index(iname TEXT) RETURNS VOID
LANGUAGE SQL AS $$ UPDATE pg_index
SET indisvalid = false
WHERE indexrelid = $1::regclass;
$$;
CREATE OR REPLACE FUNCTION trial_undrop_index(iname TEXT) RETURNS VOID
LANGUAGE SQL AS
$$ UPDATE pg_index
SET indisvalid = true
WHERE indexrelid = $1::regclass;
$$;
Note
Be careful to avoid undropping any index that was detected by the query in the Getting ready section; if it wasn't marked as invalid when applying this recipe, then it may be unusable because it isn't valid.
This recipe also uses some inside knowledge. When we create an index using CREATE INDEX CONCURRENTLY, it is a two-stage process. The first phase builds the index and then marks it as invalid. The INSERT, UPDATE, and DELETE statements now begin maintaining the index, but we perform a further pass over the table to see whether we missed anything, before declaring the index valid. User queries don't use the index until it says that it is valid.
Once the index has been built and the valid flag has been set, if we set the flag to invalid, the index will still be maintained. It's just that it will not be used by queries. This allows us to turn the index off quickly, though with the option to turn it on again if we realize that we do need the index after all. This makes it practical to test whether dropping the index will alter the performance of any of your most important queries.
Monitoring systems are not a substitute for good planning. They alert you to unplanned situations that need attention. The more unplanned things you respond to, the greater the chance that you will need to respond to multiple emergencies at once. And when that happens, something will break. Ultimately, that is your fault. If you wish to take your responsibilities seriously, you should plan for this.
This recipe is all about planning, so we'll provide discussion points rather than portions of code. We'll cover the main points that should be addressed and provide a list of points as food for thought, around which the actual implementation should be built:
Build a regular cycle of activity around the following tasks:
Many of these activities are mentioned in this chapter or throughout the rest of this cookbook. Some are not because they aren't very technical and are more about planning and understanding your environment.
You may also find time to consider the following:
3.129.208.25