Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Identifying important tables

Another aspect of maintaining a highly available database is to know all important information about the contents of the database itself. In this case, we aim to focus on tables and indexes that receive the most activity. If any problems that might require maintenance or a restart arise, the most active portions are the likely origin.

What is activity? Inserts, updates, deletes, and selects are a good start. PostgreSQL collects statistics on all of this information, making it easy to collect and track. It also tracks how often indexes or tables are scanned and how many rows were affected by each. In addition, we can find out how much disk space any object consumes, and given the help of a couple contributed tools, we can also find out how much of this space is currently reusable.

Data like this tells us which tables and indexes are the most active, which objects have the highest row turnover, and which objects require a high disk I/O. Armed with these statistics, we can properly distribute tables to high performance tablespaces, direct extra maintenance toward particularly active tables, or remove inefficient indexes.

All of these operations increase the stability, responsiveness, and throughput of a PostgreSQL database. First, however, we need to isolate our targets.

Getting ready

Many of these techniques rely on functions and views described in greater detail within the PostgreSQL documentation. In particular, we use a few system administration functions such as pg_relation_size and pg_total_relation_size and system views such as pg_class, pg_index, pg_stat_user_tables, and pg_stat_user_indexes. We also make use of a contributed module named pgstattuple.

We strongly recommend that you get familiar with these functions and views in the PostgreSQL documentation before continuing. After we are finished, we hope to convey just how useful these views are and encourage further exploration. When you are building a highly available database, there is rarely such a thing as too much information about the database.

How to do it...

Follow these steps to learn a little about the database:

Use this query to get a list of the top 20 largest tables in the current database:

SELECT oid::REGCLASS::TEXT AS table_name,

       pg_size_pretty(
         pg_total_relation_size(oid)
       ) AS total_size

  FROM pg_class

 WHERE relkind = 'r'

   AND relpages > 0

 ORDER BY pg_total_relation_size(oid) DESC

 LIMIT 20;

Use this query to get a list of the top 20 largest indexes in the current database and their parent tables:

SELECT indexrelid::REGCLASS::TEXT AS index_name,

       indrelid::REGCLASS::TEXT AS table_name,

       pg_size_pretty(
         pg_relation_size(indexrelid)
       ) AS total_size

  FROM pg_index

 ORDER BY pg_relation_size(indexrelid) DESC

 LIMIT 20;

Use this query to find the top 20 most active tables by determining the ones that receive the most inserts, updates, or deletes:

SELECT relid::REGCLASS AS table_name,

       n_tup_ins AS inserts,

       n_tup_upd + n_tup_hot_upd AS updates,

       n_tup_del AS deletes

  FROM pg_stat_user_tables

 ORDER BY (n_tup_ins + n_tup_upd +
           n_tup_hot_upd + n_tup_del) DESC

 LIMIT 20;

Use this variant to obtain top tables with fetch activity by checking index and table scans:

SELECT relid::REGCLASS AS table_name,

       coalesce(seq_scan, 0) AS sequential_scans,

       coalesce(idx_scan, 0) AS index_scans,

       coalesce(seq_tup_read, 0) AS table_matches,

       coalesce(idx_tup_fetch, 0) AS index_matches

  FROM pg_stat_user_tables

 ORDER BY (coalesce(seq_scan, 0) + 
           coalesce(idx_scan, 0)) DESC,

       (coalesce(seq_tup_read, 0) +
       coalesce(idx_tup_fetch, 0)) DESC

 LIMIT 20;

Use this query for the top 20 indexes with read activity in the current database:

SELECT indexrelid::REGCLASS AS index_name,

       coalesce(idx_scan, 0) AS index_scans,

       coalesce(idx_tup_read, 0) AS rows_read,

       coalesce(idx_tup_fetch, 0) AS rows_fetched

  FROM pg_stat_user_indexes

 ORDER BY (coalesce(idx_scan, 0) + 
           coalesce(idx_tup_read, 0)) DESC

 LIMIT 20;

How it works...

Each of these queries offers a distinct piece of information about the database. Simply executing them in a vacuum offers very little insight. We have to look at the results of each to learn anything. In addition, all of the system catalog views only return statistics for the current database we're connected to.

If the PostgreSQL instance has dozens of databases and we're only connected to one, the statistics will only apply to that particular database. To obtain stats on every database in the instance, we would need to connect to each one and collect the information separately.

The first query returns the 20 largest tables in the database, including associated indexes and the The Oversize Attribute Storage Technique (TOAST) data. This way, if a table has a large amount of excessively long row data or several indexes, we still get its true size in relation to all other tables. We will likely make use of the pg_size_pretty function several times through this book. When given a size in bytes, it converts it to a more convenient and readable notation such as megabytes or gigabytes.

The next query returns the 20 largest indexes in the database. While it is very likely that these will be associated with the largest tables, this won't necessarily be the case. Indeed, large composite indexes, functional indexes, or bloated indexes will also be listed here. Indexes (which are not primary keys) that show up in this list are good candidates for optimization, either by substituting them with partial indexes or replacing them with a more efficient version.

After size, we move on to table activity. The third query returns the 20 most active tables based on writes. In many cases, this will immediately identify tables with high turnover that will frequently invoke autovacuum or autoanalyze and may require manual adjustment. Often, user session tables appear here due to inefficient storage of web session data; identification provides ammunition for process revision. Overly active tables are bottlenecks and should be minimized if possible.

Then, we may wish to know table select information. The fourth query is somewhat crude, but the intent is to return 20 tables that are most often read by user sessions. Again, it will likely identify tables with extremely inflated read activity in comparison to the database average. These cases can often be reduced by better frontend data caches, and identifying them is the first step down this path.

Finally, we can see the top 20 indexes using read activity. This can further isolate potential indexes that should be monitored. If we invert the sorting of this query, we can also identify indexes that are not producing many matches at all and are simply wasting space.

There's more...

Though we've already obtained a wealth of information from PostgreSQL, it still has a few tricks up its sleeve.

Reset stats

Running these queries multiple times in a row, it's hard to ignore the fact that the numbers increase, and there's no associated timestamp. Several statistics-tracking systems will track the differences between readings and display this as the rate, but if we're doing this by hand, we need another way to zero out statistics for ease of analysis. Use this function to reset all activity statistics to zero:

SELECT pg_stat_reset();

Of course, we suggest that you capture this data before resetting it.

Use pgstattuple

The pgstattuple contributed extension is also useful for analysis, but it produces a deep scan of single objects identified through other means. It's best to use the extension to get storage-related data regarding indexes or tables matched with the preceding queries. To use it, it must first be installed by a superuser account. It can also only be utilized by a superuser account.

To install the extension, execute this SQL query:

CREATE EXTENSION pgstattuple;

To use it, select from it as if it were a normal table or view. The only difference is that we use it as a function with the name of the table we want to analyze. For example, to obtain storage statistics on the pg_class table, we could execute this:

SELECT * FROM pgstattuple('pg_class'),

Of particular interest is the free_percent column. If this is very high, the table mostly has empty space and could benefit from CLUSTER or VACUUM FULL. In addition, we should tell developers if this table becomes bloated frequently, as it is possible that they can modify the application to use it more efficiently.

If this isn't possible, we can also set autovacuum to be more aggressive for each specific table if necessary.

Table of Contents for
Identifying important tables

Identifying important tables

Getting ready

How to do it...

How it works...

There's more...

Reset stats

Use pgstattuple

See also

Table of Contents for Identifying important tables

Create new playlist

Sign In

Sign Up

Identifying important tables

Getting ready

How to do it...

How it works...

There's more...

Reset stats

Use pgstattuple

See also

Table of Contents for
Identifying important tables