Chapter 7. PostgreSQL Monitoring

Troubleshooting is virtually impossible if you don't have the information you need to actually track down a problem. For carrying out troubleshooting, data is everything. Without data and without the ability to see what is going on, the situation is practically hopeless.

In this chapter, you will be guided through the following topics:

  • Checking the overall database behavior
  • Checking for conflicts
  • Finding I/O bottlenecks
  • Inspecting system internals
  • Chasing down slow queries
  • External tools

Those topics will help you keep an eye on your system.

Checking the overall database behavior

The first thing to do when approaching a system is to check the overall health and see what is currently going on in the system. Is the system using too much CPU? Or maybe, too much memory? Is the disk I/O fine?

Checking pg_stat_activity

The best place to start is pg_stat_activity, a system view listing open database connections. Since PostgreSQL 9.2, pg_stat_statement has been extended a little, and a lot more information such as the state of a database connection can be seen:

test=# d pg_stat_activity
           View "pg_catalog.pg_stat_activity"
      Column      |           Type           | Modifiers 
------------------+-------------------------------------
 datid            | oid                      | 
 datname          | name                     | 
 pid              | integer                  | 
 usesysid         | oid                      | 
 usename          | name                     | 
 application_name | text                     | 
 client_addr      | inet                     | 
 client_hostname  | text                     | 
 client_port      | integer                  | 
 backend_start    | timestamp with time zone | 
 xact_start       | timestamp with time zone | 
 query_start      | timestamp with time zone | 
 state_change     | timestamp with time zone | 
 waiting          | boolean                  | 
 state            | text                     | 
 backend_xid      | xid                      | 
 backend_xmin     | xid                      | 
 query            | text                     | 

Let's go through those columns. The first two columns tell us which database a person is connected to. The first column contains the object ID of the database, and the second column contains the name of the database in plain text. Then information about the user connected is displayed. Again, two fields are used: one with the object ID, and another with the name of the user in plain text.

Then comes the application_name field. The application_name field can be set freely by the end user (or the application). It is very helpful to debug applications because, if used wisely, it can tell you where a request is generated inside your application. Consider the following example:

test=# SET application_name TO 'script.pl:456';
SET
test=# SHOW application_name;    
 application_name 
------------------
 script.pl:456
(1 row)

As you can see, it is just a plain text name, which can contain anything.

The next three columns (client_*) will tell you where a connection originates from. You can detect the hostname, the port, and so on. In case of trouble, it allows you to solve the problem on the box causing the issue.

The backend_start and xact_start fields tell you when a database connection or the transaction was started.

Tip

Keep an eye on very old and long-running transactions. They can have a serious impact on your VACUUM policy. Remember that a row can only be cleaned out when it cannot be seen by a single transaction anymore. Long-running transactions can therefore have undesirable side effects.

The next important fields are query_start, state, state_change, and query. These four fields have to be seen in combination, and not one by one. The query_start column indicates when the last query has started. When the state is set to active, the query is still active at that point of time. When the state stays idle, the query is not active anymore and was terminated when state_change was set. Therefore, the difference between query_start and state_change is the runtime of the previous query in this case. Of course, query is the query we are talking about.

After the waiting column, which was discussed in the previous chapter, there are backend_xid and backend_xmin. These two columns give you an insight into the transactional behavior of the system (transaction IDs and so on).

To avoid trouble, the following points can be checked:

  • Are there any waiting queries?
  • Are there long-running transactions causing VACUUM issues?
  • Are there no connections at all or is there any excess number of connections?
  • Are there queries listed that have taken (or are taking) way too much time?

Given those valuable insights, you can fix a couple of quite frequent issues.

Checking database-level information

Once you are done with pg_stat_activity, you might want to turn your attention to pg_stat_database. The idea of pg_stat_database is to have one row per database:

test=# d pg_stat_database
          View "pg_catalog.pg_stat_database"
     Column     |           Type           | Modifiers 
----------------+--------------------------+-----------
 datid          | oid                      | 
 datname        | name                     | 
 numbackends    | integer                  | 
 xact_commit    | bigint                   | 
 xact_rollback  | bigint                   | 
 blks_read      | bigint                   | 
 blks_hit       | bigint                   | 
 tup_returned   | bigint                   | 
 tup_fetched    | bigint                   | 
 tup_inserted   | bigint                   | 
 tup_updated    | bigint                   | 
 tup_deleted    | bigint                   | 
 conflicts      | bigint                   | 
 temp_files     | bigint                   | 
 temp_bytes     | bigint                   | 
 deadlocks      | bigint                   | 
 blk_read_time  | double precision         | 
 blk_write_time | double precision         | 
 stats_reset    | timestamp with time zone |  

There are a couple of important fields here. First of all, there are the object ID and the name of the database. Then the view reveals the number of currently open database connections (numbackends). However, there is some more subtle information here that does not seem to be widely used: xact_commit and xact_rollback. The xact_commit and xact_rollback fields contain the number of successful and aborted transactions since the start of time, respectively. Why are these fields so important? In a typical application, the ratio of COMMIT to ROLLBACK tends to stay pretty constant over time. An application may face 5 percent ROLLBACK commands over a long period of time. What does it mean if this value is suddenly jumping to 35 percent or more? Most likely, some software update must have introduced some syntax error somewhere. That bug might have been undetected so far. However, if used properly, pg_stat_database might give clues that something might be wrong.

The next few fields tell us how much data was read, added, updated, deleted, and so on. This might be interesting to see, but usually, it does not reveal deeply hidden problems. These fields are more of a general guideline, telling you how much is going on in your system.

The next field is all about conflicts. More information about conflicts can be found in the next section of this chapter, Checking for conflicts.

The temp_files and temp_bytes fields are two highly important fields. They will tell us whether the database has done extensive I/O through temporary files. When does this happen? Temporary files are written if a lot of data is sorted and so on; for example, ORDER BY clauses, CREATE INDEX, or some OVER clauses lead to heavy sorting. If work_mem or maintenance_work_mem is not properly set, it can manifest in high I/O activity when it comes to temporary files. Another reason for high temporary I/O is simply bad SQL statements (for example, too much data returned).

There are two more fields that can give you a clue about I/O: blk_write_time and blk_read_time. The purpose of these two fields is to measure the time spent on writing and reading data to and from the operating system respectively. However, these two fields are not active by default:

test=# SHOW track_io_timing;
 track_io_timing 
-----------------
 off
(1 row)

Unless track_io_timing has been activated in postgresql.conf, there is no data available. If you are fighting with I/O problems and disk wait, it is definitely recommended to turn this setting on to detect the database that is causing some or all of the I/O trouble.

Tip

Note that in some rare cases, timing can decrease performance by a small amount. In most cases, however, this overhead is not relevant.

Finally, stats_reset tells us when the statistics has been reset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.132.99