Terminating rogue connections

There comes a time in every DBA's life when they must disconnect a PostgreSQL client from the server; for us, that time is now. There are varying degrees of escalation available for this purpose, and several system catalog views to provide viable targets. Why would we want to forcefully cancel a query or disconnect a user?

To prevent utter havoc, should a user forget an important clause, a query could require several hours to complete. During this time, it is consuming an entire CPU and saturating the storage bandwidth while doing so. A buggy application could start a transaction and stop responding, leaving an idle transaction potentially holding locks and causing a wait backlog.

There are many reasons to evict a connection, and most of them revolve around maintaining a regular flow of queries. If we're unable to maintain low latency and high throughput, our work in building a highly available environment is wasted.

Getting ready

Luckily, PostgreSQL provides most of the tools we need. However, there is a more advanced command-line utility named tcpkill that we may need to use later. If it's not already installed, we recommend that you do so before continuing. Debian or Ubuntu-based systems can use this command as a root-capable user:

sudo apt-get install dsniff

How to do it...

The full escalation path starts very subtly to avoid major disruptive action. Try to follow these steps carefully, assuming eth0 is the network interface that PostgreSQL is using:

  1. Connect to the database as a superuser and execute the following query for PostgreSQL 9.2 and higher versions:
    SELECT pid, client_port, state,
           now() - query_start AS duration, query
      FROM pg_stat_activity
     WHERE now() - query_start > INTERVAL '2 seconds'
       AND state != 'idle'
     ORDER BY duration DESC;
  2. Use this query for 9.1 and lower versions:
    SELECT procpid AS pid, client_port,
           now() - query_start AS duration, current_query
      FROM pg_stat_activity
     WHERE now() - query_start > INTERVAL '2 seconds'
       AND current_query != '<IDLE>'
     ORDER BY duration DESC;
  3. Starting from the top, carefully examine the queries in this list. Make note of pid for any query that should be disconnected.
  4. Stop the currently executing query for the selected pids with the following query:
    SELECT pg_cancel_backend(pid);
  5. Execute the first query again and check the results for the targeted pid.
  6. If the query is still running or the state has switched to idle in transaction, execute the following query:
    SELECT pg_terminate_backend(pid);
  7. Execute the first query again and check the results for the targeted pid.
  8. If the query is still running, disconnect from the database and connect to the server as a root-capable user.
  9. Run the following command to terminate the client's network connection, using the contents of the client_port column:
    sudo tcpkill -i eth0 -9 port client_port
    
  10. Wait until the output from tcpkill resembles several identical lines.

How it works...

We begin the process by getting a list of every process ID, duration, and query currently running for longer than 2 seconds. Though 2 seconds is arbitrary; it helps filter out short and fast queries that we aren't interested in. If we examine the queries listed in these results, we may decide that one or more need to be canceled or disconnected. The results should resemble this output:

How it works...

If this is the case, the pid column conveys important information necessary to target the client connection. We begin by invoking pg_cancel_backend in an attempt to terminate the currently running query. Often, this is enough to clear locks or stop a query from consuming excessive resources. It's important to rerun the status query to ensure that the command successfully stopped the client's activity.

If the target connection is still active, we need to escalate to the next step: disconnect the client from the database. For this, we use pg_terminate_backend instead. This is roughly equivalent to using an operating system utility to terminate the client process, but it is something we can do directly from PostgreSQL. Again, we check for success using the status query, just in case.

In very rare cases, pg_terminate_backend can fail, and the client connection will remain unscathed. How is this possible? Networks, despite their apparent maturity, are notoriously unreliable. Misrouted packets, retransmissions, blocked sockets, timeouts, stalls, and more issues wait to disrupt the communication line between PostgreSQL and a connected client.

Sometimes the network socket is in such a state that PostgreSQL was interrupted while writing output. In this case, PostgreSQL is waiting for the client to acknowledge receipt of the data, or for the operating system to mark the network connection as broken. If this never happens, PostgreSQL will wait patiently forever until the client properly handles the terminate command.

This isn't ideal for us if the process is locking necessary tables or rows. If we can't get PostgreSQL to terminate the client, we need to use another approach. The tcpkill command gives us the ability to interrupt a network connection directly; this causes the operating system to close the network socket. When this happens, the PostgreSQL client exits automatically.

All we need to do is run tcpkill with the -i parameter to tell it about the network interface the database is using, the port to focus on, and how aggressive to be. We know the port from the client_port column of our status query, and specifying -9 tells tcpkill to block all incoming and outgoing packets so that there's no ambiguity regarding our intent.

The output from a tcpkill command should look like this towards the end:

127.0.0.10:5432 > 127.0.0.1:37601: R 315492496:315490496(0) win 0
127.0.0.10:5432 > 127.0.0.1:37601: R 315492538:315490538(0) win 0
127.0.0.10:5432 > 127.0.0.1:37601: R 315492622:315490622(0) win 0

It's important to not be impatient. Sometimes, it can take a minute or two before the connection finally dies.

There's more...

If a connected application encounters a bug and goes haywire, it might be convenient to disconnect several clients simultaneously. PostgreSQL lets us run query results through functions, so we could kill all connections that were idle in the transaction for at least 2 minutes by running this query as a superuser:

SELECT pg_terminate_backend(pid)

  FROM pg_stat_activity

 WHERE now() - query_start > INTERVAL '2 minutes'

   AND state = 'idle in transaction';

The pg_stat_activity view offers a lot of characteristics to differentiate target queries. We could terminate only connections from a specific IP address or those that connected to the database over a week ago. There is a lot of opportunity here to maintain a highly available system through direct intervention.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.152.159