There comes a time in every DBA's life when they must disconnect a PostgreSQL client from the server; for us, that time is now. There are varying degrees of escalation available for this purpose, and several system catalog views to provide viable targets. Why would we want to forcefully cancel a query or disconnect a user?
To prevent utter havoc, should a user forget an important clause, a query could require several hours to complete. During this time, it is consuming an entire CPU and saturating the storage bandwidth while doing so. A buggy application could start a transaction and stop responding, leaving an idle transaction potentially holding locks and causing a wait backlog.
There are many reasons to evict a connection, and most of them revolve around maintaining a regular flow of queries. If we're unable to maintain low latency and high throughput, our work in building a highly available environment is wasted.
Luckily, PostgreSQL provides most of the tools we need. However, there is a more advanced command-line utility named tcpkill
that we may need to use later. If it's not already installed, we recommend that you do so before continuing. Debian or Ubuntu-based systems can use this command as a root-capable user:
sudo apt-get install dsniff
The full escalation path starts very subtly to avoid major disruptive action. Try to follow these steps carefully, assuming eth0
is the network interface that PostgreSQL is using:
SELECT pid, client_port, state, now() - query_start AS duration, query FROM pg_stat_activity WHERE now() - query_start > INTERVAL '2 seconds' AND state != 'idle' ORDER BY duration DESC;
SELECT procpid AS pid, client_port, now() - query_start AS duration, current_query FROM pg_stat_activity WHERE now() - query_start > INTERVAL '2 seconds' AND current_query != '<IDLE>' ORDER BY duration DESC;
pid
for any query that should be disconnected.SELECT pg_cancel_backend(pid);
pid
.SELECT pg_terminate_backend(pid);
pid
.client_port
column:sudo tcpkill -i eth0 -9 port client_port
tcpkill
resembles several identical lines.We begin the process by getting a list of every process ID, duration, and query currently running for longer than 2 seconds. Though 2 seconds is arbitrary; it helps filter out short and fast queries that we aren't interested in. If we examine the queries listed in these results, we may decide that one or more need to be canceled or disconnected. The results should resemble this output:
If this is the case, the pid
column conveys important information necessary to target the client connection. We begin by invoking pg_cancel_backend
in an attempt to terminate the currently running query. Often, this is enough to clear locks or stop a query from consuming excessive resources. It's important to rerun the status query to ensure that the command successfully stopped the client's activity.
If the target connection is still active, we need to escalate to the next step: disconnect the client from the database. For this, we use pg_terminate_backend
instead. This is roughly equivalent to using an operating system utility to terminate the client process, but it is something we can do directly from PostgreSQL. Again, we check for success using the status query, just in case.
In very rare cases, pg_terminate_backend
can fail, and the client connection will remain unscathed. How is this possible? Networks, despite their apparent maturity, are notoriously unreliable. Misrouted packets, retransmissions, blocked sockets, timeouts, stalls, and more issues wait to disrupt the communication line between PostgreSQL and a connected client.
Sometimes the network socket is in such a state that PostgreSQL was interrupted while writing output. In this case, PostgreSQL is waiting for the client to acknowledge receipt of the data, or for the operating system to mark the network connection as broken. If this never happens, PostgreSQL will wait patiently forever until the client properly handles the terminate command.
This isn't ideal for us if the process is locking necessary tables or rows. If we can't get PostgreSQL to terminate the client, we need to use another approach. The tcpkill
command gives us the ability to interrupt a network connection directly; this causes the operating system to close the network socket. When this happens, the PostgreSQL client exits automatically.
All we need to do is run tcpkill
with the -i
parameter to tell it about the network interface the database is using, the port to focus on, and how aggressive to be. We know the port from the client_port
column of our status query, and specifying -9
tells tcpkill
to block all incoming and outgoing packets so that there's no ambiguity regarding our intent.
The output from a tcpkill
command should look like this towards the end:
127.0.0.10:5432 > 127.0.0.1:37601: R 315492496:315490496(0) win 0 127.0.0.10:5432 > 127.0.0.1:37601: R 315492538:315490538(0) win 0 127.0.0.10:5432 > 127.0.0.1:37601: R 315492622:315490622(0) win 0
It's important to not be impatient. Sometimes, it can take a minute or two before the connection finally dies.
If a connected application encounters a bug and goes haywire, it might be convenient to disconnect several clients simultaneously. PostgreSQL lets us run query results through functions, so we could kill all connections that were idle in the transaction for at least 2 minutes by running this query as a superuser:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE now() - query_start > INTERVAL '2 minutes' AND state = 'idle in transaction';
The pg_stat_activity
view offers a lot of characteristics to differentiate target queries. We could terminate only connections from a specific IP address or those that connected to the database over a week ago. There is a lot of opportunity here to maintain a highly available system through direct intervention.
3.133.152.159