New Attribution Meet Behavior.

Introducing the industry's first behavioral attribution.

Heap Blog

Decrypting PgBouncer’s Diagnostic Information

If you use Postgres at scale, at some point you’ll need a connection pooler. Postgres lets you configure a maximum number of concurrent queries via max_connections, but if you need to handle bursts of more than a few dozen connections at a time, you probably don’t want to provision a lot of connection slots to do so – this decreases the memory available per-connection and will lead to worse performance.

Instead, you typically use a connection pooler, which multiplexes a low number of underlying database connections and presents an interface that appears the same to a client. This way, your app can send bursts of thousands of queries at a time, and they’ll queue up in your connection pooler instead of swamping your database. A configuration in which a fixed, low number of database connections processes incoming queries will perform much better than one that allocates a huge number of connections that are only occasionally used.

We have been using PgBouncer as a connection pooler since 2014, and we love it. It’s easy to deploy and configure, and it presents an interface to database clients that is wire-compatible, so you can roll it out without changing your application code!

Monitoring PgBouncer

PgBouncer exposes some pseudo-tables for debugging and monitoring purposes. You can access the information they expose by connecting to a special pseudo-database and running diagnostic commands. One of the most important such diagnostics is SHOW POOLS, which you can use to understand the statuses of the connection pools in your PgBouncer instance.

Unfortunately, the output of SHOW POOLS can look somewhat cryptic, and this is aggravated by the lack of documentation about it. This was making it hard for us to debug some PgBouncer-related issues – there was even some mismatch between how different people on our team understood the same metrics. That’s why I decided to dive into the source code and try and make sense of the data provided.

Overview Of Pools

This is the output from the command SHOW POOLS when executed on one of our database nodes:

SHOW POOLS output

What do all these numbers mean? What should we be looking for? Do any invariants hold between them at all times? Say, is cl_active < cl_waiting, whatever these numbers mean? These are the kinds of questions that we’ll try to answer in this post.

Let’s start with some basic definitions.

  • pool – The docs never really define what a pool is, but you can think of it as a set of connections to the same database for a given user.
  • client connection – A client connection is a connection from some client to PgBouncer. In our case, a client could be our data ingestion code, a background job, or a customer query.
  • server connection – A server connection is a connection from PgBouncer to Postgres, meaning that each server connection in PgBouncer has a corresponding Postgres connection. This is not true for client connections.
  • CL_ACTIVE vs cl_active – as confusing as it might be, some of the terms in this post are case-sensitive since this is the way the PgBouncer code uses them. When you see lowercase cl_active, this refers to the number that SHOW POOLS reports (say, 2928 for the last pool in the screenshot). CL_ACTIVE, on the other hand, is a state that a connection can be in, defined in the [SocketState] enum in include/bouncer.h. While there is usually a one-to-one correspondence between the lowercase number and the uppercase state, that’s not always the case. (See the section on cl_waiting for an example.)

So client connections are pretty lightweight — almost free aside from a file descriptor. Server connections are heavy and need to be configured appropriately. Supporting lots of client connections with relatively few server connections is our main use case for PgBouncer.

Client Connections

There are 7 states that a client connection can be in, but we’ll discuss the two that SHOW POOLS informs us about as they are the most important from an administrative point of view.

Here is a state machine diagram of the different PgBouncer client states:

State machine diagram of PgBouncer client states

State machine transitions legend:

  1. A client tries to execute a query; a server connection is available to pair the client with and the query starts executing.
  2. A client tries to execute a query; a server connection is not available, so the client enters a waiting state.
  3. A server connection becomes available; a client is moved to an active state, from which it can then pair with the server.
  4. A client connection has successfully finished logging into PgBouncer.
  5. The query executing on a client connection completes; the client is moved to the connected state (which, in specific PgBouncer terms, means that the link property is cleared) and the server connection becomes available for other clients.

The states work as follows:

CL_LOGIN – this is the state for client connections that are currently in the process of logging in to PgBouncer. While there is only one actual PgBouncer state for this, the clients can be in two different sub-states depending on whether they are currently running a query against Postgres – we’ll call them CL_LOGIN(IN_PROGRESS) and CL_LOGIN(EXECUTING).

(Wondering why a client that is still logging in would be executing a query? Well, if the auth_user property is configured for the database the client is connecting to, PgBouncer may need to execute a query against pg_shadow to execute login checks. Since pg_shadow is a Postgres table, a server connection to Postgres is required in order to run queries on it.)

CL_ACTIVE – the “active client” state means that there is an established connection from a client to PgBouncer. Some clients in this state are executing queries, but others may not be. For ease of understanding, I have broken this up into a CL_ACTIVE(EXECUTING) state and a CL_ACTIVE(CONNECTED) state, but this split is not explicit in the PgBouncer codebase.

You can distinguish between clients in these two CL_ACTIVE sub-states by running SHOW CLIENTS. If the link value is set for a given connection, that means that the client connection is paired with a server connection, i.e. is in CL_ACTIVE(EXECUTING). If the pool_mode for the corresponding pool is transaction or statement, that means the client is currently executing a transaction (or statement, respectively); if it is session, then the client will stay connected to a server connection (occupying a Postgres connection) even if it’s not currently executing any queries.

CL_WAITING and CL_WAITING_LOGIN – the number displayed in SHOW POOLS as cl_waiting actually combines the counts for clients in these two states. In order for a client connection to enter one of these waiting states, the following sequence of events must happen:

  1. A client has connected to PgBouncer and is in state CL_ACTIVE(CONNECTED) or CL_LOGIN(IN_PROGRESS) (it’s not connected to Postgres yet. That is, it’s not paired with a server connection.)
  2. The client tries to execute a query. This requires PgBouncer to find an available server connection and pair it with our client connection. (In terms of states, this means it tries to enter the CL_ACTIVE(EXECUTING) or CL_LOGIN(EXECUTING) states – both of which are states we define in this post for clarity only.)
  3. PgBouncer fails to find an available server connection. This could happen, for example, if the maximum configured number of connections to Postgres has already been exhausted.
  4. The client connection enters the CL_WAITING (or CL_WAITING_LOGIN, respectively) state.

This is represented by a transition of type 2. in the diagram.

So what happens when a client enters a waiting state? Unsurprisingly, it starts waiting for a server connection to be available. If creating more server connections is allowed (the definition of “allowed” is briefly discussed in the section on Configurable Limits below), PgBouncer will periodically create new ones, and when it does so, it will pick a client that is in a waiting state and activate it. Alternatively, when a client that is currently CL_EXECUTING (or CL_EXECUTING_LOGIN) finishes executing a query, the corresponding server connection is freed and a waiting client is activated again.

Here are some conclusions we can draw from the above definitions:

  • Having a large cl_active value is not a reason for concern by itself. This is because, as discussed above, this represents the number of clients that have successfully logged in to PgBouncer, not the number of clients that are running a query. The only clients running queries are the ones we deemed CL_EXECUTING, or with link set in SHOW CLIENTS.
  • Having a large cl_waiting value can be indicative of problems. If the large value persists for a long time, this could mean that there are consistently more clients trying to execute queries in Postgres than we’ve allowed, or that there are long-running queries using up all the server connections.
  • There’s no direct connection between the values for cl_active and cl_waiting – they account for non-overlapping states of client connections.

Server Connections

  • SV_ACTIVE – A server connection (that is, one from PgBouncer to Postgres) being active means that it’s established and paired with a client connection. Note that this doesn’t necessarily mean that it’s actively executing a query. In the session pooling mode, the server connection will stay active until the client connected to PgBouncer explicitly disconnects.
  • Unlike active client connections, an SV_ACTIVE server connection is always paired with a client connection and has link set.
  • SV_IDLE – A connection from PgBouncer to Postgres that is established and ready to serve clients, and is not currently paired with a client connection.
  • SV_USED – PgBouncer supports checking connections to Postgres regularly to make sure they are still alive. After a server connection has been idle for longer than **server_check_delay**, it’s moved to this state, indicating that it needs to be checked before using it again.
  • SV_TESTED – This state means that the server connection is currently being checked (or will be checked immediately). (Compared to the above, SV_USED only indicates that the connection should be checked, whereas SV_TESTED indicates that it is being checked right now.)
  • SV_LOGIN – A server connection currently trying to connect to Postgres.

Some more thoughts we can get from the above:

  • cl_executing + cl_executing_login == sv_active – if there was a count reported for clients in our made-up CL_EXECUTING (and CL_EXECUTING_LOGIN) state, it would be equal to the number of active server connections. This is because, by definition, a client connection in an executing state is paired with a server connection; and there are no other clients paired with a server connection.
  • Is sv_active <= cl_active? That is, are we guaranteed that the number of active server connections is less than or equal to the number of active client connections? That’s usually the case, but not always. As mentioned immediately above (and back in the discussion on CL_WAITING_LOGIN), it’s possible for a client connection in the CL_LOGIN state to be paired with a server connection and be executing a query. Thus, it’s possible, for example, to have all active clients executing a query, and a CL_LOGIN client executing one as well. This will mean that sv_active = cl_active + 1, making our inequality above not hold in certain cases.

Configurable Limits

PgBouncer comes with a lot of tunables. Here is a brief overview of the most important ones related to connection pool sizing.

  • pool_size – This setting controls how many server connections can be created for a given pool. Note that there can be multiple pools for a given database. A pool is created for each different user connecting to each different database defined in the [databases] section of pgbouncer.ini. (If you specify the user for a given database explicitly in [databases], it will have a single pool only.) This means that sum(sv_*) <= pool_size, where sv_* are the sv_active, sv_idle, etc. numbers reported for a given pool, and pool_size is the respective pool size limit.
  • max_db_connections – total maximum number of server connections for a given PgBouncer database.
  • max_user_connections – total maximum number of server connections for a given user.
  • max_client_conn – The maximum number of client connections to PgBouncer – that is, the sum of cl_* for all pools is less than or equal to max_client_conn. There doesn’t seem to be a setting that allows controlling this at the pool level.

Failure Scenarios

Let’s look at a few different ways PgBouncer might be misbehaving, and how the above definitions could help us figure out what’s going on.

High cl_waiting, high sv_active – Let’s start with a situation where sv_active is close to the pool_size. As discussed in the section on CL_WAITING, there are two ways a client connection can get out of a waiting state – either by PgBouncer creating more server connections to Postgres, or when a connection in the CL_EXECUTING (made-up) state finishes executing its query. Since sv_active is high (and there are probably other server connections as well), PgBouncer can’t just create more server connections and grant them to clients. Instead, it needs to wait for running queries to complete, and then grant the now free server connections. Therefore, this might be indicative of some of the following problems:

  • Queries are running for too long, occupying server connections and blocking queued clients from connecting. This could be because of a temporary spike, or someone running a bunch of huge queries at the same time. If it persists for a long time, the queued client connections might start timing out. Possible ways to solve this would be to set a strict query timeout to prevent outliers from taking up all resources, or designing your queries so none of them are significantly slower than the others.
  • You constantly have more queries coming into PgBouncer than Postgres can handle. If every one of your queries takes 1 second to complete, and you have 120 queries coming in per second with a pool_size of 100, you’ll have a backlog of queries growing at a rate of 20 queries per second. You’ll usually need to tune your queries or Postgres to solve this.

High cl_waiting, low sv_active – I.e., the same scenario as above, except sv_active is low compared to pool_size. If this happens for a short time period, it might just mean that your load is spiky and PgBouncer can’t create new connections to Postgres fast enough. You might want to tweak min_pool_size to make sure your server connections don’t fall too low even if no clients are executing queries at any given moment.

If this situation is prolonged, there might be something preventing PgBouncer from creating new server connections. To give some examples:

  • The sum of the pool_sizes for all pools for a given database is larger than max_db_connections for the same database. Then, even though there appear to be free slots for a pool, PgBouncer can’t really create more connections.
  • Postgres is configured with max_connections lower than the pool_size – or, alternatively, there are clients connecting directly to Postgres that are not accounted for in sv_active.

High cl_active – This is not, by itself, a sign of problems. PgBouncer should be capable of servicing many thousands of clients being connected at the same time, since as mentioned above, they are not necessarily connected to a server connection.

Code Organization

This is a brief overview of the code structure in case anyone is curious to dive deeper.

  • include/bouncer.h – the different connection states are defined here.
  • src/objects.c – most of the interesting stuff happens here. This is where clients are paired with servers, clients are activated when a server becomes available, and so forth.
  • src/client.c and src/server.c – there are some client- and server-specific functionalities in these, respectively – for example, handling client authentication, or parsing client / server queries and responses.
  • src/janitor.c – the code for a number of periodic cleanup tasks: disconnecting idle server connections, rechecking connections, etc.
  • src/takeover.c – PgBouncer allows you to hot-reload your configuration, without dropping connections. This files contains the logic for doing so.

If you like getting deep into the guts of the tools you use, we are hiring! Or, if you have any questions, feel free to reach out to me on Twitter: @IvanVergiliev.

Ivan Vergiliev