Skip to content

Pool exhaustion is not database latency.

A short explanation of why your dashboards lie about which layer is slow, with the four checks that distinguish them.

A team called us because their API was slow under load and their database dashboard was clean. CPU low, disk idle, query times flat. Every graph said the database was fine, and yet every request that touched it was taking two seconds. They had spent a week tuning queries that were already fast.

The database was fine. The connection pool was not.

The shape of the lie

When a request waits for a connection from an exhausted pool, that wait is not database time. The query itself, once it runs, is as fast as ever, so your query-time metric stays flat. The two seconds are spent in your own process, in a queue, before a single byte reaches the database. The dashboard measures the wrong interval.

The slowest part of a request is often the part no one instrumented.

This is why connection-pool problems are so durable. The one place you would look says everything is healthy, because by its own definition it is.

Four checks that tell them apart

The four measurements that distinguish pool exhaustion from real database latency:

  • Connection acquire time. Instrument the wait to check out a connection separately from query execution. If this rises under load, you have found it.
  • Pool saturation. Track in-use connections against pool size. Sustained near the ceiling is the precondition for every wait above.
  • Query time at the database. From pg_stat_statements, not from the application. If this is flat while the app is slow, the app is the slow part.
  • Hold time per checkout. A connection held across a slow external call starves everyone else, even at low query volume.

Here is the instrumentation we add first, because it is the cheapest signal:

pool.ts
async function withConnection<T>(fn: (c: Conn) => Promise<T>): Promise<T> {
  const start = performance.now();
  const conn = await pool.acquire();
- const result = await fn(conn);
+ metrics.observe("pool.acquire_ms", performance.now() - start);
+ const result = await fn(conn);
  pool.release(conn);
  return result;
}

Once pool.acquire_ms is on a graph next to query time, the diagnosis takes minutes instead of a week. The fix is usually boring: a larger pool, a shorter hold, or moving a slow external call out from under a held connection.1

Footnotes

  1. The team in the opening was holding a connection across a third-party payment call that took 1.8 seconds. Moving the call outside the checkout returned the pool to health without changing the pool size at all.