Connection Pool Sizing: The Math and the Mistakes

The Symptom

The fare calculation endpoint handles 200 RPS with a p99 of 180ms. The team is satisfied. Product announces a New Year’s Eve promotion that will drive 5x normal traffic. The team scales the application from 3 pods to 15 pods. On New Year’s Eve, the system collapses at 600 RPS, not 1,000.

Each pod has a HikariCP pool of 10 connections. With 15 pods, that is 150 concurrent database connections. PostgreSQL’s max_connections is 100.

The Cause

Connection pool sizing involves two limits that most teams configure independently and discover interact painfully:

Application-side pool size: how many connections each pod can hold
Database-side connection limit: how many total connections PostgreSQL accepts

The defaults:

HikariCP maximum-pool-size: 10
PostgreSQL max_connections: 100

At 3 pods: 3 × 10 = 30 connections. Fine. At 15 pods: 15 × 10 = 150 connections. PostgreSQL rejects connections 101-150 with FATAL: too many connections for role "app".

The pod receives a connection rejection, HikariCP retries, the retry also fails, the request times out after connectionTimeout (30 seconds by default), and the rider stares at a spinner.

The Baseline

# load-tests/pool_sizing_locustfile.py
from locust import HttpUser, task, between, LoadTestShape

class FareUser(HttpUser):
    wait_time = between(0.5, 1.5)

    @task
    def estimate_fare(self):
        self.client.post(
            "/api/fares/estimate",
            json={
                "pickup_lat": 40.7128, "pickup_lng": -74.0060,
                "dropoff_lat": 40.7580, "dropoff_lng": -73.9855
            },
            name="/api/fares/estimate"
        )


class RampLoadShape(LoadTestShape):
    stages = [
        {"duration": 60,  "users": 100, "spawn_rate": 20},
        {"duration": 120, "users": 300, "spawn_rate": 20},
        {"duration": 180, "users": 500, "spawn_rate": 20},
        {"duration": 240, "users": 700, "spawn_rate": 20},
    ]

    def tick(self):
        run_time = self.get_run_time()
        for stage in self.stages:
            if run_time < stage["duration"]:
                return (stage["users"], stage["spawn_rate"])
        return None

Results with pool size 10, 15 pods, PostgreSQL max_connections 100:

Step 1 (100 users):  p99=  180ms  Active connections: 28/150  Fail: 0.0%
Step 2 (300 users):  p99=  650ms  Active connections: 85/150  Fail: 0.0%
Step 3 (500 users):  p99= 4200ms  Active connections: 100/150 Fail: 8.2%
                                  ← PostgreSQL rejecting connections
Step 4 (700 users):  p99=28000ms  Active connections: 100/150 Fail: 34.1%
                                  ← Cascading timeouts

The Fix

Step 1: Right-size the application pool

The fare calculation query averages 45ms. The target is 1,000 RPS across all pods. With 6 pods (not 15, because right-sized pools reduce the need for many pods):

Per-pod target: $1000 / 6 = 167$ QPS

Required connections per pod: $167 \times 0.045 / 0.8 = 9.4 \approx 12$ (with headroom)

# SCALED: Right-sized pool for the fare calculation workload
spring:
  datasource:
    hikari:
      maximum-pool-size: 12
      minimum-idle: 6
      connection-timeout: 5000
      leak-detection-threshold: 10000
      pool-name: fare-pool

Step 2: Handle the PostgreSQL limit with PgBouncer

With 6 pods × 12 connections = 72 total connections, the default max_connections = 100 is sufficient. But adding auto-scaling changes the equation. If HPA scales to 12 pods, that is 144 connections.

PgBouncer sits between the application and PostgreSQL, pooling connections at the proxy level:

# pgbouncer.ini
[databases]
ride_hailing = host=postgres port=5432 dbname=ride_hailing

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction           # Return connection after each transaction
max_client_conn = 500             # Accept up to 500 app connections
default_pool_size = 40            # Maintain 40 actual PG connections
reserve_pool_size = 10            # Extra 10 for bursts
reserve_pool_timeout = 3          # Wait 3s before using reserve pool
server_idle_timeout = 300

Transaction pooling mode returns the connection to the pool after each transaction, not after the session ends. This is critical for Spring WebFlux where a reactive chain might span multiple thread hops.

Step 3: Connection leak detection

The 3am page that nobody wants: HikariCP’s leakDetectionThreshold logs a warning when a connection is held longer than the threshold. For the fare calculation service, no query should hold a connection longer than 10 seconds.

// The code that caused the leak:
// BOTTLENECK: Forgotten close() in the driver matching service
@Service
public class DriverMatchingService {

    @Autowired
    private DataSource dataSource;

    public DriverMatch findBestMatch(RideRequest request) {
        Connection conn = dataSource.getConnection();
        try {
            PreparedStatement ps = conn.prepareStatement(
                "SELECT * FROM drivers WHERE zone_id = ? AND status = 'available'");
            ps.setString(1, request.zoneId());
            ResultSet rs = ps.executeQuery();
            // Process results...
            return match;
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
        // BUG: connection never closed in the happy path
        // Only closed if an exception is caught by a higher-level handler
    }
}

The fix is obvious once found: use try-with-resources. But finding it requires leak detection. Without leakDetectionThreshold, the pool slowly drains over hours until all connections are leaked and every request waits 30 seconds for the timeout.

The Proof

After right-sizing to 12 connections per pod, adding PgBouncer, and fixing the connection leak:

Before (pool=10, no PgBouncer, leak present):
  At 500 users: p99=4200ms, Fail=8.2%

After (pool=12, PgBouncer, leak fixed):
  At 500 users: p99= 280ms, Fail=0.0%
  At 700 users: p99= 420ms, Fail=0.0%
  At 1000 users: p99= 680ms, Fail=0.1%

  Pool utilization at 1000 users:
    Active: 9/12 per pod (75% utilization, 25% headroom)
    PgBouncer: 32/40 server connections (80% utilization)
    Wait time p99: 8ms (down from 4,200ms)

The connection pool is no longer the bottleneck. The p99 at 1,000 users (680ms) is now dominated by actual query execution and surge pricing computation, not by waiting for connections. That is a problem for the caching chapters (CH5-CH7) to solve.