Skip to main content
fast by design

JDBC Connection Pool Sizing with HikariCP

10 min read Chapter 47 of 90

JDBC Connection Pool Sizing with HikariCP

The main chapter showed that a pool of 10 connections outperforms a pool of 100 on a 16-core database server. This section provides the methodology: how to calculate the right pool size for any workload, how HikariCP’s internals affect that calculation, and how to reduce connection hold time so a smaller pool handles more load.

Queuing Theory for Connection Pools

A connection pool is a classic bounded resource with a queue. When all connections are in use, new requests wait. The system behaves as an M/M/c queue, where:

  • M (arrivals): Requests arrive following a Poisson process
  • M (service): Service time (connection hold time) is exponentially distributed
  • c (servers): The pool has $c$ connections

The Erlang C formula gives the probability that a request must wait:

$$P_w = \frac{\frac{(c \rho)^c}{c!} \cdot \frac{1}{1 - \rho}}{\sum_{k=0}^{c-1} \frac{(c \rho)^k}{k!} + \frac{(c \rho)^c}{c!} \cdot \frac{1}{1 - \rho}}$$

Where $\rho = \lambda W / c$ is the utilization per connection, $\lambda$ is the arrival rate, and $W$ is the average service time.

For the content platform:

  • $\lambda$ = 18,000 queries/s
  • $W$ = 2.6 ms = 0.0026 s
  • Target: $P_w < 0.01$ (less than 1% of requests should wait)

Solving for $c$ with $P_w = 0.01$:

$$\rho = \frac{18{,}000 \times 0.0026}{c} = \frac{46.8}{c}$$

The system is stable when $\rho < 1$, which requires $c > 46.8$. But this is for a single node handling all traffic. With 8 nodes, each handles $\lambda / 8 = 2{,}250$ queries/s:

$$\rho = \frac{2{,}250 \times 0.0026}{c} = \frac{5.85}{c}$$

For $P_w < 0.01$ with $\rho = 5.85/c$, numerical evaluation gives $c \approx 9$. With headroom for bursts: $c = 10$.

This matches the empirical result from the main chapter. The formula is not a guess; it is a derivation from first principles.

HikariCP Internals: Why It Is Fast

HikariCP is consistently the fastest JDBC connection pool in benchmarks. Three design decisions explain the performance:

ConcurrentBag: Lock-Free Connection Retrieval

Most connection pools use a lock-protected list to manage connections. HikariCP uses a ConcurrentBag data structure with thread-local affinity:

// Simplified ConcurrentBag behavior
public class ConcurrentBag<T> {
    // Each thread has a list of connections it has used before
    private final ThreadLocal<List<WeakReference<T>>> threadList =
        ThreadLocal.withInitial(ArrayList::new);
    // Shared list for cross-thread borrowing
    private final CopyOnWriteArrayList<T> sharedList =
        new CopyOnWriteArrayList<>();
    // Synchronization for waiting threads
    private final SynchronousQueue<T> handoffQueue =
        new SynchronousQueue<>(true); // Fair mode

    public T borrow(long timeout, TimeUnit unit) {
        // Step 1: Check thread-local list (no contention)
        List<WeakReference<T>> local = threadList.get();
        for (int i = local.size() - 1; i >= 0; i--) {
            T item = local.get(i).get();
            if (item != null && tryReserve(item)) {
                return item;
            }
        }

        // Step 2: Scan shared list (minimal contention)
        for (T item : sharedList) {
            if (tryReserve(item)) {
                threadList.get().add(new WeakReference<>(item));
                return item;
            }
        }

        // Step 3: Wait for handoff from returning thread
        return handoffQueue.poll(timeout, unit);
    }
}

The thread-local affinity means a thread that repeatedly borrows and returns a connection will get the same physical connection. This is cache-friendly: the connection’s internal buffers (statement cache, network buffers) stay hot in the thread’s L1/L2 cache. Under typical workloads, Step 1 succeeds 90%+ of the time, making connection acquisition a zero-contention operation.

Connection Validation: Aliveness Check

JDBC pools must verify that connections are still alive before handing them out. Stale connections (killed by the database, dropped by a network timeout) cause query failures. HikariCP uses Connection.isValid(timeout) which executes a lightweight protocol-level ping, not a full SQL query.

// HikariCP validates connections on borrow
// The JDBC4 isValid() call is significantly faster than executing "SELECT 1"
config.setConnectionTestQuery(null); // Use JDBC4 isValid(), not SQL
config.setValidationTimeout(1000);   // 1 second validation timeout

The validation overhead per connection borrow:

MethodLatency
Connection.isValid() (JDBC4)0.01 ms
SELECT 10.3 ms
No validation0 ms

At 18,000 queries/s, SELECT 1 validation adds 5.4 seconds of database time per second. JDBC4 isValid() adds 0.18 seconds. The difference is significant: SELECT 1 requires a full SQL parse-plan-execute cycle, while isValid() is a TCP-level ping.

Connection Lifecycle

HikariCP enforces connection lifecycle limits to prevent stale connections:

config.setMaxLifetime(1800000);    // 30 minutes
config.setKeepaliveTime(300000);   // 5 minute keepalive
config.setIdleTimeout(600000);     // 10 minute idle timeout
  • maxLifetime: Forcibly closes connections older than this, regardless of activity. This prevents issues with database server restarts, firewall timeouts, and PostgreSQL’s idle_in_transaction_session_timeout. Set this 30 seconds shorter than any timeout the connection might hit.
  • keepaliveTime: Sends a validation ping to idle connections. Prevents firewall/load balancer idle connection drops.
  • idleTimeout: Closes connections that have been idle longer than this threshold (only when pool is above minimumIdle). For fixed-size pools where minimumIdle = maximumPoolSize, this setting has no effect.

HikariCP staggers maxLifetime by adding a random jitter of up to 2.5% to prevent all connections from recycling simultaneously. Without this, a pool created at startup would have all 10 connections expire within milliseconds of each other, causing a burst of connection creation.

Reducing Connection Hold Time

The pool size formula depends on $W$ (average connection hold time). Reducing $W$ means fewer connections handle the same throughput. The fastest way to reduce hold time: do less work while holding the connection.

// SLOW: Connection held during non-database work
public ArticleResponse getArticleWithMetrics(String id) throws Exception {
    try (Connection conn = dataSource.getConnection()) {
        Article article = queryArticle(conn, id);     // 2ms
        ViewCount count = queryViewCount(conn, id);    // 1ms
        List<Tag> tags = queryTags(conn, id);          // 1ms
        // Connection held during serialization (non-DB work)
        ArticleResponse response = buildResponse(article, count, tags); // 3ms
        return response;
    }
    // Total hold time: 7ms (4ms DB + 3ms non-DB)
}

// FAST: Connection released before non-database work
public ArticleResponse getArticleWithMetrics(String id) throws Exception {
    Article article;
    ViewCount count;
    List<Tag> tags;
    try (Connection conn = dataSource.getConnection()) {
        article = queryArticle(conn, id);     // 2ms
        count = queryViewCount(conn, id);      // 1ms
        tags = queryTags(conn, id);            // 1ms
    }
    // Connection returned to pool; serialization runs without holding it
    return buildResponse(article, count, tags); // 3ms
    // Total hold time: 4ms (only DB work)
}

Reducing hold time from 7 ms to 4 ms reduces the required pool size by 43%:

$$L_{before} = \lambda \times 0.007 = 2{,}250 \times 0.007 = 15.75$$ $$L_{after} = \lambda \times 0.004 = 2{,}250 \times 0.004 = 9.0$$

Rounding up: the pool shrinks from 16 to 10 connections. Those 6 freed connections represent 6 fewer PostgreSQL processes, less context switching, and lower database CPU utilization.

Transaction Scope Minimization

Transactions amplify hold time because the connection is held for the entire transaction duration:

// SLOW: Broad transaction scope holds connection during external calls
@Transactional
public void publishArticle(String id) throws Exception {
    Article article = articleRepository.findById(id);      // 2ms
    article.setStatus(Status.PUBLISHED);
    articleRepository.save(article);                       // 3ms
    searchIndex.indexArticle(article);                     // 50ms (HTTP call!)
    notificationService.notifySubscribers(article);        // 30ms (HTTP call!)
    // Connection held for 85ms total
}

// FAST: Minimal transaction scope, external calls outside transaction
public void publishArticle(String id) throws Exception {
    Article article = updateArticleStatus(id);  // Transaction: 5ms
    // These run without holding a database connection
    searchIndex.indexArticle(article);           // 50ms (HTTP call)
    notificationService.notifySubscribers(article); // 30ms (HTTP call)
}

@Transactional
Article updateArticleStatus(String id) {
    Article article = articleRepository.findById(id);
    article.setStatus(Status.PUBLISHED);
    return articleRepository.save(article);
    // Connection held for 5ms, returned immediately after commit
}

The broad transaction holds a connection for 85 ms. The narrow transaction holds it for 5 ms. At 100 publishes/second, the broad approach requires $100 \times 0.085 = 8.5$ connections just for publishing. The narrow approach requires $100 \times 0.005 = 0.5$ connections. Those 8 freed connections are available for the 18,000 queries/second that are the primary workload.

HTTP calls inside transactions are a common anti-pattern. The transaction provides no benefit (the HTTP call is not transactional), but it holds a database connection hostage for the duration of the network call. If the HTTP service is slow or down, the connection pool drains within seconds.

Pool Sizing Under Varying Workloads

The content platform has two workload patterns:

  1. Normal: 15,000 req/s with 1.2 queries/req, Zipfian distribution
  2. Reindexing: 500 req/s user traffic + 2,000 req/s reindexer scanning all articles

The reindexer uses long-running queries (batch reads of 100 articles at 20 ms per batch). During reindexing, the connection profile changes:

MetricNormalReindexing
Queries/s per node2,2502,530
Avg hold time2.6 ms8.1 ms
Required pool size621

The fixed pool of 10 connections is undersized during reindexing. The team has two options:

Option A: Separate pools for user traffic and reindexer.

@Bean("userDataSource")
public DataSource userDataSource() {
    HikariConfig config = new HikariConfig();
    config.setPoolName("user-pool");
    config.setMaximumPoolSize(10);
    config.setMinimumIdle(10);
    return new HikariDataSource(config);
}

@Bean("reindexerDataSource")
public DataSource reindexerDataSource() {
    HikariConfig config = new HikariConfig();
    config.setPoolName("reindexer-pool");
    config.setMaximumPoolSize(5);
    config.setMinimumIdle(2); // Not fixed; scales with demand
    return new HikariDataSource(config);
}

Separate pools prevent the reindexer from starving user traffic. The user pool’s 10 connections are reserved for interactive queries. The reindexer pool’s 5 connections are dedicated to batch operations. Total: 15 connections per node, 120 across the cluster.

Option B: Rate-limit the reindexer to fit within the existing pool.

public class RateLimitedReindexer {
    private final RateLimiter rateLimiter = RateLimiter.create(100); // 100 batches/s

    public void reindex() {
        List<String> allArticleIds = getAllArticleIds();
        Lists.partition(allArticleIds, 100).forEach(batch -> {
            rateLimiter.acquire();
            reindexBatch(batch);
        });
    }
}

Rate-limiting keeps the reindexer’s connection demand within the pool’s capacity. The reindex takes longer (hours instead of minutes), but user traffic is unaffected.

The content platform uses Option A with separate pools. The reindexer runs during off-peak hours, and the dedicated pool prevents any interference with user-facing queries.

Diagnosing Pool Exhaustion

When hikaricp.connections.pending stays above 0, threads are waiting for connections. The diagnosis flowchart:

  1. Check active connections: If active equals maximum, the pool is saturated.
  2. Check hold time P99: If P99 is significantly higher than average, a few slow queries or leaked connections are dominating.
  3. Check for leaks: If connections.active is at maximum but the database shows fewer active queries, connections are leaked (acquired but not executing queries).
  4. Check database CPU: If database CPU is above 80%, adding connections will make things worse, not better.
  5. Check for long transactions: Query pg_stat_activity for long-running transactions holding connections idle.
-- Find long-running queries holding connections
SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
  AND state != 'idle'
ORDER BY duration DESC;

-- Find idle-in-transaction connections (potential leaks)
SELECT pid, now() - pg_stat_activity.xact_start AS duration,
       query, state
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND (now() - pg_stat_activity.xact_start) > interval '10 seconds'
ORDER BY duration DESC;

On the content platform, the most common pool exhaustion cause is not insufficient connections. It is a single slow query (missing index, unoptimized join) holding a connection for 500 ms instead of 2 ms. One slow query at 100 req/s consumes $100 \times 0.5 = 50$ connections. Fixing the query (adding an index, reducing hold time to 5 ms) reduces the demand to $100 \times 0.005 = 0.5$ connections. The fix is never “increase the pool size.” The fix is “reduce the hold time.”