Timeout Anti-Patterns

Configuring a timeout is not the same as configuring it correctly. These are the mistakes that pass code review and fail in production.

Anti-Pattern: The 30-Second Timeout

// DANGEROUS DEFAULT
var factory = new SimpleClientHttpRequestFactory();
factory.setReadTimeout(Duration.ofSeconds(30));

A 30-second read timeout means a slow dependency holds your thread for 30 seconds. At 100 requests per second, your 200-thread pool fills in 2 seconds. A 30-second timeout is functionally identical to no timeout for the purpose of preventing cascading failures.

The timeout must be derived from the dependency’s expected latency, not from a round number that “feels safe.”

// EXPLICIT AND INTENTIONAL
// Fraud detection normal p99: 120ms
// Fraud detection degraded p99: 5,000ms
// Timeout set to 2x normal p99 = 240ms, rounded up to 500ms for safety
// This means we reject requests during severe degradation (>500ms)
// rather than absorbing them and losing threads
factory.setReadTimeout(Duration.ofMillis(500));

The 500ms timeout will cause some requests to fail when fraud detection is at its normal p99 of 120ms. That is zero failures (120ms < 500ms). When fraud detection degrades to 5,000ms, almost all requests will time out. That is the correct behavior: fail fast and protect the payment service’s thread pool.

Anti-Pattern: Connection Timeout Without Read Timeout

// DANGEROUS DEFAULT
factory.setConnectTimeout(Duration.ofSeconds(1));
// No read timeout set - defaults to infinity

The connection timeout protects against unreachable hosts. The read timeout protects against slow responses. The slow response scenario is 100x more common than the unreachable host scenario because slow responses come from services that are “up but degraded,” which is the normal degradation path. Setting a connection timeout without a read timeout protects against the wrong failure mode.

Always set both.

Anti-Pattern: Retry Without Timeout

// DANGEROUS DEFAULT - each retry holds a thread for the full default timeout
public FraudScore scoreWithRetry(PaymentRequest request) {
    for (int attempt = 0; attempt < 3; attempt++) {
        try {
            return fraudClient.score(request);  // No timeout on this call
        } catch (Exception e) {
            if (attempt == 2) throw e;
            // Retry immediately
        }
    }
    throw new IllegalStateException("unreachable");
}

Three retries with no timeout. If the fraud service hangs, each retry holds a thread for the OS default TCP timeout of 120 seconds. Three retries = 360 seconds of thread occupation per request. Your 200-thread pool handles 0.55 requests per second.

// EXPLICIT AND INTENTIONAL - timeout on every call, total budget enforced
public FraudScore scoreWithRetry(PaymentRequest request) {
    Instant deadline = Instant.now().plusMillis(2000); // total budget: 2 seconds

    for (int attempt = 0; attempt < 3; attempt++) {
        Duration remaining = Duration.between(Instant.now(), deadline);
        if (remaining.isNegative()) {
            throw new TimeoutException("Retry budget exhausted after " + attempt + " attempts");
        }
        try {
            return fraudClient.score(request); // configured with 500ms read timeout
        } catch (ResourceAccessException e) {
            if (attempt == 2) throw e;
            // Next retry will check remaining budget
        }
    }
    throw new IllegalStateException("unreachable");
}

The total budget of 2 seconds caps the worst case. Three retries with a 500ms timeout each would take 1.5 seconds maximum. The 2-second budget adds margin for connection overhead.

Anti-Pattern: Different Timeouts Per Instance

When you have three instances of the payment service, each deployed from a different commit with different timeout values, traffic distribution becomes uneven. Instance A with a 500ms timeout rejects slow requests fast and processes more total requests. Instance B with a 5-second timeout holds threads longer and processes fewer requests. The load balancer sends equal traffic to both, but instance B’s effective capacity is 10x lower. It saturates first and starts failing, while instance A appears healthy.

Timeout values must be configuration, not code. Store them in a shared configuration source (Spring Cloud Config, environment variables from a deployment manifest) so all instances of the same service use identical values.

# EXPLICIT AND INTENTIONAL - externalized timeout configuration
fraud:
  client:
    connect-timeout-ms: 1000
    read-timeout-ms: 500
    connection-request-timeout-ms: 500

// PRODUCTION - timeout from configuration, not hardcoded
@ConfigurationProperties(prefix = "fraud.client")
public record FraudClientProperties(
        int connectTimeoutMs,
        int readTimeoutMs,
        int connectionRequestTimeoutMs
) {}

This ensures every instance uses the same timeout values, and changing a timeout requires a configuration change and redeployment, not a code change. The timeout is visible, auditable, and consistent.