DNS and TCP: The Hidden Latency Budget
DNS and TCP: The Hidden Latency Budget
The main chapter showed that connection establishment adds 2-200ms of overhead depending on network topology. This section provides the operational playbook: how to configure DNS caching correctly for your deployment environment, how to measure actual handshake latency with tcpdump, and how to eliminate Nagle’s algorithm interference in service-to-service communication.
DNS Caching Strategy by Deployment Model
Different infrastructure requires different DNS cache TTLs. The wrong choice either causes stale routing (TTL too high) or unnecessary latency (TTL too low):
// DNS configuration matrix for common deployment scenarios:
//
// ┌─────────────────────────┬──────────┬──────────────────────────────────┐
// │ Environment │ TTL (s) │ Reason │
// ├─────────────────────────┼──────────┼──────────────────────────────────┤
// │ Kubernetes (CoreDNS) │ 30 │ Match CoreDNS default TTL │
// │ AWS ECS/ALB │ 60 │ ALB IPs change on scaling │
// │ AWS ElastiCache │ 5-10 │ Failover requires fast switch │
// │ Static on-prem servers │ 300 │ IPs never change, minimize DNS │
// │ Consul service mesh │ 10 │ Health checks deregister fast │
// │ Cloud SQL/RDS │ 60 │ Failover promotes new IP │
// └─────────────────────────┴──────────┴──────────────────────────────────┘
public class DnsConfigByEnvironment {
public static void configureForKubernetes() {
// Kubernetes CoreDNS returns records with 30s TTL by default.
// Setting JVM cache to 30s ensures we re-resolve at the same cadence.
// Setting it lower wastes DNS queries; higher risks stale endpoints.
Security.setProperty("networkaddress.cache.ttl", "30");
Security.setProperty("networkaddress.cache.negative.ttl", "3");
}
public static void configureForAwsElb() {
// AWS ELB/ALB can change IP addresses when:
// - Auto-scaling adds/removes nodes
// - AZ failure causes rebalancing
// - ELB pre-warming completes
// 60s balances freshness with DNS query reduction.
Security.setProperty("networkaddress.cache.ttl", "60");
Security.setProperty("networkaddress.cache.negative.ttl", "5");
}
public static void configureForElastiCache() {
// ElastiCache Multi-AZ failover changes the endpoint IP.
// Failover completes in 10-30s. JVM must pick up new IP quickly.
Security.setProperty("networkaddress.cache.ttl", "5");
Security.setProperty("networkaddress.cache.negative.ttl", "1");
}
}
Kubernetes CoreDNS Optimization
The content platform runs on Kubernetes. Every DNS query goes through CoreDNS, which resolves cluster-internal service names. The default search path causes unnecessary lookups:
# Default resolv.conf in a Kubernetes pod:
# nameserver 10.96.0.10
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5
# Problem: A lookup for "search-service" with ndots:5 means ANY hostname
# with fewer than 5 dots gets the search path appended first:
# search-service.default.svc.cluster.local (1st query)
# search-service.svc.cluster.local (2nd query, if 1st fails)
# search-service.cluster.local (3rd query, if 2nd fails)
# search-service (4th query, absolute)
#
# For external hostnames like "api.openai.com" (2 dots < 5):
# api.openai.com.default.svc.cluster.local (NXDOMAIN, wasted)
# api.openai.com.svc.cluster.local (NXDOMAIN, wasted)
# api.openai.com.cluster.local (NXDOMAIN, wasted)
# api.openai.com (finally resolves)
#
# Fix: Use FQDNs with trailing dot, or reduce ndots
// FAST: Use fully-qualified domain names in service configuration
// The trailing dot prevents search path expansion
public class ServiceEndpoints {
// SLOW: Short name triggers search path (up to 4 DNS queries)
private static final String SEARCH_SERVICE_SLOW = "search-service";
// FAST: FQDN with trailing dot (single DNS query)
private static final String SEARCH_SERVICE_FAST =
"search-service.default.svc.cluster.local.";
// Alternative: reduce ndots in pod spec
// spec:
// dnsConfig:
// options:
// - name: ndots
// value: "2"
}
The difference is measurable. With the default ndots:5 and a non-qualified hostname, CoreDNS processes 4 queries (3 NXDOMAIN + 1 success) instead of 1. Under load, this quadruples DNS latency and CoreDNS CPU usage:
# Locust script measuring DNS impact on service call latency
from locust import HttpUser, task, between
import time
import socket
class ArticleServiceUser(HttpUser):
wait_time = between(0.1, 0.5)
host = "http://article-service.default.svc.cluster.local:8080"
@task(10)
def fetch_article_warm(self):
"""Steady-state request over warm connection"""
self.client.get("/api/articles/12345",
name="GET /api/articles/:id (warm)")
@task(1)
def fetch_article_cold(self):
"""Force new connection by resetting session"""
# Close existing connections to force DNS + TCP + TLS
self.client.close()
start = time.perf_counter()
self.client.get("/api/articles/12345",
name="GET /api/articles/:id (cold)")
elapsed_ms = (time.perf_counter() - start) * 1000
if elapsed_ms > 50:
print(f"COLD connection took {elapsed_ms:.1f}ms")
@task(1)
def measure_dns_resolution(self):
"""Measure raw DNS resolution time"""
hostname = "search-service.default.svc.cluster.local"
start = time.perf_counter()
try:
socket.getaddrinfo(hostname, 8080)
except socket.gaierror:
pass
elapsed_us = (time.perf_counter() - start) * 1_000_000
# Report as custom metric
self.environment.events.request.fire(
request_type="DNS",
name=f"resolve {hostname}",
response_time=elapsed_us / 1000,
response_length=0,
exception=None,
)
DNS Prefetching on Service Startup
The content platform’s article service knows its downstream dependencies at startup. Pre-resolving DNS names during initialization ensures the first user request does not pay DNS cost:
// FAST: Prefetch DNS entries during application startup
@Component
public class DnsPrefetcher {
private static final Logger log = LoggerFactory.getLogger(DnsPrefetcher.class);
private final List<String> downstreamServices = List.of(
"search-service.default.svc.cluster.local",
"recommendation-service.default.svc.cluster.local",
"analytics-service.default.svc.cluster.local",
"image-service.default.svc.cluster.local"
);
@PostConstruct
public void prefetchDns() {
log.info("Prefetching DNS for {} downstream services", downstreamServices.size());
long start = System.nanoTime();
for (String service : downstreamServices) {
try {
InetAddress[] addresses = InetAddress.getAllByName(service);
log.debug("Resolved {} to {}", service,
Arrays.stream(addresses)
.map(InetAddress::getHostAddress)
.collect(Collectors.joining(", ")));
} catch (UnknownHostException e) {
log.warn("Failed to prefetch DNS for {}: {}", service, e.getMessage());
}
}
long elapsed = (System.nanoTime() - start) / 1_000_000;
log.info("DNS prefetch completed in {}ms", elapsed);
}
}
Measuring TCP Handshake with tcpdump
Application-level metrics cannot distinguish DNS latency from TCP latency from TLS latency. Packet capture with tcpdump provides ground truth:
# Capture TCP handshake to search service (run on article-service pod)
tcpdump -i eth0 -n host 10.0.3.42 and port 8443 -w /tmp/handshake.pcap &
# Trigger a cold connection:
curl -k https://search-service.default.svc.cluster.local:8443/health
# Stop capture
kill %1
# Analyze handshake timing:
tcpdump -r /tmp/handshake.pcap -ttt -n | head -20
# Output (timestamps show inter-packet time):
# 00:00:00.000000 IP 10.0.2.15.48832 > 10.0.3.42.8443: Flags [S], seq 12345
# 00:00:00.000487 IP 10.0.3.42.8443 > 10.0.2.15.48832: Flags [S.], seq 67890, ack 12346
# 00:00:00.000023 IP 10.0.2.15.48832 > 10.0.3.42.8443: Flags [.], ack 67891
# 00:00:00.000089 IP 10.0.2.15.48832 > 10.0.3.42.8443: Flags [P.], TLS ClientHello
# 00:00:00.000512 IP 10.0.3.42.8443 > 10.0.2.15.48832: Flags [P.], TLS ServerHello
# 00:00:00.000031 IP 10.0.2.15.48832 > 10.0.3.42.8443: Flags [P.], TLS Finished
#
# TCP handshake: 0.487ms (SYN to SYN-ACK)
# TLS handshake: 0.512ms (ClientHello to ServerHello)
# Total overhead: ~1.0ms (same datacenter)
Programmatic Handshake Measurement in Java
When tcpdump is not available (or for continuous monitoring), measure handshake latency from the application:
public class HandshakeLatencyMeasurer {
/**
* Measures TCP+TLS handshake time to a target host.
* This creates and immediately closes a connection, measuring only
* the establishment overhead.
*/
public static Duration measureHandshake(String host, int port, boolean useTls) {
long start = System.nanoTime();
try {
if (useTls) {
SSLSocketFactory factory =
(SSLSocketFactory) SSLSocketFactory.getDefault();
try (SSLSocket socket = (SSLSocket) factory.createSocket()) {
socket.connect(new InetSocketAddress(host, port), 5000);
socket.startHandshake(); // Forces TLS negotiation
}
} else {
try (Socket socket = new Socket()) {
socket.connect(new InetSocketAddress(host, port), 5000);
}
}
} catch (Exception e) {
return Duration.ofMillis(-1); // Connection failed
}
return Duration.ofNanos(System.nanoTime() - start);
}
// Continuous monitoring: report handshake latency to metrics
public static void monitorHandshakeLatency(
String host, int port, MeterRegistry registry) {
Timer handshakeTimer = Timer.builder("service.handshake.duration")
.tag("target", host)
.tag("port", String.valueOf(port))
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
scheduler.scheduleAtFixedRate(() -> {
Duration duration = measureHandshake(host, port, true);
if (!duration.isNegative()) {
handshakeTimer.record(duration);
}
}, 0, 30, TimeUnit.SECONDS);
}
}
TCP_NODELAY: The Nagle Problem in Detail
Nagle’s algorithm (RFC 896) buffers small writes until either the buffer fills to MSS (Maximum Segment Size, typically 1460 bytes) or the previous packet is ACKed. Combined with TCP delayed ACK (where the receiver waits up to 40ms before sending ACK), this creates a pathological interaction:
Without TCP_NODELAY (Nagle enabled):
Client sends HTTP request header (120 bytes)
→ Nagle: buffer is < MSS, wait for ACK of previous segment
→ Server has TCP delayed ACK: wait up to 40ms before ACKing
→ Client waits up to 40ms before sending request body
→ Total added latency: up to 40ms per small write
With TCP_NODELAY (Nagle disabled):
Client sends HTTP request header (120 bytes)
→ Sent immediately regardless of outstanding ACKs
→ Client sends request body immediately after
→ No artificial delay
This matters for HTTP/1.1 where request headers and body may be sent in separate write calls. HTTP/2 frames are typically coalesced into single writes, making Nagle less impactful but still worth disabling:
// Spring Boot: Configure TCP_NODELAY for embedded Tomcat
@Configuration
public class TomcatTcpConfig {
@Bean
public WebServerFactoryCustomizer<TomcatServletWebServerFactory> tcpCustomizer() {
return factory -> factory.addConnectorCustomizers(connector -> {
if (connector.getProtocolHandler() instanceof AbstractProtocol<?> protocol) {
protocol.setTcpNoDelay(true);
}
});
}
}
// Spring WebFlux (Netty): Configure TCP_NODELAY
@Configuration
public class NettyTcpConfig {
@Bean
public WebServerFactoryCustomizer<NettyReactiveWebServerFactory> nettyCustomizer() {
return factory -> factory.addServerCustomizers(server ->
server.tcpConfiguration(tcp ->
tcp.option(ChannelOption.TCP_NODELAY, true)
.option(ChannelOption.SO_KEEPALIVE, true)
)
);
}
}
Putting It Together: Content Platform Configuration
The complete connection configuration for the article service:
@Configuration
public class ServiceConnectionConfig {
@PostConstruct
public void configureDns() {
// Kubernetes CoreDNS alignment
Security.setProperty("networkaddress.cache.ttl", "30");
Security.setProperty("networkaddress.cache.negative.ttl", "3");
}
@Bean
public HttpClient serviceHttpClient() {
return HttpClient.newBuilder()
.version(HttpClient.Version.HTTP_2)
.connectTimeout(Duration.ofSeconds(2))
.build();
}
// For WebClient (Spring WebFlux)
@Bean
public WebClient.Builder webClientBuilder() {
HttpClient nettyClient = HttpClient.create()
.option(ChannelOption.TCP_NODELAY, true)
.option(ChannelOption.SO_KEEPALIVE, true)
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 2000)
.responseTimeout(Duration.ofSeconds(5))
.protocol(HttpProtocol.H2, HttpProtocol.HTTP11);
return WebClient.builder()
.clientConnector(new ReactorClientHttpConnector(nettyClient));
}
}
Benchmark Results
Content platform article service latency (P99):
Before optimization:
- First request after deploy: 180ms
- Steady state: 45ms
- After pod reschedule: 120ms (DNS cache miss + cold connections)
After DNS + TCP optimization:
- First request after deploy: 52ms (pre-warmed connections, prefetched DNS)
- Steady state: 14ms
- After pod reschedule: 38ms (DNS TTL aligned, fast re-establishment)
DNS queries to CoreDNS per minute:
Before: 50,000+ (no JVM caching due to SecurityManager legacy)
After: 8 (4 services * 2 queries per 30s TTL)
Reduction: 99.98%
The hidden latency budget was consuming 75% of P99 latency. DNS and TCP optimization recovered it without changing any application logic.