Batching, Compression, and the Bandwidth-Latency Tradeoff
Batching, Compression, and the Bandwidth-Latency Tradeoff
The Black Box
The logistics platform’s event ingestion handles two very different traffic patterns. During normal operations: 33 events per second, steady. During warehouse batch scans (a handheld scanner uploads a day’s worth of offline scans): 50,000 events in 30 seconds. The batching configuration that works for steady-state traffic is wrong for burst traffic, and vice versa.
The Mechanism
Calculating Optimal Batch Size
The optimal batch size depends on three factors:
- Per-message overhead ($O$): bytes of protocol framing per message (Kafka record header: 74 bytes, HTTP header: ~300 bytes).
- Network round-trip time ($RTT$): latency for one send-acknowledge cycle (0.5ms within a datacenter, 50ms cross-region).
- Latency budget ($L$): maximum acceptable delay from event generation to downstream visibility.
The break-even batch size, where batching starts saving more in network overhead than it costs in latency:
$$\text{batch_size} = \frac{L \times \text{message_rate}}{1}$$
For steady-state (33 messages/sec, 100ms latency budget): $$\text{batch_size} = 0.1 \times 33 = 3.3 \approx 4 \text{ messages}$$
Batching 4 messages saves 3 round trips (1.5ms) and 222 bytes of overhead. The gain is negligible.
For burst traffic (1,667 messages/sec, 100ms latency budget): $$\text{batch_size} = 0.1 \times 1667 = 167 \text{ messages}$$
Batching 167 messages saves 166 round trips (83ms) and 12,284 bytes of overhead. The gain is significant.
// Concept: adaptive batch sizing based on traffic rate
// Use Kafka's built-in batching with a linger.ms that works for both patterns
Properties props = new Properties();
props.put("batch.size", 131072); // 128KB max batch size
props.put("linger.ms", 50); // Wait up to 50ms
// Steady-state: 33 msgs/sec × 50ms = 1.65 messages per batch (effectively no batching)
// Burst: 1667 msgs/sec × 50ms = 83 messages per batch (significant savings)
// The Kafka producer adapts automatically:
// - Low rate: messages are sent quickly (within 50ms)
// - High rate: messages accumulate into larger batches
// linger.ms=50 is a reasonable default for the logistics platform.
Combined Batching + Compression
Compression is more effective on larger payloads because compression algorithms find more redundancy in bigger data blocks. A single 280-byte JSON message compresses poorly (LZ4 reduces it to ~220 bytes, ratio 1.3:1). A batch of 100 similar messages compresses well (LZ4 reduces 28,000 bytes to ~5,600 bytes, ratio 5.0:1).
| Configuration | Bytes per 100 messages | Network round trips |
|---|---|---|
| No batching, no compression | 28,000 | 100 |
| Batching, no compression | 28,000 | 1 |
| No batching, LZ4 | 22,000 | 100 |
| Batching + LZ4 | 5,600 | 1 |
Batching alone saves round trips. Compression alone saves bytes. Combined, they save both, and compression becomes more effective because it operates on a larger input.
The Observable Consequence
The logistics platform’s Kafka producer with different configurations:
# Concept: measuring producer throughput with different configurations
# Using kafka-producer-perf-test with 1 million messages, 280 bytes each
# No batching, no compression:
# 12,400 records/sec, 3.31 MB/sec, avg latency 0.42ms, p99 1.2ms
# Batching (linger.ms=20), no compression:
# 89,200 records/sec, 23.82 MB/sec, avg latency 14.3ms, p99 22.1ms
# Batching (linger.ms=20) + LZ4:
# 142,000 records/sec, 37.92 MB/sec, avg latency 12.8ms, p99 19.4ms
# Batching improves throughput 7.2x at the cost of 14ms average latency.
# Adding LZ4 improves throughput by another 1.6x with slightly lower latency
# (fewer bytes to transmit = faster network transfer).
The Decision Rule
For real-time event streams where latency matters (dashboard updates, notifications): set linger.ms to 5-10ms. Accept modest batching gains. The additional 5-10ms of latency is invisible to users.
For bulk data movement (batch scans, ETL, backfills): set linger.ms to 50-200ms and batch.size to 256KB or higher. Throughput is the priority.
Always enable compression on Kafka producers. LZ4 adds negligible CPU overhead and saves 2-5x in bandwidth and storage. There is no scenario where uncompressed Kafka traffic is the right choice for the logistics platform’s data volumes.
For JDBC batch inserts, use the largest batch size that fits within your transaction timeout. For the logistics platform’s batch scanner import (50,000 events), batch sizes of 1,000-5,000 rows offer the best tradeoff between round-trip savings and transaction rollback risk. Larger batches take longer to process and hold locks longer.