Design a Rate Limiter
SummaryCovers token bucket, leaky bucket, fixed window, sliding...
Covers token bucket, leaky bucket, fixed window, sliding...
Covers token bucket, leaky bucket, fixed window, sliding window log, and sliding window counter algorithms with Redis-backed implementations in Java 25.
Design a Rate Limiter
Rate limiting controls the number of requests a client can send to a server within a given time window. Every large-scale API — from Stripe to GitHub to Cloudflare — uses rate limiting to prevent abuse, protect resources, and ensure fair usage. The design is frequently tested in system design interviews because it touches on distributed systems, concurrency, algorithmic trade-offs, and practical API design.
Requirements
Functional Requirements
- Limit requests per user, IP address, or API key within a configurable time window.
- Configurable rules — different limits for different API endpoints, user tiers, or authentication levels.
- Return rate limit headers —
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reseton every response. - Reject excess requests with HTTP 429 (Too Many Requests) and include a
Retry-Afterheader. - Support multiple rate limiting dimensions — a single request can be evaluated against multiple rules simultaneously (e.g., per-user AND per-endpoint limits).
Non-Functional Requirements
- Low latency overhead — rate limit checks must complete in under 5ms (p99) to avoid degrading API response times.
- Distributed — rate limiting must work correctly across multiple API gateway instances without centralized bottlenecks.
- Highly available — a rate limiter failure should default to allowing requests (fail-open) rather than blocking legitimate traffic.
- Accuracy — tolerate a small margin of error (±1-2%) in distributed environments rather than requiring perfect precision.
Capacity Estimation
Assume a platform with 500 million registered users, of which 10 million are daily active. At peak load, the system handles 10,000 requests per second (RPS).
Memory estimation for counters:
- Each rate limit entry stores: user identifier (64 bytes) + counter (8 bytes) + timestamp (8 bytes) + TTL metadata (8 bytes) ≈ 88 bytes per entry.
- With 10M active users and an average of 3 rate limit rules each: 10M × 3 × 88 bytes ≈ 2.64 GB.
- Redis can comfortably hold this in memory on a single node, though we use Redis Cluster for availability.
Throughput:
- 10K RPS × 1-3 Redis operations per request = 10K–30K Redis operations/second — well within Redis’s capability of 100K+ ops/sec on a single instance.
High-Level Design
┌──────────┐ ┌──────────────────────────────┐ ┌─────────────────┐
│ │ │ API Gateway │ │ │
│ Client ├────►│ ┌──────────────────────┐ ├────►│ Backend │
│ │ │ │ Rate Limiter │ │ │ Services │
│ │◄────┤ │ Middleware │ │◄────┤ │
│ │ │ └─────────┬────────────┘ │ │ │
└──────────┘ │ │ │ └─────────────────┘
└────────────┼──────────────────┘
│
┌─────────▼─────────┐
│ Redis Cluster │
│ (Counters + │
│ Token Buckets) │
└────────────────────┘
Middleware vs. Separate Service:
Placing the rate limiter as middleware in the API gateway keeps latency low — there is no additional network hop. The middleware intercepts every incoming request, queries Redis for the current count, and either forwards the request downstream or returns HTTP 429.
A separate rate limiting service provides better separation of concerns and independent scaling, but adds a network hop (1-3ms). For most systems, the middleware approach wins on latency. Use a separate service when you need complex, centralized policy management across heterogeneous backends.
Deep Dive
Algorithm Comparison
| Algorithm | Memory | Accuracy | Burst Handling | Complexity | Best For |
|---|---|---|---|---|---|
| Token Bucket | Low (counter + timestamp per key) | Good | Allows controlled bursts | Low | API rate limiting with burst tolerance |
| Leaky Bucket | Low (queue size) | High | Smooths bursts (no spikes) | Low | Traffic shaping, steady throughput |
| Fixed Window Counter | Very Low (single counter per key) | Low (boundary spike issue) | Poor (2x burst at window edges) | Very Low | Simple use cases, approximate limiting |
| Sliding Window Log | High (stores every timestamp) | Perfect | Precise | High | Audit-critical systems |
| Sliding Window Counter | Low (two counters per key) | Good (weighted approximation) | Good | Low | Production API limiting (best trade-off) |
Token Bucket Algorithm
The token bucket algorithm works by maintaining a virtual “bucket” for each rate-limited entity. Tokens accumulate at a fixed rate up to a maximum capacity. Each request consumes one token. When the bucket is empty, requests are rejected.
Walkthrough:
- A bucket starts with capacity tokens (e.g., 10).
- Tokens are added at a refill rate (e.g., 1 token per second).
- When a request arrives, check the bucket: if tokens > 0, consume one token and allow the request. If tokens = 0, reject.
- Tokens never exceed the bucket capacity — excess tokens are discarded.
This design naturally allows bursts: a client that has been idle accumulates tokens up to the capacity, enabling a short burst of rapid requests before the rate limit kicks in.
Java 25 Implementation:
// Token bucket state as an immutable record
record TokenBucket(
String key,
int capacity,
double refillRatePerSecond,
double availableTokens,
long lastRefillTimestamp
) {
TokenBucket refill(long nowMillis) {
double elapsed = (nowMillis - lastRefillTimestamp) / 1000.0;
double newTokens = Math.min(capacity, availableTokens + elapsed * refillRatePerSecond);
return new TokenBucket(key, capacity, refillRatePerSecond, newTokens, nowMillis);
}
record ConsumeResult(TokenBucket bucket, boolean allowed) {}
ConsumeResult tryConsume() {
if (availableTokens >= 1.0) {
var updated = new TokenBucket(key, capacity, refillRatePerSecond, availableTokens - 1, lastRefillTimestamp);
return new ConsumeResult(updated, true);
}
return new ConsumeResult(this, false);
}
}
Redis + Lua Script for Atomic Operations:
In a distributed environment, multiple API gateway instances access the same Redis key simultaneously. A Lua script guarantees atomicity — Redis executes it as a single operation with no interleaving.
-- token_bucket.lua
-- KEYS[1] = rate limit key
-- ARGV[1] = capacity, ARGV[2] = refill rate (tokens/sec), ARGV[3] = current time (ms)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
if tokens == nil then
tokens = capacity
last_refill = now
end
-- Refill tokens based on elapsed time
local elapsed = (now - last_refill) / 1000.0
tokens = math.min(capacity, tokens + elapsed * refill_rate)
local allowed = 0
if tokens >= 1 then
tokens = tokens - 1
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
return {allowed, math.floor(tokens)}
Calling from Java 25 with virtual threads:
sealed interface RateLimitResult permits RateLimitResult.Allowed, RateLimitResult.Rejected {
record Allowed(int remainingTokens) implements RateLimitResult {}
record Rejected(long retryAfterMs) implements RateLimitResult {}
}
class RedisTokenBucketLimiter {
private final JedisPool jedisPool;
private final String luaScriptSha;
RateLimitResult checkLimit(String key, int capacity, double refillRate) {
try (var jedis = jedisPool.getResource()) {
var result = (List<Long>) jedis.evalsha(
luaScriptSha,
List.of(key),
List.of(
String.valueOf(capacity),
String.valueOf(refillRate),
String.valueOf(System.currentTimeMillis())
)
);
long allowed = result.get(0);
long remaining = result.get(1);
return switch ((int) allowed) {
case 1 -> new RateLimitResult.Allowed((int) remaining);
case 0 -> new RateLimitResult.Rejected((long) (1000.0 / refillRate));
default -> throw new IllegalStateException("Unexpected result from Lua script");
};
}
}
}
Sliding Window Counter
The sliding window counter addresses the boundary problem of fixed windows by using a weighted combination of the current and previous window’s counts.
How it works:
-
Maintain counters for the current window and the previous window.
-
When a request arrives at time
t, calculate the weighted count:weightedCount = previousCount × overlapRatio + currentCountwhere
overlapRatio = (windowSize - elapsedInCurrentWindow) / windowSize. -
If
weightedCount < limit, allow the request and increment the current counter.
Java 25 Implementation:
record SlidingWindowState(
long previousCount,
long currentCount,
long windowStartMs,
long windowSizeMs
) {
double weightedCount(long nowMs) {
long elapsed = nowMs - windowStartMs;
if (elapsed >= windowSizeMs) {
// We've moved past the current window entirely
return 0;
}
double overlapRatio = (windowSizeMs - elapsed) / (double) windowSizeMs;
return previousCount * overlapRatio + currentCount;
}
}
class SlidingWindowRateLimiter {
private final JedisPool jedisPool;
RateLimitResult checkLimit(String key, int maxRequests, long windowSizeMs) {
long now = System.currentTimeMillis();
long currentWindow = now / windowSizeMs;
long previousWindow = currentWindow - 1;
String currentKey = key + ":" + currentWindow;
String previousKey = key + ":" + previousWindow;
try (var jedis = jedisPool.getResource()) {
// Pipeline to reduce round trips
var pipeline = jedis.pipelined();
var prevResponse = pipeline.get(previousKey);
var currResponse = pipeline.get(currentKey);
pipeline.sync();
long prevCount = prevResponse.get() != null ? Long.parseLong(prevResponse.get()) : 0;
long currCount = currResponse.get() != null ? Long.parseLong(currResponse.get()) : 0;
long elapsedInWindow = now % windowSizeMs;
double overlapRatio = (windowSizeMs - elapsedInWindow) / (double) windowSizeMs;
double weightedCount = prevCount * overlapRatio + currCount;
if (weightedCount < maxRequests) {
jedis.incr(currentKey);
jedis.expire(currentKey, (int) (windowSizeMs / 1000) * 2);
int remaining = (int) (maxRequests - weightedCount - 1);
return new RateLimitResult.Allowed(remaining);
}
long retryAfterMs = windowSizeMs - elapsedInWindow;
return new RateLimitResult.Rejected(retryAfterMs);
}
}
}
Distributed Rate Limiting
When multiple API gateway instances share a Redis-backed rate limiter, race conditions become the primary concern. Two instances can read the same counter value, both see “under limit,” and both increment — allowing more requests than intended.
Solving with Lua Scripts:
Lua scripts execute atomically in Redis. The check-and-increment operation happens as a single unit — no other command can interleave. This is the recommended approach and the reason the token bucket implementation above uses Lua.
Redis MULTI/EXEC Alternative:
// Optimistic locking with WATCH
try (var jedis = jedisPool.getResource()) {
jedis.watch(key);
String current = jedis.get(key);
int count = current != null ? Integer.parseInt(current) : 0;
if (count >= limit) {
jedis.unwatch();
return new RateLimitResult.Rejected(retryAfterMs);
}
Transaction tx = jedis.multi();
tx.incr(key);
tx.expire(key, windowSeconds);
List<Object> results = tx.exec();
if (results == null) {
// Transaction aborted due to concurrent modification — retry
return checkLimit(key, limit, windowSeconds);
}
}
Synchronization Across Instances:
In a multi-datacenter deployment, each datacenter can maintain a local rate limiter with a fraction of the global limit. Periodically synchronize counts across datacenters asynchronously. This trades perfect accuracy for lower latency — a request hitting the US datacenter does not need to call the EU datacenter’s Redis.
// Effective limit per datacenter
record DistributedConfig(int globalLimit, int numDatacenters) {
int localLimit() {
return globalLimit / numDatacenters;
}
}
Rate Limiting Rules Engine
A flexible rate limiter supports rules defined externally, allowing operators to change limits without redeploying.
YAML-based rules:
rate_limits:
- name: "default-per-user"
key: "user_id"
algorithm: "token_bucket"
capacity: 100
refill_rate: 10 # tokens per second
- name: "premium-per-user"
key: "user_id"
condition: "user.tier == 'premium'"
algorithm: "token_bucket"
capacity: 1000
refill_rate: 100
- name: "global-per-endpoint"
key: "endpoint"
algorithm: "sliding_window"
max_requests: 50000
window_size_seconds: 60
Java 25 sealed interface for rule types:
sealed interface RateLimitRule permits
RateLimitRule.TokenBucketRule,
RateLimitRule.SlidingWindowRule,
RateLimitRule.FixedWindowRule {
String name();
String keyExtractor();
record TokenBucketRule(String name, String keyExtractor, int capacity, double refillRate)
implements RateLimitRule {}
record SlidingWindowRule(String name, String keyExtractor, int maxRequests, long windowSizeMs)
implements RateLimitRule {}
record FixedWindowRule(String name, String keyExtractor, int maxRequests, long windowSizeMs)
implements RateLimitRule {}
}
// Evaluate all applicable rules using virtual threads for parallel execution
List<RateLimitResult> evaluateRules(List<RateLimitRule> rules, String requestKey) {
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
var futures = rules.stream()
.map(rule -> scope.fork(() -> evaluateRule(rule, requestKey)))
.toList();
scope.join().throwIfFailed();
return futures.stream().map(StructuredTaskScope.Subtask::get).toList();
}
}
HTTP Response Strategy
Every response from a rate-limited API should include informational headers, whether the request was allowed or rejected:
On allowed requests (HTTP 200):
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1707580800
On rejected requests (HTTP 429):
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1707580800
Retry-After: 12
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "Rate limit exceeded. Try again in 12 seconds.",
"retry_after": 12
}
Client-side backoff strategy should use exponential backoff with jitter to prevent thundering herd problems when many clients hit the limit simultaneously:
record BackoffConfig(long baseMs, long maxMs, double jitterFactor) {
long computeDelay(int attempt) {
long exponential = Math.min(maxMs, baseMs * (1L << attempt));
long jitter = (long) (exponential * jitterFactor * ThreadLocalRandom.current().nextDouble());
return exponential + jitter;
}
}
Bottlenecks & Scaling
Redis as a single point of failure:
Deploy Redis Cluster with at least 3 master nodes and 3 replicas. Use hash tags in keys (e.g., {user:123}:rate_limit) to ensure all rate limit keys for a user land on the same shard, avoiding cross-shard Lua script execution.
Clock skew in distributed environments:
Rate limiters that depend on timestamps (token bucket refill, sliding window boundaries) are sensitive to clock drift. Use NTP synchronization across all nodes and design algorithms to tolerate ±100ms drift. Prefer server-side timestamps from Redis (redis.call('TIME')) over client-side System.currentTimeMillis().
Rate limiting at different layers:
| Layer | Use Case | Tools |
|---|---|---|
| L4 (Transport) | Connection rate limiting, SYN flood protection | iptables, IPVS, cloud LB |
| L7 (Application) | Per-user, per-endpoint API limiting | API Gateway, custom middleware |
| Application | Business logic limits (e.g., 5 password attempts) | In-app rate limiter |
A defense-in-depth approach applies rate limiting at multiple layers. L4 limiting protects against volumetric DDoS attacks — these never reach your application-level rate limiter.
Interviewer Tips
Common follow-up questions:
-
“What happens when Redis is down?” — Fail open (allow requests) with local in-memory fallback counters. Log the failure for alerting. Never block legitimate users due to infrastructure failures.
-
“How do you rate limit in a multi-tenant system?” — Use composite keys:
{tenant_id}:{user_id}:{endpoint}. Each tenant can have independent limits defined in the rules engine. -
“How do you handle bursty traffic?” — Token bucket is the go-to. Set the bucket capacity to the burst size and the refill rate to the sustained rate.
-
“Fixed window or sliding window?” — Always prefer sliding window counter in production. Fixed windows allow up to 2× the intended rate at window boundaries.
-
“What about rate limiting WebSocket connections?” — Limit both connection establishment rate and message rate per connection. Use a separate token bucket for message-level throttling.
Red flags interviewers watch for:
- Ignoring the distributed nature of the problem (only designing for a single server).
- Not mentioning atomicity concerns with Redis.
- Forgetting to discuss fail-open vs. fail-closed behavior.
- Overlooking the HTTP response headers and client retry strategy.