Design a Rate Limiter

Rate limiting controls the number of requests a client can send to a server within a given time window. Every large-scale API — from Stripe to GitHub to Cloudflare — uses rate limiting to prevent abuse, protect resources, and ensure fair usage. The design is frequently tested in system design interviews because it touches on distributed systems, concurrency, algorithmic trade-offs, and practical API design.

Requirements

Functional Requirements

Limit requests per user, IP address, or API key within a configurable time window.
Configurable rules — different limits for different API endpoints, user tiers, or authentication levels.
Return rate limit headers — X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset on every response.
Reject excess requests with HTTP 429 (Too Many Requests) and include a Retry-After header.
Support multiple rate limiting dimensions — a single request can be evaluated against multiple rules simultaneously (e.g., per-user AND per-endpoint limits).

Non-Functional Requirements

Low latency overhead — rate limit checks must complete in under 5ms (p99) to avoid degrading API response times.
Distributed — rate limiting must work correctly across multiple API gateway instances without centralized bottlenecks.
Highly available — a rate limiter failure should default to allowing requests (fail-open) rather than blocking legitimate traffic.
Accuracy — tolerate a small margin of error (±1-2%) in distributed environments rather than requiring perfect precision.

Capacity Estimation

Assume a platform with 500 million registered users, of which 10 million are daily active. At peak load, the system handles 10,000 requests per second (RPS).

Memory estimation for counters:

Each rate limit entry stores: user identifier (64 bytes) + counter (8 bytes) + timestamp (8 bytes) + TTL metadata (8 bytes) ≈ 88 bytes per entry.
With 10M active users and an average of 3 rate limit rules each: 10M × 3 × 88 bytes ≈ 2.64 GB.
Redis can comfortably hold this in memory on a single node, though we use Redis Cluster for availability.

Throughput:

10K RPS × 1-3 Redis operations per request = 10K–30K Redis operations/second — well within Redis’s capability of 100K+ ops/sec on a single instance.

High-Level Design

┌──────────┐     ┌──────────────────────────────┐     ┌─────────────────┐
│          │     │       API Gateway             │     │                 │
│  Client  ├────►│  ┌──────────────────────┐     ├────►│  Backend        │
│          │     │  │  Rate Limiter        │     │     │  Services       │
│          │◄────┤  │  Middleware           │     │◄────┤                 │
│          │     │  └─────────┬────────────┘     │     │                 │
└──────────┘     │            │                  │     └─────────────────┘
                 └────────────┼──────────────────┘
                              │
                    ┌─────────▼─────────┐
                    │   Redis Cluster    │
                    │   (Counters +      │
                    │    Token Buckets)   │
                    └────────────────────┘

Middleware vs. Separate Service:

Placing the rate limiter as middleware in the API gateway keeps latency low — there is no additional network hop. The middleware intercepts every incoming request, queries Redis for the current count, and either forwards the request downstream or returns HTTP 429.

A separate rate limiting service provides better separation of concerns and independent scaling, but adds a network hop (1-3ms). For most systems, the middleware approach wins on latency. Use a separate service when you need complex, centralized policy management across heterogeneous backends.

Deep Dive

Algorithm Comparison

Algorithm	Memory	Accuracy	Burst Handling	Complexity	Best For
Token Bucket	Low (counter + timestamp per key)	Good	Allows controlled bursts	Low	API rate limiting with burst tolerance
Leaky Bucket	Low (queue size)	High	Smooths bursts (no spikes)	Low	Traffic shaping, steady throughput
Fixed Window Counter	Very Low (single counter per key)	Low (boundary spike issue)	Poor (2x burst at window edges)	Very Low	Simple use cases, approximate limiting
Sliding Window Log	High (stores every timestamp)	Perfect	Precise	High	Audit-critical systems
Sliding Window Counter	Low (two counters per key)	Good (weighted approximation)	Good	Low	Production API limiting (best trade-off)

Token Bucket Algorithm

The token bucket algorithm works by maintaining a virtual “bucket” for each rate-limited entity. Tokens accumulate at a fixed rate up to a maximum capacity. Each request consumes one token. When the bucket is empty, requests are rejected.

Walkthrough:

A bucket starts with capacity tokens (e.g., 10).
Tokens are added at a refill rate (e.g., 1 token per second).
When a request arrives, check the bucket: if tokens > 0, consume one token and allow the request. If tokens = 0, reject.
Tokens never exceed the bucket capacity — excess tokens are discarded.

This design naturally allows bursts: a client that has been idle accumulates tokens up to the capacity, enabling a short burst of rapid requests before the rate limit kicks in.

Java 25 Implementation:

// Token bucket state as an immutable record
record TokenBucket(
    String key,
    int capacity,
    double refillRatePerSecond,
    double availableTokens,
    long lastRefillTimestamp
) {
    TokenBucket refill(long nowMillis) {
        double elapsed = (nowMillis - lastRefillTimestamp) / 1000.0;
        double newTokens = Math.min(capacity, availableTokens + elapsed * refillRatePerSecond);
        return new TokenBucket(key, capacity, refillRatePerSecond, newTokens, nowMillis);
    }

    record ConsumeResult(TokenBucket bucket, boolean allowed) {}

    ConsumeResult tryConsume() {
        if (availableTokens >= 1.0) {
            var updated = new TokenBucket(key, capacity, refillRatePerSecond, availableTokens - 1, lastRefillTimestamp);
            return new ConsumeResult(updated, true);
        }
        return new ConsumeResult(this, false);
    }
}

Redis + Lua Script for Atomic Operations:

In a distributed environment, multiple API gateway instances access the same Redis key simultaneously. A Lua script guarantees atomicity — Redis executes it as a single operation with no interleaving.

-- token_bucket.lua
-- KEYS[1] = rate limit key
-- ARGV[1] = capacity, ARGV[2] = refill rate (tokens/sec), ARGV[3] = current time (ms)

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens == nil then
    tokens = capacity
    last_refill = now
end

-- Refill tokens based on elapsed time
local elapsed = (now - last_refill) / 1000.0
tokens = math.min(capacity, tokens + elapsed * refill_rate)

local allowed = 0
if tokens >= 1 then
    tokens = tokens - 1
    allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)

return {allowed, math.floor(tokens)}

Calling from Java 25 with virtual threads:

sealed interface RateLimitResult permits RateLimitResult.Allowed, RateLimitResult.Rejected {
    record Allowed(int remainingTokens) implements RateLimitResult {}
    record Rejected(long retryAfterMs) implements RateLimitResult {}
}

class RedisTokenBucketLimiter {
    private final JedisPool jedisPool;
    private final String luaScriptSha;

    RateLimitResult checkLimit(String key, int capacity, double refillRate) {
        try (var jedis = jedisPool.getResource()) {
            var result = (List<Long>) jedis.evalsha(
                luaScriptSha,
                List.of(key),
                List.of(
                    String.valueOf(capacity),
                    String.valueOf(refillRate),
                    String.valueOf(System.currentTimeMillis())
                )
            );
            long allowed = result.get(0);
            long remaining = result.get(1);

            return switch ((int) allowed) {
                case 1 -> new RateLimitResult.Allowed((int) remaining);
                case 0 -> new RateLimitResult.Rejected((long) (1000.0 / refillRate));
                default -> throw new IllegalStateException("Unexpected result from Lua script");
            };
        }
    }
}

Sliding Window Counter

The sliding window counter addresses the boundary problem of fixed windows by using a weighted combination of the current and previous window’s counts.

How it works:

Maintain counters for the current window and the previous window.
When a request arrives at time t, calculate the weighted count:

weightedCount = previousCount × overlapRatio + currentCount

where overlapRatio = (windowSize - elapsedInCurrentWindow) / windowSize.
If weightedCount < limit, allow the request and increment the current counter.

Java 25 Implementation:

record SlidingWindowState(
    long previousCount,
    long currentCount,
    long windowStartMs,
    long windowSizeMs
) {
    double weightedCount(long nowMs) {
        long elapsed = nowMs - windowStartMs;
        if (elapsed >= windowSizeMs) {
            // We've moved past the current window entirely
            return 0;
        }
        double overlapRatio = (windowSizeMs - elapsed) / (double) windowSizeMs;
        return previousCount * overlapRatio + currentCount;
    }
}

class SlidingWindowRateLimiter {
    private final JedisPool jedisPool;

    RateLimitResult checkLimit(String key, int maxRequests, long windowSizeMs) {
        long now = System.currentTimeMillis();
        long currentWindow = now / windowSizeMs;
        long previousWindow = currentWindow - 1;

        String currentKey = key + ":" + currentWindow;
        String previousKey = key + ":" + previousWindow;

        try (var jedis = jedisPool.getResource()) {
            // Pipeline to reduce round trips
            var pipeline = jedis.pipelined();
            var prevResponse = pipeline.get(previousKey);
            var currResponse = pipeline.get(currentKey);
            pipeline.sync();

            long prevCount = prevResponse.get() != null ? Long.parseLong(prevResponse.get()) : 0;
            long currCount = currResponse.get() != null ? Long.parseLong(currResponse.get()) : 0;

            long elapsedInWindow = now % windowSizeMs;
            double overlapRatio = (windowSizeMs - elapsedInWindow) / (double) windowSizeMs;
            double weightedCount = prevCount * overlapRatio + currCount;

            if (weightedCount < maxRequests) {
                jedis.incr(currentKey);
                jedis.expire(currentKey, (int) (windowSizeMs / 1000) * 2);
                int remaining = (int) (maxRequests - weightedCount - 1);
                return new RateLimitResult.Allowed(remaining);
            }

            long retryAfterMs = windowSizeMs - elapsedInWindow;
            return new RateLimitResult.Rejected(retryAfterMs);
        }
    }
}

Distributed Rate Limiting

When multiple API gateway instances share a Redis-backed rate limiter, race conditions become the primary concern. Two instances can read the same counter value, both see “under limit,” and both increment — allowing more requests than intended.

Solving with Lua Scripts:

Lua scripts execute atomically in Redis. The check-and-increment operation happens as a single unit — no other command can interleave. This is the recommended approach and the reason the token bucket implementation above uses Lua.

Redis MULTI/EXEC Alternative:

// Optimistic locking with WATCH
try (var jedis = jedisPool.getResource()) {
    jedis.watch(key);
    String current = jedis.get(key);
    int count = current != null ? Integer.parseInt(current) : 0;

    if (count >= limit) {
        jedis.unwatch();
        return new RateLimitResult.Rejected(retryAfterMs);
    }

    Transaction tx = jedis.multi();
    tx.incr(key);
    tx.expire(key, windowSeconds);
    List<Object> results = tx.exec();

    if (results == null) {
        // Transaction aborted due to concurrent modification — retry
        return checkLimit(key, limit, windowSeconds);
    }
}

Synchronization Across Instances:

In a multi-datacenter deployment, each datacenter can maintain a local rate limiter with a fraction of the global limit. Periodically synchronize counts across datacenters asynchronously. This trades perfect accuracy for lower latency — a request hitting the US datacenter does not need to call the EU datacenter’s Redis.

// Effective limit per datacenter
record DistributedConfig(int globalLimit, int numDatacenters) {
    int localLimit() {
        return globalLimit / numDatacenters;
    }
}

Rate Limiting Rules Engine

A flexible rate limiter supports rules defined externally, allowing operators to change limits without redeploying.

YAML-based rules:

rate_limits:
  - name: "default-per-user"
    key: "user_id"
    algorithm: "token_bucket"
    capacity: 100
    refill_rate: 10  # tokens per second

  - name: "premium-per-user"
    key: "user_id"
    condition: "user.tier == 'premium'"
    algorithm: "token_bucket"
    capacity: 1000
    refill_rate: 100

  - name: "global-per-endpoint"
    key: "endpoint"
    algorithm: "sliding_window"
    max_requests: 50000
    window_size_seconds: 60

Java 25 sealed interface for rule types:

sealed interface RateLimitRule permits
    RateLimitRule.TokenBucketRule,
    RateLimitRule.SlidingWindowRule,
    RateLimitRule.FixedWindowRule {

    String name();
    String keyExtractor();

    record TokenBucketRule(String name, String keyExtractor, int capacity, double refillRate)
        implements RateLimitRule {}

    record SlidingWindowRule(String name, String keyExtractor, int maxRequests, long windowSizeMs)
        implements RateLimitRule {}

    record FixedWindowRule(String name, String keyExtractor, int maxRequests, long windowSizeMs)
        implements RateLimitRule {}
}

// Evaluate all applicable rules using virtual threads for parallel execution
List<RateLimitResult> evaluateRules(List<RateLimitRule> rules, String requestKey) {
    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        var futures = rules.stream()
            .map(rule -> scope.fork(() -> evaluateRule(rule, requestKey)))
            .toList();

        scope.join().throwIfFailed();
        return futures.stream().map(StructuredTaskScope.Subtask::get).toList();
    }
}

HTTP Response Strategy

Every response from a rate-limited API should include informational headers, whether the request was allowed or rejected:

On allowed requests (HTTP 200):

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1707580800

On rejected requests (HTTP 429):

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1707580800
Retry-After: 12
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "Rate limit exceeded. Try again in 12 seconds.",
  "retry_after": 12
}

Client-side backoff strategy should use exponential backoff with jitter to prevent thundering herd problems when many clients hit the limit simultaneously:

record BackoffConfig(long baseMs, long maxMs, double jitterFactor) {
    long computeDelay(int attempt) {
        long exponential = Math.min(maxMs, baseMs * (1L << attempt));
        long jitter = (long) (exponential * jitterFactor * ThreadLocalRandom.current().nextDouble());
        return exponential + jitter;
    }
}

Bottlenecks & Scaling

Redis as a single point of failure:

Deploy Redis Cluster with at least 3 master nodes and 3 replicas. Use hash tags in keys (e.g., {user:123}:rate_limit) to ensure all rate limit keys for a user land on the same shard, avoiding cross-shard Lua script execution.

Clock skew in distributed environments:

Rate limiters that depend on timestamps (token bucket refill, sliding window boundaries) are sensitive to clock drift. Use NTP synchronization across all nodes and design algorithms to tolerate ±100ms drift. Prefer server-side timestamps from Redis (redis.call('TIME')) over client-side System.currentTimeMillis().

Rate limiting at different layers:

Layer	Use Case	Tools
L4 (Transport)	Connection rate limiting, SYN flood protection	iptables, IPVS, cloud LB
L7 (Application)	Per-user, per-endpoint API limiting	API Gateway, custom middleware
Application	Business logic limits (e.g., 5 password attempts)	In-app rate limiter

A defense-in-depth approach applies rate limiting at multiple layers. L4 limiting protects against volumetric DDoS attacks — these never reach your application-level rate limiter.

Interviewer Tips

Common follow-up questions:

“What happens when Redis is down?” — Fail open (allow requests) with local in-memory fallback counters. Log the failure for alerting. Never block legitimate users due to infrastructure failures.
“How do you rate limit in a multi-tenant system?” — Use composite keys: {tenant_id}:{user_id}:{endpoint}. Each tenant can have independent limits defined in the rules engine.
“How do you handle bursty traffic?” — Token bucket is the go-to. Set the bucket capacity to the burst size and the refill rate to the sustained rate.
“Fixed window or sliding window?” — Always prefer sliding window counter in production. Fixed windows allow up to 2× the intended rate at window boundaries.
“What about rate limiting WebSocket connections?” — Limit both connection establishment rate and message rate per connection. Use a separate token bucket for message-level throttling.

Red flags interviewers watch for:

Ignoring the distributed nature of the problem (only designing for a single server).
Not mentioning atomicity concerns with Redis.
Forgetting to discuss fail-open vs. fail-closed behavior.
Overlooking the HTTP response headers and client retry strategy.