Skip to main content
surviving the spike

Distributed Rate Limiting with Redis and Lua

13 min read Chapter 30 of 66

Distributed Rate Limiting with Redis and Lua

The Symptom

The ride-hailing platform runs 12 pods behind a Kubernetes load balancer. Each pod has an in-memory token bucket. Rider “R-4421” sends a burst of 20 requests. The load balancer distributes them round-robin across 12 pods. Each pod sees 1-2 requests from R-4421. No pod rejects anything. R-4421 effectively has a rate limit of 12x the configured value. During the concert surge, every rider gets 12x their limit. The matching service receives 12x the expected load. The rate limiter is decorative.

The Cause

In-memory rate limiting is per-process state. Distributed systems need shared state. The rate limit counter must live in a single location that all pods consult. Redis is that location.

But shared state introduces a new problem: atomicity. Two pods check R-4421’s token count simultaneously. Both read 1 token remaining. Both allow the request. Both decrement. The count goes to -1. The rate limit is violated.

The check-and-decrement must be a single atomic operation. Redis provides three mechanisms:

  1. MULTI/EXEC: Batches commands into a transaction. Cannot read a value and branch on it within the transaction. Useless for “if tokens > 0 then decrement.”
  2. WATCH + MULTI/EXEC: Optimistic locking. WATCH the key, read the value, start MULTI, write the update, EXEC. If another client modified the key between WATCH and EXEC, the transaction fails and the client retries. Under high contention (concert surge), retry storms make this worse than no rate limiting.
  3. Lua scripting (EVAL): The script executes atomically on the Redis server. No other command runs between the read and the write. No retries. No race conditions. This is the correct approach.

The Lua Script, Line by Line

-- rate_limit_token_bucket.lua
local key = KEYS[1]                          -- 1
local capacity = tonumber(ARGV[1])           -- 2
local refill_rate = tonumber(ARGV[2])        -- 3
local now = tonumber(ARGV[3])                -- 4
local requested = tonumber(ARGV[4])          -- 5

local bucket = redis.call('HMGET', key,      -- 6
    'tokens', 'last_refill')                 -- 7
local tokens = tonumber(bucket[1])           -- 8
local last_refill = tonumber(bucket[2])      -- 9

if tokens == nil then                        -- 10
    tokens = capacity                        -- 11
    last_refill = now                        -- 12
end                                          -- 13

local elapsed = now - last_refill            -- 14
local new_tokens = elapsed * refill_rate     -- 15
tokens = math.min(capacity,                  -- 16
    tokens + new_tokens)                     -- 17

local allowed = 0                            -- 18
local remaining = tokens                     -- 19

if tokens >= requested then                  -- 20
    tokens = tokens - requested              -- 21
    allowed = 1                              -- 22
    remaining = tokens                       -- 23
end                                          -- 24

redis.call('HMSET', key,                     -- 25
    'tokens', tokens,                        -- 26
    'last_refill', now)                      -- 27
redis.call('EXPIRE', key,                    -- 28
    math.ceil(capacity / refill_rate) * 2)   -- 29

local retry_after = 0                        -- 30
if allowed == 0 then                         -- 31
    retry_after = (requested - tokens)       -- 32
        / refill_rate                        -- 33
end                                          -- 34

return {allowed, math.floor(remaining),      -- 35
    math.ceil(retry_after)}                  -- 36

Lines 1-5: Input parameters. The key identifies the client and endpoint combination. Capacity and refill rate define the bucket shape. The timestamp comes from the application server (not Redis’s clock) to avoid clock skew between app servers and Redis.

Lines 6-9: Read current bucket state. HMGET retrieves both fields in a single command. On first request, both values are nil.

Lines 10-13: First request initialization. A new client starts with a full bucket. This means the first burst of requests (app open) is always allowed, which matches the desired UX.

Lines 14-17: Token refill calculation. Elapsed time since last refill multiplied by the refill rate gives the number of new tokens. math.min caps at capacity. If a client is inactive for an hour, they get a full bucket, not 36,000 tokens.

Lines 20-24: The atomic check-and-decrement. If enough tokens exist, consume them and mark allowed. This is the operation that cannot be split across two Redis commands without a race condition.

Lines 25-29: Persist the updated state. The EXPIRE sets a TTL of twice the time needed to refill a full bucket. Inactive clients’ keys are automatically cleaned up. At capacity=20 and refill_rate=10, TTL is 4 seconds. Generous enough that active clients never lose their bucket state.

Lines 30-34: Calculate Retry-After for rejected requests. If 0.3 tokens remain and 1 is requested, the client needs 0.7 tokens, which arrive in $0.7 / 10 = 0.07$ seconds. The client can retry in 70ms instead of guessing.

Lines 35-36: Return three values. Lua arrays map to Redis multi-bulk replies. The Java client reads them as List<Long>.

EVAL vs EVALSHA vs SCRIPT LOAD

Every EVAL call sends the full Lua script text to Redis. For a script that runs thousands of times per second, this is wasted bandwidth.

EVAL "local key = KEYS[1]..." 1 "rl:rider-1234:/api/rides/request" 20 10 1716480000.123 1

The script text is ~800 bytes. At 10,000 rate limit checks per second, that is 8 MB/s of repeated script text on the Redis connection.

SCRIPT LOAD registers the script once and returns its SHA1 hash:

SCRIPT LOAD "local key = KEYS[1]..."
→ "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2"

EVALSHA executes by hash:

EVALSHA "a1b2c3d4..." 1 "rl:rider-1234:/api/rides/request" 20 10 1716480000.123 1

40-byte hash instead of 800-byte script. At 10,000 RPS: 400 KB/s instead of 8 MB/s.

The failure mode: SCRIPT FLUSH or a Redis restart clears the script cache. EVALSHA returns NOSCRIPT. The client must fall back to EVAL, which re-caches the script. Spring’s RedisScript abstraction handles this automatically:

// SCALED: RedisScript handles EVALSHA with EVAL fallback
@Bean
public RedisScript<List<Long>> tokenBucketScript() {
    DefaultRedisScript<List<Long>> script = new DefaultRedisScript<>();
    script.setLocation(new ClassPathResource("scripts/rate_limit_token_bucket.lua"));
    script.setResultType((Class<List<Long>>) (Class<?>) List.class);
    return script;
}

DefaultRedisScript computes the SHA1 at bean creation, uses EVALSHA for every call, and falls back to EVAL on NOSCRIPT. No application code handles the retry.

Spring WebFlux WebFilter Implementation

The WebFilter intercepts every request before it reaches the controller:

// SCALED: Rate limiting WebFilter with configurable per-endpoint limits
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class RateLimitWebFilter implements WebFilter {

    private final ReactiveStringRedisTemplate redisTemplate;
    private final RedisScript<List<Long>> tokenBucketScript;
    private final Map<String, RateLimitConfig> endpointConfigs;

    public RateLimitWebFilter(
            ReactiveStringRedisTemplate redisTemplate,
            RedisScript<List<Long>> tokenBucketScript) {
        this.redisTemplate = redisTemplate;
        this.tokenBucketScript = tokenBucketScript;
        this.endpointConfigs = Map.of(
            "/api/rides/request", new RateLimitConfig(20, 10),
            "/api/fares/estimate", new RateLimitConfig(20, 10),
            "/api/drivers/nearby", new RateLimitConfig(30, 15),
            "/api/trips/history", new RateLimitConfig(10, 5)
        );
    }

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
        String path = exchange.getRequest().getPath().value();
        RateLimitConfig config = endpointConfigs.get(path);

        if (config == null) {
            return chain.filter(exchange); // No rate limit for this endpoint
        }

        String clientId = extractClientId(exchange);
        String key = "rl:" + clientId + ":" + path;
        String now = String.valueOf(System.currentTimeMillis() / 1000.0);

        return redisTemplate.execute(
            tokenBucketScript,
            List.of(key),
            List.of(
                String.valueOf(config.capacity()),
                String.valueOf(config.refillRate()),
                now,
                "1"
            )
        )
        .single()
        .flatMap(result -> handleResult(exchange, chain, config, result));
    }

    private Mono<Void> handleResult(
            ServerWebExchange exchange,
            WebFilterChain chain,
            RateLimitConfig config,
            List<Long> result) {

        long allowed = result.get(0);
        long remaining = result.get(1);
        long retryAfter = result.get(2);

        ServerHttpResponse response = exchange.getResponse();
        HttpHeaders headers = response.getHeaders();
        headers.set("X-RateLimit-Limit", String.valueOf(config.capacity()));
        headers.set("X-RateLimit-Remaining", String.valueOf(remaining));

        long resetEpoch = System.currentTimeMillis() / 1000
            + config.capacity() / config.refillRate();
        headers.set("X-RateLimit-Reset", String.valueOf(resetEpoch));

        if (allowed == 1) {
            return chain.filter(exchange);
        }

        // BOTTLENECK: Without this rejection, the request consumes
        // a matching service thread, a DB connection, and a Redis slot
        headers.set("Retry-After", String.valueOf(retryAfter));
        response.setStatusCode(HttpStatus.TOO_MANY_REQUESTS);

        byte[] body = ("{\"error\":\"rate_limit_exceeded\","
            + "\"retry_after\":" + retryAfter + "}")
            .getBytes(StandardCharsets.UTF_8);

        return response.writeWith(
            Mono.just(response.bufferFactory().wrap(body))
        );
    }

    private String extractClientId(ServerWebExchange exchange) {
        String apiKey = exchange.getRequest().getHeaders()
            .getFirst("X-API-Key");
        if (apiKey != null) return "key:" + apiKey;

        String userId = exchange.getRequest().getHeaders()
            .getFirst("X-User-Id");
        if (userId != null) return "user:" + userId;

        InetSocketAddress remote = exchange.getRequest().getRemoteAddress();
        if (remote != null) {
            return "ip:" + remote.getAddress().getHostAddress();
        }
        return "unknown";
    }

    record RateLimitConfig(int capacity, int refillRate) {}
}

Rate Limit Headers

Three headers communicate rate limit state to clients:

  • X-RateLimit-Limit: The bucket capacity. Tells the client the maximum burst size.
  • X-RateLimit-Remaining: Tokens remaining. The client can back off proactively when this approaches zero.
  • X-RateLimit-Reset: Unix epoch when the bucket will be full again. The client can schedule retries precisely.

On rejection (HTTP 429):

  • Retry-After: Seconds until the client should retry. Computed from the Lua script’s remaining token deficit and refill rate. This prevents retry storms: the client waits exactly as long as needed.

A well-behaved mobile client uses these headers:

// Client-side rate limit handling (Android/Kotlin pseudocode)
if (response.code() == 429) {
    val retryAfter = response.header("Retry-After")?.toLong() ?: 1
    delay(retryAfter * 1000)
    retry()
}

A poorly-behaved client ignores them and retries immediately. The rate limiter handles both correctly. The well-behaved client gets served faster. The poorly-behaved client burns through its tokens on retries and gets rejected repeatedly.

Kubernetes Manifest for Rate Limiting Redis

The rate limiting Redis instance must be separate from the caching Redis. Reason: if the caching Redis runs out of memory and evicts keys (using allkeys-lru), rate limit counters are evicted. Every client gets a fresh full bucket. The rate limiter stops working during a memory pressure event, which is exactly when you need it most.

# k8s/redis-ratelimit.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-ratelimit
  namespace: ride-hailing
  labels:
    app: redis-ratelimit
    tier: infrastructure
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis-ratelimit
  template:
    metadata:
      labels:
        app: redis-ratelimit
    spec:
      containers:
        - name: redis
          image: redis:7.4-alpine
          ports:
            - containerPort: 6379
          command: ["redis-server"]
          args:
            - "--maxmemory"
            - "256mb"
            - "--maxmemory-policy"
            - "noeviction" # Reject writes when full, never evict
            - "--save"
            - "" # Disable RDB persistence
            - "--appendonly"
            - "no" # Disable AOF persistence
            - "--tcp-backlog"
            - "511"
            - "--timeout"
            - "0"
          resources:
            requests:
              memory: "300Mi"
              cpu: "100m"
            limits:
              memory: "300Mi"
              cpu: "500m"
          readinessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 15
            periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
  name: redis-ratelimit
  namespace: ride-hailing
spec:
  selector:
    app: redis-ratelimit
  ports:
    - port: 6379
      targetPort: 6379
  clusterIP: None # Headless for direct pod addressing

Key configuration decisions:

  • noeviction: When Redis reaches 256 MB, EVAL calls return errors instead of silently evicting rate limit keys. The WebFilter catches this and allows the request through (fail-open). Failing open under Redis memory pressure is safer than failing closed and rejecting all traffic.
  • No persistence: Rate limit state is ephemeral. If Redis restarts, every client gets a fresh full bucket. This is a brief burst of un-rate-limited traffic (one bucket capacity per client), not a correctness problem. Persistence adds write latency to every rate limit check.
  • Separate from caching Redis: Failure domain isolation. The caching Redis can be evicting keys, restarting, or running a BGSAVE without affecting rate limiting.

Spring Boot configuration for connecting to the dedicated instance:

# application.yml
spring:
  data:
    redis:
      host: redis-cache # Default Redis for caching
      port: 6379

rate-limit:
  redis:
    host: redis-ratelimit # Dedicated Redis for rate limiting
    port: 6379
// SCALED: Separate RedisTemplate for rate limiting Redis
@Configuration
public class RateLimitRedisConfig {

    @Bean
    public ReactiveStringRedisTemplate rateLimitRedisTemplate(
            @Value("${rate-limit.redis.host}") String host,
            @Value("${rate-limit.redis.port}") int port) {

        LettuceConnectionFactory factory = new LettuceConnectionFactory(
            new RedisStandaloneConfiguration(host, port)
        );
        factory.afterPropertiesSet();
        return new ReactiveStringRedisTemplate(factory);
    }
}

The Baseline: 10x Load Without Rate Limiting

The Locust scenario simulates the concert surge:

# load-tests/ch10s2_surge_locustfile.py
from locust import HttpUser, task, between, tag, events
import random
import string

class ConcertSurgeRider(HttpUser):
    wait_time = between(0.1, 0.3)  # Aggressive: 3-10 requests/second

    def on_start(self):
        self.rider_id = "rider-" + "".join(
            random.choices(string.ascii_lowercase, k=8))

    @tag("surge")
    @task(5)
    def request_ride(self):
        self.client.post(
            "/api/rides/request",
            json={
                "rider_id": self.rider_id,
                "pickup_lat": 40.7505 + random.uniform(-0.01, 0.01),
                "pickup_lng": -73.9934 + random.uniform(-0.01, 0.01),
                "dropoff_lat": 40.7580 + random.uniform(-0.05, 0.05),
                "dropoff_lng": -73.9855 + random.uniform(-0.05, 0.05)
            },
            headers={"X-User-Id": self.rider_id},
            name="/api/rides/request"
        )

    @tag("surge")
    @task(2)
    def fare_estimate(self):
        self.client.post(
            "/api/fares/estimate",
            json={
                "pickup_lat": 40.7505,
                "pickup_lng": -73.9934,
                "dropoff_lat": 40.7580,
                "dropoff_lng": -73.9855
            },
            headers={"X-User-Id": self.rider_id},
            name="/api/fares/estimate"
        )

    @tag("surge")
    @task(1)
    def nearby_drivers(self):
        self.client.get(
            "/api/drivers/nearby",
            params={"lat": 40.7505, "lng": -73.9934, "radius_km": 3},
            headers={"X-User-Id": self.rider_id},
            name="/api/drivers/nearby"
        )
# Without rate limiting
locust -f load-tests/ch10s2_surge_locustfile.py \
    --host=http://localhost:8080 \
    --users 2000 \
    --spawn-rate 200 \
    --run-time 3m \
    --headless \
    --csv=load-tests/results/ch10s2-no-ratelimit

Results without rate limiting:

Name                     # reqs   Avg   Med    Min    Max     p95     p99    RPS   Fail%
/api/rides/request       42600   3100  1800     18  48000   14000   41000  236.7   21.3%
/api/fares/estimate      17040   2900  1600     22  44000   12000   38000   94.7   19.8%
/api/drivers/nearby       8520   2400  1200     15  38000   10000   32000   47.3   16.2%
Aggregated               68160   2920  1700     15  48000   13000   40000  378.7   20.1%

p99 at 40 seconds. One in five requests fails. The matching service, fare calculator, and geospatial index are all saturated. Every request that gets through is slow because it competes with requests that will eventually fail anyway.

The Fix: Token Bucket Rate Limiting Active

Same scenario, rate limiting enabled:

# With rate limiting (token bucket: capacity=20, refill=10/s per client)
locust -f load-tests/ch10s2_surge_locustfile.py \
    --host=http://localhost:8080 \
    --users 2000 \
    --spawn-rate 200 \
    --run-time 3m \
    --headless \
    --csv=load-tests/results/ch10s2-with-ratelimit

The Proof

Results with token bucket rate limiting:

Name                     # reqs   Avg   Med   Min    Max    p95    p99   RPS    Fail%
/api/rides/request       42600    68    38    10    820    280    580  236.7    0.1%
  (429 responses)        31200     2     2     1     10      6      8  173.3     -
/api/fares/estimate      17040    74    42    12    900    310    620   94.7    0.1%
  (429 responses)        12400     2     2     1     12      7      9   68.9     -
/api/drivers/nearby       8520    52    30     8    680    220    480   47.3    0.0%
  (429 responses)         5900     2     2     1      9      5      7   32.8     -
Aggregated               68160    65    36     8    900    280    590  378.7    0.1%

The numbers tell the story:

MetricBeforeAfterDelta
p99 latency (allowed requests)40,000ms590ms-98.5%
Failure rate (non-429)20.1%0.1%-99.5%
429 response time (avg)n/a2ms-
Matching service RPS378105-72.3%

The 429 responses cost 2ms each. They touch Redis (one EVALSHA round trip) and return. No matching service call. No database query. No fare calculation. The 72.3% reduction in matching service RPS keeps it within its capacity ceiling of ~150 RPS.

The Prometheus panel confirms the backend is protected:

# Matching service thread pool utilization during surge
hikaricp_connections_active{pool="matching-service-pool"}
/ hikaricp_connections_max{pool="matching-service-pool"}

Without rate limiting: 100% utilization, connection wait times > 30 seconds. With rate limiting: 62% utilization, connection wait times < 5ms.

The rate limiter did not reduce the number of riders served. It reduced the number of redundant requests from each rider. Each of the 2,000 simulated riders got their 20-request burst and 10/s sustained rate. The requests beyond that were retries from impatient clients or automated polling that consumed resources without adding value.

The matching service processes fewer requests but serves the same number of unique ride matches. The riders who get through see p99 of 590ms instead of 40 seconds. The riders who get a 429 see a “please wait” screen with a countdown based on the Retry-After header instead of a spinning loader that eventually times out.