Skip to main content
surviving the spike

Rate Limiting Algorithms: From Token Bucket to Sliding Window

10 min read Chapter 29 of 66

Rate Limiting Algorithms: From Token Bucket to Sliding Window

Every rate limiter answers a binary question: allow or reject. The algorithms differ in how they remember the past and what shape of traffic they permit through. Choosing the wrong algorithm for an endpoint is not a correctness bug. It is a capacity planning failure that surfaces at the worst possible moment.

Token Bucket

The Symptom

Riders open the app during a concert surge. The app fires 6 requests in the first second: nearby drivers, fare estimate, surge pricing check, payment methods, promotions, trip history. A naive rate limiter set to 5 RPS rejects the 6th request. The rider sees a partial load state. The promotions panel is blank. The rider taps retry. The retry storm begins.

The Cause

A simple counter-per-second rate limiter cannot distinguish between a legitimate burst (app open) and sustained abuse (scraping). Mobile clients are inherently bursty. A rate limiter that punishes bursts punishes normal users.

The Mechanics

The token bucket maintains two values: the current token count and the last refill timestamp.

Bucket capacity: 10 tokens
Refill rate: 5 tokens/second

Timeline:

t=0.0s  Bucket: 10 tokens (full)
t=0.0s  Request 1 → consume 1 → 9 remaining → ALLOWED
t=0.0s  Request 2 → consume 1 → 8 remaining → ALLOWED
t=0.0s  Request 3 → consume 1 → 7 remaining → ALLOWED
t=0.0s  Request 4 → consume 1 → 6 remaining → ALLOWED
t=0.0s  Request 5 → consume 1 → 5 remaining → ALLOWED
t=0.0s  Request 6 → consume 1 → 4 remaining → ALLOWED
t=0.1s  Request 7 → consume 1 → 3.5 remaining → ALLOWED
        (0.1s elapsed × 5 tokens/s = 0.5 tokens refilled: 3 + 0.5 = 3.5)
t=0.2s  Request 8 → consume 1 → 3.0 remaining → ALLOWED
...
t=2.0s  No requests for 2 seconds → bucket refills to 10 (capped at capacity)

The burst of 6 requests at $t=0$ succeeds because the bucket started full. The sustained rate after the burst is governed by the refill rate. The rider sees all 6 panels load. Subsequent polling requests are throttled to 5/s.

Two parameters control the behavior:

  • Capacity (burst size): how many requests can fire at once before any throttling. Set this to the number of concurrent requests a legitimate client sends on startup.
  • Refill rate (sustained rate): the steady-state requests per second allowed. Set this to the maximum RPS a single client should sustain.

When to Use It

Token bucket is the right default for any endpoint where clients burst on first interaction and then settle into a lower sustained rate. For the ride-hailing platform:

  • /api/rides/request — rider opens app, requests ride, burst of 3-6 requests
  • /api/fares/estimate — triggered alongside ride request
  • /api/drivers/nearby — initial search plus periodic refresh

The burst tolerance prevents false rejections on legitimate users. The sustained rate prevents any single client from consuming disproportionate backend capacity.

Leaky Bucket

The Symptom

Driver location ingestion receives 50,000 GPS updates per second during peak hours. The geospatial index (R-tree backed by PostgreSQL with PostGIS) handles 3,000 writes per second before query latency for nearby driver searches degrades from 45ms to 2,800ms. The ingestion rate varies wildly: drivers crossing cell tower boundaries dump 30 seconds of buffered GPS updates in a single burst.

The Cause

The geospatial index is write-sensitive. Unlike a key-value store, updating spatial data requires rebalancing the R-tree. Bursts of writes to the same geographic region cause tree rebalancing to cascade. The index needs a constant, predictable write rate.

The Mechanics

The leaky bucket is a queue with a fixed drain rate:

Queue capacity: 5000 entries
Drain rate: 3000 entries/second (constant)

Timeline:

t=0.0s  Queue: 0 entries
t=0.0s  Burst: 4000 location updates arrive
        Queue: 4000 entries
        Drain: 3000 entries processed in 1s
t=1.0s  Queue: 1000 remaining + 2500 new arrivals = 3500
        Drain: 3000 entries processed in 1s
t=2.0s  Queue: 500 remaining + 3200 new arrivals = 3700
        Drain: 3000 entries processed in 1s
t=3.0s  Queue: 700 remaining + 6000 new arrivals = 5000 (capped)
        1700 entries rejected
        Drain: 3000 entries processed in 1s

The output rate is always 3,000/s regardless of input. The queue absorbs short bursts. Long sustained overload causes rejections, but the downstream system never sees more than 3,000 writes per second.

The Difference from Token Bucket

Token bucket controls input rate with burst tolerance. Leaky bucket controls output rate with queue buffering. For the rider API, you want burst tolerance (let the app open fast). For driver location ingestion, you want output smoothing (protect the spatial index).

Token bucket:  [burst allowed] → [sustained rate enforced]
Leaky bucket:  [all input queued] → [constant output rate]

A driver sends 30 buffered updates at once. Token bucket would allow all 30 (if capacity >= 30) and they’d hit the spatial index simultaneously. Leaky bucket queues them and drains at 3,000/s aggregate across all drivers.

Fixed Window Counter

The Symptom

The admin analytics endpoint allows 10 requests per minute per API key. An internal dashboard polls every 6 seconds (10 requests/minute, exactly at the limit). Occasionally, the dashboard shows stale data because two requests land at the window boundary and one is rejected.

The Cause

Fixed windows create a boundary discontinuity. A client that sends 10 requests in the last second of one window and 10 requests in the first second of the next window has sent 20 requests in 2 seconds while both windows show 10/minute compliance.

The Mechanics

Window size: 60 seconds
Limit: 10 requests per window
Window [12:00:00 - 12:00:59]:
  12:00:50 → count=1  ALLOWED
  12:00:51 → count=2  ALLOWED
  ...
  12:00:59 → count=10 ALLOWED

Window [12:01:00 - 12:01:59]:
  12:01:00 → count=1  ALLOWED
  12:01:01 → count=2  ALLOWED
  ...
  12:01:09 → count=10 ALLOWED

Between 12:00:50 and 12:01:09 (19 seconds), the client sent 20 requests. The configured limit is 10 per 60 seconds. The boundary allowed 2x the intended rate.

Implementation is one Redis key per window:

Key: "rl:admin-key-1:stats:1716480000"  (window start epoch)
Value: 7 (current count)
TTL: 120 seconds (2x window to handle clock skew)

INCR + EXPIRE in a pipeline. Two Redis commands. Minimal memory: one key per client per active window.

When to Use It

When the boundary problem is acceptable. For the ride-hailing platform, admin analytics endpoints serve internal dashboards with low traffic. The 2x boundary overshoot on a 10 RPS limit means 20 requests hit the analytics database in a boundary second. The analytics database handles 500 RPS. The overshoot is irrelevant.

Do not use fixed windows for rider-facing endpoints where the boundary problem can be exploited by an attacker or triggered accidentally by synchronized mobile client timers.

Sliding Window Log

The Symptom

An API partner integrating with the ride-hailing platform’s fleet management API has a strict contractual limit: 100 requests per 60-second rolling window. The partner disputes every rate limit rejection, claiming their client sends exactly 100 requests per minute. The fixed window counter occasionally allows 200 requests in a boundary burst, then rejects legitimate requests in the next window when the partner has already used their “leaked” quota.

The Cause

The partner’s SLA requires exact enforcement. Fixed windows cannot provide it. The partner needs a rate limiter that tracks the precise rolling window: at any given moment, exactly 100 requests in the trailing 60 seconds.

The Mechanics

Every request timestamp is stored in a Redis sorted set:

Key: "rl:partner-api-key-1:fleet"
Score: timestamp (epoch seconds with millisecond precision)
Member: unique request ID or timestamp

On each request:
1. ZREMRANGEBYSCORE key 0 (now - 60)     // Remove entries older than 60s
2. ZCARD key                              // Count remaining entries
3. If count < 100: ZADD key now now       // Add this request
4. If count >= 100: REJECT                // Over limit

At 100 requests per 60 seconds per client, each sorted set holds at most 100 entries. For 50 API partners, that is 5,000 entries total. Manageable.

The problem scales with client count and request rate. At 10,000 clients with 1,000 requests per 60-second window each, the sorted sets hold 10,000,000 entries. Each ZREMRANGEBYSCORE scans entries. Each ZCARD is $O(1)$, but the memory is $O(n)$ where $n$ is the request count within the window.

When to Use It

When you have few clients with strict accuracy requirements. Fleet management API partners (tens of clients, contractual SLAs) fit this profile. Rider-facing endpoints (millions of clients, statistical accuracy sufficient) do not.

Sliding Window Counter

The Symptom

The trip history endpoint needs rate limiting that is more accurate than fixed windows but cannot afford the memory of sliding window logs. There are 2 million active riders. Each rider’s trip history endpoint needs its own rate limit counter.

The Cause

2 million sorted sets in Redis, each holding up to 60 timestamps, would consume approximately 2.8 GB of memory for rate limiting alone. The trip history endpoint does not justify that memory budget.

The Mechanics

The sliding window counter uses two fixed-window counters and a weighted average:

Window size: 60 seconds
Current window: 12:01:00 - 12:01:59
Previous window: 12:00:00 - 12:00:59

Previous window count: 42 requests
Current window count: 18 requests
Current position in window: 12:01:15 (25% elapsed)

Weighted count = current_count + previous_count × (1 - elapsed_fraction)
               = 18 + 42 × (1 - 0.25)
               = 18 + 31.5
               = 49.5

If the limit is 50, this request brings the weighted count to 50.5 and is rejected.

Memory per client: two Redis keys (current and previous window counters). For 2 million riders: 4 million Redis keys, each holding an integer. Approximately 320 MB with key overhead. Compare to 2.8 GB for sliding window logs.

The accuracy tradeoff: Cloudflare’s analysis measured 0.003% error rate compared to a perfect sliding window log. The weighted average assumes requests in the previous window were uniformly distributed. If all 42 previous requests landed in the last second of the previous window, the approximation undercounts. In practice, across millions of clients, the distribution is close enough to uniform that the error is negligible.

When to Use It

When you need better-than-fixed-window accuracy for a large number of clients. Trip history, payment status checks, promotion lookups. Endpoints where the client count makes sliding window logs impractical and the boundary problem of fixed windows is unacceptable.

Decision Matrix

The choice depends on three factors: burst tolerance needed, number of clients, and accuracy requirement.

FactorToken BucketLeaky BucketFixed WindowSliding LogSliding Counter
Burst toleranceYesNo (queued)Boundary leakNoNo
Memory per client$O(1)$$O(\text{queue})$$O(1)$$O(n)$$O(1)$
AccuracyExactExactBoundary 2xExact~99.997%
Implementation complexityMediumMediumLowMediumLow
Best forAPI endpointsIngestion pipelinesLow-traffic adminSLA-bound partnersHigh-client-count

For the ride-hailing platform:

  • Rider-facing APIs: Token bucket. Burst tolerance handles app-open behavior. One hash key per client in Redis.
  • Driver location ingestion: Leaky bucket. Constant output rate protects the spatial index. Implemented as a bounded queue with a fixed drain worker.
  • Admin/analytics: Fixed window. Low client count, simplicity wins, boundary problem is irrelevant at the traffic level.
  • Partner APIs: Sliding window log. Few clients, contractual SLA enforcement, memory cost is bounded.
  • Trip history, payments, promotions: Sliding window counter. Millions of clients, good accuracy, minimal memory.

The next section implements the token bucket with Redis and Lua, covering the atomic operations that make distributed rate limiting correct under concurrency.