Rate Limiting Algorithms: From Token Bucket to Sliding Window
Rate Limiting Algorithms: From Token Bucket to Sliding Window
Every rate limiter answers a binary question: allow or reject. The algorithms differ in how they remember the past and what shape of traffic they permit through. Choosing the wrong algorithm for an endpoint is not a correctness bug. It is a capacity planning failure that surfaces at the worst possible moment.
Token Bucket
The Symptom
Riders open the app during a concert surge. The app fires 6 requests in the first second: nearby drivers, fare estimate, surge pricing check, payment methods, promotions, trip history. A naive rate limiter set to 5 RPS rejects the 6th request. The rider sees a partial load state. The promotions panel is blank. The rider taps retry. The retry storm begins.
The Cause
A simple counter-per-second rate limiter cannot distinguish between a legitimate burst (app open) and sustained abuse (scraping). Mobile clients are inherently bursty. A rate limiter that punishes bursts punishes normal users.
The Mechanics
The token bucket maintains two values: the current token count and the last refill timestamp.
Bucket capacity: 10 tokens
Refill rate: 5 tokens/second
Timeline:
t=0.0s Bucket: 10 tokens (full)
t=0.0s Request 1 → consume 1 → 9 remaining → ALLOWED
t=0.0s Request 2 → consume 1 → 8 remaining → ALLOWED
t=0.0s Request 3 → consume 1 → 7 remaining → ALLOWED
t=0.0s Request 4 → consume 1 → 6 remaining → ALLOWED
t=0.0s Request 5 → consume 1 → 5 remaining → ALLOWED
t=0.0s Request 6 → consume 1 → 4 remaining → ALLOWED
t=0.1s Request 7 → consume 1 → 3.5 remaining → ALLOWED
(0.1s elapsed × 5 tokens/s = 0.5 tokens refilled: 3 + 0.5 = 3.5)
t=0.2s Request 8 → consume 1 → 3.0 remaining → ALLOWED
...
t=2.0s No requests for 2 seconds → bucket refills to 10 (capped at capacity)
The burst of 6 requests at $t=0$ succeeds because the bucket started full. The sustained rate after the burst is governed by the refill rate. The rider sees all 6 panels load. Subsequent polling requests are throttled to 5/s.
Two parameters control the behavior:
- Capacity (burst size): how many requests can fire at once before any throttling. Set this to the number of concurrent requests a legitimate client sends on startup.
- Refill rate (sustained rate): the steady-state requests per second allowed. Set this to the maximum RPS a single client should sustain.
When to Use It
Token bucket is the right default for any endpoint where clients burst on first interaction and then settle into a lower sustained rate. For the ride-hailing platform:
/api/rides/request— rider opens app, requests ride, burst of 3-6 requests/api/fares/estimate— triggered alongside ride request/api/drivers/nearby— initial search plus periodic refresh
The burst tolerance prevents false rejections on legitimate users. The sustained rate prevents any single client from consuming disproportionate backend capacity.
Leaky Bucket
The Symptom
Driver location ingestion receives 50,000 GPS updates per second during peak hours. The geospatial index (R-tree backed by PostgreSQL with PostGIS) handles 3,000 writes per second before query latency for nearby driver searches degrades from 45ms to 2,800ms. The ingestion rate varies wildly: drivers crossing cell tower boundaries dump 30 seconds of buffered GPS updates in a single burst.
The Cause
The geospatial index is write-sensitive. Unlike a key-value store, updating spatial data requires rebalancing the R-tree. Bursts of writes to the same geographic region cause tree rebalancing to cascade. The index needs a constant, predictable write rate.
The Mechanics
The leaky bucket is a queue with a fixed drain rate:
Queue capacity: 5000 entries
Drain rate: 3000 entries/second (constant)
Timeline:
t=0.0s Queue: 0 entries
t=0.0s Burst: 4000 location updates arrive
Queue: 4000 entries
Drain: 3000 entries processed in 1s
t=1.0s Queue: 1000 remaining + 2500 new arrivals = 3500
Drain: 3000 entries processed in 1s
t=2.0s Queue: 500 remaining + 3200 new arrivals = 3700
Drain: 3000 entries processed in 1s
t=3.0s Queue: 700 remaining + 6000 new arrivals = 5000 (capped)
1700 entries rejected
Drain: 3000 entries processed in 1s
The output rate is always 3,000/s regardless of input. The queue absorbs short bursts. Long sustained overload causes rejections, but the downstream system never sees more than 3,000 writes per second.
The Difference from Token Bucket
Token bucket controls input rate with burst tolerance. Leaky bucket controls output rate with queue buffering. For the rider API, you want burst tolerance (let the app open fast). For driver location ingestion, you want output smoothing (protect the spatial index).
Token bucket: [burst allowed] → [sustained rate enforced]
Leaky bucket: [all input queued] → [constant output rate]
A driver sends 30 buffered updates at once. Token bucket would allow all 30 (if capacity >= 30) and they’d hit the spatial index simultaneously. Leaky bucket queues them and drains at 3,000/s aggregate across all drivers.
Fixed Window Counter
The Symptom
The admin analytics endpoint allows 10 requests per minute per API key. An internal dashboard polls every 6 seconds (10 requests/minute, exactly at the limit). Occasionally, the dashboard shows stale data because two requests land at the window boundary and one is rejected.
The Cause
Fixed windows create a boundary discontinuity. A client that sends 10 requests in the last second of one window and 10 requests in the first second of the next window has sent 20 requests in 2 seconds while both windows show 10/minute compliance.
The Mechanics
Window size: 60 seconds
Limit: 10 requests per window
Window [12:00:00 - 12:00:59]:
12:00:50 → count=1 ALLOWED
12:00:51 → count=2 ALLOWED
...
12:00:59 → count=10 ALLOWED
Window [12:01:00 - 12:01:59]:
12:01:00 → count=1 ALLOWED
12:01:01 → count=2 ALLOWED
...
12:01:09 → count=10 ALLOWED
Between 12:00:50 and 12:01:09 (19 seconds), the client sent 20 requests. The configured limit is 10 per 60 seconds. The boundary allowed 2x the intended rate.
Implementation is one Redis key per window:
Key: "rl:admin-key-1:stats:1716480000" (window start epoch)
Value: 7 (current count)
TTL: 120 seconds (2x window to handle clock skew)
INCR + EXPIRE in a pipeline. Two Redis commands. Minimal memory: one key per client per active window.
When to Use It
When the boundary problem is acceptable. For the ride-hailing platform, admin analytics endpoints serve internal dashboards with low traffic. The 2x boundary overshoot on a 10 RPS limit means 20 requests hit the analytics database in a boundary second. The analytics database handles 500 RPS. The overshoot is irrelevant.
Do not use fixed windows for rider-facing endpoints where the boundary problem can be exploited by an attacker or triggered accidentally by synchronized mobile client timers.
Sliding Window Log
The Symptom
An API partner integrating with the ride-hailing platform’s fleet management API has a strict contractual limit: 100 requests per 60-second rolling window. The partner disputes every rate limit rejection, claiming their client sends exactly 100 requests per minute. The fixed window counter occasionally allows 200 requests in a boundary burst, then rejects legitimate requests in the next window when the partner has already used their “leaked” quota.
The Cause
The partner’s SLA requires exact enforcement. Fixed windows cannot provide it. The partner needs a rate limiter that tracks the precise rolling window: at any given moment, exactly 100 requests in the trailing 60 seconds.
The Mechanics
Every request timestamp is stored in a Redis sorted set:
Key: "rl:partner-api-key-1:fleet"
Score: timestamp (epoch seconds with millisecond precision)
Member: unique request ID or timestamp
On each request:
1. ZREMRANGEBYSCORE key 0 (now - 60) // Remove entries older than 60s
2. ZCARD key // Count remaining entries
3. If count < 100: ZADD key now now // Add this request
4. If count >= 100: REJECT // Over limit
At 100 requests per 60 seconds per client, each sorted set holds at most 100 entries. For 50 API partners, that is 5,000 entries total. Manageable.
The problem scales with client count and request rate. At 10,000 clients with 1,000 requests per 60-second window each, the sorted sets hold 10,000,000 entries. Each ZREMRANGEBYSCORE scans entries. Each ZCARD is $O(1)$, but the memory is $O(n)$ where $n$ is the request count within the window.
When to Use It
When you have few clients with strict accuracy requirements. Fleet management API partners (tens of clients, contractual SLAs) fit this profile. Rider-facing endpoints (millions of clients, statistical accuracy sufficient) do not.
Sliding Window Counter
The Symptom
The trip history endpoint needs rate limiting that is more accurate than fixed windows but cannot afford the memory of sliding window logs. There are 2 million active riders. Each rider’s trip history endpoint needs its own rate limit counter.
The Cause
2 million sorted sets in Redis, each holding up to 60 timestamps, would consume approximately 2.8 GB of memory for rate limiting alone. The trip history endpoint does not justify that memory budget.
The Mechanics
The sliding window counter uses two fixed-window counters and a weighted average:
Window size: 60 seconds
Current window: 12:01:00 - 12:01:59
Previous window: 12:00:00 - 12:00:59
Previous window count: 42 requests
Current window count: 18 requests
Current position in window: 12:01:15 (25% elapsed)
Weighted count = current_count + previous_count × (1 - elapsed_fraction)
= 18 + 42 × (1 - 0.25)
= 18 + 31.5
= 49.5
If the limit is 50, this request brings the weighted count to 50.5 and is rejected.
Memory per client: two Redis keys (current and previous window counters). For 2 million riders: 4 million Redis keys, each holding an integer. Approximately 320 MB with key overhead. Compare to 2.8 GB for sliding window logs.
The accuracy tradeoff: Cloudflare’s analysis measured 0.003% error rate compared to a perfect sliding window log. The weighted average assumes requests in the previous window were uniformly distributed. If all 42 previous requests landed in the last second of the previous window, the approximation undercounts. In practice, across millions of clients, the distribution is close enough to uniform that the error is negligible.
When to Use It
When you need better-than-fixed-window accuracy for a large number of clients. Trip history, payment status checks, promotion lookups. Endpoints where the client count makes sliding window logs impractical and the boundary problem of fixed windows is unacceptable.
Decision Matrix
The choice depends on three factors: burst tolerance needed, number of clients, and accuracy requirement.
| Factor | Token Bucket | Leaky Bucket | Fixed Window | Sliding Log | Sliding Counter |
|---|---|---|---|---|---|
| Burst tolerance | Yes | No (queued) | Boundary leak | No | No |
| Memory per client | $O(1)$ | $O(\text{queue})$ | $O(1)$ | $O(n)$ | $O(1)$ |
| Accuracy | Exact | Exact | Boundary 2x | Exact | ~99.997% |
| Implementation complexity | Medium | Medium | Low | Medium | Low |
| Best for | API endpoints | Ingestion pipelines | Low-traffic admin | SLA-bound partners | High-client-count |
For the ride-hailing platform:
- Rider-facing APIs: Token bucket. Burst tolerance handles app-open behavior. One hash key per client in Redis.
- Driver location ingestion: Leaky bucket. Constant output rate protects the spatial index. Implemented as a bounded queue with a fixed drain worker.
- Admin/analytics: Fixed window. Low client count, simplicity wins, boundary problem is irrelevant at the traffic level.
- Partner APIs: Sliding window log. Few clients, contractual SLA enforcement, memory cost is bounded.
- Trip history, payments, promotions: Sliding window counter. Millions of clients, good accuracy, minimal memory.
The next section implements the token bucket with Redis and Lua, covering the atomic operations that make distributed rate limiting correct under concurrency.