SA.

Blog post

Rate Limiting at Scale: Algorithms, Redis Patterns, and Distributed Coordination

Rate limiting protects services from overload. The algorithm matters less than where you enforce it and whether your counters are consistent across nodes.

Category
system-design
Published
Abstract topographic map made of dots — data flow and traffic visualization

Why rate limiting is a systems problem, not a library problem

Most rate limiting tutorials show you how to add middleware to a single server. That works until you have more than one server. With multiple instances, each maintaining its own counter, a client can make N requests to each of your M servers and exceed the intended limit by a factor of M. Rate limiting is only effective when the counter is shared.

The algorithms

Fixed window

Divide time into discrete windows (per minute, per hour). Count requests in the current window. If the count exceeds the limit, reject the request.

Simple. Has a problem. A client can make 100 requests at 12:00:59 and 100 more at 12:01:01 — 200 requests in two seconds, twice the per-minute limit. The window boundary creates a burst opportunity.

Sliding window log

Record a timestamp for each request. Count requests within the last N seconds (rolling backward from now). Reject if count exceeds limit.

Accurate. Eliminates the boundary burst problem. Expensive. You need to store and query a log of timestamps per client. For high-traffic APIs this is prohibitive.

Sliding window counter

A compromise: use two fixed window counters (current window and previous window) and compute a weighted count. If the current window is 70% complete and the limit is 100:

count = (current_window_count) + (previous_window_count × 0.30)

Approximate but practical. Much cheaper than the log approach. The error is bounded and acceptable for most use cases.

Token bucket

Each client has a bucket with a maximum capacity of N tokens. Tokens refill at a constant rate (e.g., 10 per second). Each request consumes one token. If the bucket is empty, the request is rejected.

Allows bursting up to the bucket capacity, then enforces the steady-state rate. Good for APIs where occasional bursts are acceptable but sustained overload is not.

Leaky bucket

Requests enter a queue (the bucket) and are processed at a fixed output rate. If the queue is full, the request is dropped.

Smooths traffic into a constant output rate. Useful for downstream systems that cannot handle bursts. Less useful for user-facing APIs where you want to allow some burst.

Implementing with Redis

Redis is the standard shared counter store. Two patterns:

INCR + EXPIRE (fixed window)

key = "rate:{client_id}:{window}"   # window = Unix timestamp / window_size
count = INCR key
if count == 1:
    EXPIRE key window_size_seconds
if count > limit:
    reject

Atomic increment with expiry. Simple, fast. Has the fixed window burst problem.

Sorted set (sliding window log)

key = "rate:{client_id}"
now = Unix timestamp in ms
window_start = now - window_ms

ZREMRANGEBYSCORE key 0 window_start    # remove old entries
count = ZCARD key
if count < limit:
    ZADD key now now                   # score and value both = timestamp
    EXPIRE key window_seconds
else:
    reject

Accurate sliding window. Each timestamp is a member of a sorted set. Removing old entries and counting is O(log N + M) where M is the number of entries removed.

Lua script for atomicity

Both patterns above have a race condition if you use separate commands. Wrap them in a Lua script, which Redis executes atomically:

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
if count < limit then
    redis.call('ZADD', key, now, now)
    redis.call('EXPIRE', key, math.ceil(window / 1000))
    return 1
end
return 0

Distributed rate limiting

With a Redis cluster, keys are sharded across nodes. A single client's counter may be split across nodes if you use multiple keys. Use Redis Cluster hash tags to force all keys for a client onto the same shard:

key = "rate:{client_id}:window"   # {} is the hash tag — only client_id is hashed

For very high throughput, a single Redis node becomes a bottleneck. Options:

  • Local approximate counters: each app server tracks a local count and syncs to Redis every 100ms. Accept that the limit may be exceeded by one synchronization window.
  • Dedicated rate limiting service: Envoy, Kong, and nginx all provide rate limiting middleware that centralizes the logic without burdening application code.
  • Redis Cluster with read replicas: for read-heavy rate limit checks (idempotency keys, quota checks) replicas reduce primary load.

Where to enforce

At the edge (API gateway / load balancer): cheapest — the request never hits application code. Best for denial-of-service protection and unauthenticated traffic.

At the service: necessary for per-user or per-tenant limits that require authentication context. The edge does not know who the user is before auth.

Both: use coarse limits at the edge (IP-based, 1000 req/min) and fine limits in the service (per-user, 100 req/min). Defense in depth.

Return the right headers

Rate limit responses should tell the client what happened and when to retry:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714500060
Retry-After: 47

Well-behaved clients will back off. Poorly-behaved clients will retry immediately regardless, which is why enforcement at the edge matters.