Rate Limiting in Distributed Systems: A Complete Guide

Rate limiting is a critical technique for controlling the rate of requests that clients can make to a service. It protects your infrastructure from abuse, prevents resource starvation, and ensures fair usage across all consumers. In this comprehensive guide, we explore the major rate limiting algorithms, their trade-offs, and how to implement them at scale.

Why Rate Limiting Matters

Without rate limiting, a single misbehaving client can overwhelm your entire system. Consider these real-world scenarios:

API Abuse: A buggy client sending thousands of requests per second can exhaust your database connections.
DDoS Mitigation: Rate limiting serves as a first line of defense against distributed denial-of-service attacks.
Cost Control: Cloud services charge per request. Uncontrolled traffic leads to unexpected bills.
Fair Usage: Ensuring no single tenant monopolizes shared resources in a multi-tenant system.

Rate limiting is closely related to other resilience patterns like circuit breakers and partial failure handling. Together, they form a robust defense against cascading failures.

Rate Limiting Algorithms

1. Token Bucket Algorithm

The token bucket is the most widely used algorithm. A bucket holds tokens that are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Key Properties:

Allows short bursts of traffic up to the bucket capacity
Smooths out traffic over time
Simple to implement and understand
Used by AWS API Gateway and Stripe

import time
import threading

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate          # tokens per second
        self.capacity = capacity  # max tokens
        self.tokens = capacity
        self.last_refill = time.monotonic()
        self.lock = threading.Lock()

    def allow_request(self):
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.capacity,
                            self.tokens + elapsed * self.rate)
            self.last_refill = now

            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

# Usage: 10 requests/sec, burst of 20
limiter = TokenBucket(rate=10, capacity=20)
if limiter.allow_request():
    process_request()
else:
    return_429_too_many_requests()

2. Leaky Bucket Algorithm

The leaky bucket processes requests at a constant rate, like water leaking from a bucket. Incoming requests are added to a queue (the bucket). If the queue is full, new requests are dropped.

Key Difference from Token Bucket: The leaky bucket enforces a strict output rate, while the token bucket allows bursts. This makes leaky bucket ideal for scenarios requiring smooth, consistent throughput.

import time
from collections import deque
import threading

class LeakyBucket:
    def __init__(self, rate, capacity):
        self.rate = rate
        self.capacity = capacity
        self.queue = deque()
        self.last_leak = time.monotonic()
        self.lock = threading.Lock()

    def allow_request(self):
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_leak
            leak_count = int(elapsed * self.rate)
            for _ in range(min(leak_count, len(self.queue))):
                self.queue.popleft()
            if leak_count > 0:
                self.last_leak = now

            if len(self.queue) < self.capacity:
                self.queue.append(now)
                return True
            return False

3. Fixed Window Counter

The fixed window algorithm divides time into fixed windows (e.g., 1-minute intervals) and counts requests in each window. Once the counter exceeds the limit, subsequent requests are rejected until the next window.

Pros: Simple, memory-efficient, easy to implement with Redis.
Cons: Suffers from boundary burst problem — a client can send 2x the limit at the window boundary.

import redis
import time

r = redis.Redis()

def fixed_window_rate_limit(client_id, limit, window_seconds):
    window = int(time.time() / window_seconds)
    key = f"rate:{client_id}:{window}"

    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, window_seconds)
    result = pipe.execute()

    current_count = result[0]
    return current_count <= limit

4. Sliding Window Log

Maintains a sorted set of timestamps for each request. To check the limit, count all timestamps within the sliding window. This eliminates the boundary burst problem but uses more memory.

def sliding_window_log(client_id, limit, window_seconds):
    now = time.time()
    key = f"sliding:{client_id}"
    window_start = now - window_seconds

    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)
    pipe.zadd(key, {str(now): now})
    pipe.zcard(key)
    pipe.expire(key, window_seconds)
    result = pipe.execute()

    return result[2] <= limit

5. Sliding Window Counter

A hybrid approach combining fixed window and sliding window. It estimates the count in the current sliding window by weighting the previous and current fixed window counts.

def sliding_window_counter(client_id, limit, window_seconds):
    now = time.time()
    current_window = int(now / window_seconds)
    previous_window = current_window - 1
    elapsed_in_window = (now % window_seconds) / window_seconds

    current_key = f"rate:{client_id}:{current_window}"
    previous_key = f"rate:{client_id}:{previous_window}"

    current_count = int(r.get(current_key) or 0)
    previous_count = int(r.get(previous_key) or 0)

    weighted_count = previous_count * (1 - elapsed_in_window) + current_count
    return weighted_count <= limit

Algorithm Comparison

Algorithm	Memory	Burst Handling	Accuracy	Complexity
Token Bucket	Low	Allows bursts	Good	Low
Leaky Bucket	Low	Strict smoothing	Good	Low
Fixed Window	Very Low	Boundary burst issue	Moderate	Very Low
Sliding Window Log	High	No boundary issues	Exact	Medium
Sliding Window Counter	Low	Near-accurate	High	Low

Distributed Rate Limiting with Redis

In a distributed system with multiple API servers, you need a centralized rate limiter. Redis is the de facto choice because of its atomic operations and low latency. Here is a production-ready implementation using Redis with the sliding window counter approach:

-- Redis Lua script for atomic sliding window counter
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local current_window = math.floor(now / window)
local previous_window = current_window - 1
local elapsed = (now % window) / window

local current_key = key .. ":" .. current_window
local previous_key = key .. ":" .. previous_window

local current_count = tonumber(redis.call("GET", current_key) or "0")
local previous_count = tonumber(redis.call("GET", previous_key) or "0")

local weighted = previous_count * (1 - elapsed) + current_count

if weighted < limit then
    redis.call("INCR", current_key)
    redis.call("EXPIRE", current_key, window * 2)
    return 1
else
    return 0
end

API Rate Limiting Best Practices

When implementing rate limiting for your API, follow these best practices:

Return proper headers: Include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset in responses.
Use 429 status code: Return HTTP 429 Too Many Requests with a Retry-After header.
Rate limit by multiple dimensions: Per user, per IP, per API key, and per endpoint.
Implement graceful degradation: Consider returning cached or partial results instead of outright rejection.
Use tiered limits: Different plans get different rate limits (free: 100/hr, pro: 10,000/hr).

from flask import Flask, request, jsonify
import functools

app = Flask(__name__)
limiter = TokenBucket(rate=10, capacity=20)

def rate_limit(limit_per_minute):
    def decorator(f):
        @functools.wraps(f)
        def wrapper(*args, **kwargs):
            client_id = request.headers.get("X-API-Key", request.remote_addr)
            if not check_rate_limit(client_id, limit_per_minute, 60):
                remaining = get_remaining(client_id)
                reset_time = get_reset_time(client_id)
                response = jsonify({"error": "Rate limit exceeded"})
                response.status_code = 429
                response.headers["X-RateLimit-Limit"] = str(limit_per_minute)
                response.headers["X-RateLimit-Remaining"] = str(remaining)
                response.headers["X-RateLimit-Reset"] = str(reset_time)
                response.headers["Retry-After"] = str(reset_time - int(time.time()))
                return response
            return f(*args, **kwargs)
        return wrapper
    return decorator

@app.route("/api/data")
@rate_limit(limit_per_minute=100)
def get_data():
    return jsonify({"data": "value"})

Real-World Rate Limiting Architectures

Major tech companies implement rate limiting at multiple layers:

Company	Algorithm	Implementation
Stripe	Token Bucket	Per-API-key, 100 req/sec default
GitHub	Sliding Window	5,000 req/hr per authenticated user
Twitter/X	Fixed Window	Per-endpoint, 15-min windows
Cloudflare	Sliding Window Counter	Edge-level, configurable per route

For more on protecting distributed systems, see our guides on circuit breakers and idempotency. You can also use the System Design Calculator to estimate the rate limiting requirements for your system.

Frequently Asked Questions

Q: What is the difference between rate limiting and throttling?

Rate limiting rejects requests that exceed a defined threshold, typically returning HTTP 429. Throttling slows down request processing without outright rejection, often using queuing or delayed responses. In practice, the terms are frequently used interchangeably, but throttling implies a softer approach where requests are eventually processed.

Q: Which rate limiting algorithm should I use?

For most API use cases, the sliding window counter offers the best balance of accuracy and memory efficiency. Use the token bucket if you need to allow controlled bursts. Use the leaky bucket if you need a strictly smooth output rate, such as for write-heavy database operations.

Q: How do I rate limit in a microservices architecture?

Use a centralized rate limiter with Redis or a dedicated rate limiting service. Place rate limiting at the API gateway level (e.g., Kong, Envoy, AWS API Gateway) for global limits, and at the service level for service-specific limits. This layered approach is connected to service discovery patterns for routing.

Q: How do I handle rate limiting across multiple data centers?

You can use local rate limiting per data center (simpler, but allows total traffic to exceed the global limit) or global rate limiting with a shared Redis cluster (accurate, but adds cross-region latency). A common compromise is to use local counters synced periodically to a global store. Learn more about this in our geo-distribution guide.

Rate Limiting in Distributed Systems: A Complete Guide