Rate Limiting Patterns That Actually Scale: Beyond Token Buckets

Rate limiting sounds simple until you try to do it correctly in a distributed system. The naive approach -- a per-IP counter in memory with a fixed window -- breaks in at least four ways. It does not work behind a load balancer. Fixed windows create burst problems at boundaries. IP-based limiting punishes everyone behind a corporate NAT. And in-memory counters vanish on deployment.

Here are four patterns we use in production.

Pattern one: sliding window counter with Redis. Our default for most APIs. A sorted set in Redis where each request adds a timestamped entry. Count entries within the window, remove expired ones. Handles window boundary problems, works across servers. Sub-millisecond latency. Good for APIs up to about one hundred thousand requests per hour.

Pattern two: distributed token bucket with Redis Lua scripting. For complex needs -- different tier limits, burst allowances, graduated throttling. The Lua script ensures atomicity without round trips. Each bucket has a fill rate, maximum capacity, and current level. Smooth rate limiting without burst problems.

Pattern three: leaky bucket with queue semantics. For webhook delivery and background dispatch where we want to smooth traffic, not reject it. We queue excess requests using BullMQ and process at a steady rate. Callers get 202 Accepted immediately.

Pattern four: adaptive rate limiting. Instead of fixed limits, we measure p95 latency in real time. When it crosses a threshold, limits tighten. When it drops, they relax. A background process samples metrics every five seconds and adjusts Redis-stored limits.

Beyond algorithms, the details matter. Rate limit headers (X-RateLimit-Remaining, Retry-After) are not optional. Return 429 with clear JSON errors. Apply different limits per endpoint. And always put a basic rate limit at the Cloudflare or nginx layer as a safety net, with application-level limiting behind it.

The biggest mistake: limiting too aggressively early on. Monitor actual usage for thirty days, then set limits at 150 percent of p95 usage. Tighten from there based on data, not fear.

Related Articles

Connection Pooling: The Performance Fix You Are Probably Getting Wrong

Go vs Node.js for APIs: Performance Benchmarks from Real Projects

Postgres Is the Only Database You Need (Until It Isn't)

Want to discuss this further?

Ready to build
something real?