Rate limiting sounds simple until you try to do it correctly in a distributed system. The naive approach -- a per-IP counter in memory with a fixed window -- breaks in at least four ways. It does not work behind a load balancer. Fixed windows create burst problems at boundaries. IP-based limiting punishes everyone behind a corporate NAT. And in-memory counters vanish on deployment.
Here are four patterns we use in production.
Pattern one: sliding window counter with Redis. Our default for most APIs. A sorted set in Redis where each request adds a timestamped entry. Count entries within the window, remove expired ones. Handles window boundary problems, works across servers. Sub-millisecond latency. Good for APIs up to about one hundred thousand requests per hour.
Pattern two: distributed token bucket with Redis Lua scripting. For complex needs -- different tier limits, burst allowances, graduated throttling. The Lua script ensures atomicity without round trips. Each bucket has a fill rate, maximum capacity, and current level. Smooth rate limiting without burst problems.
Pattern three: leaky bucket with queue semantics. For webhook delivery and background dispatch where we want to smooth traffic, not reject it. We queue excess requests using BullMQ and process at a steady rate. Callers get 202 Accepted immediately.
Pattern four: adaptive rate limiting. Instead of fixed limits, we measure p95 latency in real time. When it crosses a threshold, limits tighten. When it drops, they relax. A background process samples metrics every five seconds and adjusts Redis-stored limits.
Beyond algorithms, the details matter. Rate limit headers (X-RateLimit-Remaining, Retry-After) are not optional. Return 429 with clear JSON errors. Apply different limits per endpoint. And always put a basic rate limit at the Cloudflare or nginx layer as a safety net, with application-level limiting behind it.
The biggest mistake: limiting too aggressively early on. Monitor actual usage for thirty days, then set limits at 150 percent of p95 usage. Tighten from there based on data, not fear.
About the Author
Fordel Studios
AI-native app development for startups and growing teams. 14+ years of experience shipping production software.
We diagnosed a client application running at 40% of expected throughput. The fix was not more servers or better queries. It was proper connection pooling configuration.

We ran identical API implementations in Go and Node.js on the same hardware with real-world payloads. The results are more nuanced than "Go is faster." Here is the complete data.

PostgreSQL can handle full-text search, JSON documents, time-series data, vector embeddings, and pub/sub messaging. Here is when to use Postgres for everything and when to reach for a specialized database.
We love talking shop. If this article resonated, let's connect.
Start a ConversationTell us about your project. We'll give you honest feedback on scope, timeline, and whether we're the right fit.
Start a Conversation