Rate limiting strategies that scale

Rate limiting is one of those features that feels trivial until you have more than one server. A counter in a hash map works great on your laptop and falls apart the moment a load balancer spreads traffic across three instances. This is a walkthrough of the algorithms worth knowing, where to put the limiter, and how to make limits hold up when you scale horizontally.

What you are actually protecting

Before picking an algorithm, get clear on what the limit is for. The answer changes the design.

You might be protecting a fragile downstream resource, like a third-party API that charges per call or a database that buckles under a write storm. You might be defending against abuse: credential stuffing on a login route, scrapers hammering your search endpoint. Or you might be enforcing a billing tier, where a free plan gets 100 requests an hour and a paid plan gets 10,000.

These need different keys. Abuse protection keys on IP address or device. Billing keys on the API key or user ID. Downstream protection keys on the whole service, a single global budget shared by everyone. Mixing them up is the most common mistake we see. A per-user limit does nothing to stop a botnet, and a per-IP limit punishes everyone behind a corporate NAT.

The algorithms, ranked by how often you should use them

Fixed window

You count requests in a clock-aligned window: requests between 10:00:00 and 10:00:59 go in one bucket, the next minute starts a fresh count. It is one counter and one expiry, which is why it is the first thing everyone reaches for.

The flaw is the boundary. A client can send the full quota at 10:00:59 and the full quota again at 10:01:00, so you allow double the limit across a two-second span. For coarse protection this is fine. For anything you actually care about, it leaks.

Sliding window

The sliding window fixes the boundary burst by weighting the previous window. You take the current window count, add a fraction of the previous window based on how far into the current window you are, and compare that to the limit. It is an approximation, but a good one, and it costs only two counters per key. This is our default for API rate limiting. It is cheap, it is fair, and it does not have the fixed window's cliff.

Token bucket

A bucket holds tokens up to a maximum. Tokens refill at a steady rate, and each request spends one. If the bucket is empty, the request is rejected or queued. The nice property is that it allows bursts: a client that has been quiet builds up a reserve and can spend it in one go, which matches how real clients behave. Use this when you want to permit short spikes but cap the sustained rate, for example an endpoint that triggers expensive report generation.

Leaky bucket

Same bucket, but requests drain at a fixed rate regardless of how they arrive. It smooths traffic into a constant outflow. Reach for it when the thing downstream cannot tolerate bursts at all, like feeding a queue that a worker pool drains at a known speed. It is the rarest of the four in web work.

Where to enforce it

Put the limiter as far out as you reasonably can. Every request you reject at the edge is a request that never touches your application code, your database, or your bill.

If you run on a platform with a managed firewall or WAF, configure rate limits there first. Cloudflare, Vercel's firewall, and most API gateways can drop abusive traffic before it reaches an instance, which is exactly where you want to stop a flood. The catch is that edge limits usually key on IP and have coarse rules, so they handle abuse well but cannot express billing tiers.

For per-user and per-key limits tied to your business logic, you need an application-level limiter, because only your app knows which plan a key belongs to. The realistic setup is both: a blunt edge limit to absorb attacks, and a precise application limit for fairness and billing.

Making it work across many servers

The core problem with horizontal scaling is shared state. Three instances each with a local counter means a client gets three times the limit, one per instance, and the number drifts as the load balancer shuffles them around. You need a single source of truth, and Redis is the standard answer because it is fast and gives you atomic operations.

The trap is doing a read, then a write, in two round trips. Between those two calls another request can slip in, and your count is wrong under exactly the load where correctness matters. The fix is to make the check and the increment atomic. A Lua script runs server-side on Redis as a single operation, so no other command interleaves.

Here is a sliding window limiter as a Lua script, called from a Next.js route handler.

import Redis from "ioredis";
 
const redis = new Redis(process.env.REDIS_URL!);
 
// KEYS[1] = bucket key, ARGV = limit, windowMs, now
const slidingWindow = `
  local key = KEYS[1]
  local limit = tonumber(ARGV[1])
  local window = tonumber(ARGV[2])
  local now = tonumber(ARGV[3])
 
  local current = now - (now % window)
  local previous = current - window
  local curKey = key .. ':' .. current
  local prevKey = key .. ':' .. previous
 
  local curCount = tonumber(redis.call('GET', curKey) or '0')
  local prevCount = tonumber(redis.call('GET', prevKey) or '0')
 
  local elapsed = (now - current) / window
  local weighted = prevCount * (1 - elapsed) + curCount
 
  if weighted >= limit then
    return -1
  end
 
  redis.call('INCR', curKey)
  redis.call('PEXPIRE', curKey, window * 2)
  return limit - math.floor(weighted) - 1
`;
 
export async function rateLimit(identifier: string, limit: number, windowMs: number) {
  const remaining = (await redis.eval(
    slidingWindow,
    1,
    `rl:${identifier}`,
    limit.toString(),
    windowMs.toString(),
    Date.now().toString(),
  )) as number;
 
  return { allowed: remaining >= 0, remaining: Math.max(remaining, 0) };
}

And the route that uses it. Note that the limiter keys on the API key when present and falls back to IP, so anonymous and authenticated traffic get separate budgets.

import { NextRequest, NextResponse } from "next/server";
import { rateLimit } from "@/lib/rate-limit";
 
export async function POST(req: NextRequest) {
  const apiKey = req.headers.get("x-api-key");
  const identifier = apiKey ?? req.headers.get("x-forwarded-for") ?? "anon";
 
  const { allowed, remaining } = await rateLimit(identifier, 100, 60_000);
 
  if (!allowed) {
    return NextResponse.json(
      { error: "Too many requests" },
      { status: 429, headers: { "Retry-After": "60" } },
    );
  }
 
  const res = NextResponse.json({ ok: true });
  res.headers.set("X-RateLimit-Remaining", remaining.toString());
  return res;
}

Details that separate a real limiter from a toy

Return a 429 status, not a 400 or 403, and include a Retry-After header so well-behaved clients back off instead of retrying in a tight loop. Add X-RateLimit-Remaining so callers can pace themselves before they hit the wall. These cost nothing and turn an opaque rejection into something a client can actually handle.

Decide what happens when Redis is down. If your limiter throws on a connection error and you have no fallback, a Redis blip takes down every rate-limited route. For most products, failing open (allowing the request when the limiter cannot be reached) is the right call, because a brief loss of limiting is less bad than a full outage. For a login route guarding against credential stuffing, you may want to fail closed instead. Make that choice on purpose rather than discovering it during an incident.

Watch the cost of the limiter itself. One Redis round trip per request is usually fine, but if you are limiting a high-traffic public endpoint, that round trip is now in your hot path. Keep the script tight, set sensible expiries so keys clean themselves up, and run Redis close to your application, not across a region.

The takeaway

Pick the key first (IP for abuse, API key or user for billing), use a sliding window unless you specifically need bursts, and enforce limits at the edge plus the application. The moment you have more than one server, move the counter to Redis and make the check-and-increment atomic with a Lua script. Get those four things right and your limiter will scale with the rest of your stack instead of becoming the thing that falls over first.

If you want a second pair of eyes on where to draw your limits, we are happy to take a look.

Rate limiting strategies that scale

What you are actually protecting

The algorithms, ranked by how often you should use them

Fixed window

Sliding window

Token bucket

Leaky bucket

Where to enforce it

Making it work across many servers

Details that separate a real limiter from a toy

The takeaway

API & Backend Engineering

Keep reading

Web development for insurance and insurtech

Building software for recruiting and staffing agencies

Want this built right?