---
name: forge-rate-limiting
description: Rate-limiting discipline for HTTP and queue handlers. Token bucket as the default, sliding window for per-second precision, fixed window only for low-volume. Per-key (IP, user, API token, tenant) limits with distributed Redis backend. Always-on rate-limit headers, 429 with Retry-After, retry budget. Contains paste-ready Redis token-bucket implementation. Use when designing or auditing rate limits on a service that takes real traffic.
license: MIT
---

# forge-rate-limiting

You are protecting a service from abuse, overload, and accidental thundering herds. Default agent-written rate limiting is "an in-memory `Map<string, number>` keyed by IP that resets every minute" - works for one instance, falls over the moment you scale horizontally, ages out memory faster than it should, and routes a sympathetic 429 with no `Retry-After`. This skill exists to fix that.

The mental model: **rate-limiting is a counter-with-windowing problem keyed by an identity.** The hard parts are: choosing the identity (IP vs user vs token), choosing the window algorithm (fixed vs sliding vs token bucket), and storing the counter where every replica sees the same number.

## Quick reference (the things you must never ship)

1. An in-memory `Map` rate limiter on a service that runs >1 replica.
2. 429 response without `Retry-After` header.
3. Rate-limit headers only on 429 (should be on every response).
4. Rate-limiting by `req.ip` from a proxied request without trusting `X-Forwarded-For` correctly.
5. Same limit for `/login` and `/static/*` (auth needs much tighter limits).
6. Fixed-window counter that bursts 2x at the window boundary.
7. Per-IP limit on an endpoint that legitimate single users hit concurrently from the same NAT (mobile networks).
8. No retry budget on the CLIENT side either (amplifies upstream outage 10x).
9. Trusting a Bearer-token rate limit when the token's owner can mint new tokens.
10. Rate-limit configuration hardcoded - cannot be tightened during incident.

## Hard rules

### Choose the identity

**1. Pick the rate-limit key with care.** The wrong key locks out legitimate users.

| Endpoint kind | Key | Why |
| --- | --- | --- |
| Public unauthenticated (homepage, search) | IP | Best you have |
| Public auth-attempt (`/login`, `/signup`, `/reset`) | IP + small per-account additional limit | Brute-force protection |
| API with token | Token (or token's owning tenant_id) | Token can be revoked, tenant is the real billing unit |
| Webhook receivers | Sending tenant + endpoint | Per-relationship limit |

**2. For IP-keyed limits behind a proxy, trust the right header.** If you sit behind Cloudflare/Caddy/nginx, `X-Forwarded-For` or `CF-Connecting-IP` is the real IP. Configure the framework's trust setting; do not naively pick the first value of `X-Forwarded-For`.

### Choose the algorithm

**3. Token bucket is the default.** Smooth limits, allows burst, easy to reason about, cheap to implement.

```
Bucket capacity = 60 tokens (max burst)
Refill rate    = 1 token / second  (60 tokens / minute sustained)
```

**4. Sliding window for per-second precision.** Use when "exactly N per second" matters more than burst-tolerance.

**5. Fixed window only for low-volume endpoints.** A fixed window of 60 requests/minute allows 120 in two seconds at the boundary (59 in the last second, 60 in the next). For most use cases this is the wrong choice.

### Distributed counters

**6. Redis is the default backend.** Atomic `INCR`, `EXPIRE`, and Lua scripts give you what you need.

**7. Use the framework's rate limiter library, not a hand-roll.** Express: `express-rate-limit` + `rate-limit-redis`. Fastify: `@fastify/rate-limit`. Hono: `hono-rate-limiter`. Nginx: `limit_req_zone`. Cloudflare/Caddy at the edge for the cheapest case.

**8. If you do roll your own, the canonical Redis token-bucket pattern:**

```lua
-- KEYS[1] = bucket key
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill rate per second (tokens)
-- ARGV[3] = now (unix ms)
-- ARGV[4] = cost (tokens to consume; usually 1)
-- Returns: { allowed: 0|1, remaining: int, retry_after_ms: int }

local capacity   = tonumber(ARGV[1])
local refill_per_s = tonumber(ARGV[2])
local now        = tonumber(ARGV[3])
local cost       = tonumber(ARGV[4])

local state = redis.call("HMGET", KEYS[1], "tokens", "last_refill")
local tokens      = tonumber(state[1]) or capacity
local last_refill = tonumber(state[2]) or now

-- Refill based on elapsed time
local elapsed_s = (now - last_refill) / 1000.0
tokens = math.min(capacity, tokens + elapsed_s * refill_per_s)

local allowed = 0
local retry_after_ms = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
else
  local needed = cost - tokens
  retry_after_ms = math.ceil((needed / refill_per_s) * 1000)
end

redis.call("HSET", KEYS[1], "tokens", tokens, "last_refill", now)
-- TTL keeps inactive buckets from growing forever
redis.call("EXPIRE", KEYS[1], 3600)

return { allowed, math.floor(tokens), retry_after_ms }
```

```ts
// TypeScript wrapper
import { createClient } from "redis";

const redis = createClient();
await redis.connect();

const TOKEN_BUCKET_LUA = `... (the script above) ...`;
const SHA = await redis.scriptLoad(TOKEN_BUCKET_LUA);

export async function tryConsume(opts: {
  key: string;
  capacity: number;
  refillPerSecond: number;
  cost?: number;
}): Promise<{ allowed: boolean; remaining: number; retryAfterMs: number }> {
  const [allowed, remaining, retryAfterMs] = (await redis.evalSha(SHA, {
    keys: [`rl:${opts.key}`],
    arguments: [
      String(opts.capacity),
      String(opts.refillPerSecond),
      String(Date.now()),
      String(opts.cost ?? 1),
    ],
  })) as [number, number, number];
  return { allowed: allowed === 1, remaining, retryAfterMs };
}
```

### Per-endpoint tuning

**9. Different endpoints need different limits.**

| Endpoint kind | Suggested limit |
| --- | --- |
| `/login`, `/signup`, `/reset` per IP | 5/min |
| `/login` per account | 5 fail/min, with exponential backoff |
| Unauthenticated read (`/`, `/search`) | 60/min per IP |
| Authenticated read | 600/min per token |
| Authenticated write | 60/min per token |
| Webhook receive | 100/min per sender |
| Internal cron | unlimited (but pace yourself) |

These are starting points; tune from real traffic.

### Response shape

**10. Rate-limit headers on EVERY response, not just 429.**

```http
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1716393600          # unix seconds when bucket refills to full
```

**11. On 429, include `Retry-After`.**

```http
HTTP/1.1 429 Too Many Requests
Retry-After: 12                          # seconds (integer) OR HTTP-date
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1716393612
Content-Type: application/json

{
  "error": {
    "code": "rate_limited",
    "message": "Too many requests. Retry after 12 seconds.",
    "request_id": "01HXY..."
  }
}
```

**12. `Retry-After` is in seconds, not milliseconds.** Always round up.

### CAPTCHA / progressive friction

**13. CAPTCHA after N failures, not on every request.** Friction up only after suspicious activity. CAPTCHA on every login is hostile to legitimate users.

**14. Account-level cooldown for repeated failed logins.** Per IP rate is too coarse - SIM-shared mobile users share IPs.

### Client-side retry hygiene (the other half)

**15. Clients that retry need a budget too.** Server rate-limits you; client should also self-rate-limit. Otherwise a 429 wave triggers a thundering herd.

```ts
// reference: client retry-with-backoff
async function withRateLimitRetry<T>(
  fn: () => Promise<Response>,
  opts = { attempts: 3, baseMs: 200 },
): Promise<Response> {
  for (let i = 0; i < opts.attempts; i++) {
    const res = await fn();
    if (res.status !== 429) return res;
    const retryAfter = Number(res.headers.get("Retry-After") ?? "1") * 1000;
    const backoff = Math.max(retryAfter, opts.baseMs * 2 ** i);
    const jitter = backoff * Math.random() * 0.3;
    if (i === opts.attempts - 1) return res;
    await new Promise((r) => setTimeout(r, backoff + jitter));
  }
  return await fn();
}
```

**16. Honor `Retry-After`. Never retry faster than the server told you to.**

### Operational

**17. Rate-limit configuration is hot-reloadable.** During an incident you might need to tighten `/api/*` from 600/min to 60/min without a deploy. Read config from Redis or a feature-flag system, not from a compiled constant.

**18. Audit log on rate-limit hits.** Above a threshold (e.g. >100 429s per minute from one key), page someone. Likely an attack or a broken client.

**19. Allowlist trusted keys.** Your own monitoring, your own webhooks - bypass the rate limit (with a much larger absolute cap).

### Bypass-prone patterns

**20. Do not rate-limit responses that are cached upstream.** Cloudflare/Caddy can cache and reply without hitting your origin. Your rate-limit count diverges from actual traffic.

**21. Rate-limit at the right layer.** If your nginx already limits, do not double-count at the app. Pick one.

## Common AI-output patterns to reject

| Pattern | Why wrong | Fix |
| --- | --- | --- |
| In-memory `Map` rate limiter | Per-replica state, scales nowhere | Redis token bucket |
| Fixed-window counter | 2x burst at boundary | Token bucket or sliding window |
| 429 with no `Retry-After` | Client guesses, often wrong | Always set `Retry-After` |
| Rate-limit headers only on 429 | Client cannot pace itself | Headers on every response |
| Same limit for `/login` and `/static/*` | Auth needs much tighter | Per-endpoint config |
| `req.ip` no proxy trust | Wrong IP for proxied | Configure framework trust + use correct header |
| One global limit for all endpoints | Coarse | Per-route / per-method config |
| Client retries without honoring `Retry-After` | Thundering herd | Wait what the server said |
| Hardcoded limits in code | Cannot tighten during incident | Hot-reloadable config |
| No allowlist for internal callers | Self-DOSing | Allowlist with a bigger cap |

## Worked example: rate-limit middleware in Hono

```ts
// src/middleware/rate-limit.ts
import type { MiddlewareHandler } from "hono";
import { tryConsume } from "../lib/rate-limit.js";

type Bucket = {
  /** Unique name for the bucket (e.g. "login_ip", "api_token") */
  name: string;
  /** Max tokens (max burst) */
  capacity: number;
  /** Refill rate per second */
  refillPerSecond: number;
  /** How to derive the per-request key */
  key: (c: Parameters<MiddlewareHandler>[0]) => string;
};

export function rateLimit(bucket: Bucket): MiddlewareHandler {
  return async (c, next) => {
    const k = `${bucket.name}:${bucket.key(c)}`;
    const result = await tryConsume({
      key: k,
      capacity: bucket.capacity,
      refillPerSecond: bucket.refillPerSecond,
    });

    // Always set headers, even on success
    c.header("X-RateLimit-Limit", String(bucket.capacity));
    c.header("X-RateLimit-Remaining", String(result.remaining));
    c.header("X-RateLimit-Reset", String(Math.floor(Date.now() / 1000) + Math.ceil((bucket.capacity - result.remaining) / bucket.refillPerSecond)));

    if (!result.allowed) {
      const retryAfter = Math.ceil(result.retryAfterMs / 1000);
      c.header("Retry-After", String(retryAfter));
      return c.json({
        error: {
          code: "rate_limited",
          message: `Too many requests. Retry after ${retryAfter} seconds.`,
          request_id: c.get("requestId") as string,
        },
      }, 429);
    }

    return next();
  };
}

// Usage in routes
app.use("/api/login", rateLimit({
  name: "login_ip",
  capacity: 5,
  refillPerSecond: 5 / 60,       // 5 per minute sustained
  key: (c) => c.req.header("cf-connecting-ip") ?? c.req.header("x-forwarded-for")?.split(",")[0]?.trim() ?? "unknown",
}));

app.use("/api/*", rateLimit({
  name: "api_token",
  capacity: 60,
  refillPerSecond: 10,           // 600/min sustained
  key: (c) => c.get("token_id") as string,
}));
```

What this demonstrates: separate bucket per logical limit (rule 9); CF-Connecting-IP fallback to X-Forwarded-For first hop, correct for proxied requests (rule 2); headers set on every response (rule 10); `Retry-After` always set on 429 (rule 11); canonical error shape (forge-api-design rule 7).

## Workflow

When adding rate limiting:

1. **Identify endpoints that need different limits.** Auth, write, read, webhooks.
2. **Pick the key per endpoint.** IP for unauthenticated, token for authenticated.
3. **Pick the algorithm.** Token bucket as default.
4. **Pick the backend.** Redis if you have >1 replica. In-memory only for single-replica or local dev.
5. **Configure limits in env or feature flags.** Not in source.
6. **Set rate-limit headers on every response.**
7. **Monitor 429 rate. Alert when one key spikes.**

## Verification

This skill is structural; no shell verifier. Manual checklist:

- [ ] Rate-limit state survives a replica restart (i.e., not in-memory).
- [ ] Headers set on every response (200, 429, etc).
- [ ] `Retry-After` always set on 429.
- [ ] Different limits per route family.
- [ ] Limits configurable without redeploy.
- [ ] Client code (if you control it) honors `Retry-After`.

## When to skip this skill

- Pre-launch projects with no real traffic.
- Internal-only services behind a corporate VPN.
- Services entirely behind Cloudflare/Vercel rate limiting (still configure something).

## Related skills

- [`forge-api-design`](../forge-api-design/SKILL.md) - 429 status + error shape + headers.
- [`forge-auth`](../forge-auth/SKILL.md) - login endpoints need tighter limits.
- [`forge-caddy`](../../infra/forge-caddy/SKILL.md) - edge rate limiting.
- [`forge-error-handling`](../forge-error-handling/SKILL.md) - retry-with-backoff helper.