API Rate Limiting Explained: Reading API Rate Limits, 429s, and Backoff
How API rate limiting works: token bucket vs sliding window, the standard headers, what 429 and Retry-After mean, and how clients should back off safely.
API Rate Limiting Explained: Reading API Rate Limits, 429s, and Backoff
Every API that survives contact with real traffic eventually has to say "no, slow down." Rate limiting is how a server protects itself from one noisy client starving everyone else, and it is also the part of the contract that most clients get wrong. The mechanics are not complicated once you separate three questions: how the server counts requests, how it tells you that you have hit the wall, and how your client should react. This post walks through all three, with the exact headers and a worked retry example you can paste into your own code.
The four algorithms behind every limiter
Most rate limiters are a variation on four shapes.
A fixed window counts requests inside a clock-aligned bucket, say 100 requests per minute starting on the minute. It is trivial to implement with one counter, but it has an ugly edge: a client can fire 100 requests at 12:00:59 and another 100 at 12:01:00, sneaking 200 requests through in two seconds.
A sliding window fixes that by smoothing the count over a rolling interval instead of a hard clock boundary. Rather than resetting to zero on the minute, it weighs the previous window's count so the rate stays roughly even no matter when your burst lands.
A token bucket holds a capacity of tokens that refills at a steady rate. Each request spends one token; when the bucket is empty you are throttled. The key property is that a token bucket allows bursts up to its capacity, then settles back to the steady refill rate. If the bucket holds 60 tokens and refills at one per second, you can fire 60 requests instantly after an idle period, but sustained traffic is capped at 60 per minute.
A leaky bucket is the inverse: requests queue and drain at a fixed rate, which smooths bursts into a constant outflow instead of allowing them. Token bucket optimizes for "let bursts through, cap the average"; leaky bucket optimizes for "never exceed this instantaneous rate."
You rarely need to know which one a provider uses internally. What you do need is to read what the response tells you.
The headers that actually matter
When you cross a limit, a well-behaved server returns 429 Too Many Requests with headers describing the situation. The modern, standardized set looks like this:
RateLimit-Limit— the size of the quota window (how many requests you get).RateLimit-Remaining— how many requests are left in the current window.RateLimit-Reset— when the window refreshes, usually as a delay in seconds.RateLimit-Policy— a machine-readable description of the policy, e.g.100;w=60.Retry-After— on a 429 or 503, the explicit wait before retrying, as seconds or an HTTP date.
You will also meet the older X-RateLimit-* family. They carry the same intent but with one nasty difference: X-RateLimit-Reset is frequently a Unix epoch timestamp (like 1717171717), while the standard RateLimit-Reset is a delay in seconds (like 42). Sleep on the wrong one and you either retry instantly into another wall or schedule a wait until the year 2024 has long passed. If you want to paste a raw response and have it sorted out for you, the API Rate Limit Cheatsheet reads the headers, decides which reset convention applies, and prints the next safe retry time.
One trap worth naming: RateLimit-Remaining is capacity, not time. It tells you how many calls are left, not how long to wait. Retry timing always comes from Retry-After or a reset value, never from the remaining counter.
A worked example: read the headers, back off
Suppose a background job calls an API and the response comes back:
HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 30
Retry-After: 30
The server is explicit: you are out of quota and the window refreshes in 30 seconds. The correct move is to honor Retry-After directly — wait 30 seconds, then retry. No cleverness required.
Now suppose Retry-After is missing and only RateLimit-Reset: 30 is present. You convert the reset to a delay and apply capped exponential backoff with jitter so a fleet of workers does not all wake at the same instant and re-collide:
async function callWithBackoff(doRequest, maxRetries = 5) {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const res = await doRequest();
if (res.status !== 429) return res;
const retryAfter = res.headers.get("retry-after");
const reset = res.headers.get("ratelimit-reset");
// base wait: prefer the server's instruction
let waitMs = retryAfter
? Number(retryAfter) * 1000
: reset
? Number(reset) * 1000
: Math.min(2 ** attempt * 500, 30000); // capped exponential
// add jitter so workers don't sync up
waitMs += Math.random() * 1000;
await new Promise((r) => setTimeout(r, waitMs));
}
throw new Error("Rate limit retries exhausted");
}
The rules baked in: prefer the server's instruction, fall back to exponential growth capped at a ceiling (here 30 seconds), and always add jitter. Hammering an endpoint the moment a 429 arrives just earns you another 429 and, on stricter providers, a longer ban.
I learned the jitter lesson the hard way. I once shipped a sync worker that retried on a clean doubling schedule — 1s, 2s, 4s — with no randomness. It worked fine in testing with one instance. In production we ran twelve workers, and a single rate-limit spike synchronized all twelve onto the identical retry clock. They marched in lockstep, hitting the API together at every interval and re-triggering the limit on each wake. The fix was three characters of Math.random(). Since then I treat jitter as non-optional, not a nice-to-have.
Why you get limited with quota remaining
A confusing case: RateLimit-Remaining says 40, yet you still get a 429. Remaining describes one quota window, not every limiter guarding the API. You may have tripped a concurrency cap (too many in-flight requests at once), a per-endpoint cap, a token-per-minute budget on an AI API, or a workspace-level quota shared across your whole team. Log which limiter dimension fired and check provider-specific reason fields before assuming the counter is broken.
This is also why retrying a non-idempotent POST after a timeout is dangerous. If the original request actually succeeded but the response was lost, a blind retry double-creates the record or double-charges the card. Send an Idempotency-Key so the server can recognize and deduplicate the repeat.
Designing limits for your own API
If you are the one issuing 429s, a few choices save your users grief. Publish your policy in headers, not just docs — emit RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset on every response, not only on rejection, so clients can pace themselves before hitting the wall. Always include Retry-After on a 429 so well-behaved clients do not have to guess. Pick a token-bucket-style allowance if your traffic is bursty and human-driven; pick a stricter leaky-bucket shape if a downstream system genuinely cannot absorb spikes. And document which reset convention you use, because the epoch-versus-delay ambiguity is the single most common integration bug.
Rate limit responses cross paths with the rest of your HTTP surface — caching policy, CORS, and error payloads all show up in the same response. When you are tuning those, the Cache-Control Builder helps you set the freshness directives without fighting the syntax. Keep the headers honest, keep the retry instructions explicit, and most clients will behave.
Made by Toolora · Updated 2026-06-13