The gateway and upstream providers may rate-limit per tenant, API key, or model route. Exact quotas are shown in your console or provided by support. Below: HTTP behavior and recommended handling.
429 Too Many Requests
When limited, you typically receive 429. OpenAI-compatible errors may look like:
{
"error": {
"message": "...",
"type": "rate_limit_error",
"code": "upstream_rate_limited"
}
}
code varies by scenario — trust the response body.
Client strategy
- Exponential backoff with jitter (e.g. 1s → 2s → 4s).
- Cap retries (e.g. 3–5) to avoid amplifying load.
- No tight loops on 429.
- Idempotency for retried calls with side effects (especially tools).
vs 403
| Status | Meaning |
|---|---|
| 429 | Temporary throttle — retry after backoff |
| 403 | Policy block (balance, model, IP) — fix config |
See Errors.
Key-level vs model-level
Limits may apply at tenant/key scope and per model/upstream route. Usage and trends are in the console.
Capacity planning
- Load-test streaming concurrency in a non-production environment before peaks.
- Use queues with concurrency caps for batch jobs instead of unbounded parallel calls.