Help guides

The gateway and upstream providers may rate-limit per tenant, API key, or model route. Exact quotas are shown in your console or provided by support. Below: HTTP behavior and recommended handling.

429 Too Many Requests

When limited, you typically receive 429. OpenAI-compatible errors may look like:

{
  "error": {
    "message": "...",
    "type": "rate_limit_error",
    "code": "upstream_rate_limited"
  }
}

code varies by scenario — trust the response body.

Client strategy

Exponential backoff with jitter (e.g. 1s → 2s → 4s).
Cap retries (e.g. 3–5) to avoid amplifying load.
No tight loops on 429.
Idempotency for retried calls with side effects (especially tools).

vs 403

Status	Meaning
429	Temporary throttle — retry after backoff
403	Policy block (balance, model, IP) — fix config

See Errors.

Key-level vs model-level

Limits may apply at tenant/key scope and per model/upstream route. Usage and trends are in the console.

Capacity planning

Load-test streaming concurrency in a non-production environment before peaks.
Use queues with concurrency caps for batch jobs instead of unbounded parallel calls.

429 Too Many Requests

Client strategy

vs 403

Key-level vs model-level

Capacity planning

Related