The gateway and upstream providers may rate-limit per tenant, API key, or model route. Exact quotas are shown in your console or provided by support. Below: HTTP behavior and recommended handling.

429 Too Many Requests

When limited, you typically receive 429. OpenAI-compatible errors may look like:

{
  "error": {
    "message": "...",
    "type": "rate_limit_error",
    "code": "upstream_rate_limited"
  }
}

code varies by scenario — trust the response body.

Client strategy

  1. Exponential backoff with jitter (e.g. 1s → 2s → 4s).
  2. Cap retries (e.g. 3–5) to avoid amplifying load.
  3. No tight loops on 429.
  4. Idempotency for retried calls with side effects (especially tools).

vs 403

StatusMeaning
429Temporary throttle — retry after backoff
403Policy block (balance, model, IP) — fix config

See Errors.

Key-level vs model-level

Limits may apply at tenant/key scope and per model/upstream route. Usage and trends are in the console.

Capacity planning

  • Load-test streaming concurrency in a non-production environment before peaks.
  • Use queues with concurrency caps for batch jobs instead of unbounded parallel calls.

Related