56 docs indexed
Per-request upstream timeouts, automatic retries with exponential backoff, and routing a single model across multiple upstream clusters with failover or weighted load-balancing.
obleth can keep a model serving when a single upstream call is slow, fails transiently, or a whole cluster goes down. Three controls work together, all configured per model:
failover (priority order) or load_balance (weighted).All three are opt-in and additive: a model with none configured behaves exactly
as before — one attempt against its api_base, bounded by the global timeout.
Every upstream attempt is wrapped in a timeout. By default it inherits the
global OBLETH_UPSTREAM_TIMEOUT_SECS (300s, covering the full stream). Set a
per-model request_timeout_secs to override it for slow or latency-sensitive
routes; leave it null to use the global default.
curl -X PUT "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/reliability" \
-H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"request_timeout_secs": 60,
"max_retries": 2,
"retry_backoff_ms": 200,
"endpoint_selection_mode": "failover"
}'
When an attempt exceeds the timeout, it counts as a retryable failure: obleth
retries (if attempts remain) or fails over to the next endpoint. If every
attempt times out, the client receives 504 Gateway Timeout.
max_retries (default 0) is the number of additional attempts against the
same endpoint after the first one fails with a retryable error. retry_backoff_ms
(default 200) is the base delay; it grows exponentially per attempt
(base, base×2, base×4, …) and is capped at base×64.
attempt 0 → fail (retryable) → wait 200ms
attempt 1 → fail (retryable) → wait 400ms
attempt 2 → success
Only transient transport or overload conditions are retried or failed over. Client errors are returned immediately so a bad request isn't amplified into several upstream calls.
| Outcome | Classification |
|---|---|
| Connection / DNS / transport error | Retryable |
| Timeout (per-request) | Retryable |
HTTP 408, 429 | Retryable |
HTTP 500, 502, 503, 504 | Retryable |
HTTP 400, 401, 403, 404, 422 (any other 4xx) | Fatal — returned as-is |
HTTP 2xx | Success |
Retries and failover only happen before the first response byte is streamed to the client. Once obleth has started forwarding the upstream response, the request is committed to that endpoint — a mid-stream failure surfaces to the client rather than silently restarting. This preserves correct streaming (SSE) semantics and avoids duplicate side effects.
Retries replay the JSON request body. Multipart routes (audio transcription/translation) are not replayable, so they always get a single attempt against the first endpoint — no retry, no failover.
A single fairshare permit and a single token-budget reservation cover the whole attempt sequence, so retries and failover never double-charge a tenant.
A model can front several upstream clusters that all serve the same
upstream_model — for example a primary and a standby cluster, or two regions
behind one client-facing name. Each endpoint carries its own api_base,
optional api_key, priority, weight, and enabled flag, plus its own
health state.
When a model has one or more enabled and healthy endpoints, the data plane
routes across them. When it has none, it falls back to the model's own api_base
/api_key (the legacy single-upstream path), so adding endpoints is always
optional.
endpoint_selection_mode controls ordering of the eligible endpoints:
| Mode | Behavior |
|---|---|
failover (default) | Endpoints are tried in ascending priority (lowest number first). Traffic sticks to the highest-priority healthy endpoint; lower ones are standbys used only when those ahead fail. |
load_balance | Endpoints are ordered by a weighted random shuffle (A-Res) so the first choice is picked in proportion to weight. The remaining endpoints stay available as failover targets for the same request. |
In both modes a single request walks the ordered list: it tries an endpoint
(with its retry budget), and on exhaustion fails over to the next, until one
succeeds or the list is exhausted (502 Bad Gateway, or 504 if the last
failure was a timeout).
From the dashboard, expand a model on the Models page and use the Reliability & endpoints panel to set the timeout/retry policy, choose the selection mode, and add, enable/disable, or remove endpoints.
From the Management API:
# List endpoints for a model
curl "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints" \
-H "Authorization: Bearer $OBLETH_ADMIN_TOKEN"
# Add a second cluster (priority 200 = standby behind the primary at 100)
curl -X POST "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints" \
-H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "cluster-b",
"api_base": "http://cluster-b.internal/v1",
"api_key": "sk_cluster_b",
"priority": 200,
"weight": 100,
"enabled": true
}'
# Update (omit api_key to keep the stored secret; send "" to clear it)
curl -X PUT "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints/$ENDPOINT_ID" \
-H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "cluster-b", "api_base": "http://cluster-b.internal/v1", "priority": 150, "weight": 100, "enabled": false}'
# Remove
curl -X DELETE "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints/$ENDPOINT_ID" \
-H "Authorization: Bearer $OBLETH_ADMIN_TOKEN"
Endpoint api_base URLs are SSRF-validated like any admin-registered upstream
(see OBLETH_ALLOWED_PRIVATE_CIDRS). Endpoint secrets are encrypted at rest
when OBLETH_ENCRYPTION_KEY is set. Changes propagate to every gateway pod
immediately via the Redis invalidation channel.
The model health worker probes each enabled endpoint independently with the same
token-free GET {api_base}/models liveness check used for the model itself, and
records a per-endpoint health_status. Only endpoints that are both enabled
and not explicitly unhealthy are eligible for routing — an endpoint whose
probe confirms it is down (or that an operator disables) is removed from rotation
until it recovers, while unknown/degraded states soft-pass. The model is only
reported fully down when all of its endpoints are unhealthy. See
Model Health.
Every upstream attempt increments obleth_upstream_attempts_total{outcome=...}:
outcome | Meaning |
|---|---|
success | Attempt returned a usable (non-retryable) response |
retry | Retryable failure with attempts remaining on the same endpoint |
timeout | Attempt hit the per-request timeout |
failover | Endpoint exhausted; moved to the next endpoint |
exhausted | All endpoints and attempts failed; request returned 502/504 |
# Failover rate (how often the primary endpoint isn't serving)
rate(obleth_upstream_attempts_total{outcome="failover"}[5m])
# Retry amplification
rate(obleth_upstream_attempts_total{outcome="retry"}[5m])
/ rate(obleth_upstream_attempts_total{outcome="success"}[5m])
This feature fails a model over between endpoints of the same model — same
upstream_model, same capabilities, same cost. It deliberately does not
swap to a different model, which would change pricing, capabilities, and
capacity accounting. Routing a request to an entirely different model on failure
(a cross-model fallback chain) is a separate, future capability.