Reliability & Failover

Per-request upstream timeouts, automatic retries with exponential backoff, and routing a single model across multiple upstream clusters with failover or weighted load-balancing.

obleth can keep a model serving when a single upstream call is slow, fails transiently, or a whole cluster goes down. Three controls work together, all configured per model:

a per-request timeout that bounds how long any one attempt may run;
automatic retries with exponential backoff for transient failures; and
multiple endpoints — the same model fronted by several upstream clusters, selected by failover (priority order), load_balance (weighted), or session_hash (conversation-sticky).

All three are opt-in and additive: a model with none configured behaves exactly as before — one attempt against its api_base, bounded by the global timeout.

Per-request timeout

Every upstream attempt is wrapped in a timeout. By default it inherits the global OBLETH_UPSTREAM_TIMEOUT_SECS (300s, covering the full stream). Set a per-model request_timeout_secs to override it for slow or latency-sensitive routes; leave it null to use the global default.

curl -X PUT "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/reliability" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "request_timeout_secs": 60,
    "max_retries": 2,
    "retry_backoff_ms": 200,
    "endpoint_selection_mode": "failover"
  }'

When an attempt exceeds the timeout, it counts as a retryable failure: obleth retries (if attempts remain) or fails over to the next endpoint. If every attempt times out, the client receives 504 Gateway Timeout.

Retries and backoff

max_retries (default 0) is the number of additional attempts against the same endpoint after the first one fails with a retryable error. retry_backoff_ms (default 200) is the base delay; it grows exponentially per attempt (base, base×2, base×4, …) and is capped at base×64.

attempt 0 → fail (retryable) → wait 200ms
attempt 1 → fail (retryable) → wait 400ms
attempt 2 → success

What counts as retryable

Only transient transport or overload conditions are retried or failed over. Client errors are returned immediately so a bad request isn't amplified into several upstream calls.

Outcome	Classification
Connection / DNS / transport error	Retryable
Timeout (per-request)	Retryable
HTTP `408`, `429`	Retryable
HTTP `500`, `502`, `503`, `504`	Retryable
HTTP `400`, `401`, `403`, `404`, `422` (any other 4xx)	Fatal — returned as-is
HTTP `2xx`	Success

Point of no return

Retries and failover only happen before the first response byte is streamed to the client. Once obleth has started forwarding the upstream response, the request is committed to that endpoint — a mid-stream failure surfaces to the client rather than silently restarting. This preserves correct streaming (SSE) semantics and avoids duplicate side effects.

Retries replay the JSON request body. Multipart routes (audio transcription/translation) are not replayable, so they always get a single attempt against the first endpoint — no retry, no failover.

A single fairshare permit and a single token-budget reservation cover the whole attempt sequence, so retries and failover never double-charge a tenant.

Multiple endpoints per model

A model can front several upstream clusters that all serve the same upstream_model — for example a primary and a standby cluster, or two regions behind one client-facing name. Each endpoint carries its own api_base, optional api_key, priority, weight, and enabled flag, plus its own health state.

When a model has one or more enabled and healthy endpoints, the data plane routes across them. When it has none, it falls back to the model's own api_base /api_key (the legacy single-upstream path), so adding endpoints is always optional.

Selection modes

endpoint_selection_mode controls ordering of the eligible endpoints:

Mode	Behavior
`failover` (default)	Endpoints are tried in ascending `priority` (lowest number first). Traffic sticks to the highest-priority healthy endpoint; lower ones are standbys used only when those ahead fail.
`load_balance`	Endpoints are ordered by a weighted random shuffle (A-Res) so the first choice is picked in proportion to `weight`. The remaining endpoints stay available as failover targets for the same request.
`session_hash`	A whole conversation is pinned to the same endpoint by a consistent hash of its session id, so a session keeps hitting the upstream replica that already has its KV cache warm. The other endpoints stay available as failover targets, and if the pinned endpoint is removed the session re-pins deterministically.

session_hash needs a conversation id to pin on. obleth derives one from each request (or honors a client-supplied x-session-id), so no client change is required — but a request that genuinely has no session key falls back to the load_balance weighted order for that request (and logs a warning), so the mode degrades safely rather than pinning everything to one endpoint.

In both modes a single request walks the ordered list: it tries an endpoint (with its retry budget), and on exhaustion fails over to the next, until one succeeds or the list is exhausted (502 Bad Gateway, or 504 if the last failure was a timeout).

Managing endpoints

From the dashboard, expand a model on the Models page and use the Reliability & endpoints panel to set the timeout/retry policy, choose the selection mode, and add, enable/disable, or remove endpoints.

From the Management API:

# List endpoints for a model
curl "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN"

# Add a second cluster (priority 200 = standby behind the primary at 100)
curl -X POST "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "cluster-b",
    "api_base": "http://cluster-b.internal/v1",
    "api_key": "sk_cluster_b",
    "priority": 200,
    "weight": 100,
    "enabled": true
  }'

# Update (omit api_key to keep the stored secret; send "" to clear it)
curl -X PUT "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints/$ENDPOINT_ID" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "cluster-b", "api_base": "http://cluster-b.internal/v1", "priority": 150, "weight": 100, "enabled": false}'

# Remove
curl -X DELETE "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints/$ENDPOINT_ID" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN"

Endpoint api_base URLs are SSRF-validated like any admin-registered upstream (see OBLETH_ALLOWED_PRIVATE_CIDRS). Endpoint secrets are encrypted at rest when OBLETH_ENCRYPTION_KEY is set. Changes propagate to every gateway pod immediately via the Redis invalidation channel.

Per-endpoint health

The model health worker probes each enabled endpoint independently with the same token-free GET {api_base}/models liveness check used for the model itself, and records a per-endpoint health_status. Only endpoints that are both enabled and not explicitly unhealthy are eligible for routing — an endpoint whose probe confirms it is down (or that an operator disables) is removed from rotation until it recovers, while unknown/degraded states soft-pass. The model is only reported fully down when all of its endpoints are unhealthy. See Model Health.

Debugging upstream failures

Some 502/504 failures are intermittent and hard to attribute after the fact — the upstream looks up, but a request to it gave up. To turn those cases into concrete evidence, enable Debug upstream failures for a model (its debug_diagnostics flag, in the Reliability → Delivery panel). When it is on and a request to that model exhausts its attempts with a 502/504, obleth runs a quick read-only check of the upstream — does its hostname still resolve in DNS, and is the port reachable over TCP — and records the result as a span in the request trace.

That converts an "it's up, but I got a 502" mystery into a specific finding (e.g. a DNS blip, or a port that stopped accepting connections). It is:

Opt-in and off by default — models without it enabled behave exactly as before, with no extra checks.
Diagnose-only — the probe never changes routing or retry behavior. It runs after the request has already failed, purely to annotate the trace.

See Observability for reading the trace span, and Model Health for proactive (pre-failure) health probing.

Observability

Every upstream attempt increments obleth_upstream_attempts_total{outcome=...}:

`outcome`	Meaning
`success`	Attempt returned a usable (non-retryable) response
`retry`	Retryable failure with attempts remaining on the same endpoint
`timeout`	Attempt hit the per-request timeout
`failover`	Endpoint exhausted; moved to the next endpoint
`exhausted`	All endpoints and attempts failed; request returned `502`/`504`

# Failover rate (how often the primary endpoint isn't serving)
rate(obleth_upstream_attempts_total{outcome="failover"}[5m])

# Retry amplification
rate(obleth_upstream_attempts_total{outcome="retry"}[5m])
  / rate(obleth_upstream_attempts_total{outcome="success"}[5m])

Relationship to cross-model fallback

This feature fails a model over between endpoints of the same model — same upstream_model, same capabilities, same cost. It deliberately does not swap to a different model, which would change pricing, capabilities, and capacity accounting. Routing a request to an entirely different model on failure (a cross-model fallback chain) is a separate, future capability.

PreviousCapacity Auto-tune

NextConversations & Sessions

Getting Started

Concepts

Guides

Reference

Operations