56 docs indexed

Reliability & Failover

Per-request upstream timeouts, automatic retries with exponential backoff, and routing a single model across multiple upstream clusters with failover or weighted load-balancing.

obleth can keep a model serving when a single upstream call is slow, fails transiently, or a whole cluster goes down. Three controls work together, all configured per model:

  • a per-request timeout that bounds how long any one attempt may run;
  • automatic retries with exponential backoff for transient failures; and
  • multiple endpoints — the same model fronted by several upstream clusters, selected by failover (priority order) or load_balance (weighted).

All three are opt-in and additive: a model with none configured behaves exactly as before — one attempt against its api_base, bounded by the global timeout.

Per-request timeout

Every upstream attempt is wrapped in a timeout. By default it inherits the global OBLETH_UPSTREAM_TIMEOUT_SECS (300s, covering the full stream). Set a per-model request_timeout_secs to override it for slow or latency-sensitive routes; leave it null to use the global default.

curl -X PUT "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/reliability" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "request_timeout_secs": 60,
    "max_retries": 2,
    "retry_backoff_ms": 200,
    "endpoint_selection_mode": "failover"
  }'

When an attempt exceeds the timeout, it counts as a retryable failure: obleth retries (if attempts remain) or fails over to the next endpoint. If every attempt times out, the client receives 504 Gateway Timeout.

Retries and backoff

max_retries (default 0) is the number of additional attempts against the same endpoint after the first one fails with a retryable error. retry_backoff_ms (default 200) is the base delay; it grows exponentially per attempt (base, base×2, base×4, …) and is capped at base×64.

attempt 0 → fail (retryable) → wait 200ms
attempt 1 → fail (retryable) → wait 400ms
attempt 2 → success

What counts as retryable

Only transient transport or overload conditions are retried or failed over. Client errors are returned immediately so a bad request isn't amplified into several upstream calls.

OutcomeClassification
Connection / DNS / transport errorRetryable
Timeout (per-request)Retryable
HTTP 408, 429Retryable
HTTP 500, 502, 503, 504Retryable
HTTP 400, 401, 403, 404, 422 (any other 4xx)Fatal — returned as-is
HTTP 2xxSuccess

Point of no return

Retries and failover only happen before the first response byte is streamed to the client. Once obleth has started forwarding the upstream response, the request is committed to that endpoint — a mid-stream failure surfaces to the client rather than silently restarting. This preserves correct streaming (SSE) semantics and avoids duplicate side effects.

Retries replay the JSON request body. Multipart routes (audio transcription/translation) are not replayable, so they always get a single attempt against the first endpoint — no retry, no failover.

A single fairshare permit and a single token-budget reservation cover the whole attempt sequence, so retries and failover never double-charge a tenant.

Multiple endpoints per model

A model can front several upstream clusters that all serve the same upstream_model — for example a primary and a standby cluster, or two regions behind one client-facing name. Each endpoint carries its own api_base, optional api_key, priority, weight, and enabled flag, plus its own health state.

When a model has one or more enabled and healthy endpoints, the data plane routes across them. When it has none, it falls back to the model's own api_base /api_key (the legacy single-upstream path), so adding endpoints is always optional.

Selection modes

endpoint_selection_mode controls ordering of the eligible endpoints:

ModeBehavior
failover (default)Endpoints are tried in ascending priority (lowest number first). Traffic sticks to the highest-priority healthy endpoint; lower ones are standbys used only when those ahead fail.
load_balanceEndpoints are ordered by a weighted random shuffle (A-Res) so the first choice is picked in proportion to weight. The remaining endpoints stay available as failover targets for the same request.

In both modes a single request walks the ordered list: it tries an endpoint (with its retry budget), and on exhaustion fails over to the next, until one succeeds or the list is exhausted (502 Bad Gateway, or 504 if the last failure was a timeout).

Managing endpoints

From the dashboard, expand a model on the Models page and use the Reliability & endpoints panel to set the timeout/retry policy, choose the selection mode, and add, enable/disable, or remove endpoints.

From the Management API:

# List endpoints for a model
curl "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN"

# Add a second cluster (priority 200 = standby behind the primary at 100)
curl -X POST "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "cluster-b",
    "api_base": "http://cluster-b.internal/v1",
    "api_key": "sk_cluster_b",
    "priority": 200,
    "weight": 100,
    "enabled": true
  }'

# Update (omit api_key to keep the stored secret; send "" to clear it)
curl -X PUT "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints/$ENDPOINT_ID" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "cluster-b", "api_base": "http://cluster-b.internal/v1", "priority": 150, "weight": 100, "enabled": false}'

# Remove
curl -X DELETE "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/endpoints/$ENDPOINT_ID" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN"

Endpoint api_base URLs are SSRF-validated like any admin-registered upstream (see OBLETH_ALLOWED_PRIVATE_CIDRS). Endpoint secrets are encrypted at rest when OBLETH_ENCRYPTION_KEY is set. Changes propagate to every gateway pod immediately via the Redis invalidation channel.

Per-endpoint health

The model health worker probes each enabled endpoint independently with the same token-free GET {api_base}/models liveness check used for the model itself, and records a per-endpoint health_status. Only endpoints that are both enabled and not explicitly unhealthy are eligible for routing — an endpoint whose probe confirms it is down (or that an operator disables) is removed from rotation until it recovers, while unknown/degraded states soft-pass. The model is only reported fully down when all of its endpoints are unhealthy. See Model Health.

Observability

Every upstream attempt increments obleth_upstream_attempts_total{outcome=...}:

outcomeMeaning
successAttempt returned a usable (non-retryable) response
retryRetryable failure with attempts remaining on the same endpoint
timeoutAttempt hit the per-request timeout
failoverEndpoint exhausted; moved to the next endpoint
exhaustedAll endpoints and attempts failed; request returned 502/504
# Failover rate (how often the primary endpoint isn't serving)
rate(obleth_upstream_attempts_total{outcome="failover"}[5m])

# Retry amplification
rate(obleth_upstream_attempts_total{outcome="retry"}[5m])
  / rate(obleth_upstream_attempts_total{outcome="success"}[5m])

Relationship to cross-model fallback

This feature fails a model over between endpoints of the same model — same upstream_model, same capabilities, same cost. It deliberately does not swap to a different model, which would change pricing, capabilities, and capacity accounting. Routing a request to an entirely different model on failure (a cross-model fallback chain) is a separate, future capability.