Reliability & Fail-open

How obleth keeps serving when Redis or ClickHouse are unavailable: the moka fallback cache, the telemetry WAL, and the fail-open vs fail-closed tradeoff.

One of obleth's core design goals is that a Redis or ClickHouse outage should not take down inference traffic. This is the fail-open behavior.

Three caching layers

Key resolution on the data plane is multi-layered:

Bearer token → SHA-256 hash
  → moka (in-process, TTL=5min, cap=100k)  ← zero network latency
  → Redis (sub-ms, shared across pods)       ← shared, always fresh
  → (Postgres on cache miss, never directly from hot path)

The in-process moka cache means that even a total Redis outage doesn't break auth lookups for keys that were recently resolved. Keys stay valid in moka for 5 minutes after their last resolution.

The moka cache is kept up-to-date via Redis pub/sub invalidation: when a key is created, disabled, or has its tenant's weight changed via the Management API, a message is published to obleth:invalidate. Each pod's moka evicts the affected key on receipt.

Fail-open vs fail-closed

OBLETH_FAIL_OPEN (default: true) controls what obleth does when Redis is unreachable and the moka cache doesn't have the key:

ModeBehavior on Redis failure
fail_open=true (default)Serve from moka cache. If not in moka, log a warning and continue (no budget enforcement).
fail_closed=falseReturn 503 Service Unavailable for any request that requires a Redis lookup.

For production deployments handling real customer traffic, fail-open is the safer default — a Redis blip doesn't cause a customer-visible outage. Token budget enforcement is bypassed temporarily, but requests are still served.

For strict billing scenarios (e.g. metered API products where over-spend must never happen), set OBLETH_FAIL_OPEN=false to reject rather than over-serve.

OBLETH_FAIL_OPEN=true

Telemetry WAL

ClickHouse is never on the request hot path. Usage records are sent to a bounded async channel (mpsc::Sender), and a background task batches them into ClickHouse every second.

If ClickHouse is unavailable:

  1. The background flusher catches the error.
  2. If fail_open=true, each failed batch is appended to a local write-ahead log at OBLETH_WAL_PATH (default: ./obleth-telemetry.wal).
  3. Once ClickHouse recovers, the WAL replayer reads the file and re-inserts the records.
  4. The WAL is then truncated.

This means no usage records are dropped under a ClickHouse outage as long as the pod has local disk space.

# Where to spill telemetry when ClickHouse is unavailable
OBLETH_WAL_PATH=/var/lib/obleth/telemetry.wal

Monitor WAL accumulation via Prometheus: obleth_telemetry_dropped counts records dropped when even the WAL write fails (e.g. disk full).

What obleth does NOT protect against

  • Total Postgres outage at startup: obleth won't boot if it can't connect to Postgres and run migrations. Once running, Postgres is off the hot path and its outage doesn't affect traffic.
  • Redis outage with fail_closed=true: all requests that miss the moka cache return 503.
  • Disk full with fail_open=true: WAL writes fail, telemetry is dropped, and obleth_telemetry_dropped increments.
  • Pod restart with empty moka cache: after a restart, moka is cold. The first request for each key hits Redis (which should be up). If Redis is also down and fail_open is true, the request is served without budget enforcement.