How obleth keeps serving when Redis or ClickHouse are unavailable: the moka fallback cache, the telemetry WAL, and the fail-open vs fail-closed tradeoff.
One of obleth's core design goals is that a Redis or ClickHouse outage should not take down inference traffic. This is the fail-open behavior.
Key resolution on the data plane is multi-layered:
Bearer token → SHA-256 hash
→ moka (in-process, TTL=5min, cap=100k) ← zero network latency
→ Redis (sub-ms, shared across pods) ← shared, always fresh
→ (Postgres on cache miss, never directly from hot path)
The in-process moka cache means that even a total Redis outage doesn't break auth lookups for keys that were recently resolved. Keys stay valid in moka for 5 minutes after their last resolution.
The moka cache is kept up-to-date via Redis pub/sub invalidation: when a key is created, disabled, or has its tenant's weight changed via the Management API, a message is published to obleth:invalidate. Each pod's moka evicts the affected key on receipt.
OBLETH_FAIL_OPEN (default: true) controls what obleth does when Redis is unreachable and the moka cache doesn't have the key:
| Mode | Behavior on Redis failure |
|---|---|
fail_open=true (default) | Serve from moka cache. If not in moka, log a warning and continue (no budget enforcement). |
fail_closed=false | Return 503 Service Unavailable for any request that requires a Redis lookup. |
For production deployments handling real customer traffic, fail-open is the safer default — a Redis blip doesn't cause a customer-visible outage. Token budget enforcement is bypassed temporarily, but requests are still served.
For strict billing scenarios (e.g. metered API products where over-spend must never happen), set OBLETH_FAIL_OPEN=false to reject rather than over-serve.
OBLETH_FAIL_OPEN=true
ClickHouse is never on the request hot path. Usage records are sent to a bounded async channel (mpsc::Sender), and a background task batches them into ClickHouse every second.
If ClickHouse is unavailable:
fail_open=true, each failed batch is appended to a local write-ahead log at OBLETH_WAL_PATH (default: ./obleth-telemetry.wal).This means no usage records are dropped under a ClickHouse outage as long as the pod has local disk space.
# Where to spill telemetry when ClickHouse is unavailable
OBLETH_WAL_PATH=/var/lib/obleth/telemetry.wal
Monitor WAL accumulation via Prometheus: obleth_telemetry_dropped counts records dropped when even the WAL write fails (e.g. disk full).
obleth_telemetry_dropped increments.