Response Cache

obleth can serve identical requests from an exact-match response cache instead of

obleth can serve identical requests from an exact-match response cache instead of hitting the upstream. A cache hit skips admission, the token budget, and the upstream entirely — it returns immediately and costs nothing against fairshare or quota. This is the cleanest possible offload: real requests removed from the backend, not just smoothed.

The cache is opt-in per model and off by default.

How it works

auth → resolve model → cache check ─ hit ─→ return cached response (no permit, no budget)
                                  └ miss ─→ fairshare admit → reserve budget → upstream → store on success
  • Cache key = sha256(model_name + request_body). Two requests collide (a hit) only when the client-facing model and the full request body are byte-identical.
  • Storage: the full upstream response (body + content-type + token counts) is buffered on a miss and written to Redis with the model's TTL. Streaming (SSE) responses are buffered and replayed verbatim on a hit, so streaming clients keep working.
  • Safety: responses larger than 512 KiB are streamed through uncached (the cache can't be used to balloon Redis memory). Brownout-degraded responses are never cached, so a capped response can't poison later unsaturated reads. Only 200 OK responses are stored.

Enabling it

Per model, from the dashboard (Models → Cache toggle) or the Management API:

curl -X PUT "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/cache" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"cache_enabled": true, "cache_ttl_secs": 300}'
  • cache_enabled — turn the cache on/off for this model.
  • cache_ttl_secs — entry lifetime in seconds (0 = no expiry).

Changes propagate to every gateway pod immediately via the Redis invalidation channel.

Visibility

Each request records a cache_status of hit, miss, or off in the ClickHouse usage ledger.

Dashboard: the Models page shows a cache panel with 24h hit rate, hits, misses, and tokens saved.

Management API:

curl "${OBLETH_ADMIN_BASE_URL}/api/v1/usage/cache?since_ms=..." \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN"
# { "hits": 1234, "misses": 5678, "tokens_saved": 987654 }

Prometheus:

  • obleth_cache_lookups_total{result="hit|miss"}
  • obleth_cache_tokens_saved_total

Verifying real offload

The mock backend exposes GET /stats with a true request counter. Run a load test twice (cache off, then on) and compare requests against the number of requests you issued — the delta is exactly the requests the cache absorbed.

Caveats

  • Exact-match only. Requests that differ by a single token miss. (Semantic / vector-based caching is a planned Phase 2 layer on top of this.)
  • The cache trusts that identical inputs yield equivalent outputs. Disable it for models whose responses must be unique per call (e.g. high-temperature sampling where you want variety), or keep the TTL short.