Response Cache

obleth can serve identical requests from an exact-match response cache instead of

obleth can serve identical requests from an exact-match response cache instead of hitting the upstream. A cache hit skips admission, the token budget, and the upstream entirely — it returns immediately and costs nothing against fairshare or quota. This is the cleanest possible offload: real requests removed from the backend, not just smoothed.

The cache is opt-in per model and off by default.

How it works

auth → resolve model → cache check ─ hit ─→ return cached response (no permit, no budget)
                                  └ miss ─→ fairshare admit → reserve budget → upstream → store on success

Cache key = sha256(model_name + request_body). Two requests collide (a hit) only when the client-facing model and the full request body are byte-identical.
Storage: the full upstream response (body + content-type + token counts) is buffered on a miss and written to Redis with the model's TTL. Streaming (SSE) responses are buffered and replayed verbatim on a hit, so streaming clients keep working.
Safety: responses larger than 512 KiB are streamed through uncached (the cache can't be used to balloon Redis memory). Only 200 OK responses are stored, so an error or partial response can't poison later reads.
Tool-loop answers are never cached. A request handled by the gateway tool loop depends on live tool results (e.g. a web search), so a cached answer would be wrong by definition — obleth skips the cache entirely for it, even when the model has caching enabled.

Enabling it

Per model, from the dashboard (Models → Cache toggle) or the Management API:

curl -X PUT "${OBLETH_ADMIN_BASE_URL}/api/v1/models/$MODEL_ID/cache" \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"cache_enabled": true, "cache_ttl_secs": 300}'

cache_enabled — turn the cache on/off for this model.
cache_ttl_secs — entry lifetime in seconds (0 = no expiry).

Changes propagate to every gateway pod immediately via the Redis invalidation channel.

Visibility

Each request records a cache_status of hit, miss, or off in the ClickHouse usage ledger.

Dashboard: the Models page shows a cache panel with 24h hit rate, hits, misses, and tokens saved.

Management API:

curl "${OBLETH_ADMIN_BASE_URL}/api/v1/usage/cache?since_ms=..." \
  -H "Authorization: Bearer $OBLETH_ADMIN_TOKEN"
# { "hits": 1234, "misses": 5678, "tokens_saved": 987654 }

Prometheus:

obleth_cache_lookups_total{result="hit|miss"}
obleth_cache_tokens_saved_total

Verifying real offload

The benchmark fixture backend exposes GET /stats with a true request counter. Run a load test twice (cache off, then on) and compare requests against the number of requests you issued — the delta is exactly the requests the cache absorbed.

Caveats

Exact-match only. Requests that differ by a single token miss. (Semantic / vector-based caching is a planned Phase 2 layer on top of this.)
The cache trusts that identical inputs yield equivalent outputs. Disable it for models whose responses must be unique per call (e.g. high-temperature sampling where you want variety), or keep the TTL short.

PreviousMCP Gateway

NextModular Deploy

Getting Started

Concepts

Guides

Reference

Operations