Performance

Fast by default, observable by design

Response caching eliminates redundant upstream calls. Fairshare admission prevents any tenant from saturating the cluster. Brownout degradation keeps low-priority traffic flowing instead of failing. And every decision is instrumented — Prometheus gauges, latency histograms, and optional OTLP traces keep the hot path visible without putting analytics on it.

No permit

Cache hits

Redis hits skip fairshare, budget reserve, and upstream calls

Actor

Admission

Single scheduler task owns queue and in-flight state

SSE

Streaming

Response chunks are yielded as they arrive from upstream

No-op

Tracing

Unset OTLP endpoint means logs only, with no trace exporter

Data-plane internals

What makes it fast

response_cache

Response caching

Exact-match cache backed by Redis. Cache hits skip admission, budget, and upstream entirely — zero concurrency cost. Per-model toggle with configurable TTL. Brownout-degraded responses are never cached.

LookupBefore admission
PermitNot acquired
fairshare_admission

Fairshare scheduler

A single-owner actor keeps admission deterministic. Hierarchical mode partitions capacity by fairshare group weight; weighted mode picks the queued tenant most behind its share_score.

OwnerSingle actor
PermitHeld through stream
brownout_degradation

Brownout degradation

Instead of rejecting overloaded tenants, obleth caps max_tokens and admits them. Requests complete faster, free slots sooner, and the tenant still gets a response rather than a 503.

BehaviorDegrade, not reject
ResultFaster slot turnover
token_budget

Atomic rate limiting

Per-tenant token budgets enforced atomically in Redis — correct across all pods with no coordination layer. Estimated cost reserved at admission; actual cost reconciled after streaming completes.

ScopeCross-pod consistent
AccuracyPost-stream reconciled

Two-tier cache

Moka + Redis

Key, model, and MCP resolution hit an in-process Moka cache first, then Redis. At startup, the gateway warms Redis from Postgres for keys and warms both Redis and Moka for models and MCP servers. Pub/sub invalidation propagates config changes across pods; the 300 second TTL is the backstop.

Key cache

100k

max entries · 300 s TTL

Model cache

10k

max entries · 300 s TTL

MCP cache

10k

max entries · 300 s TTL

Resilience

Streaming & fail-open

SSE pass-through

Response chunks stream to the client as they arrive from the upstream. When response caching is enabled, obleth buffers successful responses up to 512 KiB so identical future requests can be replayed from Redis. The fairshare permit is held across the full stream so concurrency accounting matches real occupancy.

Fail-open by default

With fail-open enabled, a Redis budget-check failure logs a warning and the request proceeds. If ClickHouse is unreachable after startup, usage records spill to a local write-ahead log and replay when connectivity returns. Dependencies retry at boot with backoff so ordinary pod ordering is easier to absorb.

Observability

See saturation before users feel it

Every data-plane pod exports Prometheus metrics on a dedicated listener. Latency histograms track TTFT and total request duration. Queue depth and in-flight gauges let you see saturation before users feel it.

Throughput spectrum

obleth_ttft_ms · obleth_total_ms

24h window

p50 regionp95 tail

Capacity utilization

22%
0255075100

obleth_in_flight

42

Active requests on the data plane

obleth_queue_depth

7

Requests waiting on fairshare admission

Distributed tracing

OTLP spans

Set OBLETH_OTEL_ENDPOINT to an OTLP/HTTP collector (Jaeger, Tempo, or the OTel Collector). Service name is reported as obleth. Spans are emitted via a background batch exporter so tracing never blocks the proxy hot path. MCP routes emit a separate mcp_request root span tagged with the server name.

OBLETH_OTEL_ENDPOINT=http://collector:4318

Request spans

  • proxy_requestroot
  • cache_lookupoptional hit
  • reserve_budgetfairshare
  • upstream_requestSSE stream
  • mcp_requestMCP routes only

Key resolution and auth run inline with no span overhead. Export is OTLP/HTTP binary to {endpoint}/v1/traces.

Prove it in your cluster

The benchmark harness drives load across tenants so you can validate fairshare behavior, cache hit rates, brownout thresholds, and admission outcomes before production cutover.

Run the benchmark harness