Performance
Fast by default, observable by design
Response caching eliminates redundant upstream calls. Fairshare admission prevents any tenant from saturating the cluster. Brownout degradation keeps low-priority traffic flowing instead of failing. And every decision is instrumented — Prometheus gauges, latency histograms, and optional OTLP traces keep the hot path visible without putting analytics on it.
No permit
Cache hits
Redis hits skip fairshare, budget reserve, and upstream calls
Actor
Admission
Single scheduler task owns queue and in-flight state
SSE
Streaming
Response chunks are yielded as they arrive from upstream
No-op
Tracing
Unset OTLP endpoint means logs only, with no trace exporter
Data-plane internals
What makes it fast
Response caching
Exact-match cache backed by Redis. Cache hits skip admission, budget, and upstream entirely — zero concurrency cost. Per-model toggle with configurable TTL. Brownout-degraded responses are never cached.
Fairshare scheduler
A single-owner actor keeps admission deterministic. Hierarchical mode partitions capacity by fairshare group weight; weighted mode picks the queued tenant most behind its share_score.
Brownout degradation
Instead of rejecting overloaded tenants, obleth caps max_tokens and admits them. Requests complete faster, free slots sooner, and the tenant still gets a response rather than a 503.
Atomic rate limiting
Per-tenant token budgets enforced atomically in Redis — correct across all pods with no coordination layer. Estimated cost reserved at admission; actual cost reconciled after streaming completes.
Two-tier cache
Moka + Redis
Key, model, and MCP resolution hit an in-process Moka cache first, then Redis. At startup, the gateway warms Redis from Postgres for keys and warms both Redis and Moka for models and MCP servers. Pub/sub invalidation propagates config changes across pods; the 300 second TTL is the backstop.
Key cache
100k
max entries · 300 s TTL
Model cache
10k
max entries · 300 s TTL
MCP cache
10k
max entries · 300 s TTL
Resilience
Streaming & fail-open
SSE pass-through
Response chunks stream to the client as they arrive from the upstream. When response caching is enabled, obleth buffers successful responses up to 512 KiB so identical future requests can be replayed from Redis. The fairshare permit is held across the full stream so concurrency accounting matches real occupancy.
Fail-open by default
With fail-open enabled, a Redis budget-check failure logs a warning and the request proceeds. If ClickHouse is unreachable after startup, usage records spill to a local write-ahead log and replay when connectivity returns. Dependencies retry at boot with backoff so ordinary pod ordering is easier to absorb.
Observability
See saturation before users feel it
Every data-plane pod exports Prometheus metrics on a dedicated listener. Latency histograms track TTFT and total request duration. Queue depth and in-flight gauges let you see saturation before users feel it.
Throughput spectrum
obleth_ttft_ms · obleth_total_ms
24h window
Capacity utilization
22%obleth_in_flight
42
Active requests on the data plane
obleth_queue_depth
7
Requests waiting on fairshare admission
Distributed tracing
OTLP spans
Set OBLETH_OTEL_ENDPOINT to an OTLP/HTTP collector (Jaeger, Tempo, or the OTel Collector). Service name is reported as obleth. Spans are emitted via a background batch exporter so tracing never blocks the proxy hot path. MCP routes emit a separate mcp_request root span tagged with the server name.
OBLETH_OTEL_ENDPOINT=http://collector:4318
Request spans
- proxy_requestroot
- cache_lookupoptional hit
- reserve_budgetfairshare
- upstream_requestSSE stream
- mcp_requestMCP routes only
Key resolution and auth run inline with no span overhead. Export is OTLP/HTTP binary to {endpoint}/v1/traces.
Prove it in your cluster
The benchmark harness drives load across tenants so you can validate fairshare behavior, cache hit rates, brownout thresholds, and admission outcomes before production cutover.
Run the benchmark harness