obleth

Performance

Fast path, visible decisions

Cache hits never touch the GPU. Fairshare holds slots for the full stream so concurrency matches real occupancy. Prometheus and optional OTLP traces record every admission class without putting analytics on the hot path.

Observability guide Dashboard

Skip GPU

Cache repeat

Ask the same question twice? The second answer comes back without touching the model

Fair queue

Not FIFO

When the cluster is full, work waits its turn by priority — not who connected first

Full stream

Honest counts

A slot stays reserved until the response finishes — occupancy reflects real GPU time

Tune caps

From dashboard

Find a sensible concurrency limit per model and apply it from the Models page

Hot path

Four mechanisms

response_cache

Response caching

Per-model toggle with configurable TTL. A cache hit never reaches fairshare or the upstream — only successful completions are stored for replay.

Keysha256(model+body)

Hit costZero slots

fairshare_admission

Fairshare scheduler

One Tokio task owns all admission state — no lock races. Hierarchical mode partitions slots by group weight; weighted mode picks lowest share_score globally.

Scoreserved_tokens ÷ weight

PermitHeld through stream

capacity_autotune

Capacity autotune

From the Models page, run a short load probe against your upstream and get a recommended concurrency cap. You review the curve and apply it — nothing changes until you confirm.

ProbeDirect to upstream

ApplyOperator confirms

token_budget

Atomic TPM limits

Redis Lua scripts reserve estimated tokens at admission and reconcile after streaming. Cross-pod consistent with no coordination layer between gateway replicas.

Refilltokens_per_minute

Reject429 budget exceeded

Capacity autotune

Right-size each model

On the Models page, run a short load probe against your self-hosted upstream and see a recommended concurrency cap with a per-step latency curve. Review it, then apply with one click — nothing changes until you confirm. Managed cloud APIs stay on a fixed static cap.

Models page

Same place

Tune concurrency where you already manage routes — not a separate script or side tool

One minute

Quick test

A short load run against your model gives you a recommended cap and a latency curve to review

Your hardware

Self-hosted

For vLLM, Aibrix, and models you run — not metered cloud endpoints with a fixed cap

You apply

Nothing automatic

Review the suggestion, check the curve, click apply — config does not change on its own

Control plane dashboard Autotune guide

Resolution cache

Moka + Redis

Keys, models, and MCP servers resolve through in-process moka first, then Redis. Startup warms Redis from Postgres; pub/sub invalidation propagates weight changes across pods within milliseconds.

Key cache

100k

max entries · 5 min TTL

Model cache

10k

max entries · 5 min TTL

MCP cache

10k

max entries · 5 min TTL

Resilience

Streaming & fail-open

SSE pass-through

Response chunks stream to the client as they arrive. The fairshare permit stays held for the full stream — concurrency accounting matches real GPU occupancy, not just connection open time.

Fail-open by default

Redis budget-check failure logs a warning and the request proceeds. ClickHouse outages spill usage to a local WAL. Dependencies retry at boot with backoff so ordinary pod ordering is easier to absorb.

Fail-open behavior

Observability

Prometheus on :9091

Low-cardinality labels only on the metrics endpoint — admission class and status class. Per-tenant breakdowns live in ClickHouse, not Prometheus, to avoid label explosion.

Throughput spectrum

obleth_ttft_ms · obleth_total_ms

24h window

p50 regionp95 tail

Capacity utilization

22%

0255075100

obleth_in_flight

Active requests on the data plane

obleth_queue_depth

Requests waiting on fairshare admission

Distributed tracing

OTLP spans

Set OBLETH_OTEL_ENDPOINT to an OTLP/HTTP collector. Spans export via a background batch processor — tracing never blocks the proxy. MCP routes get a separate mcp_request root span.

OBLETH_OTEL_ENDPOINT=http://collector:4318

Request spans

proxy_requestroot
cache_lookupoptional hit
reserve_budgetfairshare
upstream_requestSSE stream
mcp_requestMCP routes only

Key resolution and auth run inline with no span overhead. Export is OTLP/HTTP binary to {endpoint}/v1/traces.

Watch it in the dashboard

The control plane shows live in-flight and queued counts, per-tenant usage, token throughput, fairshare group pools, and model capacity — including autotune from the Models page. Run the quick start, open the dashboard, and watch admission under real load.

Quick start Control plane guide