Request Lifecycle

Every step a request takes through obleth: auth, cache, fairshare admission, budget reservation, upstream proxy, cost reconciliation, and telemetry.

Every request that arrives at the obleth data plane runs through the steps below in order. Understanding this pipeline explains how fairshare and cost accounting work together.

Pipeline overview

① Auth
② Parse body + resolve model
③ Response cache check  ──── hit ──→ return immediately (no permit, no budget)
④ Fairshare admission   ──── at capacity ──→ queue until a slot opens
⑤ Token budget reserve  ──── exceeded ──→ 429
⑥ Term budget check     ──── exhausted ──→ 403
⑦ Proxy upstream
⑧ Reconcile actual cost
⑨ Emit telemetry

Step 1 — Auth

obleth extracts the bearer token from either the Authorization: Bearer <token> header or the x-api-key: <token> header.

The raw token is never stored. It is immediately hashed with SHA-256 and the hash is looked up:

moka in-process cache (TTL=5 min, cap=100k keys) — fastest, no network hop.
Redis (obleth:key:{hash}) — shared across all gateway pods.
If neither cache has it, the key doesn't exist → 401 invalid api key.

The cache returns a ResolvedKey containing everything admission needs: tenant_id, weight, tokens_per_minute, max_in_flight, fairshare_group, group_weight, disabled.

If disabled is true, the request is rejected with 403.

Step 2 — Parse body and resolve model

obleth reads and parses the request body (limit: 64 MiB). It extracts the model field and looks it up in the model registry (same cache chain: moka → Redis).

For paths that require a registered model (/v1/chat/completions, /v1/completions, etc.):

If model is missing → 400 model is required
If the model isn't registered → 404 model not registered
If the model is enabled: false → 403 model is disabled

Models have an admission_weight multiplier that is applied on top of the tenant's weight during fairshare admission, letting you make certain expensive models require proportionally more capacity.

obleth also estimates the token cost at this point — see Token-measured Fairness.

Step 3 — Response cache check

If the matched model has cache_enabled: true, obleth computes a cache key from sha256(model_name + request_body) and checks Redis (obleth:cache:{key}).

A cache hit returns the stored response immediately and exits the pipeline. No fairshare permit is acquired, no budget is consumed, and the upstream is never called. The usage record is written with cache_status = "hit".

A cache miss or cache off continues to step 4.

Step 4 — Fairshare admission

obleth calls the fairshare scheduler with:

tenant_id + weight (from the resolved key)
group + group_weight (for hierarchical mode)
cost (estimated token count)

The scheduler holds a global concurrency semaphore (OBLETH_GLOBAL_MAX_IN_FLIGHT, default 256). If a permit is available, the request is admitted immediately (Admission::Fast).

If the cluster is at capacity, the request joins a per-tenant queue. When a slot opens, the scheduler grants it to the tenant most behind its weighted fair share and admits it with Admission::Queued — see Fairshare Engine. A queued request keeps its place until a permit frees up; there is no timeout-based degradation. See Saturation Behavior for what happens when demand exceeds capacity.

Step 5 — Token budget reserve

Before proxying, obleth atomically reserves the estimated token cost from the tenant's token bucket in Redis using a Lua script.

The token bucket refills at tokens_per_minute / 60000 tokens per millisecond. If the bucket doesn't have enough tokens:

The permit is released (admission slot freed)
The request is finalized with Admission::Rejected
The client receives 429 token budget exceeded

If OBLETH_FAIL_OPEN=true (default) and the Redis call fails, obleth logs a warning and continues — the budget check is skipped rather than rejecting the request.

Step 6 — Term budget check

If the tenant has a cumulative budget configured (budget_tokens and/or budget_cost_usd over a lifetime, monthly, or term period), obleth reads the tenant's usage so far for the current period from Redis (obleth:term_usage:{tenant}). If either cap is already met:

The permit is released
The request is finalized with Admission::Rejected
The client receives 403 tenant term budget exhausted

Like the per-minute bucket, a Redis read failure here fails open by default (OBLETH_FAIL_OPEN=true) and fails closed (503) when fail-open is disabled. Tenants with no term budget configured skip this step. See Quotas & Rate Limits.

Step 7 — Proxy upstream

obleth forwards the request to the upstream (Aibrix, vLLM, or a registered model's api_base) using a pooled reqwest HTTP client. Streaming (SSE) responses are streamed through to the client byte-for-byte.

Each attempt is bounded by a timeout (request_timeout_secs, or the global OBLETH_UPSTREAM_TIMEOUT_SECS default). On a transient failure — a connection error, a timeout, or HTTP 408/429/5xx — obleth retries the same endpoint up to max_retries times with exponential backoff, then fails over to the model's next endpoint if one is configured. Retries and failover only happen before the first response byte reaches the client; client errors (4xx) are returned immediately and never retried. A single permit and budget reservation cover the whole attempt sequence. See Reliability & Failover.

The fairshare permit is held for the entire duration of the upstream call, including streaming time. This is important: concurrency accounting reflects real GPU occupancy, not just the time to the first byte.

On upstream success, if the model has cache_enabled and the response is ≤ 512 KiB and status 200, the response body is written to Redis with the configured TTL.

Step 8 — Reconcile cost

After the stream finishes, obleth reads the actual token counts from the upstream's usage field (or counts them from the SSE stream). It runs a second Lua script to reconcile:

reconcile = estimated_tokens - actual_tokens

If the estimate was high, tokens are refunded to the bucket. If the actual cost was higher, additional tokens are charged. This ensures billing accuracy regardless of estimation error.

Step 9 — Emit telemetry

obleth sends a UsageRecord to the telemetry channel (a non-blocking async mpsc::send). The hot path returns immediately. A background task batches records and inserts them into ClickHouse every second.

Each record carries a completion timestamp (ts_ms), token counts, latency (queue_wait_ms, ttft_ms, total_ms), status code, frozen cost_usd, the coarse request_type (derived from the API path), and an optional session_id (captured from client headers or body fields). Rejected requests (429 budget, 403 term budget, upstream errors) are logged too — not only successes.

If ClickHouse is unavailable and fail_open is set, records spill to a local WAL file and are replayed once ClickHouse recovers.

Individual rows are queryable via GET /api/v1/usage/logs and the dashboard Request Logs page. They are retained for the configured window (default 180 days) before day-partition pruning; the permanent usage_daily rollup is kept forever. See ClickHouse Usage.

When tracing_enabled is set on the key or tenant, the proxy also emits a set of span records via the same non-blocking sink — one per pipeline phase (auth_resolve, admission, cache_lookup, upstream, boon:tool_loop, etc.) plus a root proxy_request span. Spans land in a separate ClickHouse spans table and are surfaced in the dashboard's Request Detail panel. See Per-request span tracing.

The permit (fairshare slot) is dropped after reconciliation, freeing capacity for the next queued request.

Error summary

Stage	Status	Reason
Auth	`401`	Missing or invalid bearer token
Auth	`403`	Key is disabled
Parse	`400`	Body too large (>64 MiB) or missing `model`
Model	`404`	Model not registered
Model	`403`	Model disabled
Admission	`503`	Scheduler unavailable
Budget	`429`	Token budget (TPM quota) exceeded
Term budget	`403`	Cumulative token or USD cap exhausted for the period
Upstream	`502`	Upstream request failed after all retries and endpoints
Upstream	`504`	All upstream attempts timed out

PreviousArchitecture

NextMulti-tenancy

Getting Started

Concepts

Guides

Reference

Operations