Every step a request takes through obleth: auth, cache, fairshare admission, budget reservation, upstream proxy, cost reconciliation, and telemetry.
Every request that arrives at the obleth data plane runs through eight steps in order. Understanding this pipeline explains how fairshare and cost accounting work together.
① Auth
② Parse body + resolve model
③ Response cache check ──── hit ──→ return immediately (no permit, no budget)
④ Fairshare admission ──── queued / brownout ──→ wait or degrade
⑤ Token budget reserve ──── exceeded ──→ 429
⑥ Proxy upstream
⑦ Reconcile actual cost
⑧ Emit telemetry
obleth extracts the bearer token from either the Authorization: Bearer <token> header or the x-api-key: <token> header.
The raw token is never stored. It is immediately hashed with SHA-256 and the hash is looked up:
obleth:key:{hash}) — shared across all gateway pods.401 invalid api key.The cache returns a ResolvedKey containing everything admission needs: tenant_id, weight, tokens_per_minute, max_in_flight, fairshare_group, group_weight, disabled.
If disabled is true, the request is rejected with 403.
obleth reads and parses the request body (limit: 64 MiB). It extracts the model field and looks it up in the model registry (same cache chain: moka → Redis).
For paths that require a registered model (/v1/chat/completions, /v1/completions, etc.):
model is missing → 400 model is required404 model not registeredenabled: false → 403 model is disabledModels have an admission_weight multiplier that is applied on top of the tenant's weight during fairshare admission, letting you make certain expensive models require proportionally more capacity.
obleth also estimates the token cost at this point — see Token-measured Fairness.
If the matched model has cache_enabled: true, obleth computes a cache key from sha256(model_name + request_body) and checks Redis (obleth:cache:{key}).
A cache hit returns the stored response immediately and exits the pipeline. No fairshare permit is acquired, no budget is consumed, and the upstream is never called. The usage record is written with cache_status = "hit".
A cache miss or cache off continues to step 4.
obleth calls the fairshare scheduler with:
tenant_id + weight (from the resolved key)group + group_weight (for hierarchical mode)cost (estimated token count)The scheduler holds a global concurrency semaphore (OBLETH_GLOBAL_MAX_IN_FLIGHT, default 256). If a permit is available, the request is admitted immediately (Admission::Fast).
If the cluster is at capacity, the request joins a per-tenant queue. When a slot opens, the scheduler grants it to the tenant most behind its weighted fair share — see Fairshare Engine.
If the request has been queued for longer than OBLETH_BROWNOUT_WAIT_MS (default 750ms), it is admitted with Admission::Brownout — the scheduler does not reject it, but marks it for degradation in the next step.
For Brownout-admitted requests, obleth rewrites the request body: max_tokens is capped at 256. The request still reaches the upstream and returns a real (shorter) response. This avoids a hard 429 while ensuring low-priority traffic doesn't consume full GPU time under saturation.
Before proxying, obleth atomically reserves the estimated token cost from the tenant's token bucket in Redis using a Lua script.
The token bucket refills at tokens_per_minute / 60000 tokens per millisecond. If the bucket doesn't have enough tokens:
Admission::Rejected429 token budget exceededIf OBLETH_FAIL_OPEN=true (default) and the Redis call fails, obleth logs a warning and continues — the budget check is skipped rather than rejecting the request.
obleth forwards the request to the upstream (Aibrix, vLLM, or a registered model's api_base) using a pooled reqwest HTTP client. Streaming (SSE) responses are streamed through to the client byte-for-byte.
The fairshare permit is held for the entire duration of the upstream call, including streaming time. This is important: concurrency accounting reflects real GPU occupancy, not just the time to the first byte.
On upstream success, if the model has cache_enabled and the response is ≤ 512 KiB and status 200, the response body is written to Redis with the configured TTL.
After the stream finishes, obleth reads the actual token counts from the upstream's usage field (or counts them from the SSE stream). It runs a second Lua script to reconcile:
reconcile = estimated_tokens - actual_tokens
If the estimate was high, tokens are refunded to the bucket. If the actual cost was higher, additional tokens are charged. This ensures billing accuracy regardless of estimation error.
obleth sends a UsageRecord to the telemetry channel (a non-blocking async mpsc::send). The hot path returns immediately. A background task batches records and inserts them into ClickHouse every second.
If ClickHouse is unavailable and fail_open is set, records spill to a local WAL file and are replayed once ClickHouse recovers.
The permit (fairshare slot) is dropped after reconciliation, freeing capacity for the next queued request.
| Stage | Status | Reason |
|---|---|---|
| Auth | 401 | Missing or invalid bearer token |
| Auth | 403 | Key is disabled |
| Parse | 400 | Body too large (>64 MiB) or missing model |
| Model | 404 | Model not registered |
| Model | 403 | Model disabled |
| Admission | 503 | Scheduler unavailable |
| Budget | 429 | Token budget (TPM quota) exceeded |
| Upstream | 502 | Upstream request failed |