What obleth is, the problem it solves, and where it fits in your AI infrastructure stack.
obleth is a lightweight, high-performance, fairshare-first AI gateway built in Rust. It sits between your front door (HAProxy or a cloud load balancer) and your inference backend (Aibrix, vLLM, or any OpenAI-compatible API), and it owns the layer those tools deliberately don't: multi-tenant identity, contention-based weighted fair queuing, token-accurate cost accounting, and reliability.
Running a shared GPU cluster for multiple teams, products, or customers exposes a fundamental tension: you want all of them to be able to send requests, but you also want important traffic (your chatbot, your SLA-bound production API) to always get through — even when the cluster is saturated.
Standard approaches fail predictably:
obleth solves this with token-measured weighted fairness: capacity under contention is divided proportionally to each tenant's configured weight, measured in the tokens that actually consume GPU time, not request counts.
| Capability | How |
|---|---|
| Multi-tenant identity | Every request carries an API key. obleth resolves it to a tenant with a weight and quota — from a fast in-process cache backed by Redis. |
| Fairshare admission | A single Rust async scheduler controls admission. Under saturation, the tenant most behind its weighted fair share gets the next slot. |
| Priority boosts | Change a tenant's weight live via the Management API or dashboard. The scheduler picks it up immediately — no restart, no config reload. |
| Token-accurate cost accounting | Estimate tokens at admission, atomically reserve a budget in Redis, reconcile the true cost after the stream finishes. Fairness is in GPU-seconds, not request counts. |
| Brownout, not 429 | Instead of rejecting low-priority traffic, obleth degrades it: it caps max_tokens so the request still completes, just with a shorter answer. |
| Reliability under failures | If Redis or ClickHouse go down, obleth keeps serving from its in-process key cache and spills usage telemetry to a local write-ahead log that replays on recovery. |
| OpenAI-compatible proxy | Any client that speaks the OpenAI API (Python SDK, LangChain, custom code) works with obleth with only a base URL change. |
| MCP gateway | obleth also fronts MCP (Model Context Protocol) servers with the same identity and audit layer applied to LLM traffic. |
obleth does not decide which inference pod handles a request — that's Aibrix's job. It does not re-implement model routing, KV-cache affinity, or replica selection. It deliberately composes with a downstream router rather than duplicating it.
See Positioning with Aibrix for how the two layers interact.
obleth uses exactly three datastores, each chosen for a distinct workload:
See Datastores for a deeper explanation of why all three are needed.
Client → HAProxy → obleth data plane (:8080) → Aibrix / vLLM
obleth admin API (:9090) ← dashboard / CLI
obleth metrics (:9091) ← Prometheus
Start with Quick Start to have obleth running in five minutes.