Overview

What obleth is, the problem it solves, and where it fits in your AI infrastructure stack.

obleth is a lightweight, high-performance, fairshare-first AI gateway built in Rust. It sits between your front door (HAProxy or a cloud load balancer) and your inference backend (Aibrix, vLLM, or any OpenAI-compatible API), and it owns the layer those tools deliberately don't: multi-tenant identity, contention-based weighted fair queuing, token-accurate cost accounting, and reliability.

The problem

Running a shared GPU cluster for multiple teams, products, or customers exposes a fundamental tension: you want all of them to be able to send requests, but you also want important traffic (your chatbot, your SLA-bound production API) to always get through — even when the cluster is saturated.

Standard approaches fail predictably:

  • Rate limits (requests per minute) reject traffic outright instead of queuing it. Under a burst, a batch job and a production chatbot compete on equal terms and both degrade.
  • Per-key concurrency limits protect one tenant from another, but the distribution is unweighted: a key with weight 1 and a key with weight 500 compete identically.
  • Round-robin or FIFO queues are fair only in a naive sense. A batch job that generates thousands of cheap, fast requests effectively locks out a slower production API even with equal queue priority.

obleth solves this with token-measured weighted fairness: capacity under contention is divided proportionally to each tenant's configured weight, measured in the tokens that actually consume GPU time, not request counts.

What obleth does

CapabilityHow
Multi-tenant identityEvery request carries an API key. obleth resolves it to a tenant with a weight and quota — from a fast in-process cache backed by Redis.
Fairshare admissionA single Rust async scheduler controls admission. Under saturation, the tenant most behind its weighted fair share gets the next slot.
Priority boostsChange a tenant's weight live via the Management API or dashboard. The scheduler picks it up immediately — no restart, no config reload.
Token-accurate cost accountingEstimate tokens at admission, atomically reserve a budget in Redis, reconcile the true cost after the stream finishes. Fairness is in GPU-seconds, not request counts.
Brownout, not 429Instead of rejecting low-priority traffic, obleth degrades it: it caps max_tokens so the request still completes, just with a shorter answer.
Reliability under failuresIf Redis or ClickHouse go down, obleth keeps serving from its in-process key cache and spills usage telemetry to a local write-ahead log that replays on recovery.
OpenAI-compatible proxyAny client that speaks the OpenAI API (Python SDK, LangChain, custom code) works with obleth with only a base URL change.
MCP gatewayobleth also fronts MCP (Model Context Protocol) servers with the same identity and audit layer applied to LLM traffic.

What obleth is not

obleth does not decide which inference pod handles a request — that's Aibrix's job. It does not re-implement model routing, KV-cache affinity, or replica selection. It deliberately composes with a downstream router rather than duplicating it.

See Positioning with Aibrix for how the two layers interact.

The three datastores

obleth uses exactly three datastores, each chosen for a distinct workload:

  • Postgres — relational source of truth for configuration and audit. Tenants, API keys, model routes, quotas, and weight-change history all live here. Never on the request hot path.
  • Redis — sub-millisecond hot cache for key resolution and atomic token-bucket budget enforcement. The data plane reads only Redis.
  • ClickHouse — append-only usage and cost ledger. Async-inserted in batches, never blocking a request.

See Datastores for a deeper explanation of why all three are needed.

Quick orientation

Client → HAProxy → obleth data plane (:8080) → Aibrix / vLLM
                   obleth admin API   (:9090)  ← dashboard / CLI
                   obleth metrics     (:9091)  ← Prometheus
  • Data plane: proxies inference requests, enforces fairshare, records telemetry.
  • Management API: CRUD for tenants, keys, models, quotas; live weight changes; usage reads.
  • Metrics: Prometheus endpoint, always on.

Start with Quick Start to have obleth running in five minutes.