Overview

What obleth is, the problem it solves, and where it fits in your AI infrastructure stack.

obleth is a lightweight, high-performance, fairshare-first AI gateway built in Rust. It sits between your front door (HAProxy or a cloud load balancer) and your inference backend (Aibrix, vLLM, or any OpenAI-compatible API), and it owns the layer those tools deliberately don't: multi-tenant identity, contention-based weighted fair queuing, token-accurate cost accounting, and reliability.

The problem

Running a shared GPU cluster for multiple teams or workloads exposes a fundamental tension: you want all of them to be able to send requests, but you also want important traffic (your chatbot, latency-sensitive workloads) to always get through — even when the cluster is saturated.

Standard approaches fail predictably:

Rate limits (requests per minute) reject traffic outright instead of queuing it. Under a burst, a batch job and a production chatbot compete on equal terms and both degrade.
Per-key concurrency limits protect one tenant from another, but the distribution is unweighted: a key with weight 1 and a key with weight 500 compete identically.
Round-robin or FIFO queues are fair only in a naive sense. A batch job that generates thousands of cheap, fast requests effectively locks out a slower production API even with equal queue priority.

obleth solves this with token-measured weighted fairness: capacity under contention is divided proportionally to each tenant's configured weight, measured in the tokens that actually consume GPU time, not request counts.

What obleth does

Capability	How
Multi-tenant identity	Every request carries an API key. obleth resolves it to a tenant with a weight and quota — from a fast in-process cache backed by Redis.
Fairshare admission	A single Rust async scheduler controls admission. Under saturation, the tenant most behind its weighted fair share gets the next slot.
Priority boosts	Change a tenant's weight live via the Management API or dashboard. The scheduler picks it up immediately — no restart, no config reload.
Token-accurate cost accounting	Estimate tokens at admission, atomically reserve a budget in Redis, reconcile the true cost after the stream finishes. Fairness is in GPU-seconds, not request counts.
Queue, not instant 429	When the cluster is at capacity, requests wait in the weighted fairshare queue instead of failing at the door. obleth only returns `429`/`403` when a token or term budget is actually exhausted.
Reliability under failures	If Redis or ClickHouse go down, obleth keeps serving from its in-process key cache and spills usage telemetry to a local write-ahead log that replays on recovery.
OpenAI-compatible proxy	Any client that speaks the OpenAI API (Python SDK, LangChain, custom code) works with obleth with only a base URL change.
Model boons	Grant a capability at the gateway to a model that lacks it natively — image input (vision), `response_format` JSON-schema enforcement (structured output), or input-token reduction (compression) — so a basic model can serve advanced requests.
Guardrails	Per-tenant content safety: scan request and response content for prompt injection, PII, banned keywords, or harm, and block, redact, or log matches — fail-open.
MCP gateway + tool loop	obleth fronts MCP (Model Context Protocol) servers with the same identity and audit layer applied to LLM traffic, and can run those tools on a model's behalf — injecting them, executing the tool calls, and looping until the model answers.

What obleth is not

obleth does not decide which inference pod handles a request — that's Aibrix's job. It does not re-implement model routing, KV-cache affinity, or replica selection. It deliberately composes with a downstream router rather than duplicating it.

See Positioning with Aibrix for how the two layers interact.

The three datastores

obleth uses exactly three datastores, each chosen for a distinct workload:

Postgres — relational source of truth for configuration and audit. Tenants, API keys, model routes, quotas, and weight-change history all live here. Never on the request hot path.
Redis — sub-millisecond hot cache for key resolution and atomic token-bucket budget enforcement. The data plane reads only Redis.
ClickHouse — append-only usage and cost ledger. Async-inserted in batches, never blocking a request.

See Datastores for a deeper explanation of why all three are needed.

Quick orientation

Client → HAProxy → obleth data plane (:8080) → Aibrix / vLLM
                   obleth admin API   (:9180)  ← dashboard / CLI
                   obleth metrics     (:9091)  ← Prometheus

Data plane: proxies inference requests, enforces fairshare, records telemetry.
Management API: CRUD for tenants, keys, models, quotas; live weight changes; usage reads.
Metrics: Prometheus endpoint, always on.

Start with Quick Start to have obleth running in five minutes.

PreviousLocal Development

NextWhy obleth?

Getting Started

Concepts

Guides

Reference

Operations