Why obleth?

How obleth compares to LiteLLM, raw vLLM ingress, provider proxies, and Aibrix-only setups — and when to use each.

Before self-hosting a gateway, it's worth understanding what each existing option actually does and where it stops.

The alternatives and their gaps

Raw vLLM or Aibrix with no gateway

vLLM accepts any request that arrives. Aibrix routes it to a replica. Neither has a concept of tenants, API keys, quotas, or fairshare. In a single-team, trusted environment that's fine. In a multi-team or multi-customer setup, a batch job from one team can saturate the cluster and starve everyone else with zero visibility into who is responsible.

LiteLLM

LiteLLM is a provider-abstraction proxy. Its sweet spot is routing across many LLM providers (OpenAI, Anthropic, Azure, Bedrock, local models) with a unified interface and per-key rate limits.

Where it falls short for a dedicated self-hosted GPU cluster:

Rate limits are per-key, not contention-aware. If key_A is limited to 1000 RPM and the cluster can only handle 200 RPM right now, LiteLLM rejects key_A at 201 RPM even when key_B is idle and there's headroom. There's no weighted queue that redistributes unused capacity.
No token-measured fairness. A thousand 1-token requests and one 100k-token request have the same cost in requests-per-minute. GPU occupancy is in tokens.
Under saturation, it rejects. A saturated LiteLLM returns 429s. Important traffic competes by retrying faster, not by having a higher configured priority.
No live weight adjustment. You can't boost a tenant's priority without a config reload.

Aibrix alone

Aibrix is excellent at what it does: replica selection (least-KV-cache, prefix-cache affinity, power-of-two, throughput-weighted). obleth deliberately does not duplicate this. The right architecture is both: obleth decides who sends and at what priority; Aibrix decides which pod serves it.

What makes obleth different

Feature	LiteLLM	Aibrix alone	obleth
Multi-tenant API keys	✓	—	✓
Per-key rate limits (RPM)	✓	—	✓ (TPM-based)
Weighted fairshare under saturation	—	—	✓
Token-measured admission	—	—	✓
Live priority boost (no restart)	—	—	✓
Queue under saturation (not instant 429)	—	—	✓
Fail-open with WAL	—	—	✓
Multi-provider routing	✓	—	per-model routes
Replica/pod selection	—	✓	delegates to Aibrix
OpenAI-compatible proxy	✓	✓	✓

When to pick obleth

You run a shared GPU cluster (on-prem, bare-metal, k8s) with multiple teams or customers.
You need weighted fairness: team A's chatbot should always get through even when team B runs a batch job at 10× the concurrency.
You want a live priority dial: during an incident you can boost the production key's weight in the dashboard and it takes effect immediately.
You want cost accounting in tokens, not requests, so usage attribution across teams reflects actual GPU usage.
You're already using or planning to use Aibrix for inference routing and need the admission/identity layer it doesn't provide.

When to pick something else

You're routing across many cloud providers and the main value-add is provider abstraction — LiteLLM is better suited.
You have a single team on a dedicated cluster with no contention concerns — a simple reverse proxy is enough.
You need semantic caching today — obleth's response cache is exact-match only (semantic/vector caching is planned for a later version).

PreviousOverview

NextPositioning with Aibrix

Getting Started

Concepts

Guides

Reference

Operations