How obleth compares to LiteLLM, raw vLLM ingress, provider proxies, and Aibrix-only setups — and when to use each.
Before self-hosting a gateway, it's worth understanding what each existing option actually does and where it stops.
vLLM accepts any request that arrives. Aibrix routes it to a replica. Neither has a concept of tenants, API keys, quotas, or fairshare. In a single-team, trusted environment that's fine. In a multi-team or multi-customer setup, a batch job from one team can saturate the cluster and starve everyone else with zero visibility into who is responsible.
LiteLLM is a provider-abstraction proxy. Its sweet spot is routing across many LLM providers (OpenAI, Anthropic, Azure, Bedrock, local models) with a unified interface and per-key rate limits.
Where it falls short for a dedicated self-hosted GPU cluster:
key_A is limited to 1000 RPM and the cluster can only handle 200 RPM right now, LiteLLM rejects key_A at 201 RPM even when key_B is idle and there's headroom. There's no weighted queue that redistributes unused capacity.Aibrix is excellent at what it does: replica selection (least-KV-cache, prefix-cache affinity, power-of-two, throughput-weighted). obleth deliberately does not duplicate this. The right architecture is both: obleth decides who sends and at what priority; Aibrix decides which pod serves it.
| Feature | LiteLLM | Aibrix alone | obleth |
|---|---|---|---|
| Multi-tenant API keys | ✓ | — | ✓ |
| Per-key rate limits (RPM) | ✓ | — | ✓ (TPM-based) |
| Weighted fairshare under saturation | — | — | ✓ |
| Token-measured admission | — | — | ✓ |
| Live priority boost (no restart) | — | — | ✓ |
| Brownout instead of 429 | — | — | ✓ |
| Fail-open with WAL | — | — | ✓ |
| Multi-provider routing | ✓ | — | per-model routes |
| Replica/pod selection | — | ✓ | delegates to Aibrix |
| OpenAI-compatible proxy | ✓ | ✓ | ✓ |