How obleth and Aibrix compose: obleth owns identity and admission, Aibrix owns pod routing. Neither duplicates the other.
Aibrix is an open-source inference gateway that handles replica selection for vLLM clusters: routing requests to the least-loaded pod, power-of-two choices, prefix-cache affinity, throughput-weighted routing. It is excellent at this.
Aibrix deliberately does not handle multi-tenant identity, API key authentication, weighted fairshare admission, or cost accounting. obleth deliberately does not handle replica selection. They compose cleanly.
Client
↓ HTTPS
HAProxy (TLS + round-robin across obleth pods)
↓ HTTP
obleth ←── owns this layer:
• API key auth + tenant resolution
• Weighted fairshare admission
• Token budget enforcement
• Brownout policy
• Response cache
• MCP gateway
• Cost accounting + telemetry
↓ HTTP (OpenAI-compatible)
Aibrix ←── owns this layer:
• Pod/replica selection
• KV-cache affinity routing
• Prefix-cache routing
• Per-model RPS routing
• Least-loaded replica
↓ HTTP
vLLM replicas
obleth decides who sends and at what priority. Aibrix decides which pod serves it.
Point OBLETH_UPSTREAM_BASE_URL at your Aibrix gateway endpoint:
OBLETH_UPSTREAM_BASE_URL=http://aibrix-gateway:8080/v1
obleth proxies the OpenAI-compatible request to Aibrix. Aibrix sees a normal inference request (the Authorization header is forwarded as-is, or overridden with the model's api_key if configured in obleth's model registry).
For per-model upstream overrides, use the model registry:
curl -X POST http://localhost:9090/api/v1/models \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_name": "llama-3-70b",
"upstream_model": "llama-3-70b-instruct",
"api_base": "http://aibrix-gateway:8080",
"input_cost_per_token": 0.0000005,
"output_cost_per_token": 0.0000015,
"context_window": 131072,
"enabled": true
}'
Models have an admission_weight multiplier (default 1) that is applied on top of the tenant's weight during fairshare admission:
effective_weight = tenant.weight * model.admission_weight
This lets you make expensive models require proportionally more fairshare credit. For example, a 70B model with admission_weight=4 means a tenant needs 4× more fairshare capacity to run requests against it compared to a 7B model with admission_weight=1.
obleth works with any OpenAI-compatible upstream. Point OBLETH_UPSTREAM_BASE_URL at:
Aibrix is the recommended pairing for self-hosted GPU clusters, not a hard dependency.