How to connect obleth to an Aibrix inference gateway, configure per-model routing, and set up admission weights.
Aibrix is the recommended downstream inference router for obleth. It handles replica selection, KV-cache affinity, and prefix-cache routing — the layer obleth deliberately doesn't duplicate.
Point OBLETH_UPSTREAM_BASE_URL at your Aibrix gateway:
OBLETH_UPSTREAM_BASE_URL=http://aibrix-gateway.aibrix.svc.cluster.local:8080
With this set, all requests that don't match a per-model api_base override are forwarded to Aibrix. obleth preserves the model field, and Aibrix routes it to the appropriate vLLM replica.
Register each model in obleth's model registry with its api_base:
curl -X POST http://localhost:9090/api/v1/models \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_name": "llama-3-70b",
"upstream_model": "meta-llama/Llama-3-70b-instruct",
"api_base": "http://aibrix-gateway:8080",
"input_cost_per_token": 0.0000005,
"output_cost_per_token": 0.0000015,
"context_window": 131072,
"admission_weight": 4,
"supports_function_calling": true,
"supports_system_messages": true,
"enabled": true
}'
upstream_model is the model identifier sent to Aibrix (and then to vLLM). This lets you rename models for clients without changing your backend configuration.
Not all models cost the same. A 70B model consumes roughly 4–10× more GPU time than a 7B model. The admission_weight multiplier adjusts the effective fairshare cost:
effective_weight = tenant.weight × model.admission_weight
A tenant with weight=100 using a model with admission_weight=4 competes as if they had weight=400.
| Model | admission_weight |
|---|---|
llama-3-8b | 1 |
llama-3-70b | 4 |
mixtral-8x7b | 2 |
If your Aibrix gateway requires its own API key, set it in the model record. obleth strips the client's Authorization and injects the model's key upstream:
curl -X PUT http://localhost:9090/api/v1/models/$MODEL_ID \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{...existing fields..., "api_key": "aibrix-internal-key-xyz"}'
obleth works with any OpenAI-compatible upstream. Point OBLETH_UPSTREAM_BASE_URL at a raw vLLM service, LiteLLM, or the bundled mock backend. Aibrix is recommended for multi-replica GPU clusters, not a hard dependency.