Positioning with Aibrix

How obleth and Aibrix compose: obleth owns identity and admission, Aibrix owns pod routing. Neither duplicates the other.

Aibrix is an open-source inference gateway that handles replica selection for vLLM clusters: routing requests to the least-loaded pod, power-of-two choices, prefix-cache affinity, throughput-weighted routing. It is excellent at this.

Aibrix deliberately does not handle multi-tenant identity, API key authentication, weighted fairshare admission, or cost accounting. obleth deliberately does not handle replica selection. They compose cleanly.

Division of responsibility

Client
  ↓ HTTPS
HAProxy  (TLS + round-robin across obleth pods)
  ↓ HTTP
obleth  ←── owns this layer:
  • API key auth + tenant resolution
  • Weighted fairshare admission
  • Token budget enforcement
  • Brownout policy
  • Response cache
  • MCP gateway
  • Cost accounting + telemetry
  ↓ HTTP (OpenAI-compatible)
Aibrix  ←── owns this layer:
  • Pod/replica selection
  • KV-cache affinity routing
  • Prefix-cache routing
  • Per-model RPS routing
  • Least-loaded replica
  ↓ HTTP
vLLM replicas

obleth decides who sends and at what priority. Aibrix decides which pod serves it.

Connecting obleth to Aibrix

Point OBLETH_UPSTREAM_BASE_URL at your Aibrix gateway endpoint:

OBLETH_UPSTREAM_BASE_URL=http://aibrix-gateway:8080/v1

obleth proxies the OpenAI-compatible request to Aibrix. Aibrix sees a normal inference request (the Authorization header is forwarded as-is, or overridden with the model's api_key if configured in obleth's model registry).

For per-model upstream overrides, use the model registry:

curl -X POST http://localhost:9090/api/v1/models \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "llama-3-70b",
    "upstream_model": "llama-3-70b-instruct",
    "api_base": "http://aibrix-gateway:8080",
    "input_cost_per_token": 0.0000005,
    "output_cost_per_token": 0.0000015,
    "context_window": 131072,
    "enabled": true
  }'

Model admission weights

Models have an admission_weight multiplier (default 1) that is applied on top of the tenant's weight during fairshare admission:

effective_weight = tenant.weight * model.admission_weight

This lets you make expensive models require proportionally more fairshare credit. For example, a 70B model with admission_weight=4 means a tenant needs 4× more fairshare capacity to run requests against it compared to a 7B model with admission_weight=1.

Without Aibrix

obleth works with any OpenAI-compatible upstream. Point OBLETH_UPSTREAM_BASE_URL at:

  • A raw vLLM service
  • LiteLLM (provider abstraction without fairshare)
  • Any other compatible API
  • The bundled mock backend (for dev/demo)

Aibrix is the recommended pairing for self-hosted GPU clusters, not a hard dependency.