All data-plane routes obleth proxies, authentication headers, streaming, model routing, and how to configure popular SDKs.
obleth's data plane is a transparent OpenAI-compatible proxy. Any client that speaks the OpenAI HTTP API works with obleth with only a base_url change.
All data-plane requests must include a tenant API key. Either header is accepted:
Authorization: Bearer sk_<48 hex chars>
x-api-key: sk_<48 hex chars>
The admin token (OBLETH_ADMIN_TOKEN) is for the Management API only. Never send it to the data plane.
obleth proxies all standard OpenAI inference routes to the configured upstream:
| Route | Method | Notes |
|---|---|---|
/v1/chat/completions | POST | Streaming (stream: true) and non-streaming |
/v1/completions | POST | Legacy completions |
/v1/embeddings | POST | Proxied, not fairshare-throttled separately |
/v1/models | GET | Proxied to upstream |
/health | GET | Returns ok (no auth required) |
/mcp/{server} | ANY | MCP gateway (see MCP Gateway) |
Any other path is forwarded to the upstream unchanged — obleth is a fall-through proxy for paths it doesn't handle specially.
curl -s http://localhost/v1/chat/completions \
-H "Authorization: Bearer $SECRET" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is obleth?"}
],
"max_tokens": 256,
"temperature": 0.7
}'
curl -N http://localhost/v1/chat/completions \
-H "Authorization: Bearer $SECRET" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-70b",
"stream": true,
"messages": [{"role": "user", "content": "Count to 5"}],
"max_tokens": 32
}'
obleth streams the SSE response byte-for-byte from the upstream. The fairshare permit is held until the stream closes.
obleth routes by the model field in the request body. For paths that require model resolution (/v1/chat/completions, /v1/completions), the model must be registered in obleth's model registry.
Register a model:
curl -X POST http://localhost:9090/api/v1/models \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_name": "llama-3-70b",
"upstream_model": "meta-llama/Llama-3-70b-instruct",
"api_base": "http://my-aibrix:8080",
"input_cost_per_token": 0.0000005,
"output_cost_per_token": 0.0000015,
"context_window": 131072,
"enabled": true
}'
For the mock backend dev stack, mock-model is automatically available without registration.
admission_weight (default 1) multiplies the tenant's weight for the fairshare score on this model. Set it higher for expensive models:
# 70B model costs 4x the admission weight of a 7B model
curl -X PUT http://localhost:9090/api/v1/models/$MODEL_ID \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{...existing fields..., "admission_weight": 4}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost/v1", # or http://localhost:8088/v1 direct
api_key="sk_...",
)
response = client.chat.completions.create(
model="llama-3-70b",
messages=[{"role": "user", "content": "Hello"}],
max_tokens=64,
)
print(response.choices[0].message.content)
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost/v1",
api_key="sk_...",
model="llama-3-70b",
)
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost/v1',
apiKey: 'sk_...',
});
const response = await client.chat.completions.create({
model: 'llama-3-70b',
messages: [{ role: 'user', content: 'Hello' }],
max_tokens: 64,
});
| Limit | Value |
|---|---|
| Request body size | 64 MiB |
| MCP body size | 16 MiB |
| Response cache max body | 512 KiB (larger responses stream through uncached) |
obleth forwards all non-hop-by-hop headers from the client to the upstream, minus the client's Authorization header. If the model has an api_key configured, obleth injects that as the upstream's Authorization: Bearer <model_api_key> instead.