56 docs indexed
Find each model's in-flight knee with a bounded ramp probe, then apply the recommended max_in_flight slots from the dashboard or Management API.
Each registered model can carry its own max_in_flight slot cap — an optional ceiling on concurrent requests for that route, inside the global fairshare scheduler limit (OBLETH_GLOBAL_MAX_IN_FLIGHT). Auto-tune helps you set that cap for self-hosted chat and embedding models by driving real load directly at the upstream and measuring where throughput stops climbing or tail latency breaches your SLO.
Cloud-hosted models should stay on a static cap so spend stays bounded. Auto-tune is intentionally recommend-only: the probe never writes config; you apply the suggestion explicitly.
| Mode | Meaning |
|---|---|
static (default) | max_in_flight is operator-set. Use for cloud APIs and any model where you want a fixed spend/concurrency ceiling. |
tuned | max_in_flight was last written by auto-tune. capacity_tuned_at records when. You can flip back to static at any time without clearing the slot value. |
Both modes use the same max_in_flight field. The mode is metadata for operators and audit — it does not change admission behavior on the data plane.
The ramp probe runs in the Management API (obleth-admin). For each step it:
api_base, bypassing obleth's gateway admission, so the measurement reflects the upstream's true capacity rather than the gateway's current cap.max_concurrency (hard ceiling 256).target_p99_msThe recommended max_in_flight is the highest concurrency that stayed under the SLO while still meaningfully improving throughput.
| Limit | Value |
|---|---|
| Wall clock | ≤ 60 seconds (all steps combined) |
| Total requests | ≤ 20,000 |
| Max concurrency | ≤ 256 (regardless of request) |
| Chat probe size | max_tokens: 8, non-streaming |
| Supported modalities | chat, embedding only |
Defaults when omitted: target_p99_ms = 2000, max_concurrency = 64.
All three routes are audit-logged. See Management API — Models for the full listing.
curl -X PUT "http://localhost:9180/api/v1/models/$MODEL_ID/capacity-mode" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"capacity_mode": "static"}'
curl -X POST "http://localhost:9180/api/v1/models/$MODEL_ID/autotune" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"target_p99_ms": 2000, "max_concurrency": 64}'
Example response (truncated):
{
"model_id": "...",
"model_name": "qwen3-8b",
"modality": "chat",
"recommended_max_in_flight": 32,
"knee_reason": "slo_breach",
"target_p99_ms": 2000,
"max_concurrency": 64,
"recommended_throughput_rps": 18.4,
"duration_ms": 12400,
"steps": [
{
"concurrency": 16,
"throughput_rps": 17.2,
"p50_ms": 420,
"p99_ms": 890,
"requests": 430,
"errors": 0
},
{
"concurrency": 32,
"throughput_rps": 18.4,
"p50_ms": 510,
"p99_ms": 1120,
"requests": 460,
"errors": 0
}
]
}
curl -X POST "http://localhost:9180/api/v1/models/$MODEL_ID/autotune/apply" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"max_in_flight": 32}'
This sets max_in_flight, flips capacity_mode to tuned, stamps capacity_tuned_at, syncs the data plane, and records an audit entry. You can also apply a value from a saved probe report without re-running the probe.
On the Models page, expand a chat or embedding route:
Static / Tuned toggle in operational controls (calls PUT …/capacity-mode).max_in_flight editor (calls PUT …/capacity).The auto-tune panel is hidden for non-probeable modalities (image, audio_*, etc.).
| Scenario | Recommendation |
|---|---|
| Self-hosted vLLM / Aibrix / local OpenAI-compatible server | Run auto-tune during a quiet window, then apply. Re-run after hardware or replica changes. |
| OpenAI, Together, or other metered cloud APIs | Keep static and set a conservative max_in_flight manually to cap spend. |
| Production model under heavy traffic | Avoid probing during peak — the probe adds real concurrent load at the upstream. |
| After apply | Confirm obleth_queue_depth stays near zero and GPU/backend utilization looks healthy. |
Per-model slots sit inside the global scheduler cap. Effective admission is the minimum of global max_in_flight, per-model slots, tenant/group limits, and fairshare. See Capacity Provider and Scaling.