56 docs indexed
What obleth does when demand exceeds capacity: weighted queuing, then explicit rejection with HTTP 429 or 403.
When more work arrives than the cluster can serve, obleth does not silently drop or degrade requests. It queues them by weighted fair share, and only rejects when a hard limit is crossed. Every outcome is explicit and recorded in the usage ledger.
The scheduler tags every admitted request with one of three outcomes (Admission in obleth-config):
| Outcome | Meaning |
|---|---|
fast | Capacity was available; admitted immediately. |
queued | The cluster was at its in-flight cap; the request waited in the weighted queue until a permit was released. |
rejected | A budget check failed (or a fail-closed dependency was unavailable); the request was not proxied. |
There is no timeout-based degradation: a queued request keeps its place and is admitted as soon as the scheduler picks it. It is never silently shortened.
When the global in-flight cap (OBLETH_GLOBAL_MAX_IN_FLIGHT) is reached, new requests join a per-tenant queue rather than failing at the door. When an in-flight request completes and releases its permit, the scheduler grants the slot to the tenant most behind its weighted fair share — see Fairshare Engine.
A request is only turned away when it crosses an explicit limit:
| Status | Cause |
|---|---|
429 token budget exceeded | The tenant's per-minute token bucket (tokens_per_minute) is empty. |
403 tenant term budget exhausted | A cumulative token or USD cap (budget_tokens / budget_cost_usd) for the current period is used up. |
403 | The tenant is inactive (outside an access window, suspended, or archived) or the model is not on its allowlist. |
503 scheduler unavailable | The fairshare scheduler could not be reached. |
Rejecting under load doesn't reduce load. A 429 forces the client to back off and retry, which adds traffic and compounds the problem into a retry storm. Holding the request in a weighted queue admits it the moment a slot frees up, so the cluster stays busy with useful work and high-weight tenants keep their share.
Token budgets are the deliberate hard stop. Queuing protects against transient contention. The per-minute and term budgets are where you express real spending limits — when those are crossed, obleth returns an explicit 429/403 rather than serving work the tenant has no allowance for.
If requests spend too long queued, the cluster is under-provisioned for its offered load. Options:
OBLETH_GLOBAL_MAX_IN_FLIGHT (only if the upstream can serve more concurrency).You can change the live global cap without a restart:
curl -X PUT http://localhost:9180/api/v1/capacity \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"max_in_flight": 300}'
Every request is recorded in ClickHouse with its admission class (fast, queued, or rejected) and queue_wait_ms. Query the ledger to see how much traffic is waiting and how long:
SELECT
tenant_id,
admission,
count() AS requests,
avg(queue_wait_ms) AS avg_wait_ms
FROM usage
WHERE ts_ms > (toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000)
GROUP BY tenant_id, admission
ORDER BY requests DESC
A rising share of queued requests with high queue_wait_ms is the signal to add headroom; a rising share of rejected requests points at token or term budgets that are too tight for the workload.