56 docs indexed

Saturation Behavior

What obleth does when demand exceeds capacity: weighted queuing, then explicit rejection with HTTP 429 or 403.

When more work arrives than the cluster can serve, obleth does not silently drop or degrade requests. It queues them by weighted fair share, and only rejects when a hard limit is crossed. Every outcome is explicit and recorded in the usage ledger.

The three admission outcomes

The scheduler tags every admitted request with one of three outcomes (Admission in obleth-config):

OutcomeMeaning
fastCapacity was available; admitted immediately.
queuedThe cluster was at its in-flight cap; the request waited in the weighted queue until a permit was released.
rejectedA budget check failed (or a fail-closed dependency was unavailable); the request was not proxied.

There is no timeout-based degradation: a queued request keeps its place and is admitted as soon as the scheduler picks it. It is never silently shortened.

Queue first, reject only on a hard limit

When the global in-flight cap (OBLETH_GLOBAL_MAX_IN_FLIGHT) is reached, new requests join a per-tenant queue rather than failing at the door. When an in-flight request completes and releases its permit, the scheduler grants the slot to the tenant most behind its weighted fair share — see Fairshare Engine.

A request is only turned away when it crosses an explicit limit:

StatusCause
429 token budget exceededThe tenant's per-minute token bucket (tokens_per_minute) is empty.
403 tenant term budget exhaustedA cumulative token or USD cap (budget_tokens / budget_cost_usd) for the current period is used up.
403The tenant is inactive (outside an access window, suspended, or archived) or the model is not on its allowlist.
503 scheduler unavailableThe fairshare scheduler could not be reached.

Why queue instead of reject?

Rejecting under load doesn't reduce load. A 429 forces the client to back off and retry, which adds traffic and compounds the problem into a retry storm. Holding the request in a weighted queue admits it the moment a slot frees up, so the cluster stays busy with useful work and high-weight tenants keep their share.

Token budgets are the deliberate hard stop. Queuing protects against transient contention. The per-minute and term budgets are where you express real spending limits — when those are crossed, obleth returns an explicit 429/403 rather than serving work the tenant has no allowance for.

Adding headroom

If requests spend too long queued, the cluster is under-provisioned for its offered load. Options:

  1. Increase OBLETH_GLOBAL_MAX_IN_FLIGHT (only if the upstream can serve more concurrency).
  2. Scale the number of obleth pods (horizontal scaling adds capacity).
  3. Review tenant weights — a very high-weight tenant may be consuming most of the capacity.

You can change the live global cap without a restart:

curl -X PUT http://localhost:9180/api/v1/capacity \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_in_flight": 300}'

Telemetry

Every request is recorded in ClickHouse with its admission class (fast, queued, or rejected) and queue_wait_ms. Query the ledger to see how much traffic is waiting and how long:

SELECT
  tenant_id,
  admission,
  count() AS requests,
  avg(queue_wait_ms) AS avg_wait_ms
FROM usage
WHERE ts_ms > (toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000)
GROUP BY tenant_id, admission
ORDER BY requests DESC

A rising share of queued requests with high queue_wait_ms is the signal to add headroom; a rising share of rejected requests points at token or term budgets that are too tight for the workload.