Brownout Policy

How obleth degrades low-priority traffic by capping max_tokens instead of returning HTTP 429 under saturation.

When a request has been sitting in the admission queue for too long, obleth makes a choice: reject it with 429 Too Many Requests, or let it through in a degraded form. obleth chooses degradation.

What brownout does

A request admitted with Admission::Brownout is not rejected. It is proxied to the upstream normally, but with one modification: the max_tokens field in the request body is capped to 256.

This means:

The client receives a real response (not an error).
The response is shorter than it might have been.
The GPU time consumed by the request is bounded.
The tenant's fairshare score is still charged for the actual tokens used.

When it triggers

Brownout triggers when a request has been waiting in the admission queue for longer than OBLETH_BROWNOUT_WAIT_MS (default: 750ms).

The scheduler checks the enqueue timestamp when it selects a request for admission. If now - enqueued > brownout_wait, the request is tagged Admission::Brownout before its permit is granted.

Why not 429?

429 forces the client to retry. Under saturation, retries add load and compound the problem. A client that gets a 429 typically backs off and retries — which means the request eventually takes more total time than if it had just waited and been served a short answer.

Brownout keeps clients happy. A chatbot user who gets a short but real answer in under 1s has a much better experience than a client that gets 429 and has to retry. For batch jobs, a shorter answer is still usable data; the job can fetch more context in the next request.

Brownout preserves system stability. Rejecting traffic under saturation doesn't reduce load — it shifts it to retry storms. Brownout admits and completes the request with bounded GPU time, releasing the slot for the next queued request.

Configuring the threshold

# Lower = degrade sooner (more aggressive protection of high-weight tenants)
# Higher = tolerate longer queues before degrading (better quality for all)
OBLETH_BROWNOUT_WAIT_MS=750

You can also adjust this live via the capacity endpoint:

curl -X PUT http://localhost:9090/api/v1/capacity \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_in_flight": 300}'

Increasing max_in_flight gives more concurrency headroom, which reduces the frequency of brownout.

Telemetry

Every brownout request is recorded in ClickHouse with admission = "brownout". You can query the ledger to see how often brownout is triggering and for which tenants:

SELECT
  tenant_id,
  count() AS brownout_requests,
  avg(queue_wait_ms) AS avg_wait_ms
FROM usage
WHERE admission = 'brownout'
  AND ts_ms > (toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000)
GROUP BY tenant_id
ORDER BY brownout_requests DESC

If brownout is frequent, consider:

Increasing OBLETH_GLOBAL_MAX_IN_FLIGHT (more concurrency headroom).
Scaling the number of obleth pods (horizontal scaling; capacity adds).
Reviewing tenant weights — a very high-weight tenant may be consuming most of the capacity.

PreviousToken-measured Fairness

NextCapacity Provider

Getting Started

Concepts

Guides

Reference

Operations