Brownout Policy

How obleth degrades low-priority traffic by capping max_tokens instead of returning HTTP 429 under saturation.

When a request has been sitting in the admission queue for too long, obleth makes a choice: reject it with 429 Too Many Requests, or let it through in a degraded form. obleth chooses degradation.

What brownout does

A request admitted with Admission::Brownout is not rejected. It is proxied to the upstream normally, but with one modification: the max_tokens field in the request body is capped to 256.

This means:

  • The client receives a real response (not an error).
  • The response is shorter than it might have been.
  • The GPU time consumed by the request is bounded.
  • The tenant's fairshare score is still charged for the actual tokens used.

When it triggers

Brownout triggers when a request has been waiting in the admission queue for longer than OBLETH_BROWNOUT_WAIT_MS (default: 750ms).

The scheduler checks the enqueue timestamp when it selects a request for admission. If now - enqueued > brownout_wait, the request is tagged Admission::Brownout before its permit is granted.

Why not 429?

429 forces the client to retry. Under saturation, retries add load and compound the problem. A client that gets a 429 typically backs off and retries — which means the request eventually takes more total time than if it had just waited and been served a short answer.

Brownout keeps clients happy. A chatbot user who gets a short but real answer in under 1s has a much better experience than a client that gets 429 and has to retry. For batch jobs, a shorter answer is still usable data; the job can fetch more context in the next request.

Brownout preserves system stability. Rejecting traffic under saturation doesn't reduce load — it shifts it to retry storms. Brownout admits and completes the request with bounded GPU time, releasing the slot for the next queued request.

Configuring the threshold

# Lower = degrade sooner (more aggressive protection of high-weight tenants)
# Higher = tolerate longer queues before degrading (better quality for all)
OBLETH_BROWNOUT_WAIT_MS=750

You can also adjust this live via the capacity endpoint:

curl -X PUT http://localhost:9090/api/v1/capacity \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_in_flight": 300}'

Increasing max_in_flight gives more concurrency headroom, which reduces the frequency of brownout.

Telemetry

Every brownout request is recorded in ClickHouse with admission = "brownout". You can query the ledger to see how often brownout is triggering and for which tenants:

SELECT
  tenant_id,
  count() AS brownout_requests,
  avg(queue_wait_ms) AS avg_wait_ms
FROM usage
WHERE admission = 'brownout'
  AND ts_ms > (toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000)
GROUP BY tenant_id
ORDER BY brownout_requests DESC

If brownout is frequent, consider:

  1. Increasing OBLETH_GLOBAL_MAX_IN_FLIGHT (more concurrency headroom).
  2. Scaling the number of obleth pods (horizontal scaling; capacity adds).
  3. Reviewing tenant weights — a very high-weight tenant may be consuming most of the capacity.