How obleth's admission scheduler chooses slots under saturation across weighted and hierarchical fairshare.
The fairshare engine is obleth's heart. It answers one question: when the cluster is at capacity and a new request arrives, which tenant's queued request gets admitted next?
Admission model
A single scheduler owns queue order and in-flight counts. This example shows two tenants competing for an eight-slot pool: chatbot has weight 500, while api-batch has weight 50.
Admit
obleth identifies the tenant and policy before routing upstream.
Queue
When the pool is full, requests wait by tenant instead of racing.
Pick
The next open slot goes to the tenant most entitled to it.
Release
The slot returns when the model response finishes streaming.
Hierarchical mode - current runtime default
With an eight-slot pool and active groups weighted 500:50, the gateway allocates seven slots to chatbot and one reserved slot to the api group. api-batch still queues, but it does not vanish.
chatbot
group chatbot / weight 500
cap 7 / queued 57
api-batch
group api / weight 50
cap 1 / queued 31
8-slot pool
example split
Inside a group
After group caps reserve capacity, tenants inside the eligible group are balanced by how much work they have already received. A busy tenant cannot permanently crowd out its peers.
chatbot
served 8.6k
queued 28
ahead
chatbot-2
served 8.4k
queued 29
next
batch-job
served 9.1k
queued 12
waits
Fast
If capacity is open and no backlog exists, the request is admitted immediately.
Queued
If the pool is full, the request waits until a released permit triggers dispatch.
Brownout
If it waited past the brownout threshold, it is still admitted but marked degraded.
Under the flat weighted algorithm, each tenant accumulates a served counter in tokens as requests are admitted. The scheduler tracks the ratio served / weight for every active tenant. When a slot opens, it goes to the tenant with the lowest ratio — the one most behind its proportional fair share.
A tenant with weight=500 gets roughly ten times the throughput of one with weight=50, not by receiving ten times more slots, but because the scheduler always picks the most under-served tenant next.
All tenants compete globally. Admission scores are served_tokens / weight. The tenant with the lowest score wins each slot.
This is start-time fair queuing (STFQ) applied to token cost: a tenant that just used a lot of tokens won't be selected again until the others catch up relative to their weights.
Set with OBLETH_FAIRSHARE_ALGORITHM=weighted.
Global capacity is first partitioned among fairshare groups by group weight, then split evenly among tenants within each group.
For example, with OBLETH_GLOBAL_MAX_IN_FLIGHT=8 and two groups:
prod group (weight 500) → ~7 slotsdev group (weight 50) → ~1 slotWithin each group, slots are distributed evenly regardless of individual tenant weights. This lets you protect an entire team or product category as a unit.
Set with OBLETH_FAIRSHARE_ALGORITHM=hierarchical.
A single Tokio async task owns all admission state. All requests send an Admit message and receive a Permit back. There are no locks on the hot path, no races between concurrent admission decisions, and the scheduler's view of in-flight counts is always consistent.
The permit is a Rust RAII type: dropping it automatically sends a Release message to the scheduler, freeing the slot. This means the slot is held for the full duration of the upstream call including streaming time. Concurrency accounting reflects real GPU occupancy.
Because tenant weight is just a field in Postgres and Redis, changing it takes effect on the next admission decision after the cache is invalidated. You don't need to restart obleth or reload config.
# Double the chatbot's priority during a traffic spike
curl -X PATCH http://localhost:9090/api/v1/tenants/$TID/weight \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"weight": 1000}'
The pub/sub invalidation channel propagates the change to every gateway pod's moka cache within milliseconds.
Fairshare admission via share_score means no tenant with non-zero weight can be starved indefinitely. A low-weight tenant's score grows more slowly per token served than a high-weight tenant's, but it always advances while it is active. When a heavy tenant's score catches up, the light tenant becomes the most behind fair share and receives the next slot.
The scheduler implements a DRR-inspired policy; see DRR Algorithm for the mapping between classic deficit counters and share_score.
Idle tenants don't accumulate "credit" they can spend in a burst — the scheduler resets the served counter context so they can't coast on historical inactivity.
If a request has been queued for longer than OBLETH_BROWNOUT_WAIT_MS (default 750ms), it is admitted with Admission::Brownout — its max_tokens is capped to 256. This is a deliberate policy: the request still completes, just with a shorter answer, rather than receiving a hard 429.
See Brownout Policy for more detail.
The Management API exposes a real-time snapshot of the scheduler:
curl http://localhost:9090/api/v1/fairshare/live \
-H "Authorization: Bearer $TOKEN"
Returns per-tenant and per-group in_flight, queued, served_tokens, share_score, and weight_share — exactly what the control-plane dashboard displays.