Fairshare
Who gets the next slot?
When the cluster hits its global in-flight cap, new requests queue instead of failing. A single Tokio scheduler task picks the next tenant to admit — no lock races, and the permit stays held through the full upstream stream.
Higher priority does not mean more concurrent slots — it means that tenant gets picked next more often when the cluster is busy.
Change tenant priority in the dashboard; every gateway pod honors it on the next request — no restart.
Prod first
Priority sticks
When GPUs are scarce, important workloads keep advancing — noisy neighbors can't take over indefinitely
By team
Separate pools
Give prod, research, and sandbox their own slice of the cluster instead of one shared free-for-all
Live dial
Dashboard
Change tenant priority from the UI — no redeploy, no config reload, no maintenance window
See it
Queue visibility
Watch in-flight work, queue depth, and who is waiting — right in the dashboard
Algorithms
Hierarchical or weighted
Hierarchical
DefaultGlobal capacity partitions among fairshare groups by group weight, then splits evenly within each group. Protect an entire team or product category as a unit.
Weighted
AlternativeAll tenants compete globally. Higher priority means more throughput over time — not more reserved slots, but picked next more often when behind.
Admission model
Every open slot is assigned deliberately.
A single scheduler owns queue order and in-flight counts. This example shows two tenants competing for an eight-slot pool: chatbot has weight 500, while api-batch has weight 50.
Admit
obleth identifies the tenant and policy before routing upstream.
Queue
When the pool is full, requests wait by tenant instead of racing.
Pick
The next open slot goes to the tenant most entitled to it.
Release
The slot returns when the model response finishes streaming.
Hierarchical mode - current runtime default
Group caps keep api-batch alive under a 10x weight gap.
With an eight-slot pool and active groups weighted 500:50, the gateway allocates seven slots to chatbot and one reserved slot to the api group. api-batch still queues, but it does not vanish.
chatbot
group chatbot / weight 500
cap 7 / queued 57
api-batch
group api / weight 50
cap 1 / queued 31
8-slot pool
example split
Inside a group
Each group gets its own fairshare queue.
After group caps reserve capacity, tenants inside the eligible group are balanced by how much work they have already received. A busy tenant cannot permanently crowd out its peers.
chatbot
served 8.6k
queued 28
ahead
chatbot-2
served 8.4k
queued 29
next
batch-job
served 9.1k
queued 12
waits
Fast
If capacity is open and no backlog exists, the request is admitted immediately.
Queued
If the pool is full, the request waits until a released permit triggers dispatch.
Rejected
Only when a hard limit is crossed — the per-minute token budget (429) or a term budget (403).
Starvation-free,
live-tunable.
No tenant with non-zero weight can be starved indefinitely. The dashboard shows real-time slot counts, queue depth, and group pools — tune weights live and watch admission respond.