Fairshare admission & slot control

Fairshare

Who gets the next slot?

When the cluster hits its global in-flight cap, new requests queue instead of failing. A single Tokio scheduler task picks the next tenant to admit — no lock races, and the permit stays held through the full upstream stream.

Higher priority does not mean more concurrent slots — it means that tenant gets picked next more often when the cluster is busy.

Change tenant priority in the dashboard; every gateway pod honors it on the next request — no restart.

Prod first

Priority sticks

When GPUs are scarce, important workloads keep advancing — noisy neighbors can't take over indefinitely

By team

Separate pools

Give prod, research, and sandbox their own slice of the cluster instead of one shared free-for-all

Live dial

Dashboard

Change tenant priority from the UI — no redeploy, no config reload, no maintenance window

See it

Queue visibility

Watch in-flight work, queue depth, and who is waiting — right in the dashboard

Algorithms

Hierarchical or weighted

Hierarchical

Default

Global capacity partitions among fairshare groups by group weight, then splits evenly within each group. Protect an entire team or product category as a unit.

Weighted

Alternative

All tenants compete globally. Higher priority means more throughput over time — not more reserved slots, but picked next more often when behind.

Admission model

Every open slot is assigned deliberately.

A single scheduler owns queue order and in-flight counts. This example shows two tenants competing for an eight-slot pool: chatbot has weight 500, while api-batch has weight 50.

Admit

obleth identifies the tenant and policy before routing upstream.

Queue

When the pool is full, requests wait by tenant instead of racing.

Pick

The next open slot goes to the tenant most entitled to it.

Release

The slot returns when the model response finishes streaming.

Hierarchical mode - current runtime default

Group caps keep api-batch alive under a 10x weight gap.

With an eight-slot pool and active groups weighted 500:50, the gateway allocates seven slots to chatbot and one reserved slot to the api group. api-batch still queues, but it does not vanish.

chatbot

group chatbot / weight 500

cap 7 / queued 57

api-batch

group api / weight 50

cap 1 / queued 31

8-slot pool

example split

chatbot

api

chatbot7 slots

api-batch1 slot

Inside a group

Each group gets its own fairshare queue.

After group caps reserve capacity, tenants inside the eligible group are balanced by how much work they have already received. A busy tenant cannot permanently crowd out its peers.

chatbot

served 8.6k

queued 28

ahead

chatbot-2

served 8.4k

queued 29

batch-job

served 9.1k

queued 12

waits

Fast

If capacity is open and no backlog exists, the request is admitted immediately.

Queued

If the pool is full, the request waits until a released permit triggers dispatch.

Rejected

Only when a hard limit is crossed — the per-minute token budget (429) or a term budget (403).

Starvation-free,
live-tunable.

No tenant with non-zero weight can be starved indefinitely. The dashboard shows real-time slot counts, queue depth, and group pools — tune weights live and watch admission respond.

FAIRSHARE ENGINE DOCS

Who gets the next slot?

Hierarchical or weighted

Hierarchical

Weighted

Every open slot is assigned deliberately.

Group caps keep api-batch alive under a 10x weight gap.

Each group gets its own fairshare queue.

Starvation-free,live-tunable.

Starvation-free,
live-tunable.