Module 02 - Fairshare admission
Slots under load
When inference saturates, obleth does not immediately spray 429s. A global in-flight cap bounds the pool, then fairshare admission decides which queued request receives the next slot.
Hierarchical mode protects groups with reserved slot caps. Within an eligible group, obleth admits the tenant that has received the least work so peers keep making progress.
The diagram below keeps one visual model: the scheduler loop, an example 8-slot split, and the per-group tenant queue.
Admission model
Every open slot is assigned deliberately.
A single scheduler owns queue order and in-flight counts. This example shows two tenants competing for an eight-slot pool: chatbot has weight 500, while api-batch has weight 50.
Admit
obleth identifies the tenant and policy before routing upstream.
Queue
When the pool is full, requests wait by tenant instead of racing.
Pick
The next open slot goes to the tenant most entitled to it.
Release
The slot returns when the model response finishes streaming.
Hierarchical mode - current runtime default
Group caps keep api-batch alive under a 10x weight gap.
With an eight-slot pool and active groups weighted 500:50, the gateway allocates seven slots to chatbot and one reserved slot to the api group. api-batch still queues, but it does not vanish.
chatbot
group chatbot / weight 500
cap 7 / queued 57
api-batch
group api / weight 50
cap 1 / queued 31
8-slot pool
example split
Inside a group
Each group gets its own fairshare queue.
After group caps reserve capacity, tenants inside the eligible group are balanced by how much work they have already received. A busy tenant cannot permanently crowd out its peers.
chatbot
served 8.6k
queued 28
ahead
chatbot-2
served 8.4k
queued 29
next
batch-job
served 9.1k
queued 12
waits
Fast
If capacity is open and no backlog exists, the request is admitted immediately.
Queued
If the pool is full, the request waits until a released permit triggers dispatch.
Brownout
If it waited past the brownout threshold, it is still admitted but marked degraded.
Control your
inference economy.
The production dashboard shows real-time slot counts, queue depth, and group pools - this page mirrors that mental model so you can see how fairshare behaves before deploying.