DRR Algorithm

How classic deficit round-robin maps to obleth's share_score fairshare admission.

Product term: the control plane and GET /api/v1/fairshare/live expose share_score (served_tokens / weight). Lower score means more behind fair share and admitted next. This page explains the DRR lineage; day-to-day operations use Fairshare Engine terminology.

obleth's admission scheduler adapts Deficit Round Robin (DRR) to concurrent, token-measured requests instead of packet queues. The observable metric is share_score, not a per-tenant deficit counter.

Classic DRR

In the original DRR algorithm, each queue has a deficit counter. In each round, a queue can send up to quantum + deficit bytes. If it sends less, the remainder carries into the next round. Larger quanta produce proportionally more bandwidth while preserving starvation freedom.

obleth's adaptation

obleth applies the same fairness principle to admission. "Sending" means holding a concurrency permit for the duration of a request.

For each active tenant, obleth tracks served_tokens, the estimated token cost of admitted requests. The admission score is:

share_score = served_tokens / weight

When a concurrency slot opens, it goes to the queued tenant with the lowest share_score. A tenant with weight=500 gets roughly ten times the opportunity of one with weight=50 because its score grows ten times more slowly per token.

Hierarchical scheduling

With OBLETH_FAIRSHARE_ALGORITHM=hierarchical, obleth first partitions global concurrency among fairshare groups:

slots_for_group = floor(group_weight / sum(group_weights) * max_in_flight)

Fractional remainders are distributed largest-first so the total equals max_in_flight, and every active group gets at least one slot if capacity allows.

Within each group, slots are distributed among that group's tenants using the tenant-level fairshare score.

Worked example

max_in_flight = 8
groups: prod (weight=500), dev (weight=50)
total_weight = 550

prod: floor(8 * 500 / 550) = floor(7.27) = 7
dev:  floor(8 *  50 / 550) = floor(0.72) = 0 -> bumped to 1
sum  = 8

The min-one guarantee means every active group eventually gets capacity.

Starvation freedom

Every tenant with a non-zero weight eventually gets admitted. The served counter grows for every tenant that gets slots. A low-weight tenant whose score is lowest will be selected once higher-weight tenants have received enough service to bring their scores above it.

Idle tenants are not in the queue, so they cannot bank unused credit. When they re-enter, they start from the current virtual-time baseline.

Scheduler loop

  1. A request sends Ctl::Admit { req, respond, enqueued } to the scheduler channel.
  2. If in_flight < max_in_flight, the permit is granted immediately.
  3. Otherwise, the request is appended to the tenant's queue.
  4. When an in-flight request completes, its permit sends Ctl::Release.
  5. The scheduler finds the queued tenant with the lowest served / weight score.
  6. If the winning request waited longer than brownout_wait, it is admitted with brownout limits.
  7. The slot is granted and served[tenant] += cost is updated.

All of this runs on a single async task, so admission order is deterministic and avoids lock races.

Proving it

The benchmark harness seeds two tenants with weight 500 and 50, lets the lower-weight api-batch tenant flood first, then starts the boosted chatbot tenant while the scheduler is already under contention.

node bench/run-benchmark.mjs

The command passes when chatbot makes visible progress after joining, api-batch is not starved, and ClickHouse usage stays close to client-observed completions. See Benchmark Harness for the full walkthrough.