Token-measured Fairness

Why fairness in obleth is measured in tokens, and how the estimate-reserve-reconcile cycle works.

Most gateways count requests. obleth counts tokens. The difference matters because one inference request can consume anywhere from a few tokens to tens of thousands — a single long-context request can occupy a GPU slot 1000× longer than a short one.

The estimate-reserve-reconcile cycle

obleth can't know how many tokens a response will consume before it runs. It uses a three-step process:

1. Estimate

Before fairshare admission, obleth estimates the total token cost from the request body:

  • Input tokens: HeuristicTokenizer counts ~4 characters per token, with 4 overhead tokens per message for role/formatting. Deterministic and fast.
  • Output tokens: the max_tokens field from the request, capped at 8192. Default ceiling is 512 if max_tokens is not set.

The total (input + estimated_output) is the request's cost used for both the fairshare score and the Redis budget reservation.

2. Reserve (atomic, cross-pod)

After admission but before the upstream call, a Redis Lua script atomically:

  1. Reads the tenant's token bucket: tokens + last-refill timestamp ts.
  2. Refills: tokens = min(capacity, tokens + (now_ms - ts) × rate) where rate = tokens_per_minute / 60000.
  3. Checks if tokens >= estimated_cost.
  4. If yes: subtracts the estimate, stores the new state, returns allowed=1.
  5. If no: returns allowed=0429 token budget exceeded.

The script is a single Redis round-trip. Correct across all pods with no application-level locking.

3. Reconcile

After the stream finishes, obleth reads actual token counts from the upstream's usage field and runs a second Lua script:

delta = estimated_output_tokens - actual_output_tokens
bucket_tokens = min(capacity, bucket_tokens + delta)

Positive delta (over-reserved) refunds tokens. Negative delta (under-reserved) charges more. The bucket can briefly go negative (bounded by −capacity) and is paid back over subsequent requests.

Why estimation accuracy matters less than you'd think

The estimate only affects admission ordering and budget reservation — not billing. Billing always uses reconciled actual cost. Slight estimation error just means the reservation is off, which reconciliation corrects. The heuristic is good enough for English text; for dense code or CJK content you may want a real BPE tokenizer.

Pluggable tokenizer

The Tokenizer trait in obleth-tokenizer is the seam:

pub trait Tokenizer: Send + Sync {
    fn count_text(&self, text: &str) -> u32;
    fn estimate_request(&self, body: &Value) -> CostEstimate { ... }
}

Drop in a tiktoken-rs or HuggingFace tokenizers implementation for model-accurate counting without changing anything else in the pipeline.

Token bucket parameters

FieldSourceEffect
tokens_per_minuteTenant.tokens_per_minuteSustained throughput budget per tenant
Bucket capacity (burst ceiling)same as tokens_per_minuteMaximum tokens held at once
Refill ratetokens_per_minute / 60000Tokens added per millisecond

A tenant with tokens_per_minute=2000000 can burst up to 2M tokens then refills at 33k tokens/ms. Adjust via PUT /api/v1/tenants/{id}/quota.