Why fairness in obleth is measured in tokens, and how the estimate-reserve-reconcile cycle works.
Most gateways count requests. obleth counts tokens. The difference matters because one inference request can consume anywhere from a few tokens to tens of thousands — a single long-context request can occupy a GPU slot 1000× longer than a short one.
obleth can't know how many tokens a response will consume before it runs. It uses a three-step process:
Before fairshare admission, obleth estimates the total token cost from the request body:
HeuristicTokenizer counts ~4 characters per token, with 4 overhead tokens per message for role/formatting. Deterministic and fast.max_tokens field from the request, capped at 8192. Default ceiling is 512 if max_tokens is not set.The total (input + estimated_output) is the request's cost used for both the fairshare score and the Redis budget reservation.
After admission but before the upstream call, a Redis Lua script atomically:
tokens + last-refill timestamp ts.tokens = min(capacity, tokens + (now_ms - ts) × rate) where rate = tokens_per_minute / 60000.tokens >= estimated_cost.allowed=1.allowed=0 → 429 token budget exceeded.The script is a single Redis round-trip. Correct across all pods with no application-level locking.
After the stream finishes, obleth reads actual token counts from the upstream's usage field and runs a second Lua script:
delta = estimated_output_tokens - actual_output_tokens
bucket_tokens = min(capacity, bucket_tokens + delta)
Positive delta (over-reserved) refunds tokens. Negative delta (under-reserved) charges more. The bucket can briefly go negative (bounded by −capacity) and is paid back over subsequent requests.
The estimate only affects admission ordering and budget reservation — not billing. Billing always uses reconciled actual cost. Slight estimation error just means the reservation is off, which reconciliation corrects. The heuristic is good enough for English text; for dense code or CJK content you may want a real BPE tokenizer.
The Tokenizer trait in obleth-tokenizer is the seam:
pub trait Tokenizer: Send + Sync {
fn count_text(&self, text: &str) -> u32;
fn estimate_request(&self, body: &Value) -> CostEstimate { ... }
}
Drop in a tiktoken-rs or HuggingFace tokenizers implementation for model-accurate counting without changing anything else in the pipeline.
| Field | Source | Effect |
|---|---|---|
tokens_per_minute | Tenant.tokens_per_minute | Sustained throughput budget per tenant |
| Bucket capacity (burst ceiling) | same as tokens_per_minute | Maximum tokens held at once |
| Refill rate | tokens_per_minute / 60000 | Tokens added per millisecond |
A tenant with tokens_per_minute=2000000 can burst up to 2M tokens then refills at 33k tokens/ms. Adjust via PUT /api/v1/tenants/{id}/quota.