How to configure tokens_per_minute (TPM) and per-tenant max_in_flight caps, and what each one controls.
obleth has two independent throttling mechanisms per tenant: a token budget (TPM) and an optional concurrency cap (max_in_flight). They operate at different layers and can be used together.
tokens_per_minute is a token-bucket rate limit. It controls how many tokens a tenant can consume in a sustained period.
Token bucket mechanics:
tokens_per_minutetokens_per_minute / 60000 tokens per millisecondtokens_per_minute tokens in a single burstWhen the bucket is empty, the next request that would exceed it gets 429 token budget exceeded. The fairshare permit is released and the request is not proxied.
curl -X POST http://localhost:9090/api/v1/tenants \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "chatbot",
"weight": 500,
"tokens_per_minute": 2000000
}'
curl -X PUT http://localhost:9090/api/v1/tenants/$TID/quota \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"tokens_per_minute": 5000000,
"max_in_flight": null
}'
| Use case | Suggested TPM |
|---|---|
| Internal dev/test tenant | 50,000–200,000 |
| Single chat user | 100,000–500,000 |
| Production chatbot service | 1,000,000–5,000,000 |
| Batch processing job | 500,000–2,000,000 |
A 70B model at typical throughput consumes ~1,000–3,000 tokens/second per concurrent request.
max_in_flight is an optional hard cap on how many requests a single tenant can have in flight simultaneously, regardless of the global limit.
With max_in_flight=null (default), a tenant is only constrained by the global OBLETH_GLOBAL_MAX_IN_FLIGHT and the fairshare scheduler. Setting it adds an additional per-tenant ceiling.
curl -X PUT http://localhost:9090/api/v1/tenants/$TID/quota \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"tokens_per_minute": 2000000,
"max_in_flight": 10
}'
Remove the cap:
curl -X PUT http://localhost:9090/api/v1/tenants/$TID/quota \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"tokens_per_minute": 2000000, "max_in_flight": null}'
Both limits are enforced independently. A request can fail either check:
| Check | Where | Error if failed |
|---|---|---|
| Global concurrency | Fairshare scheduler | Queued (not rejected) |
Per-tenant max_in_flight | Fairshare scheduler | Queued (not rejected) |
Token budget (tokens_per_minute) | Redis Lua | 429 token budget exceeded |
Note: concurrency limits queue the request (it may get admitted later); the TPM limit is a hard rejection.
Query ClickHouse for TPM utilization per tenant:
SELECT
tenant_id,
sum(input_tokens + output_tokens) AS tokens_used,
count() AS requests,
countIf(admission = 'rejected') AS rejected
FROM usage
WHERE ts_ms > (toUnixTimestamp(now() - INTERVAL 1 HOUR) * 1000)
GROUP BY tenant_id
ORDER BY tokens_used DESC
Or use the Management API:
curl "http://localhost:9090/api/v1/usage?since_ms=$(date -d '1 hour ago' +%s000)" \
-H "Authorization: Bearer $TOKEN" | jq .