Capacity Provider

The CapacityProvider trait that sets obleth's global concurrency budget, and how to tune or replace the static v1 implementation.

The CapacityProvider trait is the seam between the fairshare scheduler and the cluster's actual capacity. It answers one question: how many requests may be in flight right now?

The trait

pub trait CapacityProvider: Send + Sync + 'static {
    /// Maximum number of requests allowed in flight right now.
    fn max_in_flight(&self) -> usize;
}

The fairshare scheduler calls this every time it considers granting a permit. The result is the global concurrency ceiling.

v1: StaticCapacity

The v1 implementation is a fixed limit backed by an atomic integer:

pub struct StaticCapacity {
    max: AtomicUsize,
}

impl StaticCapacity {
    pub fn set(&self, max: usize) { ... }  // runtime-tunable
}

Set it at startup:

OBLETH_GLOBAL_MAX_IN_FLIGHT=256

Or tune it live without restarting:

curl -X PUT http://localhost:9180/api/v1/capacity \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_in_flight": 300}'

Read the current value:

curl http://localhost:9180/api/v1/capacity \
  -H "Authorization: Bearer $TOKEN"
# {"max_in_flight": 300}

Sizing the limit

max_in_flight should approximately match the number of concurrent requests your inference backend can serve without queuing. Practical starting points:

Backend	Suggested starting point
Single vLLM instance (A100, bf16, 7B model)	32–64
Aibrix cluster, 4 replicas, 70B model	16–32 per obleth pod
Benchmark fixture backend (dev/demo)	64–256

If you see high obleth_queue_depth and a growing share of queued admissions, the limit is too low for your offered load. If requests are rarely queued but GPU utilization is low, you may have room to increase it.

Per-model max_in_flight

In addition to the global CapacityProvider limit, each model route can carry an optional max_in_flight slot cap. Admission uses the minimum of the global ceiling, per-model slots, tenant/group limits, and fairshare.

Set slots manually:

curl -X PUT http://localhost:9180/api/v1/models/$MODEL_ID/capacity \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_in_flight": 32}'

For self-hosted chat and embedding backends, use the auto-tune ramp probe to find the upstream knee instead of guessing. The probe is recommend-only; apply the result with POST …/autotune/apply or the control-plane Auto-tune panel. Cloud models should stay on capacity_mode: static with a conservative manual cap. See Capacity Auto-tune.

Future: metrics-driven capacity

The trait is the seam for a MetricsCapacity implementation that reads live signals from the inference backend:

// planned, not yet implemented
pub struct MetricsCapacity {
    aibrix_client: AibrixMetricsClient,
}

impl CapacityProvider for MetricsCapacity {
    fn max_in_flight(&self) -> usize {
        // Read Aibrix queue depth + KV-cache utilization
        // Return a dynamic limit that shrinks when the backend is under pressure
        self.compute_safe_limit()
    }
}

Swapping in a MetricsCapacity changes only the CapacityProvider binding — the fairshare scheduler and the entire rest of the pipeline are unchanged. This is by design.

Horizontal scaling note

Each obleth pod has its own StaticCapacity instance. If you run three pods each with OBLETH_GLOBAL_MAX_IN_FLIGHT=256, the effective global limit is 768 concurrent requests. HAProxy (or your ingress) distributes incoming requests across pods, so per-pod limits add up naturally.

See Scaling for more.

PreviousSaturation Behavior

NextReliability & Fail-open

Getting Started

Concepts

Guides

Reference

Operations