Capacity Provider

The CapacityProvider trait that sets obleth's global concurrency budget, and how to tune or replace the static v1 implementation.

The CapacityProvider trait is the seam between the fairshare scheduler and the cluster's actual capacity. It answers one question: how many requests may be in flight right now?

The trait

pub trait CapacityProvider: Send + Sync + 'static {
    /// Maximum number of requests allowed in flight right now.
    fn max_in_flight(&self) -> usize;
}

The fairshare scheduler calls this every time it considers granting a permit. The result is the global concurrency ceiling.

v1: StaticCapacity

The v1 implementation is a fixed limit backed by an atomic integer:

pub struct StaticCapacity {
    max: AtomicUsize,
}

impl StaticCapacity {
    pub fn set(&self, max: usize) { ... }  // runtime-tunable
}

Set it at startup:

OBLETH_GLOBAL_MAX_IN_FLIGHT=256

Or tune it live without restarting:

curl -X PUT http://localhost:9090/api/v1/capacity \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_in_flight": 300}'

Read the current value:

curl http://localhost:9090/api/v1/capacity \
  -H "Authorization: Bearer $TOKEN"
# {"max_in_flight": 300}

Sizing the limit

max_in_flight should approximately match the number of concurrent requests your inference backend can serve without queuing. Practical starting points:

BackendSuggested starting point
Single vLLM instance (A100, bf16, 7B model)32–64
Aibrix cluster, 4 replicas, 70B model16–32 per obleth pod
Mock backend (dev/demo)64–256

If you see high obleth_queue_depth and frequent brownout, the limit is too low for your offered load. If requests are rarely queued but GPU utilization is low, you may have room to increase it.

Future: metrics-driven capacity

The trait is the seam for a MetricsCapacity implementation that reads live signals from the inference backend:

// planned, not yet implemented
pub struct MetricsCapacity {
    aibrix_client: AibrixMetricsClient,
}

impl CapacityProvider for MetricsCapacity {
    fn max_in_flight(&self) -> usize {
        // Read Aibrix queue depth + KV-cache utilization
        // Return a dynamic limit that shrinks when the backend is under pressure
        self.compute_safe_limit()
    }
}

Swapping in a MetricsCapacity changes only the CapacityProvider binding — the fairshare scheduler and the entire rest of the pipeline are unchanged. This is by design.

Horizontal scaling note

Each obleth pod has its own StaticCapacity instance. If you run three pods each with OBLETH_GLOBAL_MAX_IN_FLIGHT=256, the effective global limit is 768 concurrent requests. HAProxy (or your ingress) distributes incoming requests across pods, so per-pod limits add up naturally.

See Scaling for more.