Scaling

How to scale obleth horizontally, tune the global concurrency limit, and scale the supporting datastores.

obleth is stateless (all shared state is in Redis and Postgres) and scales horizontally by running more pods behind a load balancer.

Horizontal scaling

Each obleth pod runs independently with its own:

In-process moka key cache (TTL=5min)
Fairshare scheduler (per-pod, not global)
Concurrency budget (OBLETH_GLOBAL_MAX_IN_FLIGHT per pod)

The global effective capacity is num_pods × OBLETH_GLOBAL_MAX_IN_FLIGHT. HAProxy (or your Ingress) distributes requests across pods with round-robin.

Kubernetes HPA

The Helm chart ships an HPA enabled by default:

hpa:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

obleth is CPU-light for most workloads (it proxies bytes, not computes). A better signal for scaling is admission queue depth. With Prometheus Adapter you can write a custom HPA metric:

# Scale when average queue depth exceeds 20 per pod
metrics:
  - type: Pods
    pods:
      metric:
        name: obleth_queue_depth
      target:
        type: AverageValue
        averageValue: 20

Docker Compose scaling

docker compose -f deploy/docker/docker-compose.yml \
  up --scale obleth=3 -d

Requires HAProxy to be configured with a backend that includes all obleth instances. The Docker Compose HAProxy config auto-detects the obleth containers via DNS round-robin in the Compose network.

Tuning max_in_flight

The OBLETH_GLOBAL_MAX_IN_FLIGHT limit controls concurrency per pod. Setting it too high causes the inference backend to queue requests internally (hidden queuing, hard to observe). Setting it too low means obleth's visible queue grows.

Target: set it equal to the number of concurrent requests your inference backend (Aibrix/vLLM) can handle without internal queuing.

# Increase live (no restart)
curl -X PUT http://localhost:9180/api/v1/capacity \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_in_flight": 128}'

# Check current
curl http://localhost:9180/api/v1/capacity \
  -H "Authorization: Bearer $TOKEN"

This only affects the pod you're calling. To change all pods, either restart them with the new env var or call each pod's admin endpoint.

Per-model capacity

Each registered model can also define max_in_flight — a per-route slot cap inside the global limit. Tune it manually or use the auto-tune ramp probe (chat and embedding only) from the Management API or control plane. See Capacity Auto-tune.

# Set per-model slots
curl -X PUT http://localhost:9180/api/v1/models/$MODEL_ID/capacity \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_in_flight": 32}'

Redis scaling

Redis is used for sub-ms key lookups and atomic Lua budget operations. For most deployments, a single Redis with replica reads is sufficient.

For high-traffic deployments:

Redis Sentinel: automatic failover, no client-side sharding needed.
Redis Cluster: horizontal sharding. Note that Lua scripts (used by obleth's token budget) require all keys in a Lua script to be on the same slot. obleth's Lua scripts operate on a single key per call, so Redis Cluster works correctly.

ClickHouse scaling

ClickHouse is append-only for obleth. It handles extremely high insert rates natively. For most deployments, a single ClickHouse instance handles tens of thousands of rows per second without tuning.

For high-traffic or large-history deployments:

Use a replicated ClickHouse cluster (2–3 replicas with ZooKeeper or ClickHouse Keeper).
Configure data retention TTLs on usage to control disk growth.
Use materialized views for pre-aggregated reporting.

Postgres scaling

Postgres handles only config mutations (low frequency) and audit log appends. It is rarely the bottleneck. Use CloudNativePG or a managed service for HA. Read scaling (replicas) is not needed for obleth's workload.

Monitoring capacity signals

Metric	What it tells you
`obleth_in_flight`	Current concurrent requests per pod
`obleth_queue_depth`	Requests waiting for admission (should be near 0 at steady state)
`obleth_requests_total{admission="queued"}`	Requests that had to wait for a slot — sustained growth means the cluster is at capacity
`obleth_requests_total{admission="fast"}`	Requests admitted immediately — good signal of headroom

PreviousAlerting

NextBackup & Restore

Getting Started

Concepts

Guides

Reference

Operations