How to scale obleth horizontally, tune the global concurrency limit, and scale the supporting datastores.
obleth is stateless (all shared state is in Redis and Postgres) and scales horizontally by running more pods behind a load balancer.
Each obleth pod runs independently with its own:
OBLETH_GLOBAL_MAX_IN_FLIGHT per pod)The global effective capacity is num_pods × OBLETH_GLOBAL_MAX_IN_FLIGHT. HAProxy (or your Ingress) distributes requests across pods with round-robin.
The Helm chart ships an HPA enabled by default:
hpa:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
obleth is CPU-light for most workloads (it proxies bytes, not computes). A better signal for scaling is admission queue depth. With Prometheus Adapter you can write a custom HPA metric:
# Scale when average queue depth exceeds 20 per pod
metrics:
- type: Pods
pods:
metric:
name: obleth_queue_depth
target:
type: AverageValue
averageValue: 20
docker compose -f deploy/docker/docker-compose.yml \
up --scale obleth=3 -d
Requires HAProxy to be configured with a backend that includes all obleth instances. The Docker Compose HAProxy config auto-detects the obleth containers via DNS round-robin in the Compose network.
The OBLETH_GLOBAL_MAX_IN_FLIGHT limit controls concurrency per pod. Setting it too high causes the inference backend to queue requests internally (hidden queuing, hard to observe). Setting it too low means obleth's visible queue grows.
Target: set it equal to the number of concurrent requests your inference backend (Aibrix/vLLM) can handle without internal queuing.
# Increase live (no restart)
curl -X PUT http://localhost:9090/api/v1/capacity \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"max_in_flight": 128}'
# Check current
curl http://localhost:9090/api/v1/capacity \
-H "Authorization: Bearer $TOKEN"
This only affects the pod you're calling. To change all pods, either restart them with the new env var or call each pod's admin endpoint.
Redis is used for sub-ms key lookups and atomic Lua budget operations. For most deployments, a single Redis with replica reads is sufficient.
For high-traffic deployments:
ClickHouse is append-only for obleth. It handles extremely high insert rates natively. For most deployments, a single ClickHouse instance handles tens of thousands of rows per second without tuning.
For high-traffic or large-history deployments:
usage to control disk growth.Postgres handles only config mutations (low frequency) and audit log appends. It is rarely the bottleneck. Use CloudNativePG or a managed service for HA. Read scaling (replicas) is not needed for obleth's workload.
| Metric | What it tells you |
|---|---|
obleth_in_flight | Current concurrent requests per pod |
obleth_queue_depth | Requests waiting for admission (should be near 0 at steady state) |
obleth_requests_total{admission="brownout"} | Requests being degraded — queue is regularly hitting the wait threshold |
obleth_requests_total{admission="fast"} | Requests admitted immediately — good signal of headroom |