Production Checklist

Everything to verify before deploying obleth to production: datastores, secrets, TLS, capacity, monitoring, and backup.

This checklist covers the minimum steps to harden an obleth deployment for production traffic.

Datastores

  • Postgres: Use a managed service (CloudNativePG, RDS, Cloud SQL) or HA setup. Do not use the bundled Docker Compose Postgres in production.
  • Redis: Use Redis Sentinel or Redis Cluster for HA. obleth treats Redis as a hot cache — all data is derivable from Postgres — so a Redis failover is recoverable without data loss.
  • ClickHouse: Use a replicated cluster or a managed ClickHouse service. The WAL provides short-term durability for single-node outages.
  • All connection URLs set via environment variables or Kubernetes Secrets, not hardcoded.

Secrets

  • OBLETH_ADMIN_TOKEN: strong random token (≥32 chars). Use a Kubernetes Secret or Vault. Never commit to source control.
  • DASHBOARD_SESSION_SECRET: change from the default dev-session-secret-change-in-production.
  • DASHBOARD_PASSWORD: change from default obleth.
  • ClickHouse password: change from default obleth.
  • Postgres password: change from default obleth.
  • Model api_key and MCP auth_header values: stored in Postgres. Ensure database-at-rest encryption.

TLS and network

  • TLS terminated at HAProxy or your Ingress controller — the data plane is HTTP-only.
  • obleth data plane (:8080) reachable only from HAProxy/Ingress, not from the internet.
  • Admin port (:9090) not publicly accessible — restrict to internal network or VPN. The admin token is the only auth.
  • Metrics port (:9091) accessible only to your Prometheus scraper.

Capacity planning

  • Set OBLETH_GLOBAL_MAX_IN_FLIGHT to match real inference backend concurrency (start conservative: 64, increase based on queue depth).
  • Enable HPA in Helm (hpa.enabled=true); set maxReplicas based on traffic patterns.
  • Remember: each obleth pod has its own independent concurrency budget. 3 pods × 64 = 192 total slots.
  • Set OBLETH_BROWNOUT_WAIT_MS appropriate to your SLA.

Reliability

  • Decide on OBLETH_FAIL_OPEN: true (keep serving under Redis failure) or false (strict budget enforcement).
  • Set OBLETH_WAL_PATH to a persistent volume path (not /tmp). The WAL must survive pod restarts.
  • Verify the WAL volume has adequate disk space for your traffic volume.

Monitoring

  • Prometheus scraping :9091/metrics.
  • Alert on obleth_queue_depth > threshold (admission saturation).
  • Alert on obleth_telemetry_dropped > 0 (WAL pressure).
  • Alert on obleth_requests_total{status="5xx"} spike (upstream failures).
  • Enable serviceMonitor.enabled=true if using Prometheus Operator.
  • (Optional) Set OBLETH_OTEL_ENDPOINT for distributed tracing.

Initial configuration

  • Create your tenant hierarchy and fairshare groups before sending production traffic.
  • Register all models with accurate input_cost_per_token and output_cost_per_token.
  • Set conservative tokens_per_minute quotas initially; increase based on observed usage.
  • Verify the audit log is recording changes: GET /api/v1/audit.

Pre-launch testing

  • Run the benchmark harness against staging to verify fairshare behavior.
  • Run the chaos harness against staging to verify fail-open and WAL replay.
  • Verify ClickHouse rows appear after test requests.
  • Smoke test the dashboard against the production Management API.