Troubleshooting

Common obleth errors and how to diagnose them: auth failures, 503s, stalled queues, Redis issues, and ClickHouse connectivity.

Enable debug logging

RUST_LOG=obleth=debug

This adds verbose output for request routing, cache hits/misses, Redis operations, and fairshare decisions.

HTTP error reference

StatusMeaningCommon cause
401 UnauthorizedMissing or malformed API keyNo Authorization header; key doesn't start with sk_
403 ForbiddenKey is disabled or tenant not foundKey was deleted/disabled; tenant deleted
404 Not FoundModel not found or not enabledModel not in registry, or enabled=false
429 Too Many RequestsTenant quota exceededtokens_per_minute budget exhausted for this billing window
503 Service UnavailableRequest brownout or queue fullSystem is overloaded; obleth_queue_depth is high
502 Bad GatewayUpstream returned an errorInference backend is down or returned a non-2xx response

Auth failures (401/403)

# 1. Confirm the key exists
curl http://localhost:9090/api/v1/keys \
  -H "Authorization: Bearer $TOKEN" | jq '.[] | select(.key_prefix == "sk_abc1")'

# 2. Confirm it's not disabled
# If "disabled": true, re-enable:
curl -X PUT http://localhost:9090/api/v1/keys/$KEY_ID/disabled \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"disabled": false}'

# 3. Check the key format: it must be the full secret (sk_...), not the prefix

Key resolution flow: the gateway hashes the full key with SHA-256, then looks up obleth:key:{hash} in Redis. If Redis is down, it falls back to moka, then Postgres. If all three miss, the request is rejected.

404 on model

# List models
curl http://localhost:9090/api/v1/models \
  -H "Authorization: Bearer $TOKEN" | jq '.[] | {name: .model_name, enabled: .enabled}'

The model field in the request body must match a model_name in the registry exactly. If the model exists but enabled=false, re-enable it.

429 quota exceeded

# Check tenant quota and current budget
curl http://localhost:9090/api/v1/tenants/$TENANT_ID \
  -H "Authorization: Bearer $TOKEN" | jq '{tpm: .tokens_per_minute, group: .group_name}'

# Increase quota
curl -X PUT http://localhost:9090/api/v1/tenants/$TENANT_ID/quota \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"tokens_per_minute": 200000}'

The token budget resets every 60 seconds via the Redis Lua script. If a tenant is consistently hitting 429, either increase their quota or check if they have a runaway workload.

Requests stuck in queue (503 brownout)

The queue fills when OBLETH_GLOBAL_MAX_IN_FLIGHT is too low for the offered load. Check:

# Live stats
curl http://localhost:9090/api/v1/stats -H "Authorization: Bearer $TOKEN"
# Look at: in_flight, queued, max_in_flight

# Prometheus
curl http://localhost:9091/metrics | grep 'obleth_queue_depth\|obleth_in_flight'

Options:

  1. Increase OBLETH_GLOBAL_MAX_IN_FLIGHT (only if backend can handle more concurrency)
  2. Add more obleth pods
  3. Lower OBLETH_BROWNOUT_WAIT_MS to fail faster instead of queuing

Redis connectivity issues

obleth fails open when Redis is unavailable. If you see repeated log lines like:

WARN obleth_redis: Redis error: Connection refused
WARN obleth_proxy: budget reserve failed; failing open

Check:

docker logs obleth-redis-1
redis-cli -h $REDIS_HOST ping

When Redis reconnects, obleth resumes normal operation automatically. Check OBLETH_REDIS_URL is correct. The scheme must be redis:// (not rediss:// unless TLS is configured).

ClickHouse connectivity issues

WARN obleth_telemetry: clickhouse insert failed; spilling N records to WAL

This is non-fatal — requests continue. Check:

curl http://$CLICKHOUSE_HOST:8123/ping
# Expected: Ok.

Common causes: wrong OBLETH_CLICKHOUSE_URL, wrong credentials, ClickHouse not started. After connectivity is restored, obleth replays the WAL automatically.

Health check

curl http://localhost:9090/api/v1/health
# {"status": "ok", "redis": "ok", "postgres": "ok"}

If redis or postgres shows an error, check those services. clickhouse is not included in the health check (fail-open design).