Chaos Testing

Verify obleth's fail-open behavior and WAL replay by pausing Redis and ClickHouse during the benchmark.

obleth is designed to keep serving requests when Redis or ClickHouse blink. Chaos mode runs the normal benchmark and pauses those services while load is active.

What chaos mode verifies

ScenarioExpected behavior
ClickHouse pauseUsage records spill to the local WAL; client requests continue
ClickHouse resumeWAL records replay into ClickHouse
Redis pauseAuth uses the in-process key cache; budget checks fail open; requests continue
Redis resumeNormal Redis-backed auth and budgeting resume

Run chaos mode

CHAOS=1 node bench/run-benchmark.mjs

With Podman Compose:

CONTAINER_CLI=podman CHAOS=1 node bench/run-benchmark.mjs

For a longer real-backend run:

CHAOS=1 \
CAPACITY=16 \
DURATION_S=120 \
OUTPUT_TOKENS=150 \
MODEL=gemma4-31b-it \
CONC=64 \
node bench/run-benchmark.mjs

The benchmark still exits non-zero if tenants stop making progress, client error rates exceed the threshold, or ClickHouse usage does not line up with client completions after recovery.

Manual chaos

You can also pause services yourself while node bench/run-benchmark.mjs is running.

docker compose -f deploy/docker/docker-compose.yml pause redis
docker logs -f obleth-obleth-1
docker compose -f deploy/docker/docker-compose.yml unpause redis
docker compose -f deploy/docker/docker-compose.yml pause clickhouse
docker exec obleth-obleth-1 ls -lh /tmp/obleth-telemetry.wal
docker compose -f deploy/docker/docker-compose.yml unpause clickhouse

Replace docker with podman if you are using Podman Compose.

Verify WAL replay

After ClickHouse resumes, check that buffered records were inserted:

docker exec -it obleth-clickhouse-1 clickhouse-client \
  --user obleth --password obleth \
  --query "SELECT count() FROM obleth.usage WHERE ts_ms > $(date -d '10 minutes ago' +%s)000"

The count should include records created during the outage, and obleth logs should report WAL replay.