fragJulia
Changelog

2026-04-25 — Voice compose canonicalization (R-5 + R-6 + R-8)

Folds the 2026-04-24 EC2 hand-edits to docker-compose.yml back into main — vllm-omni image swap for Voxtral, vllm-guard sampler tuning, voice-agent healthcheck port + whisper CPU mode + env-var rename, and a busybox-compatible caddy healthcheck against the admin API. Closes R-5 #666, R-6 #667, R-8 #668, #366, #528.

What changed

voice/docker-compose.yml

vllm-guard (R-5 #666) — three-flag tuning, all needed on L4 24 GB:

  • --gpu-memory-utilization 0.15 → 0.25. At 0.15 only ~800 MB remained for the KV cache after the 2.81 GB Llama-Guard-3-1B model load, hitting ValueError: No available memory for the cache blocks.
  • --max-num-seqs 4 (was vLLM default 256). At the default, sampler warmup OOMed even after KV alloc succeeded. A 1B output guard does not need 256-way concurrency.
  • --enforce-eager — disables CUDA graph capture, which on Llama-Guard-3-1B at this concurrency level is overhead, not headroom.

vllm-voxtral (R-6 #667) — runtime swap from vllm/vllm-openai:latest to vllm/vllm-omni:v0.18.0:

  • vllm/vllm-openai resolves Voxtral-4B-TTS-2603 as plain MistralForCausalLM and crashes with ValueError: There is no module or parameter named 'acoustic_transformer'. Voxtral-4B-TTS-2603 declares model_type: voxtral_tts with acoustic_transformer_args; the matching architecture lives in vllm-omni. Tier-1 confirmation: HF model card for mistralai/Voxtral-4B-TTS-2603 documents vllm_omni v0.18.0 as the recommended runtime.
  • vllm-omni's image has an empty ENTRYPOINT, so we set entrypoint: ["vllm", "serve"] and rewrite the command to vllm-omni form: ["/models/voxtral-4b-tts", "--omni", "--port", "8001"].
  • NOT included in this PR: the optional voxtral_tts.yaml per-stage VRAM override bind-mount. The vllm-omni defaults (stage 0=0.8, stage 1=0.1) produce a working ~19.4 GB allocation on L4 24 GB. Tighter tuning to 0.68/0.1 was applied on EC2 during 2026-04-24 for headroom, but extracting the canonical YAML from the image and committing it to repo is deferred to a follow-up — fabricating the file from memory would violate feedback_infra_ids_repo_canonical.md. RT-1 #673 tracks the broader VRAM question.

voice-agent:

  • Healthcheck (R-8 #668) — the livekit-agents framework binds the worker health server on :8081 by default. The compose healthcheck targeted :8080/health, so voice-agent reported starting indefinitely while functioning correctly. Switched to a TCP-only connectivity check on :8081 (the framework's HTTP path schema is version-dependent; port-open is the most stable signal).
  • Env-var rename (SSOT)FASTER_WHISPER_MODEL_PATHFASTER_WHISPER_MODEL. voice/agent/config.py reads FASTER_WHISPER_MODEL; the previous _PATH suffix was silently ignored and only worked because the dataclass default happened to match the path. Renaming enforces the single source of truth — if someone changes the compose env to a different path, the agent now picks it up.
  • Whisper on CPUFASTER_WHISPER_DEVICE=cpu + FASTER_WHISPER_COMPUTE=int8. With vllm-voxtral + vllm-guard saturating the GPU at ~22.4 GB (97%), whisper-large-v3 on CUDA would OOM. CPU + int8 fits within the agent container's CPU budget at acceptable latency for streaming STT.

caddy (R-8 #668 — second healthcheck fix in this PR):

The previous probe wget --spider https://localhost:443/healthz could not work on the actual container:

  • caddy:2-alpine ships busybox wget, which does NOT support --spider (that's a GNU-wget-only flag). The CMD would have errored on every interval.
  • The Let's Encrypt cert is bound to livekit.fragjulia.de, so probing localhost would fail TLS verification even with a real wget. Caddy has been reporting (unhealthy) for the 41-hour run captured in #672 §2 because of these two issues, NOT because /healthz was missing — the route has been in voice/config/Caddyfile since PR #655.

New probe uses Caddy's admin API on 127.0.0.1:2019 (default-on, plaintext HTTP, no auth) — present whenever the caddy process is alive and not deadlocked, busybox-wget-compatible:

test: ["CMD-SHELL", "wget -qO- http://localhost:2019/config/ >/dev/null"]

Header comment: updated from "Three services" to "Five services" with explicit VRAM budget table — caddy and vllm-voxtral were missing from the original.

voice/config/Caddyfile — unchanged

An earlier draft of this PR cherry-picked a Caddyfile diff from the abandoned claude/672-voice-bringup-docs branch. Reverted after review caught that those "improvements" reverse PR #655's production fixes:

  • The transport http { versions h2c 1.1 } block on reverse_proxy was a known 502 trigger against the LiveKit upstream and was removed in #655.
  • The global protocols h1 h2 block was added in #655 to disable HTTP/3, which had been contending with LiveKit's TURN server on UDP/443.
  • The claude/672-voice-bringup-docs Caddyfile predates #655 and re-introduces both regressions.

Caddyfile stays as merged in PR #655 — already correct, no edit needed.

Why

Six of the nine "live-on-instance" divergences from #672 §6 collapse into this one PR (the two voice/agent/Dockerfile lines land in PR-C #683; the seventh — Caddyfile — turned out to be a regression after closer review of PR #655 history and is NOT in scope here). Each compose change individually was a tiny fix; the bundle exists because they're all on the same file and grouping them avoids a cascade of merge conflicts.

The R-5 + R-6 + R-8 grouping also matches the close-via map in #672:

  • R-5 #666 closes #366 (voice pipeline guardrail Llama-Guard-3-1B alignment).
  • R-8 #668 closes #528 (no healthcheck for voice-agent) AND fixes the broken caddy healthcheck (busybox wget + TLS cert mismatch — both pre-existing on main, neither was the /healthz route gap that #672 finding #11 alleged).
  • R-6 #667 stands alone but is logically inseparable from the vllm-voxtral block edits.

Scope

  • voice/docker-compose.yml (full canonicalization of vllm-guard, vllm-voxtral, voice-agent + caddy healthcheck + header comment).
  • Changelog entry + meta.json.
  • Does NOT touch voice/config/Caddyfile (deliberately — see above), voice/agent/Dockerfile (PR-C scope), voice/.env.example (PR-B scope), voice/scripts/ (PR-E scope).

Test plan

  • CI: docker compose -f voice/docker-compose.yml config exits 0 (YAML/schema sanity).
  • After PR-B + PR-C + this PR merge and EC2 redeploys: docker compose pull fetches vllm/vllm-omni:v0.18.0. Pre-merge: docker manifest inspect vllm/vllm-omni:v0.18.0 succeeds (Risk #1 mitigation).
  • After redeploy: docker compose ps shows all 5 services (healthy). voice-agent specifically transitions out of starting because :8081 probe matches the listening port.
  • Direct-endpoint TTS: curl -X POST http://<host>:8001/v1/audio/speech -d '{"input":"test","model":"voxtral"}' -o test.wav returns 200 with non-zero bytes (validates R-6).
  • Direct-endpoint Guard: curl http://<host>:8000/v1/models returns Llama-Guard-3-1B (validates R-5).
  • docker compose ps caddy shows (healthy) (validates the new admin-API healthcheck — was broken on main, not a regression introduced here).
  • wget -qO- https://livekit.fragjulia.de/healthz returns OK from outside the box (independent confirmation that the public route works; uses real TLS cert).
  • docker compose config | grep -E 'PYTHONPATH|FASTER_WHISPER_MODEL_PATH' returns empty (workarounds gone).

Rollout / reversibility

Reversible via revert. The vllm-omni image swap is the only piece with non-trivial blast radius: if vllm-omni v0.18.0 is yanked from Docker Hub between merge and deploy, redeploy fails. Mitigation: pin the image digest from EC2's locally-cached tag if needed.

EC2 redeploy after this lands requires docker compose up -d --force-recreate since the vllm-voxtral image, entrypoint, and command all change. Expect ~3-5 min of voice service downtime during the recreate.

Follow-ups

  • voice/config/voxtral_tts.yaml — extract the canonical default from the running vllm-omni v0.18.0 container, apply 0.68/0.1 stage-0 override, commit to repo, add the bind-mount to compose. Separate PR; the defaults work for direct-endpoint TTS so this is a tuning-not-a-blocker.
  • PR-D drops the EC2-side PYTHONPATH=/usr/local/local/lib/python3.12/dist-packages workaround once PR-C #683 lands. If PR-C ships first, no compose change needed; if PR-D ships first, EC2 will fail until PR-C lands.
  • R-10 #670 verification probes (PR-F) gate the bring-up.

On this page