2026-04-25 — Voice compose canonicalization (R-5 + R-6 + R-8)
Folds the 2026-04-24 EC2 hand-edits to docker-compose.yml back into main — vllm-omni image swap for Voxtral, vllm-guard sampler tuning, voice-agent healthcheck port + whisper CPU mode + env-var rename, and a busybox-compatible caddy healthcheck against the admin API. Closes R-5 #666, R-6 #667, R-8 #668, #366, #528.
What changed
voice/docker-compose.yml
vllm-guard (R-5 #666) — three-flag tuning, all needed on L4 24 GB:
--gpu-memory-utilization 0.15 → 0.25. At 0.15 only ~800 MB remained for the KV cache after the 2.81 GB Llama-Guard-3-1B model load, hittingValueError: No available memory for the cache blocks.--max-num-seqs 4(was vLLM default 256). At the default, sampler warmup OOMed even after KV alloc succeeded. A 1B output guard does not need 256-way concurrency.--enforce-eager— disables CUDA graph capture, which on Llama-Guard-3-1B at this concurrency level is overhead, not headroom.
vllm-voxtral (R-6 #667) — runtime swap from vllm/vllm-openai:latest to vllm/vllm-omni:v0.18.0:
vllm/vllm-openairesolves Voxtral-4B-TTS-2603 as plainMistralForCausalLMand crashes withValueError: There is no module or parameter named 'acoustic_transformer'. Voxtral-4B-TTS-2603 declaresmodel_type: voxtral_ttswithacoustic_transformer_args; the matching architecture lives invllm-omni. Tier-1 confirmation: HF model card formistralai/Voxtral-4B-TTS-2603documentsvllm_omni v0.18.0as the recommended runtime.vllm-omni's image has an emptyENTRYPOINT, so we setentrypoint: ["vllm", "serve"]and rewrite the command to vllm-omni form:["/models/voxtral-4b-tts", "--omni", "--port", "8001"].- NOT included in this PR: the optional
voxtral_tts.yamlper-stage VRAM override bind-mount. The vllm-omni defaults (stage 0=0.8, stage 1=0.1) produce a working ~19.4 GB allocation on L4 24 GB. Tighter tuning to 0.68/0.1 was applied on EC2 during 2026-04-24 for headroom, but extracting the canonical YAML from the image and committing it to repo is deferred to a follow-up — fabricating the file from memory would violatefeedback_infra_ids_repo_canonical.md. RT-1 #673 tracks the broader VRAM question.
voice-agent:
- Healthcheck (R-8 #668) — the
livekit-agentsframework binds the worker health server on:8081by default. The compose healthcheck targeted:8080/health, so voice-agent reportedstartingindefinitely while functioning correctly. Switched to a TCP-only connectivity check on:8081(the framework's HTTP path schema is version-dependent; port-open is the most stable signal). - Env-var rename (SSOT) —
FASTER_WHISPER_MODEL_PATH→FASTER_WHISPER_MODEL.voice/agent/config.pyreadsFASTER_WHISPER_MODEL; the previous_PATHsuffix was silently ignored and only worked because the dataclass default happened to match the path. Renaming enforces the single source of truth — if someone changes the compose env to a different path, the agent now picks it up. - Whisper on CPU —
FASTER_WHISPER_DEVICE=cpu+FASTER_WHISPER_COMPUTE=int8. With vllm-voxtral + vllm-guard saturating the GPU at ~22.4 GB (97%), whisper-large-v3 on CUDA would OOM. CPU + int8 fits within the agent container's CPU budget at acceptable latency for streaming STT.
caddy (R-8 #668 — second healthcheck fix in this PR):
The previous probe wget --spider https://localhost:443/healthz could not work on the actual container:
caddy:2-alpineships busybox wget, which does NOT support--spider(that's a GNU-wget-only flag). TheCMDwould have errored on every interval.- The Let's Encrypt cert is bound to
livekit.fragjulia.de, so probinglocalhostwould fail TLS verification even with a real wget. Caddy has been reporting(unhealthy)for the 41-hour run captured in #672 §2 because of these two issues, NOT because/healthzwas missing — the route has been invoice/config/Caddyfilesince PR #655.
New probe uses Caddy's admin API on 127.0.0.1:2019 (default-on, plaintext HTTP, no auth) — present whenever the caddy process is alive and not deadlocked, busybox-wget-compatible:
test: ["CMD-SHELL", "wget -qO- http://localhost:2019/config/ >/dev/null"]Header comment: updated from "Three services" to "Five services" with explicit VRAM budget table — caddy and vllm-voxtral were missing from the original.
voice/config/Caddyfile — unchanged
An earlier draft of this PR cherry-picked a Caddyfile diff from the abandoned claude/672-voice-bringup-docs branch. Reverted after review caught that those "improvements" reverse PR #655's production fixes:
- The
transport http { versions h2c 1.1 }block onreverse_proxywas a known 502 trigger against the LiveKit upstream and was removed in #655. - The global
protocols h1 h2block was added in #655 to disable HTTP/3, which had been contending with LiveKit's TURN server on UDP/443. - The
claude/672-voice-bringup-docsCaddyfile predates #655 and re-introduces both regressions.
Caddyfile stays as merged in PR #655 — already correct, no edit needed.
Why
Six of the nine "live-on-instance" divergences from #672 §6 collapse into this one PR (the two voice/agent/Dockerfile lines land in PR-C #683; the seventh — Caddyfile — turned out to be a regression after closer review of PR #655 history and is NOT in scope here). Each compose change individually was a tiny fix; the bundle exists because they're all on the same file and grouping them avoids a cascade of merge conflicts.
The R-5 + R-6 + R-8 grouping also matches the close-via map in #672:
- R-5 #666 closes #366 (voice pipeline guardrail Llama-Guard-3-1B alignment).
- R-8 #668 closes #528 (no healthcheck for voice-agent) AND fixes the broken caddy healthcheck (busybox wget + TLS cert mismatch — both pre-existing on main, neither was the
/healthzroute gap that #672 finding #11 alleged). - R-6 #667 stands alone but is logically inseparable from the vllm-voxtral block edits.
Scope
voice/docker-compose.yml(full canonicalization of vllm-guard, vllm-voxtral, voice-agent + caddy healthcheck + header comment).- Changelog entry + meta.json.
- Does NOT touch
voice/config/Caddyfile(deliberately — see above),voice/agent/Dockerfile(PR-C scope),voice/.env.example(PR-B scope),voice/scripts/(PR-E scope).
Test plan
- CI:
docker compose -f voice/docker-compose.yml configexits 0 (YAML/schema sanity). - After PR-B + PR-C + this PR merge and EC2 redeploys:
docker compose pullfetchesvllm/vllm-omni:v0.18.0. Pre-merge:docker manifest inspect vllm/vllm-omni:v0.18.0succeeds (Risk #1 mitigation). - After redeploy:
docker compose psshows all 5 services(healthy). voice-agent specifically transitions out ofstartingbecause :8081 probe matches the listening port. - Direct-endpoint TTS:
curl -X POST http://<host>:8001/v1/audio/speech -d '{"input":"test","model":"voxtral"}' -o test.wavreturns 200 with non-zero bytes (validates R-6). - Direct-endpoint Guard:
curl http://<host>:8000/v1/modelsreturns Llama-Guard-3-1B (validates R-5). -
docker compose ps caddyshows(healthy)(validates the new admin-API healthcheck — was broken on main, not a regression introduced here). -
wget -qO- https://livekit.fragjulia.de/healthzreturnsOKfrom outside the box (independent confirmation that the public route works; uses real TLS cert). -
docker compose config | grep -E 'PYTHONPATH|FASTER_WHISPER_MODEL_PATH'returns empty (workarounds gone).
Rollout / reversibility
Reversible via revert. The vllm-omni image swap is the only piece with non-trivial blast radius: if vllm-omni v0.18.0 is yanked from Docker Hub between merge and deploy, redeploy fails. Mitigation: pin the image digest from EC2's locally-cached tag if needed.
EC2 redeploy after this lands requires docker compose up -d --force-recreate since the vllm-voxtral image, entrypoint, and command all change. Expect ~3-5 min of voice service downtime during the recreate.
Follow-ups
voice/config/voxtral_tts.yaml— extract the canonical default from the running vllm-omni v0.18.0 container, apply 0.68/0.1 stage-0 override, commit to repo, add the bind-mount to compose. Separate PR; the defaults work for direct-endpoint TTS so this is a tuning-not-a-blocker.- PR-D drops the EC2-side
PYTHONPATH=/usr/local/local/lib/python3.12/dist-packagesworkaround once PR-C #683 lands. If PR-C ships first, no compose change needed; if PR-D ships first, EC2 will fail until PR-C lands. - R-10 #670 verification probes (PR-F) gate the bring-up.
2026-04-25 — R-4 weights provisioning script + Voxtral CC BY-NC policy note
Adds voice/scripts/provision-weights.sh, an idempotent downloader for the three model weights the voice stack needs (Voxtral-4B-TTS-2603, Llama-Guard-3-1B, faster-whisper-large-v3). Closes R-4 #664; partial-closes #521 with the local-deploy CC BY-NC policy note.
2026-04-25 — R-3 HF_TOKEN canonical path documented (voice/.env)
Adds HF_TOKEN to voice/.env.example with consumption-path comment, and documents the secret surface in voice/docker-compose.yml header. Closes R-3 #663 + #526.