2026-04-25 — voxtral_tts.yaml stage-0 override moved to repo SSOT
Captures the EC2-live /models/voxtral-tts-config/voxtral_tts.yaml into voice/config/voxtral_tts.yaml and adds the bind-mount to vllm-voxtral. PR-D deliberately deferred this; with Tier-1 verification of the live file vs the upstream default, it is now safe to commit.
What changed
-
voice/config/voxtral_tts.yaml(new) — the per-stage config that bind-mounts over the vllm-omni:v0.18.0 image default. Diff vs the upstream default is exactly one line: stage 0 (audio_generation)gpu_memory_utilization: 0.8 → 0.68. Stage 1 (audio_tokenizer),runtime,connectors,edges, and every other field are byte-identical to the upstream default extracted fromvllm/vllm-omni:v0.18.0viadocker run --rm --entrypoint cat … stage_configs/voxtral_tts.yaml. -
voice/docker-compose.yml— adds the bind-mount line tovllm-voxtral.volumes:- ./config/voxtral_tts.yaml:/usr/local/lib/python3.12/dist-packages/vllm_omni/model_executor/stage_configs/voxtral_tts.yaml:roMount target is the canonical path inside the v0.18.0 image, verified live.
Why now (and not in PR-D)
PR-D #685 deliberately deferred this file. The reasoning in the PR-D body: "extracting the canonical YAML from the image and committing it to repo is deferred to a follow-up — fabricating the file from memory would violate feedback_infra_ids_repo_canonical.md." That was correct discipline at the time — memory only said "stage 0 0.68, stage 1 0.1," which is one bullet, not 110 lines of YAML.
Today's EC2 read produced the live file (/models/voxtral-tts-config/voxtral_tts.yaml, 3480 bytes, root-owned, currently bind-mounted into the running vllm-voxtral container) AND the upstream default extracted from the v0.18.0 image. The two were diffed line-by-line; the only difference is the documented stage-0 number. Committing is now a Tier-1 capture, not a fabrication.
Why this matters for redeploy
PR-D folded the vllm-omni runtime swap into compose (image, entrypoint, command) but had no override file. A clean docker compose up -d --force-recreate vllm-voxtral from current main would start vllm-omni at default stage-0 0.8, target ~13.8 GB at startup with vllm-guard already holding 5.7 GB, and OOM. With this PR landed, a clean redeploy from main matches the EC2-live tuning byte-for-byte and the OOM risk is gone.
The bring-up plan deferred the actual EC2 redeploy (the EC2 git tree turned out to be 489 files divergent from main — far more than the bring-up scope), so the immediate value of this PR is making future clean deploys reproducible. The current EC2 stack is already running with this exact file at /models/voxtral-tts-config/voxtral_tts.yaml.
SSOT outcomes
- Repo
mainnow contains every byte the EC2/models/voxtral-tts-config/voxtral_tts.yamlhad. - Once a future clean deploy lands the bind-mount onto a fresh box, the host-side
/models/voxtral-tts-config/directory becomes redundant and can be deleted. - The
voice/config/directory is now the SSOT for all bind-mounted runtime config (Caddyfile, livekit.yaml, voxtral_tts.yaml, the cloudwatch/gpu sidecars).
Test plan
-
docker compose -f voice/docker-compose.yml configexits 0; bind-mount path resolves. -
diff voice/config/voxtral_tts.yaml /models/voxtral-tts-config/voxtral_tts.yamlreturns empty when run against the running EC2 (this is the Tier-1 source). - On a future clean redeploy:
docker compose up -d --force-recreate vllm-voxtralreaches(healthy)within 3-5 min andnvidia-smishows ~22.4 / 23 GiB GPU usage (matches current observed), no OOM.
Rollout / reversibility
Pure addition. Reversible via revert. The bind-mount is read-only so it cannot affect host-side state. The existing EC2 stack continues running with its current bind-mount until someone explicitly recreates the container.
Related
- Defers from PR-D #685 (compose canonicalization).
- Doesn't close any R-* issue on its own — it's the missing piece of R-6 #667 that PR-D footnoted.
- RT-1 #673 still tracks the broader VRAM ceiling question; this PR is the specific tuning that keeps the current single-L4 layout viable.