fragJulia
Changelog

2026-04-25 — voxtral_tts.yaml stage-0 override moved to repo SSOT

Captures the EC2-live /models/voxtral-tts-config/voxtral_tts.yaml into voice/config/voxtral_tts.yaml and adds the bind-mount to vllm-voxtral. PR-D deliberately deferred this; with Tier-1 verification of the live file vs the upstream default, it is now safe to commit.

What changed

  • voice/config/voxtral_tts.yaml (new) — the per-stage config that bind-mounts over the vllm-omni:v0.18.0 image default. Diff vs the upstream default is exactly one line: stage 0 (audio_generation) gpu_memory_utilization: 0.8 → 0.68. Stage 1 (audio_tokenizer), runtime, connectors, edges, and every other field are byte-identical to the upstream default extracted from vllm/vllm-omni:v0.18.0 via docker run --rm --entrypoint cat … stage_configs/voxtral_tts.yaml.

  • voice/docker-compose.yml — adds the bind-mount line to vllm-voxtral.volumes:

    - ./config/voxtral_tts.yaml:/usr/local/lib/python3.12/dist-packages/vllm_omni/model_executor/stage_configs/voxtral_tts.yaml:ro

    Mount target is the canonical path inside the v0.18.0 image, verified live.

Why now (and not in PR-D)

PR-D #685 deliberately deferred this file. The reasoning in the PR-D body: "extracting the canonical YAML from the image and committing it to repo is deferred to a follow-up — fabricating the file from memory would violate feedback_infra_ids_repo_canonical.md." That was correct discipline at the time — memory only said "stage 0 0.68, stage 1 0.1," which is one bullet, not 110 lines of YAML.

Today's EC2 read produced the live file (/models/voxtral-tts-config/voxtral_tts.yaml, 3480 bytes, root-owned, currently bind-mounted into the running vllm-voxtral container) AND the upstream default extracted from the v0.18.0 image. The two were diffed line-by-line; the only difference is the documented stage-0 number. Committing is now a Tier-1 capture, not a fabrication.

Why this matters for redeploy

PR-D folded the vllm-omni runtime swap into compose (image, entrypoint, command) but had no override file. A clean docker compose up -d --force-recreate vllm-voxtral from current main would start vllm-omni at default stage-0 0.8, target ~13.8 GB at startup with vllm-guard already holding 5.7 GB, and OOM. With this PR landed, a clean redeploy from main matches the EC2-live tuning byte-for-byte and the OOM risk is gone.

The bring-up plan deferred the actual EC2 redeploy (the EC2 git tree turned out to be 489 files divergent from main — far more than the bring-up scope), so the immediate value of this PR is making future clean deploys reproducible. The current EC2 stack is already running with this exact file at /models/voxtral-tts-config/voxtral_tts.yaml.

SSOT outcomes

  • Repo main now contains every byte the EC2 /models/voxtral-tts-config/voxtral_tts.yaml had.
  • Once a future clean deploy lands the bind-mount onto a fresh box, the host-side /models/voxtral-tts-config/ directory becomes redundant and can be deleted.
  • The voice/config/ directory is now the SSOT for all bind-mounted runtime config (Caddyfile, livekit.yaml, voxtral_tts.yaml, the cloudwatch/gpu sidecars).

Test plan

  • docker compose -f voice/docker-compose.yml config exits 0; bind-mount path resolves.
  • diff voice/config/voxtral_tts.yaml /models/voxtral-tts-config/voxtral_tts.yaml returns empty when run against the running EC2 (this is the Tier-1 source).
  • On a future clean redeploy: docker compose up -d --force-recreate vllm-voxtral reaches (healthy) within 3-5 min and nvidia-smi shows ~22.4 / 23 GiB GPU usage (matches current observed), no OOM.

Rollout / reversibility

Pure addition. Reversible via revert. The bind-mount is read-only so it cannot affect host-side state. The existing EC2 stack continues running with its current bind-mount until someone explicitly recreates the container.

  • Defers from PR-D #685 (compose canonicalization).
  • Doesn't close any R-* issue on its own — it's the missing piece of R-6 #667 that PR-D footnoted.
  • RT-1 #673 still tracks the broader VRAM ceiling question; this PR is the specific tuning that keeps the current single-L4 layout viable.

On this page