Handover 2026-04-24 — Voice stack bring-up

Bring-up outcome on the single-L4 GPU instance, 18 concrete findings, live-vs-repo divergence, research topics left open, and guidance for the next session.

Epic: neid404/fragjulia#660 (Voice Deploy Repair — Ground-Truth Reconciliation, R-0). This handover's parent tracker: the MEGA close-out issue filed as a native sub-issue of #660 — contains the full 18-finding table, per-R-* comment payloads, and pointers to attached session memory files + raw transcript. Status at handover: Services healthy on the GPU instance; direct-endpoint TTS validated; end-to-end pipeline through voice-agent is NOT yet tested; zero PRs filed — every stabilisation change lives on the instance only and has not been mirrored back to main.

What the session was about

This was the 2026-04-24 bring-up session for the voice stack: get Voxtral TTS, Llama Guard, and the LiveKit voice-agent all running on one L4 GPU, resolve the ambiguities flagged in epic R-0 (#660), and unblock R-10's bring-up verification (#670). The preceding session (2026-04-22) had already done the SSOT consolidation work (see Handover 2026-04-22) and filed R-0 itself.

Inbound state before the session:

voice/docker-compose.yml:66,106 pinned vllm/vllm-openai:latest for the TTS service.
/models/voxtral-4b-tts/ on the GPU instance was 18 GB of the wrong Voxtral variant (Voxtral-Mini-3B-2507, an ASR model) left over from an earlier exploration.
/models/llama-guard-3-1b/ was empty; HuggingFace gates were ungranted.
The voice-agent Dockerfile had never successfully built on any host.
HF_TOKEN rotation was mid-flight, tracked in #654.

What actually worked — outcome

Service	Result
LiveKit SFU	Healthy (up continuously)
Caddy TLS proxy	Reported unhealthy — known, `/healthz` route missing in Caddyfile (R-8 scope)
`vllm-voxtral`	Healthy after image + config swap; direct-endpoint TTS validated (HTTP 200, 24 kHz mono PCM WAV output on `POST /v1/audio/speech`)
`vllm-guard`	Healthy after memory + concurrency tuning; `GET /health` returns HTTP 200
`voice-agent`	Running; registered as LiveKit worker; `livekit-plugins-turn-detector` model downloaded manually (lifecycle note below)

GPU memory on the L4 (24 GiB card): ~22.4 GiB used, ~0.6 GiB free. 97% utilisation. The stack fits, but there is effectively no headroom for warmup spikes, KV-cache growth, or any fourth workload. This is the single biggest open architectural question — captured in RT-1 (see below).

End-to-end flow through the agent (WS → STT → LLM via Bedrock Mistral → Guard → Voxtral TTS → WS) was not exercised this session. That's R-10 (#670) and it remains open.

Findings summary (18 items)

Full detail — with Tier-1 evidence citations and the R-* child each maps to — lives in the MEGA close-out issue (parent = #660). The short version:

Resolved this session:

Voxtral-4B-TTS-2603 requires vllm/vllm-omni — not vllm/vllm-openai. Proven by ValueError: no module or parameter named 'acoustic_transformer' in MistralForCausalLM after the weights swap. (Maps to R-4 #664, R-6 #667.)
vllm/vllm-omni:v0.18.0 ships with an empty ENTRYPOINT; a compose entrypoint: ["vllm", "serve"] override is required.
--gpu-memory-utilization CLI flag is silently ignored by the vllm-omni multi-stage pipeline — per-stage YAML in vllm_omni/model_executor/stage_configs/voxtral_tts.yaml takes precedence.
--stage-overrides and --deploy-config are main-branch-only in vllm-omni; the released v0.18.0 image doesn't accept them. Workaround: bind-mount overlay of the stage YAML.
Llama Guard's default --max-num-seqs 256 OOMs the sampler-warmup phase on a 1B guardrail model. Dropping to --max-num-seqs 4 with --enforce-eager fixes it.
Raising Llama Guard's --gpu-memory-utilization from 0.15 to 0.25 was needed to give the KV cache enough room after the 2.8 GB weights load.
Python 3.12 isn't in Ubuntu 22.04's default apt repositories — the deadsnakes PPA is required. Ubuntu 24.04 variants of the nvidia/cuda:12.4.1-* base images don't exist on Docker Hub.
Python 3.12 removed distutils — the voice-agent Dockerfile needs python3.12 -m ensurepip --upgrade before pip install.
faster-whisper defaults to cuda + int8; on 583 MiB of free VRAM it would OOM. Forced to CPU via compose env overrides.

Resolved with a workaround that needs proper repair: 10. A classic Debian posix_local pip scheme interaction means pip install --prefix=/install lands packages at /install/local/lib/python3.12/dist-packages/. The Dockerfile's COPY --from=builder /install /usr/local then puts them at /usr/local/local/lib/python3.12/dist-packages/ (double local) — outside default sys.path. Currently papered over by a compose PYTHONPATH env var; proper fix is a Dockerfile change.

Discovered, not yet addressed: 11. livekit-plugins-turn-detector downloads model_q8.onnx at runtime into the container's ephemeral HF cache. Every docker compose up -d --force-recreate loses it. Seeded RT-2. 12. faster-whisper env var mismatch: code reads FASTER_WHISPER_MODEL, compose sets FASTER_WHISPER_MODEL_PATH. Harmless today (default path matches) but a trap. 13. Caddy /healthz route is missing from the Caddyfile. 14. voice-agent healthcheck in compose hits :8080 but the agent binds :8081. 15. voice/.env line 5 was corrupt pre-session: DEEPGRAM_API_KEY=HF_TOKEN=... (two keys merged). 16. No SSM agent on the GPU instance. EC2 Instance Connect is required (60s ephemeral key window).

Methodology lessons (captured in local session memory, not the public site): 17. Browser line-wraps in a web-shell paste become real \n bytes in quoted remote ssh commands — broke multiple long one-liners. 18. Tier-4 handoff claims deserve verification against the intended variant, not dismissal based on the wrong variant. A prior handoff said "vllm-omni ≥0.18.0 is required" and was dismissed mid-session because vllm/vllm-openai:latest loaded the then-on-disk Voxtral-Mini-3B-2507 fine. That was a wrong-variant false positive — Mini-3B is an ASR arch (VoxtralForConditionalGeneration) and supported by the generic image; the actual target, Voxtral-4B-TTS-2603, is a different arch (voxtral_tts + acoustic_transformer). Once the correct weights were in place, the generic image failed with the error listed above. The handoff was right all along.

Architecture context

The three-service single-GPU arrangement, the VRAM accounting behind it, and the runtime-download lifecycle problem are covered in Voice stack architecture. Read that before opening any of the pending PRs.

Open work (ordered)

Live-on-instance, not-yet-in-repo (the single biggest SSOT liability)

Stabilisation changes applied during bring-up live on the GPU instance's working tree only. Each needs to land in main via its R-*-scoped PR:

R-5 PR (#666 → also closes #366). Guard tuning: --max-num-seqs 4, --enforce-eager, --gpu-memory-utilization 0.25. Smallest, most self-contained — recommended first.
R-4 + R-6 PR (#664 + #667, atomic). Voxtral image swap to vllm/vllm-omni:v0.18.0, empty-ENTRYPOINT override, bind-mount overlay of the stage YAML, command rewrite, and the weights-swap record. Ships the custom voxtral_tts.yaml as a new file under voice/config/.
Voice-agent Dockerfile PR. Adds the deadsnakes PPA install, adds the ensurepip bootstrap, and fixes the --prefix=/install / posix_local bug properly (e.g. --target=/install/lib/python3.12/dist-packages in the builder, or COPY --from=builder /install/local /usr/local) so the compose PYTHONPATH workaround can be removed. Best bundled with R-4.
R-9 PR (#669 → closes #527). Whisper doc-only close — the agent already mounts whisper from the host, so the "baked in image" concern from #527 is already resolved at image level. Normalise the FASTER_WHISPER_MODEL vs FASTER_WHISPER_MODEL_PATH env-var name at the same time.
R-8 PR (#668 → closes #528). Add a /healthz route to voice/config/Caddyfile; change the voice-agent healthcheck port from 8080 to 8081.
HF-cache persistence. Once RT-2 lands a decision (image-bake vs host bind-mount vs plugin lifecycle hook), ship the implementation. Covers turn-detector today and any future livekit-plugin that does runtime HF downloads.

Research topics filed

RT-1 — GPU VRAM architecture viability on L4. Why 22.4/23.0 GiB is where we are, what the ceiling actually is under load, and whether the right answer is to squeeze further, separate the guardrail onto a second GPU, or move to a larger single GPU. Filed as a sub-issue of the MEGA close-out.
RT-2 — Turn-detector (and livekit-plugins) HF-cache lifecycle. The pattern problem behind finding #11. Filed as a sub-issue of the MEGA close-out.

R-10 — the end-to-end verification gate

R-10 (#670) runs three probes: LiveKit WS ping / worker registration; rotation probe (clean agent restart on .env change); full ASR→LLM→Guard→TTS through the agent. This closes last — it depends on R-3/4/5/6/12 plus whatever RT-1 concludes.

R-12 — this doc and its sibling

R-12 (#671) is "migrate voice ops docs into the SSOT site." This handover file and voice-stack-architecture.mdx are the first concrete deliverables for that issue — they're what it's been pointing at. Legacy voice/DEPLOY-AWS.md and voice/CREDENTIALS-CHECKLIST.md migration remains; track as follow-up.

Public-discipline redaction

Per apps/docs/content/docs/operations/ssot-discipline.mdx, this page is on the public site (/docs/operations). It therefore does not name:

the GPU instance's EC2 ID
the public or private IP
any security-group ID or name
the operator-whitelisted source IP
any file-system path that encodes a credential filename

Those details live in the session memory file attached to the MEGA close-out issue (private repo) and in the raw session transcript (also attached to the close-out issue). Operators running the stack find them in the private tracker, not here.

If you're the next session

Read this page and voice-stack-architecture.
Read the MEGA close-out issue (native sub-issue of #660; top of #660's sub-issue list) for the per-R-* comment payloads and the full 18-finding table with citations.
Pull the attached session memory file from the MEGA issue for instance-specific details (ID, IP, SG, /models/ layout).
Re-verify the live state before assuming anything: docker compose ps, nvidia-smi, /health probes. SSM is unavailable on the instance; use EC2 Instance Connect.
Open PRs in the order above. Each closes one R-* child and ships a changelog entry per ssot-discipline.mdx.
Do not assume any claim in a HANDOFF-*.md, CLAUDE.md, voice/DEPLOY-AWS.md, or voice/CREDENTIALS-CHECKLIST.md is still accurate — those are Tier 4 per the discipline rules. Corroborate via the live code or the live instance before acting.

Session artifacts

Epic: #660
MEGA close-out issue (sub-issue of #660): filed via the issue write the produced this page — find it at the top of #660's sub-issue list.
Research tickets RT-1 and RT-2: sub-issues of the MEGA close-out.
Companion page: Voice stack architecture.
Session memory (attached as .md to the MEGA close-out via GitHub web UI): project_voice_bringup_state_2026-04-24.md plus ~12 workflow feedback files.
Raw session transcript (attached as .txt to the MEGA close-out via GitHub web UI).

Handover 2026-04-24 — Voice stack bring-up

On this page