Handover 2026-04-24 — Voice stack bring-up
Bring-up outcome on the single-L4 GPU instance, 18 concrete findings, live-vs-repo divergence, research topics left open, and guidance for the next session.
Epic: neid404/fragjulia#660 (Voice Deploy Repair — Ground-Truth Reconciliation, R-0).
This handover's parent tracker: the MEGA close-out issue filed as a native sub-issue of #660 — contains the full 18-finding table, per-R-* comment payloads, and pointers to attached session memory files + raw transcript.
Status at handover: Services healthy on the GPU instance; direct-endpoint TTS validated; end-to-end pipeline through voice-agent is NOT yet tested; zero PRs filed — every stabilisation change lives on the instance only and has not been mirrored back to main.
What the session was about
This was the 2026-04-24 bring-up session for the voice stack: get Voxtral TTS, Llama Guard, and the LiveKit voice-agent all running on one L4 GPU, resolve the ambiguities flagged in epic R-0 (#660), and unblock R-10's bring-up verification (#670). The preceding session (2026-04-22) had already done the SSOT consolidation work (see Handover 2026-04-22) and filed R-0 itself.
Inbound state before the session:
voice/docker-compose.yml:66,106pinnedvllm/vllm-openai:latestfor the TTS service./models/voxtral-4b-tts/on the GPU instance was 18 GB of the wrong Voxtral variant (Voxtral-Mini-3B-2507, an ASR model) left over from an earlier exploration./models/llama-guard-3-1b/was empty; HuggingFace gates were ungranted.- The
voice-agentDockerfile had never successfully built on any host. HF_TOKENrotation was mid-flight, tracked in #654.
What actually worked — outcome
| Service | Result |
|---|---|
| LiveKit SFU | Healthy (up continuously) |
| Caddy TLS proxy | Reported unhealthy — known, /healthz route missing in Caddyfile (R-8 scope) |
vllm-voxtral | Healthy after image + config swap; direct-endpoint TTS validated (HTTP 200, 24 kHz mono PCM WAV output on POST /v1/audio/speech) |
vllm-guard | Healthy after memory + concurrency tuning; GET /health returns HTTP 200 |
voice-agent | Running; registered as LiveKit worker; livekit-plugins-turn-detector model downloaded manually (lifecycle note below) |
GPU memory on the L4 (24 GiB card): ~22.4 GiB used, ~0.6 GiB free. 97% utilisation. The stack fits, but there is effectively no headroom for warmup spikes, KV-cache growth, or any fourth workload. This is the single biggest open architectural question — captured in RT-1 (see below).
End-to-end flow through the agent (WS → STT → LLM via Bedrock Mistral → Guard → Voxtral TTS → WS) was not exercised this session. That's R-10 (#670) and it remains open.
Findings summary (18 items)
Full detail — with Tier-1 evidence citations and the R-* child each maps to — lives in the MEGA close-out issue (parent = #660). The short version:
Resolved this session:
- Voxtral-4B-TTS-2603 requires
vllm/vllm-omni— notvllm/vllm-openai. Proven byValueError: no module or parameter named 'acoustic_transformer' in MistralForCausalLMafter the weights swap. (Maps to R-4 #664, R-6 #667.) vllm/vllm-omni:v0.18.0ships with an emptyENTRYPOINT; a composeentrypoint: ["vllm", "serve"]override is required.--gpu-memory-utilizationCLI flag is silently ignored by the vllm-omni multi-stage pipeline — per-stage YAML invllm_omni/model_executor/stage_configs/voxtral_tts.yamltakes precedence.--stage-overridesand--deploy-configare main-branch-only in vllm-omni; the released v0.18.0 image doesn't accept them. Workaround: bind-mount overlay of the stage YAML.- Llama Guard's default
--max-num-seqs 256OOMs the sampler-warmup phase on a 1B guardrail model. Dropping to--max-num-seqs 4with--enforce-eagerfixes it. - Raising Llama Guard's
--gpu-memory-utilizationfrom0.15to0.25was needed to give the KV cache enough room after the 2.8 GB weights load. - Python 3.12 isn't in Ubuntu 22.04's default apt repositories — the deadsnakes PPA is required. Ubuntu 24.04 variants of the
nvidia/cuda:12.4.1-*base images don't exist on Docker Hub. - Python 3.12 removed
distutils— the voice-agent Dockerfile needspython3.12 -m ensurepip --upgradebeforepip install. faster-whisperdefaults tocuda+int8; on 583 MiB of free VRAM it would OOM. Forced to CPU via compose env overrides.
Resolved with a workaround that needs proper repair:
10. A classic Debian posix_local pip scheme interaction means pip install --prefix=/install lands packages at /install/local/lib/python3.12/dist-packages/. The Dockerfile's COPY --from=builder /install /usr/local then puts them at /usr/local/local/lib/python3.12/dist-packages/ (double local) — outside default sys.path. Currently papered over by a compose PYTHONPATH env var; proper fix is a Dockerfile change.
Discovered, not yet addressed:
11. livekit-plugins-turn-detector downloads model_q8.onnx at runtime into the container's ephemeral HF cache. Every docker compose up -d --force-recreate loses it. Seeded RT-2.
12. faster-whisper env var mismatch: code reads FASTER_WHISPER_MODEL, compose sets FASTER_WHISPER_MODEL_PATH. Harmless today (default path matches) but a trap.
13. Caddy /healthz route is missing from the Caddyfile.
14. voice-agent healthcheck in compose hits :8080 but the agent binds :8081.
15. voice/.env line 5 was corrupt pre-session: DEEPGRAM_API_KEY=HF_TOKEN=... (two keys merged).
16. No SSM agent on the GPU instance. EC2 Instance Connect is required (60s ephemeral key window).
Methodology lessons (captured in local session memory, not the public site):
17. Browser line-wraps in a web-shell paste become real \n bytes in quoted remote ssh commands — broke multiple long one-liners.
18. Tier-4 handoff claims deserve verification against the intended variant, not dismissal based on the wrong variant. A prior handoff said "vllm-omni ≥0.18.0 is required" and was dismissed mid-session because vllm/vllm-openai:latest loaded the then-on-disk Voxtral-Mini-3B-2507 fine. That was a wrong-variant false positive — Mini-3B is an ASR arch (VoxtralForConditionalGeneration) and supported by the generic image; the actual target, Voxtral-4B-TTS-2603, is a different arch (voxtral_tts + acoustic_transformer). Once the correct weights were in place, the generic image failed with the error listed above. The handoff was right all along.
Architecture context
The three-service single-GPU arrangement, the VRAM accounting behind it, and the runtime-download lifecycle problem are covered in Voice stack architecture. Read that before opening any of the pending PRs.
Open work (ordered)
Live-on-instance, not-yet-in-repo (the single biggest SSOT liability)
Stabilisation changes applied during bring-up live on the GPU instance's working tree only. Each needs to land in main via its R-*-scoped PR:
- R-5 PR (#666 → also closes #366). Guard tuning:
--max-num-seqs 4,--enforce-eager,--gpu-memory-utilization 0.25. Smallest, most self-contained — recommended first. - R-4 + R-6 PR (#664 + #667, atomic). Voxtral image swap to
vllm/vllm-omni:v0.18.0, empty-ENTRYPOINToverride, bind-mount overlay of the stage YAML, command rewrite, and the weights-swap record. Ships the customvoxtral_tts.yamlas a new file undervoice/config/. - Voice-agent Dockerfile PR. Adds the deadsnakes PPA install, adds the
ensurepipbootstrap, and fixes the--prefix=/install/posix_localbug properly (e.g.--target=/install/lib/python3.12/dist-packagesin the builder, orCOPY --from=builder /install/local /usr/local) so the composePYTHONPATHworkaround can be removed. Best bundled with R-4. - R-9 PR (#669 → closes #527). Whisper doc-only close — the agent already mounts whisper from the host, so the "baked in image" concern from #527 is already resolved at image level. Normalise the
FASTER_WHISPER_MODELvsFASTER_WHISPER_MODEL_PATHenv-var name at the same time. - R-8 PR (#668 → closes #528). Add a
/healthzroute tovoice/config/Caddyfile; change the voice-agent healthcheck port from8080to8081. - HF-cache persistence. Once RT-2 lands a decision (image-bake vs host bind-mount vs plugin lifecycle hook), ship the implementation. Covers turn-detector today and any future livekit-plugin that does runtime HF downloads.
Research topics filed
- RT-1 — GPU VRAM architecture viability on L4. Why 22.4/23.0 GiB is where we are, what the ceiling actually is under load, and whether the right answer is to squeeze further, separate the guardrail onto a second GPU, or move to a larger single GPU. Filed as a sub-issue of the MEGA close-out.
- RT-2 — Turn-detector (and livekit-plugins) HF-cache lifecycle. The pattern problem behind finding #11. Filed as a sub-issue of the MEGA close-out.
R-10 — the end-to-end verification gate
R-10 (#670) runs three probes: LiveKit WS ping / worker registration; rotation probe (clean agent restart on .env change); full ASR→LLM→Guard→TTS through the agent. This closes last — it depends on R-3/4/5/6/12 plus whatever RT-1 concludes.
R-12 — this doc and its sibling
R-12 (#671) is "migrate voice ops docs into the SSOT site." This handover file and voice-stack-architecture.mdx are the first concrete deliverables for that issue — they're what it's been pointing at. Legacy voice/DEPLOY-AWS.md and voice/CREDENTIALS-CHECKLIST.md migration remains; track as follow-up.
Public-discipline redaction
Per apps/docs/content/docs/operations/ssot-discipline.mdx, this page is on the public site (/docs/operations). It therefore does not name:
- the GPU instance's EC2 ID
- the public or private IP
- any security-group ID or name
- the operator-whitelisted source IP
- any file-system path that encodes a credential filename
Those details live in the session memory file attached to the MEGA close-out issue (private repo) and in the raw session transcript (also attached to the close-out issue). Operators running the stack find them in the private tracker, not here.
If you're the next session
- Read this page and voice-stack-architecture.
- Read the MEGA close-out issue (native sub-issue of #660; top of #660's sub-issue list) for the per-R-* comment payloads and the full 18-finding table with citations.
- Pull the attached session memory file from the MEGA issue for instance-specific details (ID, IP, SG,
/models/layout). - Re-verify the live state before assuming anything:
docker compose ps,nvidia-smi,/healthprobes. SSM is unavailable on the instance; use EC2 Instance Connect. - Open PRs in the order above. Each closes one R-* child and ships a changelog entry per
ssot-discipline.mdx. - Do not assume any claim in a
HANDOFF-*.md,CLAUDE.md,voice/DEPLOY-AWS.md, orvoice/CREDENTIALS-CHECKLIST.mdis still accurate — those are Tier 4 per the discipline rules. Corroborate via the live code or the live instance before acting.
Session artifacts
- Epic: #660
- MEGA close-out issue (sub-issue of #660): filed via the issue write the produced this page — find it at the top of #660's sub-issue list.
- Research tickets RT-1 and RT-2: sub-issues of the MEGA close-out.
- Companion page: Voice stack architecture.
- Session memory (attached as
.mdto the MEGA close-out via GitHub web UI):project_voice_bringup_state_2026-04-24.mdplus ~12 workflow feedback files. - Raw session transcript (attached as
.txtto the MEGA close-out via GitHub web UI).
Handover 2026-04-22 — fragJulia voice deploy (v2, post-correction)
Post-correction session handoff anchored to the fragJulia Voice Infra Spec v2 self-hosted PDF. Corrects v1 on Voxtral variant (4B-TTS-2603) and runtime (vllm-omni). Secrets redacted during ingestion.
Voice Stack Bring-Up Verification — 2026-04-25
R-10 verification probe results for the self-hosted voice stack on EC2. Probes 1 and 3 green; Probe 2 partial — infrastructure verified, full reply-generation deferred to the plugin-upgrade epic.