fragJulia
Changelog

2026-04-24 — Voice stack bring-up + MEGA close-out tracker

Bring-up on the single-L4 GPU instance reached healthy state for all services; direct-endpoint TTS validated. 18 findings, 9 live-not-PR'd changes, and 2 open research questions consolidated into a single close-out tracker.

What changed

  • Added apps/docs/content/docs/operations/handover-2026-04-24-voice-bringup.mdx — session-level handover for the 2026-04-24 voice-stack bring-up.
  • Added apps/docs/content/docs/operations/voice-stack-architecture.mdx — three-service layout on a single NVIDIA L4 (24 GiB), VRAM accounting under vLLM-Omni and vLLM, runtime-download lifecycle gap, and the two architectural questions that seeded the research tickets below.
  • Filed #672 (MEGA close-out) as a native sub-issue of R-0 #660, consolidating: 18 findings with Tier-1 evidence citations; per-R-* comment payloads; a live-on-instance PR backlog of 9 divergences; session-memory + raw-transcript attachment slots.
  • Filed #673 (RT-1: GPU VRAM architecture viability on L4) and #674 (RT-2: turn-detector / livekit-plugins HF-cache lifecycle) as native sub-issues of #672. Both are decision-only tickets — they produce writeups that feed subsequent R-*-scoped implementation PRs.
  • Updated apps/docs/content/docs/operations/meta.json to list both new pages in sidebar order.

Why

The 2026-04-24 bring-up session stabilised the voice stack on the target GPU instance — all services healthy, POST /v1/audio/speech against vllm-voxtral returning a valid 24 kHz WAV — but left two kinds of debt behind:

  1. SSOT divergence. Seven voice/docker-compose.yml changes and two voice/agent/Dockerfile changes are live on the instance only; none have been mirrored back to main. These are the biggest single source of risk if the instance needs to be re-provisioned. Each divergence maps to an existing R-* child of #660 and will land via that child's PR.

  2. Open architectural questions. The stack runs at 22.4 / 23.0 GiB = 97% VRAM utilisation. It works at steady state but has no headroom. Separately, livekit-plugins-turn-detector downloads its model into container-ephemeral HF cache and loses it on every --force-recreate. Both deserve researched decisions, not rushed fixes — hence RT-1 and RT-2.

Consolidating this into a single MEGA close-out tracker (#672) with two research sub-issues gives the next session one entry point and a clear split between "open PR now" vs "decide first, then PR."

Scope of this entry

This is documentation + tracker scaffolding. No voice/** code change lands with this entry. The nine live-on-instance divergences remain to be PR'd under their individual R-* children (#663, #664, #666, #667, #668, #669, with one headcheck fix in #668 also touching #528).

R-10 (#670) bring-up verification — the end-to-end test through voice-agent — has NOT been run; it remains the closing gate for R-0 (#660).

Follow-ups

  • Open the R-5 (#666) guard-tuning PR first (smallest, most self-contained).
  • Open the R-4 (#664) + R-6 (#667) atomic PR (voxtral image + stage-overlay + Dockerfile fixes).
  • Close R-9 (#669) as whisper-already-not-baked (doc-only); the turn-detector sibling issue stays on RT-2.
  • Close R-8 (#668) with a one-line Caddyfile fix (/healthz route) + healthcheck port 8080 → 8081.
  • Close R-3 (#663) after the trivial .env line-5 text fix.
  • Complete R-12 (#671) docs migration — this entry starts it; legacy voice/DEPLOY-AWS.md and voice/CREDENTIALS-CHECKLIST.md migration remain.
  • Attach session memory files and the raw session transcript to #672 via GitHub web UI (API cannot attach files).
  • Run R-10 (#670) only after RT-1 has produced its decision — that decision may alter the stack being verified.

On this page