Voice stack architecture — single-L4 GPU layout
Three-service arrangement on one NVIDIA L4 (24 GiB), VRAM accounting under vLLM and vLLM-Omni, the runtime-download lifecycle gap for agent-side HF models, and the open architectural questions that seeded RT-1 and RT-2.
Status: current as of 2026-04-24 bring-up. See Handover 2026-04-24 for what was achieved in the session that produced this architecture writeup. This page is Tier-2 SSOT per SSOT discipline.
Service layout
Three GPU-adjacent services plus two network services share one NVIDIA L4 GPU (24 GiB VRAM) on a single g6.xlarge-class EC2 instance (16 GiB host RAM, ~290 GB EBS root). Instance-specific identifiers (ID, IP, SG, whitelisted source) are redacted from this public page and live in the private tracker + session memory attached to the MEGA close-out issue under epic #660.
LiveKit Server ──────────────┐
(WebRTC SFU, :7880) │
▼
Caddy (:443, TLS) ──► voice-agent ◄──┬── vllm-guard (:8000, Llama Guard 3 1B)
(LiveKit agent, │
faster-whisper │
on CPU, └── vllm-voxtral (:8001, Voxtral-4B-TTS-2603)
LiveKit plugins)
│
└── Bedrock Mistral Large (external, sole outbound API)livekit-server: WebRTC SFU. Host networking for UDP.caddy: TLS reverse proxy. Host networking.vllm-guard: Llama Guard 3 1B output guardrail. Docker bridge network, exposes:8000.vllm-voxtral: Voxtral-4B-TTS-2603 TTS engine. Docker bridge, exposes:8001.voice-agent: Python LiveKit agent process. Host networking. Runsfaster-whisperfor STT on CPU, calls Bedrock (Mistral Large) for the LLM turn, callsvllm-guard:8000for output guardrailing, and callsvllm-voxtral:8001for TTS.
Model weights live on the host's /models/ directory and are mounted read-only into each container. No weights are baked into images.
/models/faster-whisper-large-v3— 2.9 GB (STT)/models/llama-guard-3-1b— 2.8 GB (Guard)/models/voxtral-4b-tts— 7.5 GB (TTS,mistralai/Voxtral-4B-TTS-2603, Mistral-native format)/models/voxtral-tts-config/voxtral_tts.yaml— custom stage-config overlay (see "VRAM tuning" below)
VRAM accounting on the L4
Observed post-bring-up (2026-04-24):
| Workload | VRAM |
|---|---|
vllm-voxtral (TTS) | ~16.7 GiB |
vllm-guard (Guardrail) | ~5.7 GiB |
voice-agent whisper (CPU) | ~0 GiB on GPU |
voice-agent turn-detector (CPU ONNX runtime) | ~0 GiB on GPU |
| Total | ~22.4 / 23.0 GiB = 97% |
This is stable at steady state but leaves effectively no headroom for warmup spikes, KV-cache growth under load, or any fourth workload. The 97% figure is the principal architectural concern behind RT-1 (filed as a sub-issue of the bring-up MEGA close-out).
Why Voxtral is ~16.7 GiB
The raw Voxtral-4B-TTS weights themselves are estimated at 11–12 GiB (the sibling Voxtral-Mini-4B-Realtime, Apache-2.0, has a published pure-C implementation by antirez documenting 10.4 GiB raw for the Ministral-3 backbone; Voxtral-4B-TTS adds the acoustic_transformer plus FlowMatching stages). The remaining ~5 GiB is vLLM-Omni's engine overhead:
- Pre-allocated KV cache, sized to the configured
gpu_memory_utilizationtarget. - CUDA-graph capture for ~51 sizes (
[1, 2, 4, …, 512]). - Inductor compile cache + scratch buffers.
- Sampler-warmup scratch.
The HuggingFace model card's "≥ 16 GiB GPU" requirement for Voxtral-4B-TTS is a post-overhead figure, not raw-weight. That explains why there is no realistic path to running this model on a 16 GiB card like a T4: the floor is the overhead, not the weights.
Why Guard is ~5.7 GiB
The Llama Guard 3 1B weights are 2.8 GB. The remaining ~3 GiB is vLLM's standard engine overhead — KV-cache pre-allocation (sized by --gpu-memory-utilization), CUDA-graph capture, and sampler-warmup dummy requests.
With vllm-guard's defaults on a 24 GiB card, sampler warmup tried to reserve space for 256 concurrent requests (the default --max-num-seqs) and OOM'd after the KV cache had already allocated. Dropping to --max-num-seqs 4 plus --enforce-eager (no CUDA-graph capture for the guardrail model) was sufficient. A 1B guardrail serving the output of one voice agent has no need for 256-way concurrency.
VRAM tuning mechanisms that work (and don't) on vLLM-Omni v0.18.0
The voxtral service runs vllm/vllm-omni:v0.18.0 — not the generic vllm/vllm-openai. This was determined during the bring-up because Voxtral-4B-TTS-2603 declares "model_type": "voxtral_tts" with acoustic_transformer_args in its params.json, which vllm-openai's Mistral loader resolves to a plain MistralForCausalLM and fails with ValueError: no module or parameter named 'acoustic_transformer'. The HuggingFace model card for mistralai/Voxtral-4B-TTS-2603 confirms: "Measured using vllm_omni/examples/offline_inference/voxtral_tts/end2end.py … vllm version: v0.18.0".
Practical consequences for operators tuning memory:
- The top-level
--gpu-memory-utilizationCLI flag is silently ignored by vllm-omni's multi-stage pipeline. The per-stagegpu_memory_utilizationin the loaded deploy config takes precedence. - The
--stage-overridesJSON flag is main-branch only. The released v0.18.0 image doesn't accept it (vllm: error: unrecognized arguments: --stage-overrides {...}). - The
--deploy-config <yaml-path>flag is also main-branch only in v0.18.0. - The working mechanism on v0.18.0 is to bind-mount a modified copy of
vllm_omni/model_executor/stage_configs/voxtral_tts.yamlover the container's default path. Extract once viadocker run --rm --entrypoint cat vllm/vllm-omni:v0.18.0 <that-path>, edit thegpu_memory_utilizationper stage, save to a host path undervoice/config/, bind-mount with:ro. - Default per-stage budgets in
voxtral_tts.yamlare not symmetric: stage 0 (audio generation) defaults to0.8, stage 1 (audio tokenizer) defaults to0.1. Total default target =0.9 × 24 = 21.6 GiB, observed actual on the default is ~19.4 GiB. - Current live tuning: stage 0 dropped to
0.68, stage 1 unchanged at0.1. Total target0.78 × 24 = 18.7 GiB; observed ~16.7 GiB actual.
Watch the vllm-omni release cadence — once the main-branch --deploy-config / --stage-overrides flags ship to a tagged Docker Hub image, the bind-mount overlay can be dropped. As of 2026-04-24, v0.18.0 (2026-03-29) is still the latest published tag.
Runtime-download lifecycle gap
Three agent-side models are downloaded from HuggingFace at runtime:
faster-whisper-large-v3— viaWhisperModel(...); in our stack pre-seeded on the host under/models/and mounted read-only. Not a problem.livekit/turn-detector(model_q8.onnx, ~281 MB) — vialivekit-plugins-turn-detectorat agent startup. Downloads into/root/.cache/huggingface/hub/inside the running container. On everydocker compose up -d --force-recreate voice-agentthe download is destroyed. First boot after a recreate emitsRuntimeError: livekit-plugins-turn-detector initialization failed. Could not find file "model_q8.onnx"— non-fatal (the worker registers with LiveKit anyway), but conversation turn detection degrades until the lazy fetch succeeds.- Any future livekit-plugin that does the same pattern.
This is a pattern problem, not a turn-detector-only problem. The canonical options are:
- Image-bake at build time —
RUN python main.py download-filesin the voice-agent Dockerfile. Pro: air-gap safe, container start is instant. Con: image size grows, every model update forces rebuild. - Host bind-mount — mount
/models/hf_cache:/root/.cache/huggingfacefrom the host. Pro: image stays small, cache survives container recreates, cold host still downloads once. Con: depends on filesystem layout agreement between image and host. - Per-plugin lifecycle hook — some plugins expose their own pre-download entry points. Pro: targeted. Con: inconsistent across plugins; not all support it.
This is RT-2, filed as a sub-issue of the MEGA bring-up close-out. Decision owed before the next voice-agent Dockerfile PR lands.
Open architectural questions (the ones that need research, not implementation)
RT-1: Is the L4 sized right for this three-workload mix?
The live stack is at 97% VRAM and nominally works. Open questions:
- What is the ceiling under realistic load? Concurrent WS sessions, turn-endings triggering simultaneous guard + TTS calls, KV-cache expansion across longer utterances.
- Can we squeeze further without quality regression? Obvious candidates: quantised Voxtral (none published at 2026-04-24), lower per-stage utilisation (fragile under overhead floor), smaller guardrail model.
- Or is the right split a two-box architecture — e.g. Guard on a smaller (T4-class) second GPU, Voxtral alone on the L4? What's the cross-instance latency bill for guardrail RPCs?
- Or is the right answer a bigger single GPU — L40S (48 GiB) or A10G? Pricing and availability delta?
- How does a disposable offline-batch Voxtral instance (separate from the live agent, used for non-realtime synthesis) change the live calculus? This is a preference captured in the 2026-04-24 session.
RT-1's acceptance produces a decision, not a PR. A subsequent R-* child implements whatever RT-1 concludes.
RT-2: What's the canonical lifecycle pattern for agent-side HF models?
See the previous section for the three candidate patterns. Decision owed. Acceptance is again a decision, not code.
Services that stay on vllm-openai vs ones that move to vllm-omni
vllm-guard(Llama Guard 3 1B) stays onvllm/vllm-openai:latest. Llama Guard is a plainLlamaForCausalLMand loads fine under the generic image.vllm-voxtral(Voxtral-4B-TTS-2603) must run onvllm/vllm-omni:v0.18.0. Thevoxtral_ttsarchitecture is specific to vllm-omni.
This split does mean two distinct vLLM images on disk. Image pull bandwidth is paid once at provisioning.
What is not on this page
- Instance ID, public or private IP address, security-group ID or name, whitelisted source IP. Those are captured in the private tracker (the MEGA close-out issue's attached session memory) per SSOT discipline — not suitable for the public docs site.
- Secret values, secret file paths, or the literal contents of
voice/.env. - Specific Dockerfile diffs. Those go in each R-* PR and land in the changelog with a pointer back here.
References
- Handover 2026-04-24 — Voice stack bring-up — session-level narrative.
- SSOT discipline — why this page exists and what it can say.
- Voice repair epic #660 — the tracker this architecture serves.
- MEGA close-out issue (sub-issue of #660) — per-R-* comment payloads, raw session transcript, session memory attachments.
- RT-1 (GPU VRAM architecture viability) and RT-2 (turn-detector / HF cache lifecycle) — sub-issues of the MEGA close-out.
- External: antirez/voxtral.c — pure-C Voxtral-Mini-4B-Realtime implementation; referenced for raw-weight memory accounting on the Ministral-3 backbone.
- External: Voxtral-4B-TTS-2603 model card — authoritative for the
vllm_omni+v0.18.0requirement.
Decision processes
How to triage sources, bounce conflicts, cite infra IDs inline, and verify handoffs against the intended artifact. Companion to ssot-discipline.
Comprehensive Issue Resolution Plan
88-issue, 7-sprint resolution plan covering security, design, accessibility, bugs, and config — organized by dependency order with file-level cross-references.