Voice stack architecture — single-L4 GPU layout

Three-service arrangement on one NVIDIA L4 (24 GiB), VRAM accounting under vLLM and vLLM-Omni, the runtime-download lifecycle gap for agent-side HF models, and the open architectural questions that seeded RT-1 and RT-2.

Status: current as of 2026-04-24 bring-up. See Handover 2026-04-24 for what was achieved in the session that produced this architecture writeup. This page is Tier-2 SSOT per SSOT discipline.

Service layout

Three GPU-adjacent services plus two network services share one NVIDIA L4 GPU (24 GiB VRAM) on a single g6.xlarge-class EC2 instance (16 GiB host RAM, ~290 GB EBS root). Instance-specific identifiers (ID, IP, SG, whitelisted source) are redacted from this public page and live in the private tracker + session memory attached to the MEGA close-out issue under epic #660.

  LiveKit Server  ──────────────┐
      (WebRTC SFU, :7880)       │
                                ▼
  Caddy (:443, TLS)  ──►  voice-agent  ◄──┬── vllm-guard   (:8000, Llama Guard 3 1B)
                          (LiveKit agent, │
                           faster-whisper │
                           on CPU,        └── vllm-voxtral (:8001, Voxtral-4B-TTS-2603)
                           LiveKit plugins)
                                │
                                └── Bedrock Mistral Large (external, sole outbound API)

livekit-server: WebRTC SFU. Host networking for UDP.
caddy: TLS reverse proxy. Host networking.
vllm-guard: Llama Guard 3 1B output guardrail. Docker bridge network, exposes :8000.
vllm-voxtral: Voxtral-4B-TTS-2603 TTS engine. Docker bridge, exposes :8001.
voice-agent: Python LiveKit agent process. Host networking. Runs faster-whisper for STT on CPU, calls Bedrock (Mistral Large) for the LLM turn, calls vllm-guard:8000 for output guardrailing, and calls vllm-voxtral:8001 for TTS.

Model weights live on the host's /models/ directory and are mounted read-only into each container. No weights are baked into images.

/models/faster-whisper-large-v3 — 2.9 GB (STT)
/models/llama-guard-3-1b — 2.8 GB (Guard)
/models/voxtral-4b-tts — 7.5 GB (TTS, mistralai/Voxtral-4B-TTS-2603, Mistral-native format)
/models/voxtral-tts-config/voxtral_tts.yaml — custom stage-config overlay (see "VRAM tuning" below)

VRAM accounting on the L4

Observed post-bring-up (2026-04-24):

Workload	VRAM
`vllm-voxtral` (TTS)	~16.7 GiB
`vllm-guard` (Guardrail)	~5.7 GiB
`voice-agent` whisper (CPU)	~0 GiB on GPU
`voice-agent` turn-detector (CPU ONNX runtime)	~0 GiB on GPU
Total	~22.4 / 23.0 GiB = 97%

This is stable at steady state but leaves effectively no headroom for warmup spikes, KV-cache growth under load, or any fourth workload. The 97% figure is the principal architectural concern behind RT-1 (filed as a sub-issue of the bring-up MEGA close-out).

Why Voxtral is ~16.7 GiB

The raw Voxtral-4B-TTS weights themselves are estimated at 11–12 GiB (the sibling Voxtral-Mini-4B-Realtime, Apache-2.0, has a published pure-C implementation by antirez documenting 10.4 GiB raw for the Ministral-3 backbone; Voxtral-4B-TTS adds the acoustic_transformer plus FlowMatching stages). The remaining ~5 GiB is vLLM-Omni's engine overhead:

Pre-allocated KV cache, sized to the configured gpu_memory_utilization target.
CUDA-graph capture for ~51 sizes ([1, 2, 4, …, 512]).
Inductor compile cache + scratch buffers.
Sampler-warmup scratch.

The HuggingFace model card's "≥ 16 GiB GPU" requirement for Voxtral-4B-TTS is a post-overhead figure, not raw-weight. That explains why there is no realistic path to running this model on a 16 GiB card like a T4: the floor is the overhead, not the weights.

Why Guard is ~5.7 GiB

The Llama Guard 3 1B weights are 2.8 GB. The remaining ~3 GiB is vLLM's standard engine overhead — KV-cache pre-allocation (sized by --gpu-memory-utilization), CUDA-graph capture, and sampler-warmup dummy requests.

With vllm-guard's defaults on a 24 GiB card, sampler warmup tried to reserve space for 256 concurrent requests (the default --max-num-seqs) and OOM'd after the KV cache had already allocated. Dropping to --max-num-seqs 4 plus --enforce-eager (no CUDA-graph capture for the guardrail model) was sufficient. A 1B guardrail serving the output of one voice agent has no need for 256-way concurrency.

VRAM tuning mechanisms that work (and don't) on vLLM-Omni v0.18.0

The voxtral service runs vllm/vllm-omni:v0.18.0 — not the generic vllm/vllm-openai. This was determined during the bring-up because Voxtral-4B-TTS-2603 declares "model_type": "voxtral_tts" with acoustic_transformer_args in its params.json, which vllm-openai's Mistral loader resolves to a plain MistralForCausalLM and fails with ValueError: no module or parameter named 'acoustic_transformer'. The HuggingFace model card for mistralai/Voxtral-4B-TTS-2603 confirms: "Measured using vllm_omni/examples/offline_inference/voxtral_tts/end2end.py … vllm version: v0.18.0".

Practical consequences for operators tuning memory:

The top-level --gpu-memory-utilization CLI flag is silently ignored by vllm-omni's multi-stage pipeline. The per-stage gpu_memory_utilization in the loaded deploy config takes precedence.
The --stage-overrides JSON flag is main-branch only. The released v0.18.0 image doesn't accept it (vllm: error: unrecognized arguments: --stage-overrides {...}).
The --deploy-config <yaml-path> flag is also main-branch only in v0.18.0.
The working mechanism on v0.18.0 is to bind-mount a modified copy of vllm_omni/model_executor/stage_configs/voxtral_tts.yaml over the container's default path. Extract once via docker run --rm --entrypoint cat vllm/vllm-omni:v0.18.0 <that-path>, edit the gpu_memory_utilization per stage, save to a host path under voice/config/, bind-mount with :ro.
Default per-stage budgets in voxtral_tts.yaml are not symmetric: stage 0 (audio generation) defaults to 0.8, stage 1 (audio tokenizer) defaults to 0.1. Total default target = 0.9 × 24 = 21.6 GiB, observed actual on the default is ~19.4 GiB.
Current live tuning: stage 0 dropped to 0.68, stage 1 unchanged at 0.1. Total target 0.78 × 24 = 18.7 GiB; observed ~16.7 GiB actual.

Watch the vllm-omni release cadence — once the main-branch --deploy-config / --stage-overrides flags ship to a tagged Docker Hub image, the bind-mount overlay can be dropped. As of 2026-04-24, v0.18.0 (2026-03-29) is still the latest published tag.

Runtime-download lifecycle gap

Three agent-side models are downloaded from HuggingFace at runtime:

faster-whisper-large-v3 — via WhisperModel(...); in our stack pre-seeded on the host under /models/ and mounted read-only. Not a problem.
livekit/turn-detector (model_q8.onnx, ~281 MB) — via livekit-plugins-turn-detector at agent startup. Downloads into /root/.cache/huggingface/hub/ inside the running container. On every docker compose up -d --force-recreate voice-agent the download is destroyed. First boot after a recreate emits RuntimeError: livekit-plugins-turn-detector initialization failed. Could not find file "model_q8.onnx" — non-fatal (the worker registers with LiveKit anyway), but conversation turn detection degrades until the lazy fetch succeeds.
Any future livekit-plugin that does the same pattern.

This is a pattern problem, not a turn-detector-only problem. The canonical options are:

Image-bake at build time — RUN python main.py download-files in the voice-agent Dockerfile. Pro: air-gap safe, container start is instant. Con: image size grows, every model update forces rebuild.
Host bind-mount — mount /models/hf_cache:/root/.cache/huggingface from the host. Pro: image stays small, cache survives container recreates, cold host still downloads once. Con: depends on filesystem layout agreement between image and host.
Per-plugin lifecycle hook — some plugins expose their own pre-download entry points. Pro: targeted. Con: inconsistent across plugins; not all support it.

This is RT-2, filed as a sub-issue of the MEGA bring-up close-out. Decision owed before the next voice-agent Dockerfile PR lands.

Open architectural questions (the ones that need research, not implementation)

RT-1: Is the L4 sized right for this three-workload mix?

The live stack is at 97% VRAM and nominally works. Open questions:

What is the ceiling under realistic load? Concurrent WS sessions, turn-endings triggering simultaneous guard + TTS calls, KV-cache expansion across longer utterances.
Can we squeeze further without quality regression? Obvious candidates: quantised Voxtral (none published at 2026-04-24), lower per-stage utilisation (fragile under overhead floor), smaller guardrail model.
Or is the right split a two-box architecture — e.g. Guard on a smaller (T4-class) second GPU, Voxtral alone on the L4? What's the cross-instance latency bill for guardrail RPCs?
Or is the right answer a bigger single GPU — L40S (48 GiB) or A10G? Pricing and availability delta?
How does a disposable offline-batch Voxtral instance (separate from the live agent, used for non-realtime synthesis) change the live calculus? This is a preference captured in the 2026-04-24 session.

RT-1's acceptance produces a decision, not a PR. A subsequent R-* child implements whatever RT-1 concludes.

RT-2: What's the canonical lifecycle pattern for agent-side HF models?

See the previous section for the three candidate patterns. Decision owed. Acceptance is again a decision, not code.

Services that stay on `vllm-openai` vs ones that move to `vllm-omni`

vllm-guard (Llama Guard 3 1B) stays on vllm/vllm-openai:latest. Llama Guard is a plain LlamaForCausalLM and loads fine under the generic image.
vllm-voxtral (Voxtral-4B-TTS-2603) must run on vllm/vllm-omni:v0.18.0. The voxtral_tts architecture is specific to vllm-omni.

This split does mean two distinct vLLM images on disk. Image pull bandwidth is paid once at provisioning.

What is not on this page

Instance ID, public or private IP address, security-group ID or name, whitelisted source IP. Those are captured in the private tracker (the MEGA close-out issue's attached session memory) per SSOT discipline — not suitable for the public docs site.
Secret values, secret file paths, or the literal contents of voice/.env.
Specific Dockerfile diffs. Those go in each R-* PR and land in the changelog with a pointer back here.

References

Handover 2026-04-24 — Voice stack bring-up — session-level narrative.
SSOT discipline — why this page exists and what it can say.
Voice repair epic #660 — the tracker this architecture serves.
MEGA close-out issue (sub-issue of #660) — per-R-* comment payloads, raw session transcript, session memory attachments.
RT-1 (GPU VRAM architecture viability) and RT-2 (turn-detector / HF cache lifecycle) — sub-issues of the MEGA close-out.
External: antirez/voxtral.c — pure-C Voxtral-Mini-4B-Realtime implementation; referenced for raw-weight memory accounting on the Ministral-3 backbone.
External: Voxtral-4B-TTS-2603 model card — authoritative for the vllm_omni + v0.18.0 requirement.

Voice stack architecture — single-L4 GPU layout

On this page