fragJulia
Operations

Voice stack architecture — single-L4 GPU layout

Three-service arrangement on one NVIDIA L4 (24 GiB), VRAM accounting under vLLM and vLLM-Omni, the runtime-download lifecycle gap for agent-side HF models, and the open architectural questions that seeded RT-1 and RT-2.

Status: current as of 2026-04-24 bring-up. See Handover 2026-04-24 for what was achieved in the session that produced this architecture writeup. This page is Tier-2 SSOT per SSOT discipline.


Service layout

Three GPU-adjacent services plus two network services share one NVIDIA L4 GPU (24 GiB VRAM) on a single g6.xlarge-class EC2 instance (16 GiB host RAM, ~290 GB EBS root). Instance-specific identifiers (ID, IP, SG, whitelisted source) are redacted from this public page and live in the private tracker + session memory attached to the MEGA close-out issue under epic #660.

  LiveKit Server  ──────────────┐
      (WebRTC SFU, :7880)       │

  Caddy (:443, TLS)  ──►  voice-agent  ◄──┬── vllm-guard   (:8000, Llama Guard 3 1B)
                          (LiveKit agent, │
                           faster-whisper │
                           on CPU,        └── vllm-voxtral (:8001, Voxtral-4B-TTS-2603)
                           LiveKit plugins)

                                └── Bedrock Mistral Large (external, sole outbound API)
  • livekit-server: WebRTC SFU. Host networking for UDP.
  • caddy: TLS reverse proxy. Host networking.
  • vllm-guard: Llama Guard 3 1B output guardrail. Docker bridge network, exposes :8000.
  • vllm-voxtral: Voxtral-4B-TTS-2603 TTS engine. Docker bridge, exposes :8001.
  • voice-agent: Python LiveKit agent process. Host networking. Runs faster-whisper for STT on CPU, calls Bedrock (Mistral Large) for the LLM turn, calls vllm-guard:8000 for output guardrailing, and calls vllm-voxtral:8001 for TTS.

Model weights live on the host's /models/ directory and are mounted read-only into each container. No weights are baked into images.

  • /models/faster-whisper-large-v3 — 2.9 GB (STT)
  • /models/llama-guard-3-1b — 2.8 GB (Guard)
  • /models/voxtral-4b-tts — 7.5 GB (TTS, mistralai/Voxtral-4B-TTS-2603, Mistral-native format)
  • /models/voxtral-tts-config/voxtral_tts.yaml — custom stage-config overlay (see "VRAM tuning" below)

VRAM accounting on the L4

Observed post-bring-up (2026-04-24):

WorkloadVRAM
vllm-voxtral (TTS)~16.7 GiB
vllm-guard (Guardrail)~5.7 GiB
voice-agent whisper (CPU)~0 GiB on GPU
voice-agent turn-detector (CPU ONNX runtime)~0 GiB on GPU
Total~22.4 / 23.0 GiB = 97%

This is stable at steady state but leaves effectively no headroom for warmup spikes, KV-cache growth under load, or any fourth workload. The 97% figure is the principal architectural concern behind RT-1 (filed as a sub-issue of the bring-up MEGA close-out).

Why Voxtral is ~16.7 GiB

The raw Voxtral-4B-TTS weights themselves are estimated at 11–12 GiB (the sibling Voxtral-Mini-4B-Realtime, Apache-2.0, has a published pure-C implementation by antirez documenting 10.4 GiB raw for the Ministral-3 backbone; Voxtral-4B-TTS adds the acoustic_transformer plus FlowMatching stages). The remaining ~5 GiB is vLLM-Omni's engine overhead:

  • Pre-allocated KV cache, sized to the configured gpu_memory_utilization target.
  • CUDA-graph capture for ~51 sizes ([1, 2, 4, …, 512]).
  • Inductor compile cache + scratch buffers.
  • Sampler-warmup scratch.

The HuggingFace model card's "≥ 16 GiB GPU" requirement for Voxtral-4B-TTS is a post-overhead figure, not raw-weight. That explains why there is no realistic path to running this model on a 16 GiB card like a T4: the floor is the overhead, not the weights.

Why Guard is ~5.7 GiB

The Llama Guard 3 1B weights are 2.8 GB. The remaining ~3 GiB is vLLM's standard engine overhead — KV-cache pre-allocation (sized by --gpu-memory-utilization), CUDA-graph capture, and sampler-warmup dummy requests.

With vllm-guard's defaults on a 24 GiB card, sampler warmup tried to reserve space for 256 concurrent requests (the default --max-num-seqs) and OOM'd after the KV cache had already allocated. Dropping to --max-num-seqs 4 plus --enforce-eager (no CUDA-graph capture for the guardrail model) was sufficient. A 1B guardrail serving the output of one voice agent has no need for 256-way concurrency.

VRAM tuning mechanisms that work (and don't) on vLLM-Omni v0.18.0

The voxtral service runs vllm/vllm-omni:v0.18.0 — not the generic vllm/vllm-openai. This was determined during the bring-up because Voxtral-4B-TTS-2603 declares "model_type": "voxtral_tts" with acoustic_transformer_args in its params.json, which vllm-openai's Mistral loader resolves to a plain MistralForCausalLM and fails with ValueError: no module or parameter named 'acoustic_transformer'. The HuggingFace model card for mistralai/Voxtral-4B-TTS-2603 confirms: "Measured using vllm_omni/examples/offline_inference/voxtral_tts/end2end.py … vllm version: v0.18.0".

Practical consequences for operators tuning memory:

  • The top-level --gpu-memory-utilization CLI flag is silently ignored by vllm-omni's multi-stage pipeline. The per-stage gpu_memory_utilization in the loaded deploy config takes precedence.
  • The --stage-overrides JSON flag is main-branch only. The released v0.18.0 image doesn't accept it (vllm: error: unrecognized arguments: --stage-overrides {...}).
  • The --deploy-config <yaml-path> flag is also main-branch only in v0.18.0.
  • The working mechanism on v0.18.0 is to bind-mount a modified copy of vllm_omni/model_executor/stage_configs/voxtral_tts.yaml over the container's default path. Extract once via docker run --rm --entrypoint cat vllm/vllm-omni:v0.18.0 <that-path>, edit the gpu_memory_utilization per stage, save to a host path under voice/config/, bind-mount with :ro.
  • Default per-stage budgets in voxtral_tts.yaml are not symmetric: stage 0 (audio generation) defaults to 0.8, stage 1 (audio tokenizer) defaults to 0.1. Total default target = 0.9 × 24 = 21.6 GiB, observed actual on the default is ~19.4 GiB.
  • Current live tuning: stage 0 dropped to 0.68, stage 1 unchanged at 0.1. Total target 0.78 × 24 = 18.7 GiB; observed ~16.7 GiB actual.

Watch the vllm-omni release cadence — once the main-branch --deploy-config / --stage-overrides flags ship to a tagged Docker Hub image, the bind-mount overlay can be dropped. As of 2026-04-24, v0.18.0 (2026-03-29) is still the latest published tag.

Runtime-download lifecycle gap

Three agent-side models are downloaded from HuggingFace at runtime:

  1. faster-whisper-large-v3 — via WhisperModel(...); in our stack pre-seeded on the host under /models/ and mounted read-only. Not a problem.
  2. livekit/turn-detector (model_q8.onnx, ~281 MB) — via livekit-plugins-turn-detector at agent startup. Downloads into /root/.cache/huggingface/hub/ inside the running container. On every docker compose up -d --force-recreate voice-agent the download is destroyed. First boot after a recreate emits RuntimeError: livekit-plugins-turn-detector initialization failed. Could not find file "model_q8.onnx" — non-fatal (the worker registers with LiveKit anyway), but conversation turn detection degrades until the lazy fetch succeeds.
  3. Any future livekit-plugin that does the same pattern.

This is a pattern problem, not a turn-detector-only problem. The canonical options are:

  • Image-bake at build timeRUN python main.py download-files in the voice-agent Dockerfile. Pro: air-gap safe, container start is instant. Con: image size grows, every model update forces rebuild.
  • Host bind-mount — mount /models/hf_cache:/root/.cache/huggingface from the host. Pro: image stays small, cache survives container recreates, cold host still downloads once. Con: depends on filesystem layout agreement between image and host.
  • Per-plugin lifecycle hook — some plugins expose their own pre-download entry points. Pro: targeted. Con: inconsistent across plugins; not all support it.

This is RT-2, filed as a sub-issue of the MEGA bring-up close-out. Decision owed before the next voice-agent Dockerfile PR lands.

Open architectural questions (the ones that need research, not implementation)

RT-1: Is the L4 sized right for this three-workload mix?

The live stack is at 97% VRAM and nominally works. Open questions:

  1. What is the ceiling under realistic load? Concurrent WS sessions, turn-endings triggering simultaneous guard + TTS calls, KV-cache expansion across longer utterances.
  2. Can we squeeze further without quality regression? Obvious candidates: quantised Voxtral (none published at 2026-04-24), lower per-stage utilisation (fragile under overhead floor), smaller guardrail model.
  3. Or is the right split a two-box architecture — e.g. Guard on a smaller (T4-class) second GPU, Voxtral alone on the L4? What's the cross-instance latency bill for guardrail RPCs?
  4. Or is the right answer a bigger single GPU — L40S (48 GiB) or A10G? Pricing and availability delta?
  5. How does a disposable offline-batch Voxtral instance (separate from the live agent, used for non-realtime synthesis) change the live calculus? This is a preference captured in the 2026-04-24 session.

RT-1's acceptance produces a decision, not a PR. A subsequent R-* child implements whatever RT-1 concludes.

RT-2: What's the canonical lifecycle pattern for agent-side HF models?

See the previous section for the three candidate patterns. Decision owed. Acceptance is again a decision, not code.

Services that stay on vllm-openai vs ones that move to vllm-omni

  • vllm-guard (Llama Guard 3 1B) stays on vllm/vllm-openai:latest. Llama Guard is a plain LlamaForCausalLM and loads fine under the generic image.
  • vllm-voxtral (Voxtral-4B-TTS-2603) must run on vllm/vllm-omni:v0.18.0. The voxtral_tts architecture is specific to vllm-omni.

This split does mean two distinct vLLM images on disk. Image pull bandwidth is paid once at provisioning.

What is not on this page

  • Instance ID, public or private IP address, security-group ID or name, whitelisted source IP. Those are captured in the private tracker (the MEGA close-out issue's attached session memory) per SSOT discipline — not suitable for the public docs site.
  • Secret values, secret file paths, or the literal contents of voice/.env.
  • Specific Dockerfile diffs. Those go in each R-* PR and land in the changelog with a pointer back here.

References

  • Handover 2026-04-24 — Voice stack bring-up — session-level narrative.
  • SSOT discipline — why this page exists and what it can say.
  • Voice repair epic #660 — the tracker this architecture serves.
  • MEGA close-out issue (sub-issue of #660) — per-R-* comment payloads, raw session transcript, session memory attachments.
  • RT-1 (GPU VRAM architecture viability) and RT-2 (turn-detector / HF cache lifecycle) — sub-issues of the MEGA close-out.
  • External: antirez/voxtral.c — pure-C Voxtral-Mini-4B-Realtime implementation; referenced for raw-weight memory accounting on the Ministral-3 backbone.
  • External: Voxtral-4B-TTS-2603 model card — authoritative for the vllm_omni + v0.18.0 requirement.

On this page