fragJulia
Dev

Container System (Voice Pipeline)

EC2 g6.xlarge voice stack — 5 services sharing one NVIDIA L4, host/bridge networking, GPU/VRAM allocation, healthchecks, startup order, ops runbook.

Status: Current as of 2026-04-18 (repo commit 06d84cb). Audience: Ops engineer, BfArM reviewer (infrastructure + data-residency surface), anyone troubleshooting voice.

Scope: the single-box voice stack on AWS EC2. Web (Vercel) and mobile (EAS) are not containerized and out of scope here.

Authoritative source: voice/docker-compose.yml, voice/DEPLOY-AWS.md, voice/CREDENTIALS-CHECKLIST.md.


1. Target host

FieldValue
Instance typeg6.xlarge
GPU1× NVIDIA L4 (24 GB VRAM)
vCPU / RAM4 / 16 GB
Regioneu-central-1 (Frankfurt)
ResidencyAll audio + inference stays in the EU. Bedrock calls go to eu-central-1.
OSUbuntu + NVIDIA drivers + Docker + NVIDIA Container Toolkit

EU residency on EC2 is a DiGA / DSGVO Art. 9 requirement — never move this to a US region for cost reasons.


2. Container topology

Five services, all on one host, defined in voice/docker-compose.yml:

┌───────────────────────────────────────────────────────────────┐
│ EC2 g6.xlarge (eu-central-1)                                  │
│                                                               │
│  ┌─────────────┐  host network                                │
│  │ Caddy       │  :443 (TLS termination)                      │
│  └─────┬───────┘                                              │
│        │ depends on                                           │
│  ┌─────▼──────────────┐  host network                         │
│  │ livekit-server     │  :7880 (WS) / :7881 / UDP             │
│  └─────┬──────────────┘                                       │
│        │                                                      │
│  ┌─────▼──────────────┐  host network                         │
│  │ voice-agent        │  (Python LiveKit agent)               │
│  │  + faster-whisper  │  large-v3 INT8, German medical kwds   │
│  │  + Bedrock client  │  → eu-central-1 Mistral Large         │
│  │  + Voxtral client  │  → vllm-voxtral local                 │
│  │  + Guard client    │  → vllm-guard local                   │
│  └─────┬──────┬───────┘                                       │
│        │      │                                               │
│  ┌─────▼──┐ ┌─▼──────────┐  bridge network (ports exposed)    │
│  │ vllm-  │ │ vllm-      │                                    │
│  │ guard  │ │ voxtral    │                                    │
│  │ :8000  │ │ :8001      │                                    │
│  │ Llama  │ │ Voxtral    │                                    │
│  │ Guard  │ │ TTS 4B     │                                    │
│  │ 3 1B   │ │ bfloat16   │                                    │
│  └────────┘ └────────────┘                                    │
│                                                               │
│  Shared: NVIDIA L4 (24 GB VRAM)                               │
└───────────────────────────────────────────────────────────────┘

Services:

NameImageNetworkPurpose
livekit-serverlivekit/livekit-server:latesthostWebRTC SFU for client ↔ agent
caddycaddy:2-alpinehostTLS reverse proxy, Let's Encrypt certs
vllm-guardvllm/vllm-openai:latestbridge (:8000)Llama Guard 3 1B output guardrail
vllm-voxtralvllm/vllm-openai:latestbridge (:8001)Voxtral TTS 4B (Julia cloned voice)
voice-agent./agent (local build)hostfaster-whisper STT + Bedrock orchestration + TTS client

3. GPU / VRAM allocation

All five services share the single NVIDIA L4 (24 GB). vLLM's --gpu-memory-utilization flag partitions deterministically:

ServiceAllocationNotes
vllm-voxtral0.45 × 24 GB ≈ 10.8 GBVoxtral TTS 4B (bfloat16)
vllm-guard0.15 × 24 GB ≈ 3.6 GBLlama Guard 3 1B (float16), --max-model-len 2048
voice-agent (faster-whisper)~4–5 GBlarge-v3 INT8, loaded at agent startup
Slack~5–6 GBburst headroom, faster-whisper transient batches

Failure mode: CUDA OOM. Handled by the agent restarting via restart: unless-stopped + the docker healthcheck tripping. If you add a new model, recompute this table first.


4. Networking

Host networking

livekit-server, caddy, and voice-agent use network_mode: host. Reasons:

  • LiveKit SFU needs UDP (RTP) which is painful through Docker port mapping
  • Keeps latency low (no userspace NAT)
  • Caddy binding :443 on the host means no Docker port bridge indirection

Bridge-network ports

vllm-guard and vllm-voxtral are called from voice-agent via http://localhost:8000 / :8001. They expose those ports on the host (ports: - "8000:8000"), so they are reachable from host-network containers.

External connections from EC2

TargetProtocolPurpose
AWS Bedrock (bedrock-runtime.eu-central-1.amazonaws.com)HTTPSMistral Large inference
LiveKit cloud (if hybrid)WSSfallback / test rigs
Deepgram EU (api.eu.deepgram.com)HTTPSlegacy STT fallback — verify still used before rotating key
Mistral API (api.mistral.ai)HTTPSbridge while Voxtral license pending

All outbound must succeed from the EC2 security group / NACL.

Ingress

Only :443 (Caddy) is exposed to the public internet. :7880, :7881, :8000, :8001 are bound to localhost-only or reached only via voice-agent on the same host. Double-check EC2 security group before assuming this.


5. Volumes

VolumeMountPurpose
Host /models/read-only into vllm-guard, vllm-voxtral, voice-agentPre-downloaded model weights
caddy_data (named)/data in caddyLet's Encrypt certs, HTTP-01 state
caddy_config (named)/config in caddyCaddy persistent config
Host ./config/livekit.yaml/etc/livekit.yaml (ro) in livekit-serverLiveKit config
Host ./config/Caddyfile/etc/caddy/Caddyfile (ro) in caddyCaddy routing
Host .envenv-file into voice-agentRuntime secrets (see configuration-system §3)

Disk monitoring: CloudWatch alarms on / and /models/ — alert >80% per voice/DEPLOY-AWS.md.


6. Health + startup order

Docker healthchecks (from docker-compose.yml)

ServiceTestInterval / timeoutStart period
livekit-serverwget --spider http://localhost:788015 s / 5 s10 s
caddywget --spider https://localhost:443/healthz30 s / 5 s30 s
vllm-guardcurl -f http://localhost:8000/health30 s / 10 s120 s (model load)
vllm-voxtralcurl -f http://localhost:8001/health30 s / 10 s180 s (larger model)
voice-agentpython -c "urllib.request.urlopen('http://localhost:8080/health')"30 s / 10 s60 s

Dependency graph

livekit-server (healthy) ──┐
                           ├─► voice-agent
vllm-guard    (healthy) ───┤
vllm-voxtral  (healthy) ───┘

livekit-server (healthy) ──► caddy

Cold start budget: ~3 minutes end-to-end, dominated by Voxtral model load.

No /api/health aggregator on the web side that covers voice yet — see health-check.


7. Build

voice-agent is built locally from ./agent:

build:
  context: ./agent
  dockerfile: Dockerfile

Other services pull public images:

  • livekit/livekit-server:latest
  • caddy:2-alpine
  • vllm/vllm-openai:latest (×2)

:latest is fine here because the EC2 host pins via docker compose pull on deploys, not on restart. If you want reproducibility for an audit snapshot, override with SHA-pinned tags at deploy time.


8. Ops runbook (quick reference)

Full runbook: voice/DEPLOY-AWS.md. Quick tasks:

Restart a service

docker compose -f voice/docker-compose.yml restart voice-agent
docker compose logs -f voice-agent

GPU status

nvidia-smi          # realtime
# Or, inside vllm container:
docker compose exec vllm-voxtral nvidia-smi

Inspect memory split

# vLLM prints utilization at startup
docker compose logs vllm-guard   | grep -i gpu
docker compose logs vllm-voxtral | grep -i gpu

CUDA OOM recovery

  1. Identify which container logged OOM (docker compose logs --tail 200)
  2. Restart just that container — its restart: unless-stopped policy usually handles it
  3. If persistent, one model has drifted in memory footprint — recompute §3 table, adjust --gpu-memory-utilization

Credential rotation

See voice/CREDENTIALS-CHECKLIST.md + configuration-system §8.

Model refresh

  1. Download new weights to /models/<name>/ on the host
  2. Update command: --model /models/<name> in docker-compose.yml
  3. docker compose up -d <service> — it picks up the change
  4. Monitor docker compose logs for successful warm-up

9. CloudWatch alarms (baseline from voice/DEPLOY-AWS.md)

AlarmThresholdAction
CPU>90% for 5mPage
Memory>85% for 5mPage
Disk>80%Warn
GPU utilization (via cron)>95% sustainedWarn
Log retention30 d

10. What's NOT in containers (deliberate)

  • Supabase / Postgres — managed (out of scope)
  • Vercel edge — Next.js is not containerized; Vercel manages the runtime
  • Stripe, Resend, Upstash — SaaS dependencies
  • Mobile app — native builds via EAS

Only the voice stack is containerized because it's the only part with GPU + long-lived processes.


11. Gaps + hardening candidates

  1. No Dockerfile HEALTHCHECK in voice-agent image itself — healthcheck lives only in compose. Add for defense in depth.
  2. No /api/health aggregator on the web side that surfaces voice status — see health-check.
  3. restart: unless-stopped means a crash-looping container can churn silently. Add alerting on restart count.
  4. No systemd watchdog for docker-compose itself — if the Docker daemon dies, nothing restarts. Document the fallback.
  5. Log rotation is set at 50 MB × 5 files per service — fine for development, consider shipping to CloudWatch Logs for compliance retention (§9 says 30 d retention; confirm).
  6. Model weights mounted read-only — good. But host /models/ should have restrictive permissions to prevent tampering; document ownership.

#Relevance
#579Parent docs epic
#586Pillar B parent
#337Voice AWS migration (source of current topology)
architecture §3.2Higher-level voice pipeline view
configuration-system §3Voice env vars
health-checkHealthcheck design (separate doc)
voice/docker-compose.ymlSource of truth
voice/DEPLOY-AWS.mdDeployment runbook
voice/CREDENTIALS-CHECKLIST.mdSecrets rotation
docs/HETZNER-HANDOVER.md + docs/HETZNER-OPS-RUNBOOK.mdHistorical — kept for reference, no longer authoritative after AWS migration

Changelog

  • 2026-04-18 — Initial version. Reverses bubbly-bubbling-quilt.md §4 claim that docker-compose.yml was removed during Hetzner decommission: it exists and drives the AWS deployment (see voice/docker-compose.yml). Five services, not three as quilt listed — adds caddy and vllm-voxtral (Voxtral TTS runs locally on EC2, not externally).

On this page