Container System (Voice Pipeline)
EC2 g6.xlarge voice stack — 5 services sharing one NVIDIA L4, host/bridge networking, GPU/VRAM allocation, healthchecks, startup order, ops runbook.
Status: Current as of 2026-04-18 (repo commit 06d84cb).
Audience: Ops engineer, BfArM reviewer (infrastructure + data-residency surface), anyone troubleshooting voice.
Scope: the single-box voice stack on AWS EC2. Web (Vercel) and mobile (EAS) are not containerized and out of scope here.
Authoritative source: voice/docker-compose.yml, voice/DEPLOY-AWS.md, voice/CREDENTIALS-CHECKLIST.md.
1. Target host
| Field | Value |
|---|---|
| Instance type | g6.xlarge |
| GPU | 1× NVIDIA L4 (24 GB VRAM) |
| vCPU / RAM | 4 / 16 GB |
| Region | eu-central-1 (Frankfurt) |
| Residency | All audio + inference stays in the EU. Bedrock calls go to eu-central-1. |
| OS | Ubuntu + NVIDIA drivers + Docker + NVIDIA Container Toolkit |
EU residency on EC2 is a DiGA / DSGVO Art. 9 requirement — never move this to a US region for cost reasons.
2. Container topology
Five services, all on one host, defined in voice/docker-compose.yml:
┌───────────────────────────────────────────────────────────────┐
│ EC2 g6.xlarge (eu-central-1) │
│ │
│ ┌─────────────┐ host network │
│ │ Caddy │ :443 (TLS termination) │
│ └─────┬───────┘ │
│ │ depends on │
│ ┌─────▼──────────────┐ host network │
│ │ livekit-server │ :7880 (WS) / :7881 / UDP │
│ └─────┬──────────────┘ │
│ │ │
│ ┌─────▼──────────────┐ host network │
│ │ voice-agent │ (Python LiveKit agent) │
│ │ + faster-whisper │ large-v3 INT8, German medical kwds │
│ │ + Bedrock client │ → eu-central-1 Mistral Large │
│ │ + Voxtral client │ → vllm-voxtral local │
│ │ + Guard client │ → vllm-guard local │
│ └─────┬──────┬───────┘ │
│ │ │ │
│ ┌─────▼──┐ ┌─▼──────────┐ bridge network (ports exposed) │
│ │ vllm- │ │ vllm- │ │
│ │ guard │ │ voxtral │ │
│ │ :8000 │ │ :8001 │ │
│ │ Llama │ │ Voxtral │ │
│ │ Guard │ │ TTS 4B │ │
│ │ 3 1B │ │ bfloat16 │ │
│ └────────┘ └────────────┘ │
│ │
│ Shared: NVIDIA L4 (24 GB VRAM) │
└───────────────────────────────────────────────────────────────┘Services:
| Name | Image | Network | Purpose |
|---|---|---|---|
livekit-server | livekit/livekit-server:latest | host | WebRTC SFU for client ↔ agent |
caddy | caddy:2-alpine | host | TLS reverse proxy, Let's Encrypt certs |
vllm-guard | vllm/vllm-openai:latest | bridge (:8000) | Llama Guard 3 1B output guardrail |
vllm-voxtral | vllm/vllm-openai:latest | bridge (:8001) | Voxtral TTS 4B (Julia cloned voice) |
voice-agent | ./agent (local build) | host | faster-whisper STT + Bedrock orchestration + TTS client |
3. GPU / VRAM allocation
All five services share the single NVIDIA L4 (24 GB). vLLM's --gpu-memory-utilization flag partitions deterministically:
| Service | Allocation | Notes |
|---|---|---|
vllm-voxtral | 0.45 × 24 GB ≈ 10.8 GB | Voxtral TTS 4B (bfloat16) |
vllm-guard | 0.15 × 24 GB ≈ 3.6 GB | Llama Guard 3 1B (float16), --max-model-len 2048 |
voice-agent (faster-whisper) | ~4–5 GB | large-v3 INT8, loaded at agent startup |
| Slack | ~5–6 GB | burst headroom, faster-whisper transient batches |
Failure mode: CUDA OOM. Handled by the agent restarting via restart: unless-stopped + the docker healthcheck tripping. If you add a new model, recompute this table first.
4. Networking
Host networking
livekit-server, caddy, and voice-agent use network_mode: host. Reasons:
- LiveKit SFU needs UDP (RTP) which is painful through Docker port mapping
- Keeps latency low (no userspace NAT)
- Caddy binding
:443on the host means no Docker port bridge indirection
Bridge-network ports
vllm-guard and vllm-voxtral are called from voice-agent via http://localhost:8000 / :8001. They expose those ports on the host (ports: - "8000:8000"), so they are reachable from host-network containers.
External connections from EC2
| Target | Protocol | Purpose |
|---|---|---|
AWS Bedrock (bedrock-runtime.eu-central-1.amazonaws.com) | HTTPS | Mistral Large inference |
| LiveKit cloud (if hybrid) | WSS | fallback / test rigs |
Deepgram EU (api.eu.deepgram.com) | HTTPS | legacy STT fallback — verify still used before rotating key |
Mistral API (api.mistral.ai) | HTTPS | bridge while Voxtral license pending |
All outbound must succeed from the EC2 security group / NACL.
Ingress
Only :443 (Caddy) is exposed to the public internet. :7880, :7881, :8000, :8001 are bound to localhost-only or reached only via voice-agent on the same host. Double-check EC2 security group before assuming this.
5. Volumes
| Volume | Mount | Purpose |
|---|---|---|
Host /models/ | read-only into vllm-guard, vllm-voxtral, voice-agent | Pre-downloaded model weights |
caddy_data (named) | /data in caddy | Let's Encrypt certs, HTTP-01 state |
caddy_config (named) | /config in caddy | Caddy persistent config |
Host ./config/livekit.yaml | /etc/livekit.yaml (ro) in livekit-server | LiveKit config |
Host ./config/Caddyfile | /etc/caddy/Caddyfile (ro) in caddy | Caddy routing |
Host .env | env-file into voice-agent | Runtime secrets (see configuration-system §3) |
Disk monitoring: CloudWatch alarms on / and /models/ — alert >80% per voice/DEPLOY-AWS.md.
6. Health + startup order
Docker healthchecks (from docker-compose.yml)
| Service | Test | Interval / timeout | Start period |
|---|---|---|---|
livekit-server | wget --spider http://localhost:7880 | 15 s / 5 s | 10 s |
caddy | wget --spider https://localhost:443/healthz | 30 s / 5 s | 30 s |
vllm-guard | curl -f http://localhost:8000/health | 30 s / 10 s | 120 s (model load) |
vllm-voxtral | curl -f http://localhost:8001/health | 30 s / 10 s | 180 s (larger model) |
voice-agent | python -c "urllib.request.urlopen('http://localhost:8080/health')" | 30 s / 10 s | 60 s |
Dependency graph
livekit-server (healthy) ──┐
├─► voice-agent
vllm-guard (healthy) ───┤
vllm-voxtral (healthy) ───┘
livekit-server (healthy) ──► caddyCold start budget: ~3 minutes end-to-end, dominated by Voxtral model load.
No /api/health aggregator on the web side that covers voice yet — see health-check.
7. Build
voice-agent is built locally from ./agent:
build:
context: ./agent
dockerfile: DockerfileOther services pull public images:
livekit/livekit-server:latestcaddy:2-alpinevllm/vllm-openai:latest(×2)
:latest is fine here because the EC2 host pins via docker compose pull on deploys, not on restart. If you want reproducibility for an audit snapshot, override with SHA-pinned tags at deploy time.
8. Ops runbook (quick reference)
Full runbook: voice/DEPLOY-AWS.md. Quick tasks:
Restart a service
docker compose -f voice/docker-compose.yml restart voice-agent
docker compose logs -f voice-agentGPU status
nvidia-smi # realtime
# Or, inside vllm container:
docker compose exec vllm-voxtral nvidia-smiInspect memory split
# vLLM prints utilization at startup
docker compose logs vllm-guard | grep -i gpu
docker compose logs vllm-voxtral | grep -i gpuCUDA OOM recovery
- Identify which container logged OOM (
docker compose logs --tail 200) - Restart just that container — its
restart: unless-stoppedpolicy usually handles it - If persistent, one model has drifted in memory footprint — recompute §3 table, adjust
--gpu-memory-utilization
Credential rotation
See voice/CREDENTIALS-CHECKLIST.md + configuration-system §8.
Model refresh
- Download new weights to
/models/<name>/on the host - Update
command: --model /models/<name>indocker-compose.yml docker compose up -d <service>— it picks up the change- Monitor
docker compose logsfor successful warm-up
9. CloudWatch alarms (baseline from voice/DEPLOY-AWS.md)
| Alarm | Threshold | Action |
|---|---|---|
| CPU | >90% for 5m | Page |
| Memory | >85% for 5m | Page |
| Disk | >80% | Warn |
| GPU utilization (via cron) | >95% sustained | Warn |
| Log retention | 30 d | — |
10. What's NOT in containers (deliberate)
- Supabase / Postgres — managed (out of scope)
- Vercel edge — Next.js is not containerized; Vercel manages the runtime
- Stripe, Resend, Upstash — SaaS dependencies
- Mobile app — native builds via EAS
Only the voice stack is containerized because it's the only part with GPU + long-lived processes.
11. Gaps + hardening candidates
- No
Dockerfile HEALTHCHECKinvoice-agentimage itself — healthcheck lives only in compose. Add for defense in depth. - No
/api/healthaggregator on the web side that surfaces voice status — see health-check. restart: unless-stoppedmeans a crash-looping container can churn silently. Add alerting on restart count.- No systemd watchdog for docker-compose itself — if the Docker daemon dies, nothing restarts. Document the fallback.
- Log rotation is set at 50 MB × 5 files per service — fine for development, consider shipping to CloudWatch Logs for compliance retention (§9 says 30 d retention; confirm).
- Model weights mounted read-only — good. But host
/models/should have restrictive permissions to prevent tampering; document ownership.
12. Related
| # | Relevance |
|---|---|
| #579 | Parent docs epic |
| #586 | Pillar B parent |
| #337 | Voice AWS migration (source of current topology) |
| architecture §3.2 | Higher-level voice pipeline view |
| configuration-system §3 | Voice env vars |
| health-check | Healthcheck design (separate doc) |
voice/docker-compose.yml | Source of truth |
voice/DEPLOY-AWS.md | Deployment runbook |
voice/CREDENTIALS-CHECKLIST.md | Secrets rotation |
docs/HETZNER-HANDOVER.md + docs/HETZNER-OPS-RUNBOOK.md | Historical — kept for reference, no longer authoritative after AWS migration |
Changelog
- 2026-04-18 — Initial version. Reverses
bubbly-bubbling-quilt.md§4 claim thatdocker-compose.ymlwas removed during Hetzner decommission: it exists and drives the AWS deployment (seevoice/docker-compose.yml). Five services, not three as quilt listed — addscaddyandvllm-voxtral(Voxtral TTS runs locally on EC2, not externally).