Health Checks

Liveness/readiness contract across web + voice, response shape, degraded-mode matrix, and P0–P3 implementation plan.

Status: Current as of 2026-04-18 (repo commit 06d84cb). Audience: Ops engineer, SRE designing alerting, BfArM reviewer (availability surface).

This doc defines the endpoint contract for liveness + readiness across web and voice, the response shape, degraded-mode behavior, and a staged implementation plan. Current infrastructure is partial.

1. Current state

Web (`apps/web/app/api/`)

Only GET /api/studies/health exists. It checks data-staleness for the clinical-studies feature, not a generic liveness probe.
No /api/health, /healthz, or /readyz on the web app.

Voice (EC2)

Per-container Docker healthchecks defined in voice/docker-compose.yml (§6 of container-system):
- livekit-server — wget http://localhost:7880
- caddy — wget https://localhost:443/healthz
- vllm-guard — curl http://localhost:8000/health
- vllm-voxtral — curl http://localhost:8001/health
- voice-agent — urllib.urlopen('http://localhost:8080/health')
No Dockerfile HEALTHCHECK in the voice-agent image itself.
No systemd watchdog for the Docker daemon or compose stack.
No aggregated probe accessible from off-host (only Docker itself reads these).

Mobile

N/A — client app, no server-side probe.

Summary

Most of the stack has no externally-observable liveness signal. Vercel internally checks Next.js health via edge runtime; beyond that, nothing alerts on partial failures.

2. Design — liveness vs. readiness

Standard Kubernetes-style split, useful even without k8s.

Probe	Purpose	Passes if	Fails →
`/healthz` (liveness)	"This process is alive"	Event loop responds in <1 s	External watchdog restarts container / process
`/readyz` (readiness)	"This process can serve real traffic"	Dependencies reachable, models loaded, GPU allocated	Load balancer drains / shows maintenance

Separating them matters because a slow dependency (Supabase, Bedrock) should not cause a container restart — it should cause traffic drain.

3. Proposed endpoints

Web — `GET /api/health` (new aggregator)

200 OK
{
  "status": "ok",            // ok | degraded | fail
  "service": "fragjulia-web",
  "version": "<git-sha>",    // from NEXT_PUBLIC_BUILD_SHA
  "uptime_seconds": 1234,
  "timestamp": "2026-04-18T17:59:30Z",
  "checks": {
    "supabase":  { "status": "ok",       "latency_ms": 42 },
    "upstash":   { "status": "ok",       "latency_ms": 11 },
    "openai":    { "status": "degraded", "error": "timeout" },
    "bedrock":   { "status": "ok",       "latency_ms": 89 }   // once #358 lands
  }
}

Status mapping:

ok — all checks pass
degraded — any non-critical check fails (200 OK, warning banner)
fail — critical check fails (503 Service Unavailable)

Voice — per-service `/healthz` (already via Docker) + `/readyz` (new, where missing)

voice-agent should expose both:

/healthz — process alive, pure in-memory check
/readyz — faster-whisper model loaded, LiveKit connection established, Llama Guard reachable, Voxtral reachable

vllm-guard and vllm-voxtral already expose /health per vLLM defaults. That's enough.

status: "ok" | "degraded" | "fail"
service: service identifier ("fragjulia-web", "voice-agent", etc.)
timestamp: ISO 8601

Fields (recommended)

version: git SHA / build ID
uptime_seconds: integer
checks: object of named sub-checks, each with status + latency_ms (or error)

HTTP codes

200 — ok or degraded
503 — fail

Down	Voice behavior
faster-whisper (STT)	Return German error TTS to user, close session gracefully
Voxtral TTS	Text fallback — agent sends plain text via LiveKit data channel (no voice). Not implemented yet.
Llama Guard	Block all new sessions — safety-critical. Voice answers without output guardrails is not acceptable. BfArM-relevant.
LiveKit server	Web shows maintenance banner; cannot start new session
Bedrock Mistral	Session errors out; no secondary model fallback today (planned in #358)

These are guidelines, not fully implemented contracts. Each row is a candidate issue for follow-up.

Failure-open vs. fail-closed

Upstash Redis rate limiter — fails open (documented in architecture §5)
Llama Guard — currently fails open → must become fail-closed before DiGA filing
Bedrock Guardrails (planned #358) — fails closed per design

Fail-closed on safety-critical guardrails is a DiGA non-negotiable.

6. Implementation priority

P0 (before DiGA filing)

GET /api/health on web with supabase + upstash checks
/healthz on voice-agent (distinct from existing /health)
Fail-closed posture on Llama Guard — block sessions when guard unreachable
Document the degraded-mode matrix (§5) as an incident-response checklist

P1

Dockerfile HEALTHCHECK on voice-agent image
/readyz on voice-agent with model-loaded + GPU-allocated checks
/api/health includes openai / bedrock sub-check
systemd watchdog for docker-compose on EC2

P2

CloudWatch Synthetics probe hitting /api/health + voice-agent /readyz every 60 s
PagerDuty / email alert on sustained 503 or fail status
Voxtral text-fallback when TTS container is unhealthy
Bedrock secondary model fallback (bundled with #358)

P3

Latency budgets per sub-check, p50/p95 alerts
Automated degraded-mode rehearsal (chaos-style)

7. How to add a new health check

Minimal checklist when extending /api/health:

Add a checks.<name> property — include status + latency_ms or error
Classify: critical (impacts status) vs. informational (reported, doesn't fail)
Budget: your check must complete in ≤500 ms (aggregator must stay <2 s total)
Timeout independently — use AbortController with its own budget
Never log the check's response body with PHI — only pass/fail + latency
Update this doc's §3 example

#	Relevance
#579	Parent docs epic
#586	Pillar B parent
#358	Chat Bedrock migration — extends `/api/health` checks
architecture §5	Integration failure modes table (upstream context)
container-system §6	Container-level healthchecks
configuration-system	`CRON_SECRET` etc. for hardening any authed health paths
`apps/web/app/api/studies/health/route.ts`	Existing (feature-specific) health route — reference only
`voice/docker-compose.yml`	Source of §1 voice healthcheck state

Changelog

2026-04-18 — Initial version. Current state verified: only /api/studies/health on web; Docker healthchecks on all 5 voice containers but no unified /api/health. Emphasis on fail-closed Llama Guard as a DiGA requirement.