Health Checks
Liveness/readiness contract across web + voice, response shape, degraded-mode matrix, and P0–P3 implementation plan.
Status: Current as of 2026-04-18 (repo commit 06d84cb).
Audience: Ops engineer, SRE designing alerting, BfArM reviewer (availability surface).
This doc defines the endpoint contract for liveness + readiness across web and voice, the response shape, degraded-mode behavior, and a staged implementation plan. Current infrastructure is partial.
1. Current state
Web (apps/web/app/api/)
- Only
GET /api/studies/healthexists. It checks data-staleness for the clinical-studies feature, not a generic liveness probe. - No
/api/health,/healthz, or/readyzon the web app.
Voice (EC2)
- Per-container Docker healthchecks defined in
voice/docker-compose.yml(§6 of container-system):livekit-server—wget http://localhost:7880caddy—wget https://localhost:443/healthzvllm-guard—curl http://localhost:8000/healthvllm-voxtral—curl http://localhost:8001/healthvoice-agent—urllib.urlopen('http://localhost:8080/health')
- No
Dockerfile HEALTHCHECKin thevoice-agentimage itself. - No systemd watchdog for the Docker daemon or compose stack.
- No aggregated probe accessible from off-host (only Docker itself reads these).
Mobile
- N/A — client app, no server-side probe.
Summary
Most of the stack has no externally-observable liveness signal. Vercel internally checks Next.js health via edge runtime; beyond that, nothing alerts on partial failures.
2. Design — liveness vs. readiness
Standard Kubernetes-style split, useful even without k8s.
| Probe | Purpose | Passes if | Fails → |
|---|---|---|---|
/healthz (liveness) | "This process is alive" | Event loop responds in <1 s | External watchdog restarts container / process |
/readyz (readiness) | "This process can serve real traffic" | Dependencies reachable, models loaded, GPU allocated | Load balancer drains / shows maintenance |
Separating them matters because a slow dependency (Supabase, Bedrock) should not cause a container restart — it should cause traffic drain.
3. Proposed endpoints
Web — GET /api/health (new aggregator)
200 OK
{
"status": "ok", // ok | degraded | fail
"service": "fragjulia-web",
"version": "<git-sha>", // from NEXT_PUBLIC_BUILD_SHA
"uptime_seconds": 1234,
"timestamp": "2026-04-18T17:59:30Z",
"checks": {
"supabase": { "status": "ok", "latency_ms": 42 },
"upstash": { "status": "ok", "latency_ms": 11 },
"openai": { "status": "degraded", "error": "timeout" },
"bedrock": { "status": "ok", "latency_ms": 89 } // once #358 lands
}
}Status mapping:
ok— all checks passdegraded— any non-critical check fails (200 OK, warning banner)fail— critical check fails (503 Service Unavailable)
Voice — per-service /healthz (already via Docker) + /readyz (new, where missing)
voice-agent should expose both:
/healthz— process alive, pure in-memory check/readyz— faster-whisper model loaded, LiveKit connection established, Llama Guard reachable, Voxtral reachable
vllm-guard and vllm-voxtral already expose /health per vLLM defaults. That's enough.
Web → voice visibility
/api/health on the web app should optionally ping voice-agent's /readyz (via the LiveKit cloud URL or a dedicated internal proxy). Don't make it a hard dependency — voice downtime should not fail the web app probe.
4. Response contract
Fields (required)
status:"ok" | "degraded" | "fail"service: service identifier ("fragjulia-web","voice-agent", etc.)timestamp: ISO 8601
Fields (recommended)
version: git SHA / build IDuptime_seconds: integerchecks: object of named sub-checks, each withstatus+latency_ms(orerror)
HTTP codes
- 200 —
okordegraded - 503 —
fail
Auth
Health endpoints should be unauthenticated so external probes can hit them. They must return no PHI, no user-identifying data, no secrets — only boolean-ish service state.
Caching
Cache-Control: no-store on health endpoints.
5. Degraded-mode behavior
What the voice stack should do when a dependency is unhealthy — each decision is safety-weighted.
| Down | Voice behavior |
|---|---|
| faster-whisper (STT) | Return German error TTS to user, close session gracefully |
| Voxtral TTS | Text fallback — agent sends plain text via LiveKit data channel (no voice). Not implemented yet. |
| Llama Guard | Block all new sessions — safety-critical. Voice answers without output guardrails is not acceptable. BfArM-relevant. |
| LiveKit server | Web shows maintenance banner; cannot start new session |
| Bedrock Mistral | Session errors out; no secondary model fallback today (planned in #358) |
These are guidelines, not fully implemented contracts. Each row is a candidate issue for follow-up.
Failure-open vs. fail-closed
- Upstash Redis rate limiter — fails open (documented in architecture §5)
- Llama Guard — currently fails open → must become fail-closed before DiGA filing
- Bedrock Guardrails (planned #358) — fails closed per design
Fail-closed on safety-critical guardrails is a DiGA non-negotiable.
6. Implementation priority
P0 (before DiGA filing)
-
GET /api/healthon web with supabase + upstash checks -
/healthzonvoice-agent(distinct from existing/health) - Fail-closed posture on Llama Guard — block sessions when guard unreachable
- Document the degraded-mode matrix (§5) as an incident-response checklist
P1
- Dockerfile
HEALTHCHECKonvoice-agentimage -
/readyzonvoice-agentwith model-loaded + GPU-allocated checks -
/api/healthincludesopenai/bedrocksub-check - systemd watchdog for docker-compose on EC2
P2
- CloudWatch Synthetics probe hitting
/api/health+ voice-agent/readyzevery 60 s - PagerDuty / email alert on sustained 503 or
failstatus - Voxtral text-fallback when TTS container is unhealthy
- Bedrock secondary model fallback (bundled with #358)
P3
- Latency budgets per sub-check, p50/p95 alerts
- Automated degraded-mode rehearsal (chaos-style)
7. How to add a new health check
Minimal checklist when extending /api/health:
- Add a
checks.<name>property — includestatus+latency_msorerror - Classify: critical (impacts
status) vs. informational (reported, doesn't fail) - Budget: your check must complete in ≤500 ms (aggregator must stay <2 s total)
- Timeout independently — use
AbortControllerwith its own budget - Never log the check's response body with PHI — only pass/fail + latency
- Update this doc's §3 example
8. Related
| # | Relevance |
|---|---|
| #579 | Parent docs epic |
| #586 | Pillar B parent |
| #358 | Chat Bedrock migration — extends /api/health checks |
| architecture §5 | Integration failure modes table (upstream context) |
| container-system §6 | Container-level healthchecks |
| configuration-system | CRON_SECRET etc. for hardening any authed health paths |
apps/web/app/api/studies/health/route.ts | Existing (feature-specific) health route — reference only |
voice/docker-compose.yml | Source of §1 voice healthcheck state |
Changelog
- 2026-04-18 — Initial version. Current state verified: only
/api/studies/healthon web; Docker healthchecks on all 5 voice containers but no unified/api/health. Emphasis on fail-closed Llama Guard as a DiGA requirement.