fragJulia
Dev

Health Checks

Liveness/readiness contract across web + voice, response shape, degraded-mode matrix, and P0–P3 implementation plan.

Status: Current as of 2026-04-18 (repo commit 06d84cb). Audience: Ops engineer, SRE designing alerting, BfArM reviewer (availability surface).

This doc defines the endpoint contract for liveness + readiness across web and voice, the response shape, degraded-mode behavior, and a staged implementation plan. Current infrastructure is partial.


1. Current state

Web (apps/web/app/api/)

  • Only GET /api/studies/health exists. It checks data-staleness for the clinical-studies feature, not a generic liveness probe.
  • No /api/health, /healthz, or /readyz on the web app.

Voice (EC2)

  • Per-container Docker healthchecks defined in voice/docker-compose.yml (§6 of container-system):
    • livekit-serverwget http://localhost:7880
    • caddywget https://localhost:443/healthz
    • vllm-guardcurl http://localhost:8000/health
    • vllm-voxtralcurl http://localhost:8001/health
    • voice-agenturllib.urlopen('http://localhost:8080/health')
  • No Dockerfile HEALTHCHECK in the voice-agent image itself.
  • No systemd watchdog for the Docker daemon or compose stack.
  • No aggregated probe accessible from off-host (only Docker itself reads these).

Mobile

  • N/A — client app, no server-side probe.

Summary

Most of the stack has no externally-observable liveness signal. Vercel internally checks Next.js health via edge runtime; beyond that, nothing alerts on partial failures.


2. Design — liveness vs. readiness

Standard Kubernetes-style split, useful even without k8s.

ProbePurposePasses ifFails →
/healthz (liveness)"This process is alive"Event loop responds in <1 sExternal watchdog restarts container / process
/readyz (readiness)"This process can serve real traffic"Dependencies reachable, models loaded, GPU allocatedLoad balancer drains / shows maintenance

Separating them matters because a slow dependency (Supabase, Bedrock) should not cause a container restart — it should cause traffic drain.


3. Proposed endpoints

Web — GET /api/health (new aggregator)

200 OK
{
  "status": "ok",            // ok | degraded | fail
  "service": "fragjulia-web",
  "version": "<git-sha>",    // from NEXT_PUBLIC_BUILD_SHA
  "uptime_seconds": 1234,
  "timestamp": "2026-04-18T17:59:30Z",
  "checks": {
    "supabase":  { "status": "ok",       "latency_ms": 42 },
    "upstash":   { "status": "ok",       "latency_ms": 11 },
    "openai":    { "status": "degraded", "error": "timeout" },
    "bedrock":   { "status": "ok",       "latency_ms": 89 }   // once #358 lands
  }
}

Status mapping:

  • ok — all checks pass
  • degraded — any non-critical check fails (200 OK, warning banner)
  • fail — critical check fails (503 Service Unavailable)

Voice — per-service /healthz (already via Docker) + /readyz (new, where missing)

voice-agent should expose both:

  • /healthz — process alive, pure in-memory check
  • /readyz — faster-whisper model loaded, LiveKit connection established, Llama Guard reachable, Voxtral reachable

vllm-guard and vllm-voxtral already expose /health per vLLM defaults. That's enough.

Web → voice visibility

/api/health on the web app should optionally ping voice-agent's /readyz (via the LiveKit cloud URL or a dedicated internal proxy). Don't make it a hard dependency — voice downtime should not fail the web app probe.


4. Response contract

Fields (required)

  • status: "ok" | "degraded" | "fail"
  • service: service identifier ("fragjulia-web", "voice-agent", etc.)
  • timestamp: ISO 8601
  • version: git SHA / build ID
  • uptime_seconds: integer
  • checks: object of named sub-checks, each with status + latency_ms (or error)

HTTP codes

  • 200 — ok or degraded
  • 503 — fail

Auth

Health endpoints should be unauthenticated so external probes can hit them. They must return no PHI, no user-identifying data, no secrets — only boolean-ish service state.

Caching

Cache-Control: no-store on health endpoints.


5. Degraded-mode behavior

What the voice stack should do when a dependency is unhealthy — each decision is safety-weighted.

DownVoice behavior
faster-whisper (STT)Return German error TTS to user, close session gracefully
Voxtral TTSText fallback — agent sends plain text via LiveKit data channel (no voice). Not implemented yet.
Llama GuardBlock all new sessions — safety-critical. Voice answers without output guardrails is not acceptable. BfArM-relevant.
LiveKit serverWeb shows maintenance banner; cannot start new session
Bedrock MistralSession errors out; no secondary model fallback today (planned in #358)

These are guidelines, not fully implemented contracts. Each row is a candidate issue for follow-up.

Failure-open vs. fail-closed

  • Upstash Redis rate limiter — fails open (documented in architecture §5)
  • Llama Guard — currently fails open → must become fail-closed before DiGA filing
  • Bedrock Guardrails (planned #358) — fails closed per design

Fail-closed on safety-critical guardrails is a DiGA non-negotiable.


6. Implementation priority

P0 (before DiGA filing)

  • GET /api/health on web with supabase + upstash checks
  • /healthz on voice-agent (distinct from existing /health)
  • Fail-closed posture on Llama Guard — block sessions when guard unreachable
  • Document the degraded-mode matrix (§5) as an incident-response checklist

P1

  • Dockerfile HEALTHCHECK on voice-agent image
  • /readyz on voice-agent with model-loaded + GPU-allocated checks
  • /api/health includes openai / bedrock sub-check
  • systemd watchdog for docker-compose on EC2

P2

  • CloudWatch Synthetics probe hitting /api/health + voice-agent /readyz every 60 s
  • PagerDuty / email alert on sustained 503 or fail status
  • Voxtral text-fallback when TTS container is unhealthy
  • Bedrock secondary model fallback (bundled with #358)

P3

  • Latency budgets per sub-check, p50/p95 alerts
  • Automated degraded-mode rehearsal (chaos-style)

7. How to add a new health check

Minimal checklist when extending /api/health:

  • Add a checks.<name> property — include status + latency_ms or error
  • Classify: critical (impacts status) vs. informational (reported, doesn't fail)
  • Budget: your check must complete in ≤500 ms (aggregator must stay <2 s total)
  • Timeout independently — use AbortController with its own budget
  • Never log the check's response body with PHI — only pass/fail + latency
  • Update this doc's §3 example

#Relevance
#579Parent docs epic
#586Pillar B parent
#358Chat Bedrock migration — extends /api/health checks
architecture §5Integration failure modes table (upstream context)
container-system §6Container-level healthchecks
configuration-systemCRON_SECRET etc. for hardening any authed health paths
apps/web/app/api/studies/health/route.tsExisting (feature-specific) health route — reference only
voice/docker-compose.ymlSource of §1 voice healthcheck state

Changelog

  • 2026-04-18 — Initial version. Current state verified: only /api/studies/health on web; Docker healthchecks on all 5 voice containers but no unified /api/health. Emphasis on fail-closed Llama Guard as a DiGA requirement.

On this page