--- name: service-health-check description: "Service health monitoring: Discover, Check, Report in 3 phases." user-invocable: false allowed-tools: - Bash - Read - Glob - Grep routing: triggers: - "service status" - "process health" - "uptime check" - "is service running" - "check health" category: infrastructure pairs_with: - kubernetes-debugging - endpoint-validator - condition-based-waiting --- # Service Health Check Skill ## Overview This skill provides deterministic service health monitoring using the **Discover-Check-Report** pattern. It finds services, gathers health signals from multiple sources (process table, health files, port binding), and produces actionable reports identifying degraded or failed services. **Core principle**: Health assessment is evidence-based. Never report a service healthy without verifying process status independently of health file content. Never assume a running process is functional — always cross-check against health files and port binding. --- ## Instructions ### Phase 1: DISCOVER **Goal**: Identify all services to check before running any health probes. **Step 1: Locate service definitions** Search for service configuration in this order: 1. `services.json` in project root 2. Docker/docker-compose files for service definitions 3. systemd unit files or process manager configs 4. User-provided service specification **Step 2: Build service manifest** For each service, establish: ```markdown ## Service Manifest | Service | Process Pattern | Health File | Port | Stale Threshold | |---------|----------------|-------------|------|-----------------| | api-server | gunicorn.*app:app | /tmp/api_health.json | 8000 | 300s | | worker | celery.*worker | /tmp/worker_health.json | - | 300s | | cache | redis-server | - | 6379 | - | ``` **Validation constraints**: - Each process pattern must be specific enough to avoid false matches (e.g., "python" matches all Python processes—use full paths or arguments instead) - Health file paths must be absolute - Port numbers must be valid (1-65535) - Pattern specificity matters: narrow patterns with full command paths, distinguishing arguments, or specific binary names **Step 3: Validate manifest** Confirm each entry passes the constraints above. If a pattern is too broad, use `ps aux | grep` to identify distinguishing arguments, then update the pattern. **Gate**: Service manifest complete with at least one service. Proceed only when gate passes. ### Phase 2: CHECK **Goal**: Gather health signals for every service in the manifest. Always check process status independently of health file content—a running process and a healthy health file are separate signals. **Step 1: Check process status** For each service, run process check: ```bash pgrep -f "" ``` Record: running (true/false), PIDs, process count. **Rationale**: Process existence is the primary signal. A missing process always means the service is DOWN. A running process alone is insufficient—the service may have crashed or failed to bind to its port. **Step 2: Parse health files (if configured)** Read and parse JSON health files. Evaluate: - Does the file exist? - Does it parse as valid JSON? - How old is the timestamp (staleness)? Default stale threshold is 300 seconds. - What status does the service self-report? - What is the connection state? **Critical constraint**: Never trust health file content alone. The file could be stale from before a process crash. Always verify: 1. Process is still running 2. Health file timestamp is fresh (within configured threshold) 3. Status field matches evidence (e.g., "error" requires restart) **Step 3: Probe ports (if configured)** Check if expected ports are listening: ```bash ss -tlnp "sport = :" ``` **Rationale**: Verify ports are actually bound. A process can start but fail to bind to its configured port—that is effectively a DOWN state, not HEALTHY. **Step 4: Evaluate health per service** Apply this decision tree (constraints embedded in logic): 1. **Process not running** → **DOWN** (definitive) 2. **Process running + health file missing** → **WARNING** (limited visibility, but process is alive) 3. **Process running + health file stale** (> threshold) → **WARNING** (file hasn't updated in configured time, suggests no activity or crash recovery in progress) 4. **Process running + status=error** → **ERROR** (restart recommended immediately) 5. **Process running + disconnected > 30 minutes** → **WARNING** (long disconnect suggests stuck state, restart recommended) 6. **Process running + disconnected < 30 minutes** → **DEGRADED** (allow reconnection window, monitor) 7. **Process running + port not listening** (when port is configured) → **ERROR** (process running but failed to bind port) 8. **Process running + healthy** → **HEALTHY** (all checks pass) 9. **Process running + no health file configured** → **RUNNING** (limited visibility, process verified only) **Gate**: All services evaluated with evidence-based status. No status is determined without concrete signal (process check, health file, or port probe). Proceed only when gate passes. ### Phase 3: REPORT **Goal**: Produce structured, actionable health report with specific remediation commands. **Step 1: Generate summary** ``` SERVICE HEALTH REPORT ===================== Checked: N services Healthy: X/N RESULTS: service-name [OK ] HEALTHY PID 12345, uptime 2d 4h background-worker [WARN] WARNING Health file stale (15 min) cache-service [DOWN] DOWN Process not found RECOMMENDATIONS: background-worker: Restart recommended - health file not updated in 900s cache-service: Start service - process not running SUGGESTED ACTIONS: systemctl restart background-worker systemctl start cache-service ``` **Step 2: Set exit status** - All HEALTHY/RUNNING → exit 0 - Any WARNING/DEGRADED/ERROR/DOWN → exit 1 **Step 3: Present to user** - Lead with the summary line (X/N healthy) - Highlight any services needing action - Provide copy-pasteable commands for remediation - Never auto-restart without explicit user flag. Always report findings first, let user decide. **Gate**: Report delivered with actionable recommendations for all non-healthy services. --- ## Examples ### Example 1: Routine Health Check User says: "Are all services up?" Actions: 1. Locate services.json, build manifest (DISCOVER) 2. Check each process, parse health files, probe ports (CHECK) 3. Output structured report showing 3/3 healthy (REPORT) Result: Clean report, no action needed ### Example 2: Stale Worker Detection User says: "The background worker seems stuck" Actions: 1. Identify worker service from config (DISCOVER) 2. Find process running but health file 20 minutes stale (CHECK) — triggers WARNING decision in tree 3. Report WARNING with restart recommendation (REPORT) Result: Specific diagnosis with actionable command --- ## Error Handling ### Error: "No Service Configuration Found" Cause: No services.json, docker-compose, or systemd units discovered Solution: 1. Ask user for service name and process pattern 2. Build minimal manifest from user input 3. Proceed with manual configuration ### Error: "Process Pattern Matches Too Many PIDs" Cause: Pattern too broad (e.g., "python" matches all Python processes) Solution: 1. Narrow pattern with full command path or arguments 2. Use `ps aux | grep` to identify distinguishing arguments 3. Update manifest with more specific pattern 4. Rationale: False positives hide real failures. Specificity is required to avoid misdiagnosis. ### Error: "Health File Exists But Cannot Parse" Cause: Malformed JSON, permissions issue, or file being written during read Solution: 1. Check file permissions with `ls -la` 2. Attempt raw read to inspect content 3. If mid-write, retry after 2-second delay 4. Report as WARNING with parse error details --- ## References ### Health File Format Reference Services should write health files as: ```json { "timestamp": "ISO8601, updated every 30-60s", "status": "healthy|degraded|error", "connection": "connected|disconnected|reconnecting", "last_activity": "ISO8601 of last meaningful action", "running": true, "uptime_seconds": 12345, "metrics": {} } ``` ### Key Constraints Summary | Constraint | Rationale | Application | |-----------|-----------|-------------| | Process status verified independently of health file | Running process ≠ functional service | Always check process before trusting health file | | Health file staleness detected by timestamp freshness | File could be stale from before crash | Check timestamp against 300s (configurable) threshold | | Port binding verified when configured | Process running doesn't mean port is bound | Always verify expected port listening when port specified | | No auto-restart without explicit flag | Restart masks root cause | Report findings first; only execute restart if user flags it | | Narrow process patterns required | "python" matches all processes, giving false matches | Use full paths or specific args; validate with `ps aux \| grep` | | Evidence-based status only | Status must have supporting signal | No status without concrete evidence (process, health file, or port) |