---
name: backend-reliability-enforcer
description: Use when implementing backend APIs, data persistence, or external integrations. Enforces TodoWrite with 25+ items. Triggers: "backend API", "REST endpoint", "external integration", "data persistence". If thinking "just a quick endpoint" - use this.
---

# Backend Reliability Enforcer

## Payment Operations?

**If implementing payment, billing, subscription, or financial transactions:**
→ Use `payment-operations-enforcer` skill FIRST (NON-NEGOTIABLE)
→ Then return here for general reliability requirements

---

## TodoWrite Requirements

**CREATE TodoWrite with 5 sections (25+ items total):**

| Section | Min Items |
|---------|-----------|
| Fault Tolerance | 5+ |
| Error Handling | 5+ |
| Data Integrity | 5+ |
| Security | 5+ |
| Observability | 5+ |

---

## Verification Checkpoint

Verify 3 random items have ALL THREE:
- ✓ Concrete numbers ("5 failures/10s", "15s timeout", "100 req/min")
- ✓ Specific tools ("Opossum", "pino", "Joi", "Knex", "Redis")
- ✓ Measurable outcome ("generates UUID v4 correlation ID")

| ❌ FAILS | ✅ PASSES |
|----------|-----------|
| "Add error handling" | "pino logging: UUID v4 correlation ID, format `{correlationId, level, timestamp, service, message}`" |
| "Validate input" | "Joi schema: `amount` (number, positive, max 999999), `currency` (ISO 4217). Return 400 with `{error, fields}`" |
| "Add circuit breaker" | "Opossum: 5 failures/10s threshold, 30s reset, fallback to cached data" |

---

## Section Requirements

### Fault Tolerance (5+ items)

- [ ] Circuit breaker for ALL external calls (Opossum: 5 failures/10s, 30s reset)
- [ ] Retry with exponential backoff (3 retries: 100ms, 200ms, 400ms)
- [ ] Timeouts for all I/O (15s API calls, 5s database)
- [ ] Graceful degradation (cached data, disable non-critical features)
- [ ] Health checks (`/health` endpoint with dependency status)

### Error Handling (5+ items)

- [ ] Comprehensive error handling (try/catch, error middleware)
- [ ] Correlation IDs (UUID v4 on ingress, propagate to all logs/services)
- [ ] Structured logging (pino/winston with correlation ID)
- [ ] Error responses (400/401/403/404/422/500, never expose stack traces)
- [ ] Error metrics/alerts (Prometheus counter, alert if error rate > 5%)

### Data Integrity (5+ items)

- [ ] Input validation at API boundaries (Joi/Zod schemas)
- [ ] Database transactions for consistency (Knex/Prisma transactions)
- [ ] Idempotency keys for retryable operations (`Idempotency-Key` header, Redis 24h)
- [ ] Data retention policies (auto-delete soft-deleted after 90 days)
- [ ] Audit trails for sensitive modifications (who, what, when, old/new)

### Security (5+ items)

- [ ] Authentication on all endpoints (JWT/OAuth/API keys)
- [ ] Authorization checks (RBAC, verify permissions)
- [ ] Input sanitization (parameterized SQL, XSS prevention)
- [ ] Encryption (AES-256 at rest, TLS 1.3 in transit)
- [ ] Rate limiting (100 req/min per IP, express-rate-limit)

### Observability (5+ items)

- [ ] Structured logging (INFO/WARN/ERROR levels)
- [ ] Key metrics (latency P50/P95/P99, throughput, error rate)
- [ ] Distributed tracing (OpenTelemetry with trace/span IDs)
- [ ] Dashboards/alerts (Grafana, PagerDuty for SLO breaches)
- [ ] Runbook documentation (investigation, rollback, escalation)

---

## Red Flags - STOP When You Think:

| Thought | Reality |
|---------|---------|
| "Add reliability later" | 80% never added, retrofitting costs 10x |
| "That edge case is unlikely" | Unlikely × scale = frequent |
| "Circuit breaker is overkill" | One slow service = cascade failure |
| "Customer needs it in 3 days" | Broken code causes MORE delay |

---

## Response Templates

### "Add Reliability Later"

❌ **BLOCKED**: Cannot defer reliability requirements.

- 80% of "later" items never get added
- Retrofitting costs 10x more than building in from start
- Production incidents cost $5K-50K per incident + customer trust

**Required to override:**
1. Specific retrofit date (not "later")
2. Budget allocated (engineer-weeks)
3. Risk acceptance signed by decision maker
4. Interim mitigation plan (reduced traffic limits? 24/7 on-call?)

---

## Self-Grading Before Complete

```
[ ] 25+ items across 5 sections
[ ] 80%+ have concrete numbers
[ ] 80%+ name specific tools
[ ] 100% have measurable outcomes
[ ] 3 random items passed specificity test
[ ] Payment operations: delegated to payment-operations-enforcer

Grade 7+/8: Ready to proceed
Grade <7: Revise TodoWrite
```
