---
name: sre-reliability-audit
description: Assess Site Reliability maturity across five dimensions — SLOs/SLIs, runbooks, on-call, postmortems, game days — with per-dimension commentary and uplift path. Static, live (PagerDuty/Opsgenie), and runtime (game day) modes.
argument-hint: [repo-path]
allowed-tools: Read Grep Glob Write Edit Bash(bash:*)
effort: medium
---

# SRE Reliability Audit

<!-- anthril-output-directive -->
> **Output path directive (canonical — overrides in-body references).**
> All file outputs from this skill MUST be written under `.anthril/audits/sre-reliability-audit/`.
> Run `mkdir -p .anthril/audits/sre-reliability-audit` before the first `Write` call.
> Primary artefact: `.anthril/audits/sre-reliability-audit/<artefact>`.
> Do NOT write to the project root or to bare filenames at cwd.
> Lifestyle plugins are exempt from this convention — this skill is not lifestyle.

## When to use

Run this skill when the user mentions:
- SRE audit, reliability maturity
- SLO review, SLI definition, error-budget alerts
- Runbook audit, runbook quality
- On-call review, rotation, escalation
- Postmortem quality, blameless template
- Game day, chaos engineering at the organisational level

Narrative assessment — scores five dimensions 0”“4 rather than emitting findings-with-IDs. SLOs and SLIs (defined, error budgets computed, burn-rate alerts, review cadence), runbook quality (one per paging alert, executable steps, freshness SLA), on-call readiness (rotation, escalation policy, handover template, fair schedule), postmortem culture (blameless template, action-item tracking, retrospective cadence), and game days (scoped chaos experiments, scheduled, learnings documented).

## Before You Start

1. **Determine operating mode.** `--live` reads from PagerDuty/Opsgenie/incident.io via API keys in env (`PD_API_KEY`, `OG_API_KEY`). `--runtime` runs a scoped game-day exercise — **requires non-prod target** and explicit opt-in via `--gameday-confirmed`.
2. **Discover artefacts.** Run the three discovery scripts — `scripts/find-runbooks.sh`, `scripts/find-slos.sh`, `scripts/find-postmortems.sh`.
3. **Load `.sre-ignore`** for suppression.

## User Context

$ARGUMENTS

Runbooks: !`bash "${CLAUDE_PLUGIN_ROOT}/skills/sre-reliability-audit/scripts/find-runbooks.sh"`

SLO files: !`bash "${CLAUDE_PLUGIN_ROOT}/skills/sre-reliability-audit/scripts/find-slos.sh"`

Postmortems: !`bash "${CLAUDE_PLUGIN_ROOT}/skills/sre-reliability-audit/scripts/find-postmortems.sh"`

---

## Audit Phases

### Phase 1: Artefact Discovery

Catalogue:

- SLO/SLI files (`slo.yaml`, `slis.yaml`, `error-budget.md`, OpenSLO)
- Runbook corpus (`runbooks/`, `docs/runbooks/`, `RUNBOOK.md`)
- On-call documentation (`on-call/`, `ONCALL.md`, schedules, escalation policies)
- Postmortem archive (`postmortems/`, `incidents/`)
- Game day logs (`gameday/`, `chaos/`, `game-day-*.md`)
- Alert rule files (cross-referenced from observability-audit if available)

### Phase 2: Dimension Scoring (narrative)

For each of five dimensions, produce a maturity-level narrative (0”“4) and a short prose assessment.

#### D1. SLOs/SLIs (0”“4)
- **0** — No SLOs defined.
- **1** — SLOs defined in a doc but no measurement.
- **2** — SLOs defined and measured; error budget implicit.
- **3** — 2 + multi-window multi-burn-rate alerts.
- **4** — 3 + quarterly review cadence documented; budget-driven release policy.

#### D2. Runbooks (0”“4)
- **0** — No runbooks.
- **1** — A few runbooks exist but coverage is spotty.
- **2** — Every paging alert has a linked runbook with executable steps.
- **3** — 2 + freshness SLA (last-updated < 90 days); runbooks tested in game days.
- **4** — 3 + runbooks generate tickets on drift; auto-linked from alerts.

#### D3. On-call (0”“4)
- **0** — Whoever notices fixes it.
- **1** — One person is oncall, informally.
- **2** — Rotation documented; escalation policy; handover notes.
- **3** — 2 + fair schedule; defined response time targets; compensation / time-in-lieu.
- **4** — 3 + oncall sustainability metrics (pages per shift, false-positive rate) tracked and managed.

#### D4. Postmortems (0”“4)
- **0** — No postmortems written.
- **1** — Occasional postmortems; ad-hoc format.
- **2** — Blameless template used; action items tracked in a tracker.
- **3** — 2 + retrospective cadence; action items completed > 80%.
- **4** — 3 + postmortem library surfaced to onboarding; recurring patterns escalated to platform-level.

#### D5. Game days (0”“4)
- **0** — No game days.
- **1** — Occasional ad-hoc chaos.
- **2** — Scheduled game days (quarterly); scoped to one service.
- **3** — 2 + learnings documented; runbook gaps closed as a result.
- **4** — 3 + automated chaos baseline tests on non-prod CI.

### Phase 3: Cross-dimension Synthesis

- Find the weakest dimension — this is the uplift-first target.
- Find evidence of strong coupling: strong SLOs but weak runbooks = alerts without response.
- Produce a one-paragraph narrative assessment.

### Phase 4: Live Mode (opt-in)

If `--live`:
- Pull incident history from PagerDuty / Opsgenie.
- Compute MTTR by service and trend over 90 days.
- Count Dependabot-style "acknowledge and ignore" on paging alerts.
- Check that on-call rotation is currently populated (nobody missing from the schedule).

### Phase 5: Game Day (opt-in)

If `--runtime --gameday-confirmed`:
- Pick a scoped failure: kill one pod, inject 500ms latency on one service, drop a DNS entry.
- Confirm non-prod context.
- Run the exercise for a pre-set duration (default 20 minutes).
- Measure detection time, response time, rollback time.
- Write `gameday-report.md` with the narrative, timeline, findings, and follow-up actions.

### Phase 6: Reporting

Render `sre-reliability-audit.md` and `sre-reliability-audit.json`. Write `gameday-report.md` if game day was run.

---

## Scoring

Five dimensions Ã— 0”“4 scale. Maturity aggregate:

| Avg score | Maturity tier |
|---|---|
| 3.5+ | Mature — maintain and iterate |
| 2.5”“3.4 | Effective — tighten runbook + postmortem loops |
| 1.5”“2.4 | Developing — focus on one dimension at a time |
| <1.5 | Nascent — start with SLOs and runbooks |

---

## Important Principles

- **An SLO that isn't alerted on is aspirational.** Maturity requires teeth.
- **A runbook that hasn't been tested in a game day probably doesn't work.** Test by running.
- **Oncall without rotation is burnout.** Flag single-person oncall as HIGH.
- **Postmortem = no-blame.** Any language in the archive that names individuals as causes is a tooling/culture problem. Flag it.
- **Game day without learnings recorded is theatre.** Require documented follow-ups.
- **Runtime game day requires explicit opt-in.** Never run against prod without `--i-really-mean-prod --gameday-confirmed`.
- **Australian English. DD/MM/YYYY. Markdown-first.**

---

## Edge Cases

1. **Small team (<5 engineers).** Single-person oncall may be unavoidable; recommend page-time constraints rather than rotation.
2. **No production workload.** Many dimensions N/A — report says so and recommends revisit after launch.
3. **Strong observability + weak SRE.** Instrumentation without process is common; recommend the process work.
4. **Many postmortems but nothing changed.** Action-item backlog is the red flag; flag as HIGH.
5. **Postmortems marked "private".** Can't audit without access. Record as limitation.
6. **On-call via external vendor (eg incident response firm).** Different audit shape; flag for custom assessment.
