---
name: audit-observability
description: Use when the user says "audit my observability" / "audit logging/tracing/alerting" / "what are we blind to" / "review my Sentry/Datadog setup" — reads the connected observability stack + engineering-context architecture and produces a 3-column matrix (signal × coverage × gap) across errors / traces / logs / alerts / SLOs, plus a top-5 fix list ranked by blast-radius reduction.
---

# Audit Observability

Inventory of what's actually being observed vs what should be. The
user thinks they're covered; the audit shows which critical paths are
silent.

## When to use

- "Audit my observability stack"
- "Audit logging / tracing / alerting"
- "What are we blind to"
- "Review my Sentry / Datadog / Honeycomb / PostHog setup"
- "Am I alerting on the right things"

## Steps

1. **Read engineering context** at
   `../head-of-engineering/engineering-context.md`. If missing or
   empty, tell the user to run Head of Engineering's
   `define-engineering-context` first and stop. I need the
   architecture section — I can't audit coverage without knowing
   what should be covered (which services, which critical paths).

2. **Read config:** `config/observability.json`. If missing, ask
   ONE question naming the best modality (connect your observability
   tool via Composio > paste dashboard URLs > describe what's
   connected). Update `config/observability.json` and continue. If
   the stack is "stdout-only" or "none", that's a finding — proceed
   and frame the audit around "foundational gaps".

3. **Pull signal inventory.** Run `composio search observability`.
   For each connected provider, enumerate what it's actually
   collecting:
   - **Sentry / error trackers** — which projects, which
     environments, release tracking on/off, source maps uploaded?
   - **APM / tracing** (Datadog / New Relic / Honeycomb) — which
     services instrumented, trace sample rate, which endpoints
     covered?
   - **Logs** — structured or plain, log level in prod, retention,
     indexable fields.
   - **Metrics / dashboards** — which are actually viewed weekly vs
     stale.
   - **Alerts** — list of active alerts, routes (where they page),
     recent fire history (P1 / P2 / noise).
   - **SLOs** — any defined? If yes, error budgets, burn rate
     alerts.

4. **Enumerate expected coverage.** From the architecture section of
   engineering-context.md, list the critical paths — user flows and
   services where an outage would hurt. For each, ask "what would we
   see if this broke?" That's the coverage bar.

5. **Build the matrix.** 3 columns: **Signal · Coverage · Gap**.
   Rows for: errors, traces on critical paths, log structure, alert
   signal-to-noise, page-worthy thresholds, SLO coverage, release
   tracking, database / external-dependency health. Each row is one
   observation grounded in the signal inventory from step 3 and the
   expected-coverage from step 4.

6. **Rank fixes.** Top-5 fixes prioritized by **blast-radius
   reduction** (how much detection / containment time does this save
   when the thing breaks?). For each fix:
   - Title.
   - Gap it closes (row from the matrix).
   - Concrete change (e.g. "add 3 alerts on `/auth` p95 latency,
     error rate, and success rate").
   - Effort estimate (S / M / L).
   - Metric it will move (e.g. "MTTD on auth outages from ~15min
     to <2min").

7. **Flag hard gaps.** If there's no error tracking at all, that's
   #1 regardless of what else the matrix says. Same for "no alerts
   go anywhere a human sees" or "critical path X is not
   instrumented."

8. **Draft the audit** in this structure:
   - **Summary** — 3 bullets; top gap, top quick-win, overall shape.
   - **What's connected** — brief inventory.
   - **Signal × coverage × gap matrix** — the big table.
   - **Alert signal-to-noise** — list of alerts firing >N times/week
     that aren't page-worthy (alerts fatigue).
   - **SLO posture** — what's defined, what's missing.
   - **Top-5 fixes** — ranked, with effort + expected impact.

9. **Write** atomically to `observability-audits/{YYYY-MM-DD}.md`
   (`*.tmp` → rename).

10. **Append to `outputs.json`** — new entry `{ id, type:
    "observability-audit", title, summary, path, status: "ready",
    createdAt, updatedAt }`.

11. **Summarize to user** — one paragraph covering top gap, top
    quick-win, and the path to the audit. Flag anything marked
    UNKNOWN so you can fill gaps.

## Outputs

- `observability-audits/{YYYY-MM-DD}.md`
- Appends to `outputs.json` with `type: "observability-audit"`.
