---
name: incident-recap
description: >
  Generate a monthly incident recap page in Notion, following the standard
  Productboard SRE Guild format. Pulls incident data from the Notion
  "Incident Retrospectives" database, follow-up tasks from Rootly, and
  deploy counts from local git repos (pb-manifests, pb-frontend). Use when
  creating monthly incident recaps, deploy statistics, or reliability summaries.
disable-model-invocation: true
argument-hint: "[month year]"
allowed-tools: >
  Read, Grep, Glob, Bash, AskUserQuestion,
  mcp__notion__notion-fetch,
  mcp__notion__notion-create-pages,
  mcp__notion__notion-update-page,
  mcp__notion__notion-search,
  mcp__notion__notion-query-data-sources,
  mcp__rootly__search_incidents,
  mcp__rootly__listIncidentActionItems
---

# Monthly Incident Recap Generator

Generate a complete monthly incident recap following the Productboard SRE Guild format. The output is a Notion page under the "Monthly Incident Recap" collection.

## Phase 1: Determine the Month

If arguments are provided (`$ARGUMENTS`), parse the month and year from them.

If no arguments are provided, calculate the previous month from today's date and ask the user to confirm:

```
"I'll generate the incident recap for {previous_month} {year}. Is that correct, or would you like a different month?"
```

Use `AskUserQuestion` to confirm. The default suggestion should always be the previous calendar month relative to the current date.

Store the confirmed month as:
- `MONTH_NAME` (e.g., `February`)
- `MONTH_NUM` (e.g., `02`)
- `YEAR` (e.g., `2026`)
- `NEXT_MONTH_START` (e.g., `2026-03-01`) — for date range queries

## Phase 2: Gather Reference Format

Fetch the previous month's recap to use as a format reference:

1. Use `notion-fetch` to load the "Monthly Incident Recap" parent page: `https://www.notion.so/21697e9c7d4f80afabb4ecd727c007e5`
2. Look for the most recent recap page (the month before the one being generated)
3. Fetch it to extract the exact structure, tone, and formatting

If you cannot find a previous recap, use the template in [recap-template.md](./recap-template.md).

## Phase 3: Query Incidents from Notion Database

Query the **Incident Retrospectives** database to get all incidents for the target month:

```sql
SELECT url, Name, Severity, Status, "Created time", Services
FROM "collection://9081f798-508e-4978-9ca8-5c3aa4a58c4f"
WHERE "Created time" >= '{YEAR}-{MONTH_NUM}-01'
  AND "Created time" < '{NEXT_MONTH_START}'
ORDER BY "Created time" ASC
```

Data source ID: `collection://9081f798-508e-4978-9ca8-5c3aa4a58c4f`

This returns all incident pages for the month with their severity, status, and services.

Then, for **each incident URL** returned, use `notion-fetch` to retrieve the full postmortem page content. Fetch all pages **in parallel** for speed.

From each page, extract:
- **Incident number** (from title, e.g., `#319`)
- **Title** (short description)
- **Date** (from `Started at` timestamp in the page content)
- **Severity** (SEV1/SEV2/SEV3 — already from the query)
- **Service** (from query results)
- **Cause category** — classify as one of:
  - Faulty Deploy (code defect introduced by a deploy; fixed by reverting PR or deploying a fix)
  - Load/Resource (CPU saturation, OOM, Kafka lag from traffic, bulk operations)
  - 3rd Party Outage (external provider failure)
- **Time to Mitigation (TTM)** = `Mitigated at` - `Started at`
- **Time to Resolution (TTR)** = `Resolved at` - `Started at`
- **Root cause summary** (from Summary or Root Cause Analysis section)
- **Follow-up items** (from Ideas for Improvement / Follow Up Items)
- **Discovery source** — classify as one of:
  - Alerting/Monitoring (automated alert, on-call page)
  - Internal detection (Slack report from employee)
  - Customer reports (support ticket, customer message)
- **Was it caused by a faulty deploy?** (yes if reverting a deploy/PR fixed it)

### Classification tips
- A code change that causes a production incident (fixed by reverting or deploying a fix) is a **Faulty Deploy**, not a "Bug"
- Resource exhaustion triggered by traffic patterns, bulk operations, or data volume is **Load/Resource** — even if a code change made the system more vulnerable to load
- Scheduler issues creating duplicate tasks or bulk processing overload = **Load/Resource**
- Frontend build/migration issues (e.g., Vite/Rolldown) that break functionality = **Faulty Deploy**

## Phase 4: Enrich with Rootly Data

Use Rootly MCP to retrieve action items and follow-up tasks for each incident:

1. For each incident, search Rootly using `search_incidents` with the incident title or number
   - If exact title doesn't match, try broader search terms (e.g., service name, key words)
   - Rootly titles may differ from Notion titles
2. For each found Rootly incident, use `listIncidentActionItems` to get follow-up tasks
3. Collect all action items across incidents with their status (open/done/cancelled)
4. Note which incidents have completed retrospectives vs pending/skipped

## Phase 5: Pull Deploy Statistics

Run these git commands to count production deploys for the month:

### pb-manifests (Argo deploys)
```bash
cd ~/pb/pb-manifests && git fetch origin main 2>&1 | tail -3 && \
git log origin/main --oneline --since="{YEAR}-{MONTH_NUM}-01" --until="{NEXT_MONTH_START}" | \
grep 'Update production/' | grep -v 'pb-frontend-fallback' | grep -v 'pb-backend-debug.yaml' | \
sed 's|.*Update production/\([^ ]*\) .*|\1|' | sort | uniq -c | sort -rn
```

### pb-frontend (CDN deploys)
Note: `pb-frontend` uses **`master`** as its default branch (not `main`).
```bash
cd ~/pb/pb-frontend && git fetch origin master 2>&1 | tail -3 && \
git log origin/master --oneline --since="{YEAR}-{MONTH_NUM}-01" --until="{NEXT_MONTH_START}" | wc -l
```

### Verify boundary dates
Spot-check that no commits from the next month leak in:
```bash
cd ~/pb/pb-manifests && git log origin/main --oneline --format="%ai %s" \
--since="{LAST_DAY_OF_MONTH}T20:00:00" --until="{NEXT_MONTH_START}T12:00:00" | \
grep 'Update production/' | head -10
```

**Exclusions** (always exclude from counts):
- `pb-frontend-fallback` — not a real deploy
- `pb-backend-debug.yaml` — not a real service

## Phase 6: Compute Metrics

### Incident categorization
Group incidents by cause:
- Faulty Deploys (code defects fixed by reverting/deploying a fix)
- Load/Resource issues
- 3rd Party Outages

### Time metrics
- **Median TTM**: Sort all TTM values, take median. Report fastest and slowest.
- **Median TTR**: Sort all TTR values, take median. Report fastest and slowest.
- For 3rd party outages, note them separately if they skew metrics.

### Discovery sources
Count how incidents were discovered:
- Alerting/Monitoring: X/N -> Y%
- Internal detection: X/N -> Y%
- Customer reports: X/N -> Y%

### Follow-up tasks
From Rootly action items:
- Total action items planned
- Completed vs pending count

### Deployment failure rate
- Count incidents caused by faulty deploys (where reverting fixed the issue)
- Total deploys = pb-manifests production count + pb-frontend count
- Rate = faulty deploys / total deploys
- Also express as "1 broken deploy per ~N production deploys"

## Phase 7: Generate the Recap

### Structure
Follow the exact format from the reference recap (Phase 2). The standard structure is:

1. **Header** — `## ℹ️ **Incident Recap – {Month} {Year}**`
2. **Greeting and summary paragraph** — total incidents, total deploys, deploy failure rate
3. **Cause breakdown** — bulleted list grouping incidents by cause with brief descriptions. Categories: faulty deploys, load/resource issues, 3rd party outages.
4. **Affected services** — comma-separated list
5. **Key Metrics** — TTM, TTR, follow-up tasks (from Rootly), discovery sources
6. **Highlights** — 2 focused bullet points on notable patterns and announcements (e.g., recurring themes like load issues, new tooling improvements)
7. **Divider** — `---`
8. **Production Deployment Statistics** — data source description, date collected, deploy count table by service with TOTAL row in `blue_bg`, deployment failure rate section with faulty deploy table, exclusions list, and callout box.

### Formatting rules (Notion-flavored Markdown)
- Tables use `<table>` XML format with `fit-page-width="true"` and `header-row="true"`
- TOTAL row uses `<tr color="blue_bg">`
- Deployment failure rate callout: `::: callout {icon="📈"}`
- Deploy count table: first row is `pb-frontend (CDN updated count)`, then services sorted by count descending
- Faulty deploy table uses `<colgroup>` with `<col width="238.75">` for the Title column

## Phase 8: Create Notion Page

1. Create the page under the "Monthly Incident Recap" parent page
2. Use `notion-create-pages` with `parent: { page_id: "21697e9c7d4f80afabb4ecd727c007e5" }`
3. Title: `Incident Recap – {Month} {Year}`
4. Present the Notion page URL to the user

## Phase 9: Review and Adjust

After creating the page, ask the user:

1. "Does everything look correct? Any incidents miscategorized?"
2. "Should I adjust any metrics or highlights?"

If adjustments are needed, use `notion-update-page` with `update_content` command.

## Notes

- If a postmortem has status "Skipped" and no root cause filled in, note this in the recap
- For follow-up tasks, pull from Rootly action items first, then supplement from Notion "Ideas for improvement" sections
- Always compare key metrics to the previous month when available
- The Incident Retrospectives database data source ID is: `collection://9081f798-508e-4978-9ca8-5c3aa4a58c4f`
- The Monthly Incident Recap parent page ID is: `21697e9c7d4f80afabb4ecd727c007e5`
- Rootly incident titles may differ from Notion titles — use broader search if exact match fails
