---
name: agent-activity-audit
description:
  Audit recent agent transcripts (Claude Code and Codex) to learn how a tool, system, or skill is
  actually being used in the wild. Surfaces failure modes, friction, success patterns, and concrete
  improvement candidates from real session data. Use this when you want to improve a
  developer-facing system that agents interact with regularly.
allowed-tools: Bash, Read, Write, Edit, Grep, Glob, Agent
---

# Agent Activity Audit

This skill executes a structured pass over recent agent transcripts to learn what's working and
what's hurting. The original audit (May 2026) examined ~30 days of Claude Code and Codex sessions to
improve Sibyl itself — see `EXAMPLES.md` for the full reproducible run.

The output is a synthesis report grounded in real session evidence, plus per-group findings files
you can act on directly.

---

## When to use

- You maintain a system that agents call (CLI, MCP server, library, skill) and want signal beyond
  "did it work?"
- You suspect agents are stumbling on something but can't name what.
- A planning cycle is about to start and you want product priorities grounded in usage data, not
  vibes.
- A new release shipped and you want to see how it landed in the wild.

**Not for:** general code review, security audits, performance benchmarking. This skill reads
session transcripts; it doesn't analyze code.

---

## Agent rules (READ FIRST)

1. **Always write artifacts under `contexts/<analysis-name>-<date>/`.** Keep raw scans, episode
   extracts, and findings in one tree so the analysis is reproducible and the user can replay or
   extend it.

2. **Filter early, filter hard.** Most transcripts are noise. Triage with cheap grep before spinning
   up parallel subagents — the goal is to give each subagent ~50-100 KB of focused episode data, not
   raw multi-MB JSONLs.

3. **Partition by date for the swarm.** Date-based partitions are mutually exclusive, cover the full
   window, and make convergence across groups easy to spot (same theme in 4+ date ranges = durable
   issue).

4. **Each subagent writes findings to a file.** Don't let agents return giant prose back to the main
   thread. Their job: produce `findings/group_<X>.md`, return a ≤250-word summary.

5. **Convergence-first synthesis.** A pain point in 4+ groups is durable. Single-group findings
   warrant a sanity check before they're elevated. Count evidence; don't trust impressions.

6. **Verify before recommending fixes.** Inspect current source for the surfaces the audit
   implicates. A finding like "the CLI rejects `--kind gotcha`" should point at the enum's actual
   location.

7. **Capture durable learnings to Sibyl after synthesis.** The point is to feed back into the
   product graph; use `sibyl remember --kind pattern` (or `--kind decision`) on the substantive
   findings.

---

## The Workflow

```
inventory → triage → extract episodes → parallel swarm → synthesis → capture
```

Each step has fall-back behavior if data shape varies between Claude and Codex transcripts.

### Step 1: Inventory

Find all JSONLs in the target window. Claude lives in
`~/.claude/projects/<project-slug>/<uuid>.jsonl`, Codex lives in
`~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl`. Filter by mtime.

```bash
mkdir -p contexts/<name>-$(date +%F)/triage
cd contexts/<name>-$(date +%F)

# Last 30 days
find ~/.claude/projects -name '*.jsonl' -newermt "$(date -d '30 days ago' +%F)" > triage/claude_files.txt
find ~/.codex/sessions -name '*.jsonl' -newermt "$(date -d '30 days ago' +%F)" > triage/codex_files.txt
cat triage/claude_files.txt triage/codex_files.txt > triage/all_files.txt

wc -l triage/*.txt
```

### Step 2: Triage with the scanner

Use `scripts/scan.py` (ships with this skill) to extract per-file statistics: total events, target
tool-call counts, error counts, user corrections. Runs in parallel.

```bash
SKILL_DIR=$(dirname "$(realpath "$0")")  # or hard-code the path
cat triage/all_files.txt | xargs -P 12 -n 5 python3 "$SKILL_DIR/scripts/scan.py" \
  --target-name <tool-or-skill-keyword> > triage/scan_results.jsonl
```

The scanner detects three shapes of usage:

- **CLI**: Bash/exec_command calls whose command line matches the target's CLI name
- **MCP**: tool names matching `mcp__<target>__*` or `<target>_*`
- **Skill**: Skill invocations whose name matches the target

Look at the output to confirm signal quality before proceeding.

### Step 3: Extract focused episodes

Each "episode" is a target-tool call plus its preceding user message, assistant text, and tool
result. Use `scripts/extract_episodes.py` to write per-session markdown files (~10-30 KB each, vs
the 50-500 KB raw transcripts).

```bash
# Filter to files with actual usage
python3 -c "
import json
for line in open('triage/scan_results.jsonl'):
    r = json.loads(line)
    if r.get('target_total', 0) > 0:
        print(r['path'])
" > triage/using_files.txt

mkdir -p episodes
cat triage/using_files.txt | xargs -P 12 -n 3 python3 \
  "$SKILL_DIR/scripts/extract_episodes.py" --target <name> episodes
```

### Step 4: Partition for the swarm

Partition episodes by date (or by file count if the window is shorter). Aim for groups of 20-60
files each, ~500-1500 KB total payload per group. One agent per partition.

```bash
# Date-based partitions (adjust ranges to your window)
grep -E '^claude-' episodes_dir | awk '{print "episodes/"$0}' > triage/group_A_claude.txt
grep -E '^codex-2026-04-(16|17|18|19|20|21)-' ... > triage/group_B_codex_apr_early.txt
# ... etc

# Pull out outliers as their own groups. If one session has 8 MB+ episodes, give it its own agent.
```

### Step 5: Dispatch the swarm (in parallel)

Send all agents in ONE message with multiple Agent tool calls. Use `run_in_background: true`. Each
agent's prompt should include:

- Goal context (what system, why we're auditing, what good looks like)
- The exact file list (paste it inline; agents won't always reach for files outside their context)
- The output schema (structured headings — see template below)
- The exit shape (≤250 word return summary, full findings to file)

**Findings file template (use this verbatim in agent prompts):**

```markdown
# Group <id> — <description>

## At-a-glance

- Sessions analyzed: N
- Total target tool calls: N (with CLI / MCP / skill breakdown)
- Errored calls: N
- Date range: first → last ts
- Projects represented: list
- Net assessment: Helping / Hurting / Mixed (one sentence)

## Usage patterns (ranked by frequency)

What did agents reach for the target to do? How often? How well?

## Top failure modes (with evidence)

Verbatim error message, frequency, blast radius, session refs.

## What genuinely helped

Concrete wins with citations.

## UX friction

Confusing CLI/output, subcommand naming, output formatting, etc.

## User reactions

Direct user messages about the target — corrections, complaints, praise.

## Improvement ideas (ranked by impact)

1. [Issue] → [Specific fix]
   - Evidence (session refs)
   - Why it matters
2. ...

## Surprises
```

### Step 6: Build cross-cutting data

While agents work, do the prep that needs the full corpus, not partitions:

- **Error catalog**: classify all error outputs by pattern. Most-common categories should match what
  subagents independently find.
- **Workflow stats**: did sessions follow the full lifecycle?
  `sessions_using_target / sessions_capturing_knowledge / sessions_completing_lifecycle`.
- **Retry loops**: same command run 3+ times in a row in any session → signals stuck behavior.
- **User corrections**: short user messages mentioning the target tool + reaction words ("ugh",
  "broken", "wrong", "stop") → real feedback.

### Step 7: Synthesize

Read all `findings/group_*.md`, the cross-cutting data, and current source code for the surfaces
implicated. Write `SYNTHESIS.md` with:

- Executive summary (≤200 words) with net assessment
- Methodology
- Baseline metrics
- What's working (defend these surfaces)
- What's broken (P0/P1/P2/P3 with evidence)
- Counterintuitive findings
- Recommended fixes table (priority × effort × impact)
- Cross-cutting observations
- Process notes
- Artifact appendix

**Cardinal rule:** every claim should cite specific session files. "Internal Server Error" with no
file reference is a vibe; "21 ISE responses in 35 minutes across 5 sessions, e.g.
`codex-2026-04-21-019db33d.md` ep.3" is evidence.

### Step 8: Capture durable findings to Sibyl

The audit is itself a learning opportunity. For each P0/P1 finding:

```bash
sibyl remember "Sibyl gap: --kind enum drift" "CLI --help lists 9 kinds, API accepts 29; agents
hit Pydantic enum rejections on 'gotcha', 'learning', 'review'. Source: entities.py EntityType
vs main.py remember --help. Audit: contexts/sibyl-analysis-2026-05-14/SYNTHESIS.md §4 P1." \
  --kind error_pattern --tags audit,cli,enum
```

Keep these scoped to the project being audited; future sessions on that project should find them via
`sibyl recall`.

---

## Quality bar

A good audit:

- Has at least 3 convergent findings (same theme in 4+ partitions).
- Quantifies impact (calls/month, sessions affected, minutes wasted) rather than naming severity in
  the abstract.
- Names current code locations for every recommended fix.
- Distinguishes design issues from operational issues from documentation issues.
- Identifies what's working so the team knows what _not_ to change.
- Captures surprises — the patterns that contradict the team's prior model.

A bad audit:

- Reads like a list of complaints.
- Has findings that only appear in one session.
- Recommends fixes without naming code paths.
- Conflates "the system is bad" with "I'm bad at using the system."
- Misses what's working.

---

## Scaling considerations

- **Big sessions**: any single transcript > 5 MB of episodes deserves its own subagent. The May 10
  monster session (8.4 MB, 4409 episodes spanning 4 days) needed strategic sampling — read
  start/middle/end + all error blocks, not top-to-bottom.
- **Cold sessions**: transcripts where the target tool was barely used are still data. They tell you
  the agent _didn't reach_ for the tool. That's its own finding.
- **Cross-project bleed**: if the target lives in one repo but is called from many, partition by cwd
  as well as date.
- **Multi-client**: Claude and Codex have different transcript schemas. The scanner handles both but
  findings should note any client-specific patterns (e.g., Codex agents read SKILL.md every session;
  Claude agents launch the skill differently).

---

## Caveats

- **Survivorship bias**: agents who got stuck and gave up early produce shorter transcripts. Don't
  conclude the system is fine from a sample of finished work.
- **User-message false positives**: filter out boilerplate (`# AGENTS.md`, `<INSTRUCTIONS>`, long
  task prompts) before flagging "user reactions." Real reactions are short, in lowercase, and often
  profane.
- **"OUTPUT (ERROR)" over-inclusion**: the episode extractor flags errors heuristically. Filter
  again on `Process exited with code 1` or `✗` markers before counting real failures.
- **Don't fix surfaces the team is already redesigning.** Check `sibyl recall <topic>` before
  writing up a recommendation — the work might already be in flight.

---

## See also

- `EXAMPLES.md` — full worked example: the 2026-05-14 Sibyl self-audit
- `scripts/scan.py` — the parallel JSONL scanner
- `scripts/extract_episodes.py` — focused-context episode extractor
- The `sibyl` skill — for capturing audit findings back into the graph
