---
name: run-evals
description: Run benchmark evaluations for any skill. Reads evals/evals.json from the target skill, spawns parallel with-skill and without-skill agent runs, grades outputs against assertions, aggregates benchmark statistics, and launches the eval viewer. Use when you want to measure or compare skill performance without entering the full skill-creator create/improve flow. Trigger on phrases like "run evals", "run the evals", "benchmark this skill", "run the eval loop", "test this skill against baseline", or when a skill has an evals.json and the user wants to see how it performs.
license: MIT
---

# Run Evals

Runs the benchmarking eval loop for an existing skill. Extracted from `skill-creator` — use this when the skill already exists and you only need to measure it, not create or improve it.

This is one continuous sequence — do not stop partway through.

---

## Inputs

- **Skill path** — absolute path to the skill directory being evaluated (e.g. `~/.claude/skills/writing-architecture-and-design`)
- **Iteration** — auto-detect from existing `evals/workspace/iteration-*/` dirs; use next number, default 1

---

## File Structure

All eval artefacts live inside the skill directory being evaluated:

```
<skill-dir>/
  SKILL.md
  evals/
    evals.json                         ← eval definitions (pre-existing; read by this skill)
    workspace/
      iteration-N/
        <eval-name>/                   ← one directory per eval case
          eval_metadata.json
          with_skill/
            outputs/                   ← skill output files + transcript.md go here
            grading.json               ← grader saves here (one level up from outputs/)
            timing.json                ← captured from task notification
          without_skill/
            outputs/
            grading.json
            timing.json
        benchmark.json                 ← aggregate_benchmark.py output
        benchmark.md
        feedback.json                  ← downloaded from viewer
```

Supporting tools:
- Grader: `~/.claude/skills/skill-creator/agents/grader.md`
- Benchmark: `~/.claude/skills/run-evals/scripts/aggregate_benchmark.py`
- Viewer: `~/.claude/skills/skill-creator/eval-viewer/generate_review.py`

---

## Step 1 — Read evals and prepare

1. Read `<skill-dir>/evals/evals.json`
2. **Validate tier field — mandatory.** For every assertion/expectation in every eval case, verify a `tier` field is present with value `"structural"` or `"nuanced"`. If any assertion is missing `tier`, **stop immediately** and surface an error:

   > "evals.json validation failed: assertion `<name or text>` in eval `<eval_name>` is missing a required `tier` field. Add `\"tier\": \"structural\"` or `\"tier\": \"nuanced\"` to every assertion before running evals."

   Do not proceed until every assertion has a valid tier.

3. Determine the iteration number: check existing `evals/workspace/iteration-*/` dirs; use the next number (1 if none exist)
4. Create `<skill-dir>/evals/workspace/iteration-N/`

---

## Step 2 — Spawn all runs in the same turn

For every eval case in `evals.json`, spawn two subagents **in the same turn** — one with the skill, one without. Launch everything at once; do not do with-skill runs first and come back for baselines.

**With-skill run prompt:**

```
Read the skill at <skill-dir>/SKILL.md and execute this task.

Task: <prompt from evals.json>
Input files: <files from evals.json, or "none">

Save all output files to:
  <skill-dir>/evals/workspace/iteration-N/<eval-name>/with_skill/outputs/

Save a transcript of your work (what you did, decisions made, tool calls) to:
  <skill-dir>/evals/workspace/iteration-N/<eval-name>/with_skill/outputs/transcript.md
```

**Without-skill run prompt** (same task, no skill):

```
Execute this task without any skill guidance.

Task: <prompt from evals.json>
Input files: <files from evals.json, or "none">

Save all output files to:
  <skill-dir>/evals/workspace/iteration-N/<eval-name>/without_skill/outputs/

Save a transcript of your work to:
  <skill-dir>/evals/workspace/iteration-N/<eval-name>/without_skill/outputs/transcript.md
```

Write `eval_metadata.json` for each eval case immediately (assertions can be empty if not yet in `evals.json`):

```json
{
  "eval_id": 0,
  "eval_name": "<name from evals.json>",
  "prompt": "<prompt>",
  "assertions": []
}
```

Save to `<skill-dir>/evals/workspace/iteration-N/<eval-name>/eval_metadata.json`.

---

## Step 3 — Draft assertions while runs are in progress

Do not wait idle. Use this time to draft assertions from `evals.json` or from observation. Explain each assertion to the user.

If `evals.json` already has assertions in a `description` or `assertions` field, review and explain them — surface any that look non-discriminating (would pass even for a clearly wrong output).

Update `eval_metadata.json` with the confirmed assertions before runs complete.

---

## Step 4 — Capture timing as runs complete

When each subagent task completes, save timing **immediately** — this data comes through the task notification and is not persisted elsewhere.

Save to `<skill-dir>/evals/workspace/iteration-N/<eval-name>/{with_skill|without_skill}/timing.json`:

```json
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}
```

---

## Step 5 — Grade all runs

Assertions in `evals.json` carry a required `tier` field: `"structural"` or `"nuanced"` (validated in Step 1). For each run, spawn **two grader subagents in parallel** — one per tier — then merge their results into `grading.json`.

**Structural grader prompt** (spawn with `model: haiku`):

```
Read ~/.claude/skills/skill-creator/agents/grader.md and grade this run.

You are grading STRUCTURAL assertions only — section/gate presence checks, absence of
placeholder text, and explicit text-pattern assertions. You do NOT need to read the
transcript. Grade by examining the output files directly. Skip Steps 4–6 of grader.md
(claims extraction, user notes, eval feedback) — leave those fields empty or omit them.

Expectations (structural tier — assertions where tier = "structural"):
  <list each structural assertion description as a string>

Outputs dir (read output files from here):
  <skill-dir>/evals/workspace/iteration-N/<eval-name>/{with_skill|without_skill}/outputs/

Save grading results to:
  <skill-dir>/evals/workspace/iteration-N/<eval-name>/{with_skill|without_skill}/grading_structural.json
```

**Nuanced grader prompt** (spawn with `model: sonnet`):

```
Read ~/.claude/skills/skill-creator/agents/grader.md and grade this run.

You are grading NUANCED assertions only — assertions that require understanding
architectural concepts (application layer design, composition root, typing.Protocol vs ABC,
layer isolation semantics, grep command correctness). Read both the transcript and the
output files. Perform full Steps 4–6 of grader.md (claims extraction, user notes, eval
feedback).

Expectations (nuanced tier — assertions where tier = "nuanced"):
  <list each nuanced assertion description as a string>

Transcript path:
  <skill-dir>/evals/workspace/iteration-N/<eval-name>/{with_skill|without_skill}/outputs/transcript.md

Outputs dir (read output files from here):
  <skill-dir>/evals/workspace/iteration-N/<eval-name>/{with_skill|without_skill}/outputs/

Save grading results to:
  <skill-dir>/evals/workspace/iteration-N/<eval-name>/{with_skill|without_skill}/grading_nuanced.json
```

**After both graders complete**, merge into `grading.json`:

1. Read `grading_structural.json` and `grading_nuanced.json`
2. Combine `expectations` arrays (structural first, then nuanced)
3. Recalculate `summary`: recount passed/failed/total, recompute pass_rate
4. Take `execution_metrics`, `timing`, `claims`, `user_notes_summary`, `eval_feedback` from the nuanced grader result
5. Write merged result to `grading.json`
6. Delete `grading_structural.json` and `grading_nuanced.json`

If an eval has no nuanced assertions, skip the nuanced grader entirely — write `grading.json` directly from the structural grader output. If an eval has no structural assertions, skip the structural grader and write `grading.json` directly from the nuanced grader output.

Grade all runs before moving to Step 6.

The `grading.json` schema requires `text`, `passed`, and `evidence` fields per expectation (not `name`/`met`/`details`). The viewer depends on these exact field names.

---

## Step 6 — Aggregate benchmark

Run from the `skill-creator` directory:

```bash
cd ~/.claude/skills/run-evals
python -m scripts.aggregate_benchmark \
  <skill-dir>/evals/workspace/iteration-N \
  --skill-name <skill-name>
```

This writes `benchmark.json` and `benchmark.md` to `evals/workspace/iteration-N/`. Show the summary output to the user.

---

## Step 7 — Launch viewer

```bash
nohup python ~/.claude/skills/skill-creator/eval-viewer/generate_review.py \
  <skill-dir>/evals/workspace/iteration-N \
  --skill-name "<skill-name>" \
  --benchmark <skill-dir>/evals/workspace/iteration-N/benchmark.json \
  > /dev/null 2>&1 &
VIEWER_PID=$!
```

For iteration 2+, also pass:
```
--previous-workspace <skill-dir>/evals/workspace/iteration-<N-1>
```

Tell the user the viewer is open and ask them to return when done reviewing.

---

## Step 8 — Read feedback

When the user returns, read `<skill-dir>/evals/workspace/iteration-N/feedback.json`:

```json
{
  "reviews": [
    {"run_id": "bloom-filter-with_skill", "feedback": "...", "timestamp": "..."}
  ],
  "status": "complete"
}
```

Empty feedback means the output was fine. Surface patterns and suggest next steps.

Kill the viewer:
```bash
kill $VIEWER_PID 2>/dev/null
```

---

## Notes

- All file paths in subagent prompts must be **absolute**
- Timing data is only available at task completion — capture it immediately per notification
- `evals.json` assertions may use `name` + `description` fields; translate `description` → `text` when building the grader's expectations list
- `aggregate_benchmark.py` reads `<eval-name>/with_skill/grading.json` directly — if grading.json is missing, that eval produces no data point
