--- name: fortify description: Systematic ablation study runner. After research:run finds improvements, fortify identifies component candidates from git diff + diary, creates isolated git worktrees per ablation (main repo never modified), runs metric+guard in each worktree, ranks component importance, and optionally generates reviewer Q&A calibrated to a target venue. argument-hint: '[|] [--venue ] [--max-ablations ] [--skip-run]' effort: high allowed-tools: Read, Write, Edit, Bash, Grep, Glob, Agent, TaskCreate, TaskUpdate, AskUserQuestion disable-model-invocation: true --- Systematic ablation study runner — after `/research:run` finds improvements, fortify identifies which components contributed, generates ablation variants (remove one component at a time), runs each variant in **isolated git worktrees** (main repo never modified), ranks component importance, and optionally generates reviewer Q&A material calibrated to a venue. NOT for: running the initial optimization loop (use `/research:run`); validating methodology before running (use `/research:judge`); verifying paper-vs-code consistency (use `/research:verify`); hypothesis generation (use `research:scientist` directly). Fortify exclusively runs ablation studies on completed runs. ```yaml MAX_ABLATION_CANDIDATES: 8 (ceiling — scientist produces 3–8; --max-ablations caps further) METRIC_TIMEOUT_MS: 360000 (6 min — same as run SKILL.md) GUARD_TIMEOUT_MS: 360000 GIT_OP_TIMEOUT_MS: 15000 SANITY_DIVERGENCE_PCT: 2.0 (full-variant vs best_metric mismatch threshold) IMPORTANCE_CLASS_CRITICAL: 50.0 (% of full metric lost) IMPORTANCE_CLASS_SIGNIFICANT: 10.0 FORTIFY_DIR_BASE: .experiments STATE_DIR_BASE: .experiments/state ``` ## Agent Resolution > **Foundry plugin check**: run `Glob(pattern="foundry*", path="$HOME/.claude/plugins/cache/")` returning results = installed. If check fails, proceed as if foundry available — common case; only fall back if agent dispatch explicitly fails. `research:scientist` in same plugin as this skill — no fallback needed if research plugin installed. ## CRITICAL: Worktree-based isolation **Do NOT use `git checkout -b ` for ablations** — this dirties the main working tree and corrupts concurrent tool calls. Each ablation gets its own git worktree under `$FORTIFY_DIR/worktrees/`, created from `best_commit`. Main working tree is NEVER modified. Cleanup: `git worktree remove --force` per variant; `git worktree prune` on interrupt. ## Fortify Mode (Steps F1–F8) Triggered by `fortify` or `fortify `. **Task tracking**: create tasks for F1, F2, F3, F4, F5, F6, F7, F8 at start — before any tool calls. ## Step F1: Locate source run and validate judge approval **Input resolution** (priority order): 1. Explicit `` argument → read `.experiments/state//state.json` 2. Explicit `` argument → scan `.experiments/state/*/state.json` for matching `program_file`, pick latest with `status: completed` or `status: goal-achieved` 3. No argument → scan `.experiments/state/`, pick latest with `status: completed` or `status: goal-achieved` 4. None found → stop: ```text fortify: No completed run found. Run /research:run first. ``` **Guard: judge approval required.** The judge skill writes its verdict to `.temp/output-judge--.md` — scan that location for an APPROVED verdict line: ```bash JUDGE_VERDICT_FILE=$(ls -t .temp/output-judge-*.md 2>/dev/null | head -1) # timeout: 5000 if [ -z "$JUDGE_VERDICT_FILE" ]; then echo "fortify: BLOCKED — no judge verdict found in .temp/." echo "Ablation studies require an approved baseline. Run: /research:judge " exit 1 fi JUDGE_VERDICT=$(grep -i '^[*]*[Vv]erdict[*]*:' "$JUDGE_VERDICT_FILE" | head -1 | sed -E 's/.*[Vv]erdict[*: ]+//;s/[* ].*//') ``` Verify `JUDGE_VERDICT == "APPROVED"` AND the verdict file references the same `program_file` (grep the file for the program path). If verdict is not APPROVED or file mismatches: ```text fortify: BLOCKED — no APPROVED judge verdict found for this program. Ablation studies require an approved baseline. Run: /research:judge ``` > Note: do NOT infer from `methodology.md` alone — `methodology_rating: sound` is one input to the verdict, not the verdict itself. Only the `## Verdict` line in the judge output file is authoritative. Read from `state.json`: `goal`, `best_metric`, `best_commit`, `config` (including `metric_cmd`, `guard_cmd`, `compute`), `program_file`. Also read `baseline_commit` — iteration 0 commit from `experiments.jsonl` (first line, `status: "baseline"`, field `"commit"`). **Pre-compute run directory** (each in separate Bash call): ```bash BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-' || echo 'main') # timeout: 3000 TS=$(date -u +%Y-%m-%dT%H-%M-%SZ) # timeout: 3000 FORTIFY_DIR=".experiments/fortify-$TS" # timeout: 5000 WORKTREE_BASE="$FORTIFY_DIR/worktrees" mkdir -p "$FORTIFY_DIR" "$WORKTREE_BASE" ``` ## Step F2: Identify ablation candidates via scientist Gather two inputs for the scientist: 1. **Git diff**: run `git diff ... --stat` (summary) and full `git diff ...`. If full diff exceeds ~200 lines, write to `$FORTIFY_DIR/diff.txt` via Write tool; otherwise inline in prompt. 2. **Experiment history**: paths to `experiments.jsonl` and `diary.md` from the source run directory. Spawn `research:scientist` via `Agent(subagent_type="research:scientist", prompt="...")` with health monitoring (15-min cutoff, one 5-min extension — same pattern as judge J3): ```markdown Act as an ML ablation study designer. Read: - git diff at /diff.txt (or inline if small) - experiments.jsonl at (filter for entries with status: "kept") - diary.md at (if exists) Identify 3–8 distinct logical components that were changed during this run. A component = a logically independent change that can be removed independently. For each component produce one JSON line to /ablation-candidates.jsonl: { "component_id": , "name": "", "description": "", "files": [""], "revert_commits": [""], "expected_importance": "HIGH|MEDIUM|LOW" } Write your analysis to /candidates-analysis.md. Include ## Confidence block. Return ONLY: {"status":"done","components":N,"file":"/ablation-candidates.jsonl","confidence":0.N} ``` **Health monitoring** (CLAUDE.md §8): ```bash LAUNCH_AT_F2=$(date +%s) CHECKPOINT_F2="/tmp/fortify-check-$LAUNCH_AT_F2" touch "$CHECKPOINT_F2" # timeout: 3000 ``` Poll every 5 min: `find -newer "$CHECKPOINT_F2" -type f | wc -l` (`timeout: 5000`) — new files = alive; zero = stalled. - **Hard cutoff: 15 min** no file activity → timed out - **One extension (+5 min)**: if `tail -20 /candidates-analysis.md` shows active progress, grant one extension; second stall = hard cutoff - **On timeout**: stop with `"fortify: Scientist timed out. Check / for partial output."`; surface with ⏱ Read `ablation-candidates.jsonl` after scientist completes. If `--max-ablations ` specified and component count + 1 (for full variant) exceeds M: sort by `expected_importance` (HIGH first, then MEDIUM, then LOW), keep top M-1 components plus always include the `full` sanity-check variant. **`--skip-run` early exit**: if `--skip-run` flag present, print candidate table (component_id, name, description, files, expected_importance) and exit. No ablation execution. Print: `"fortify: --skip-run — candidates identified. Next: /research:fortify without --skip-run"`. Jump to F8 (skip-run variant). ## Step F3: Generate ablation variants For each component from F2, there is one ablation variant: `no-` (slugified — lowercase, spaces replaced with hyphens). Plus one `full` variant (sanity check — should reproduce `best_metric`). Write variant configs to `$FORTIFY_DIR/variants.jsonl` via Write tool — one JSON line per variant: ```json {"variant_name": "full", "component_removed": null, "revert_commits": [], "revert_strategy": "none"} {"variant_name": "no-", "component_removed": "", "revert_commits": ["", ""], "revert_strategy": "git-revert"} ``` ## Step F4: Run ablation variants via worktrees Run each variant **sequentially** to avoid git worktree conflicts. **Before loop — store original working directory:** ```bash ORIG_DIR="$(pwd)" # timeout: 3000 ``` **On interrupt** (user abort or unexpected error mid-loop): `cd "$ORIG_DIR"` first, then run `git worktree prune` (`timeout: 15000`) to clean up any partially created worktrees before exiting. For each variant in `variants.jsonl`: **4a. Create isolated worktree at best_commit:** ```bash git worktree add "$WORKTREE_BASE/" # timeout: 15000 ``` **4b. Navigate into worktree** (two separate Bash calls — cd first, then command): ```bash cd "$WORKTREE_BASE/" # timeout: 3000 ``` **4c. Apply revert (skip for `full` variant):** For `full` variant: no changes — proceed directly to 4d. For `no-` variant: revert the component's commits. **IMPORTANT — order matters**: revert in **reverse chronological order** (newest first) to avoid conflicts. If `revert_commits` from `variants.jsonl` is in chronological order (oldest first, e.g. as scientist returned them), reverse before reverting: ```bash # Sort newest-first for conflict-free revert REVERT_COMMITS_SORTED=$(echo " ..." | tr ' ' '\n' | tac | tr '\n' ' ') git revert $REVERT_COMMITS_SORTED --no-edit # timeout: 15000 ``` If revert produces merge conflicts: append `{"variant":"","status":"revert-conflict",...}` to `results.jsonl`, jump to 4f (cleanup). **4d. Run metric_cmd in worktree:** ```bash # timeout: 360000 ``` Parse stdout for numeric metric value. If command fails or no numeric output: record `status: "metric-failed"`, jump to 4f. **4e. Run guard_cmd in worktree:** ```bash # timeout: 360000 ``` Record guard result: `"pass"` (exit 0) or `"fail"` (non-zero). **4f. Cleanup worktree (INVARIANT — must execute even if 4c/4d/4e fail):** ```bash cd "$ORIG_DIR" # timeout: 3000 ``` ```bash git worktree remove --force "$WORKTREE_BASE/" # timeout: 15000 ``` **4g. Record result** — append one JSON line to `$FORTIFY_DIR/results.jsonl`: ```json {"variant":"","component_removed":"","metric":0.0,"delta_from_full":0.0,"delta_pct":0.0,"guard":"pass|fail","status":"completed|revert-conflict|metric-failed|timeout","timestamp":""} ``` `delta_from_full` and `delta_pct` are placeholders here — computed in post-loop step below. After all variants processed: ```bash git worktree prune # timeout: 15000 ``` **Post-loop delta computation**: read `results.jsonl`, find `full` variant metric. For each completed `no-` variant: - `delta_from_full = ablated_metric - full_metric` - `delta_pct = (delta_from_full / abs(full_metric)) * 100` (signed — negative means removing the component hurt). If `full_metric == 0`: set `delta_pct = 0` (avoid division by zero; metric is already at zero baseline). Update `results.jsonl` with computed deltas via Write tool (rewrite full file). ## Step F5: Rank component importance For each `no-` variant with `status: "completed"`: - Read `metric_direction` from `## Metric` block in `program_file` (`higher` or `lower`). If absent, default to `higher`. - Compute **signed delta** (positive = removal hurt the metric → component is helpful): ```python signed_delta = (full_metric - ablated_metric) * (1 if direction == 'higher' else -1) importance = signed_delta / abs(full_metric) * 100 if full_metric != 0 else 0 ``` - Importance class (helpful components — `signed_delta >= 0`): - **CRITICAL**: importance > 50% - **SIGNIFICANT**: importance 10–50% - **MARGINAL**: importance < 10% - **Potentially Harmful** class: `signed_delta < -5%` — removing the component IMPROVED the metric. Surface in dedicated `Potentially Harmful Components` report section; not ranked in main table. Sort by importance descending (helpful components only). Write ranked results to `$FORTIFY_DIR/importance-ranking.json` via Write tool — JSON array of objects with fields: `rank`, `component`, `full_metric`, `ablated_metric`, `signed_delta_pct`, `importance_pct`, `class` (`CRITICAL`/`SIGNIFICANT`/`MARGINAL`/`HARMFUL`). **Sanity check**: compare `full` variant metric against `best_metric` from `state.json`. If divergence exceeds 2%: ```text Warning: Sanity check failed: full-variant metric= differs from best_metric= by %. Results may be unreliable (non-deterministic metric or environment change). ``` Include this warning prominently in the F7 report. ## Step F6: Reviewer Q&A (optional — `--venue` only) Skip entirely if no `--venue` flag. Supported venues: `CVPR`, `NeurIPS`, `ICML`, `workshop`. Spawn `research:scientist` via `Agent(subagent_type="research:scientist", prompt="...")` with health monitoring (same 15-min cutoff, one 5-min extension): ```markdown Act as a peer reviewer for . Read: - ablation results at /results.jsonl - importance ranking at /importance-ranking.json - original program.md at Generate: 1. 5–7 likely reviewer questions calibrated to standards (CVPR/NeurIPS/ICML: expect thorough ablations, statistical significance, compute budget justification; workshop: lighter bar) 2. For each question: a data-backed answer referencing specific ablation results 3. A supplementary material draft section with the ablation table (LaTeX-ready) Write to /reviewer-qa.md. Include ## Confidence block. Return ONLY: {"status":"done","questions":N,"file":"/reviewer-qa.md","confidence":0.N} ``` **Health monitoring**: same as F2 (15-min cutoff, one extension). On timeout: note `"Reviewer Q&A: timed out"` in report, continue to F7. ## Step F7: Write fortify report Pre-compute branch (if not already set): ```bash BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-' || echo 'main') # timeout: 3000 ``` Write full report to `.temp/output-fortify-$BRANCH-$(date +%Y-%m-%d).md` via Write tool (never overwrite — append counter suffix if slug exists, e.g. `-2.md`): ```markdown ## Fortify Report: **Source run**: **Date**: **Baseline commit**: **Components identified**: **Ablations run**: of ### Sanity Check (full variant) Full metric: (expected from run: ) — PASS | Warning MISMATCH (% divergence) ### Component Importance Ranking | Rank | Component | Full Metric | Ablated Metric | Signed Δ | Importance | Class | |------|-----------|-------------|----------------|----------|------------|-------| | 1 | ... | ... | ... | +X.X% | X.X% | CRITICAL | ### Potentially Harmful Components Components whose removal **improved** the metric (signed_delta < −5%) — likely added noise or hurt performance. | Component | Full Metric | Ablated Metric | Signed Δ | |-----------|-------------|----------------|----------| | ... | ... | ... | −X.X% | (Omit section entirely if no harmful components found.) ### Ablation Matrix | Variant | Metric | Guard | Status | Delta from Full | |---------------|--------|-------|------------------|-----------------| | full | ... | pass | completed | baseline | | no-component1 | ... | pass | completed | -X.X% | | no-component2 | ... | n/a | revert-conflict | n/a | ### Skipped Variants ### Reviewer Q&A

Full artifacts: / ## Confidence **Score**: 0.N — [high|moderate|low] **Gaps**: - [specific limitation] ``` ## Step F8: Terminal summary Print compact terminal summary: ```text --- Fortify — Source run: Sanity: full= (expected ) — PASS | Warning MISMATCH Components: identified · ablations completed Top: (importance: X.X% · CRITICAL|SIGNIFICANT|MARGINAL) Marginal: components < 10% each Venue Q&A: generated for | n/a -> saved to .temp/output-fortify--.md -> ablation artifacts: / --- Next: simplify model by removing marginal components, re-run /research:run ``` If `--skip-run` was used (early exit at F2): replace ablation lines with: ```text --- Fortify — (--skip-run) Source run: Components: candidates identified — ablations not executed -> candidates: /ablation-candidates.jsonl -> analysis: /candidates-analysis.md --- Next: run /research:fortify without --skip-run to execute ablations ``` - **Worktree invariant** — cleanup (`git worktree remove --force`) must execute even if metric/guard fails. Never leave stale worktrees. Final `git worktree prune` catches any missed cleanup. - **Main repo never modified** — all ablation work happens in worktrees. Main working tree stays clean throughout. - **Sequential execution** — variants run one at a time. Parallel worktrees would require separate detached HEADs and complicate cleanup. - **No compound Bash commands** — always two separate Bash calls (cd then command). CWD persists between calls. - **Bash tool `timeout` parameter** — never shell `timeout` wrapper. Pass `timeout: ` on Bash tool call. - **Judge prerequisite** — fortify refuses to run without an APPROVED judge verdict. This prevents ablation studies on unapproved methodologies. - **`--skip-run` for planning** — generates candidate list without running ablations. Useful for reviewing what would be ablated before committing compute. - **`--skip-run` scope**: this flag skips ablation *execution* only — the source run (`research:run`) must already be complete before invoking fortify with `--skip-run`. It does not affect the source run. - **Fortify run directories** don't write `result.jsonl` — exempt from automated 30-day TTL cleanup (exempt per `.claude/rules/artifact-lifecycle.md` TTL policy — no `result.jsonl` = cleanup skipped); remove manually when no longer needed (`rm -rf .experiments/fortify-*/`) - **Compute mode**: local execution only. `--compute` and `--colab` passthrough not implemented — contributions welcome. Until then, fortify runs `metric_cmd`/`guard_cmd` directly in each worktree on the local machine. - **Revert conflicts expected** — when commits are interleaved (component A's commit touches same lines as component B's), revert may conflict. This is recorded as `revert-conflict` and reported, not treated as an error.