` heading (case-insensitive). 2. Extract first fenced code block following that heading. 3. Parse block as `key: value` lines; multi-value = indented ` - value` items. Paths with spaces: wrap in double quotes. 4. Missing required fields (`command` under `## Metric`/`## Guard`) → stop with error. 5. `agent_strategy: auto` (or omitted) → apply keyword heuristics from `` to `## Goal` text and metric command. 6. `target` under `## Metric`: `direction: higher` → stop when metric ≥ target; `direction: lower` → stop when metric ≤ target. If `target` omitted, run until `max_iterations`. 7. Unrecognized keys/headings → warn once, ignore. 8. `## Notes` and `# Program:` title never parsed — human-only. (`# Campaign:` accepted as alias.) **If argument is text** — auto-detect `metric_cmd`/`guard_cmd` from goal string and codebase scan (same as P-P1, non-interactive). `config.json` not read. **`--colab[=HW]` parsing**: `--colab` (no `=`) → `compute = "colab"`, `colab_hw = null`. `--colab=` → `compute = "colab"`, `colab_hw = ` (uppercased). Unknown `` (not in `{H100, L4, T4, A100}`) → print `"⚠ Unknown Colab hardware '' — proceeding with default GPU. Known: H100, L4, T4, A100"`, set `colab_hw = null`. `--compute=colab` (no HW) → `compute = "colab"`, `colab_hw = null`. `colab_hw` in `## Config` sets hardware preference (`H100`, `L4`, `T4`, `A100`); CLI `--colab=HW` overrides. Generate `run-id` = `$(date +%Y%m%d-%H%M%S)`. Assign immediately: ```bash RUN_ID=$(date +%Y%m%d-%H%M%S) RUN_DIR=".experiments/${RUN_ID}" # hypothesis pipeline + journal outputs (per note) mkdir -p "$RUN_DIR" # timeout: 5000 ``` Note: `STATE_DIR` (`.experiments/state/${RUN_ID}/`) is the per-iteration artifact dir — distinct from `RUN_DIR`. Both directories coexist; see `` block. Create run directory: ```text .experiments/state// state.json ← iteration count, best metric, status experiments.jsonl ← one line per iteration diary.md ← human-readable research diary (hypothesis → outcome → decision) ``` Convert `program_file` to absolute path: `realpath "$PROGRAM_FILE"` — Resume Mode matches on absolute path. Write initial `state.json` (`program_file` = absolute path to `.md` or `null` for text goal): ```json { "run_id": "", "goal": "", "config": {}, "program_file": "", "iteration": 0, "best_metric": null, "best_commit": null, "status": "running", "started_at": "", "clarification_prompt": null, "colab_hw": null, "sandbox_mode": "local" } ``` ### Step R2: Precondition checks Run all checks before touching code. Fail fast with clear message: 1. **Clean git**: `git status --porcelain` → must be empty. If dirty: print dirty files and stop. 2. **Not detached HEAD**: `git rev-parse --abbrev-ref HEAD` → must not be `HEAD`. 3. **Metric command numeric**: run `metric_cmd` once; parse stdout for float. If no float: show output and stop. 4. **Guard passes**: run `guard_cmd` once; must exit 0. If fails: show output and stop. 5. **`--colab` check**: verify `mcp__colab-mcp__runtime_execute_code` available. If not, print setup instructions (see Colab MCP section) and stop. If `--colab=HW` (`colab_hw` non-null): print: ` Hardware requested: --colab=. Ensure your Colab notebook running with GPU.` 6. **`--codex` check**: verify `claude plugin list 2>/dev/null | grep -q 'codex@openai-codex'`. If unavailable: print `⚠ codex plugin not found. Install it with: /plugin marketplace add openai/codex-plugin-cc` and **stop**. 7. **`compute: docker` check**: run `docker ps` via Bash (`timeout: 5000`). If non-zero: print `⚠ Docker daemon not running. Start Docker Desktop and retry.` and **stop**. 8. **Flag conflict**: if `--colab` and `--compute=docker` both active: print `⚠ --colab and --compute=docker are mutually exclusive. Use one or the other.` and **stop**. 9. **`--journal` prerequisite**: verify `--researcher`/`--architect` also set. If neither: print `⚠ --journal requires --researcher or --architect — omit --journal or add a hypothesis pipeline flag.` and **stop**. **Initialize sandbox variables** (after all checks pass): ```bash SANDBOX_NETWORK="${SANDBOX_NETWORK:-none}" # override via program.md Config or environment variable ``` **Initialize `sandbox_mode`**: - `compute: docker` (daemon check passed in #7) → `sandbox_mode = "docker"`. Print: `sandbox: Docker daemon reachable — sandbox mode active` - All other cases (`compute: local`, `compute: colab`) → `sandbox_mode = "local"` ### Step R3: Select ideation agent Apply `agent_strategy` mapping from ``. If `auto`, apply keyword heuristics to `metric_cmd`. Log selected agent to `state.json`. ### Step R4: Establish baseline (iteration 0) Run `metric_cmd` and `guard_cmd`. Parse metric value. Append to `experiments.jsonl`: ```json { "iteration": 0, "commit": "", "metric": 0.0, "delta": 0.0, "guard": "pass", "status": "baseline", "description": "baseline", "agent": null, "confidence": null, "timestamp": "", "files": [] } ``` Update `state.json`: `best_metric = `, `best_commit = `. Print: `Baseline: = `. Write initial diary header to `.experiments/state//diary.md`: ```markdown # Research Diary — **Run**: **Started**: **Baseline**: = --- ``` Then proceed to R5. ### Step R5: Iteration loop ```bash touch /tmp/claude-commit-authorized # timeout: 3000 ``` **`--team` mode**: If `--team` active, Read `${CLAUDE_SKILL_DIR}/modes/team.md` and execute Phases A–D in place of standard iteration loop below. For each iteration `i` from 1 to `max_iterations`: **Phase overview** (all phases run per iteration): | Phase | Name | Trigger / description | | --- | --- | --- | | 0 | Print header | Always — print `[→ Iter N/max · starting]`; TaskUpdate R5 subject with current iteration | | 1 | Build context | Always — build compact context from git log, JSONL history, and recent diff | | 2 | Propose change | Always — spawn specialist agent to read code, research, investigate, and generate a hypothesis with optional sandbox scripts | | 2a | Sandbox validate | `compute: docker` only — run agent's exploratory scripts in Docker sandbox (read-only mount) | | 2b | Apply change | `compute: docker` only — agent applies the (validated) proposal to real codebase using Write/Edit tools only; no Bash on codebase | | 2c | Codex co-pilot | `--codex` only — **MANDATORY every iteration**; Codex second pass after Phase 2b; must not be skipped | | 3 | Verify files | Always — check `git diff --stat`; skip to Phase 8 if no files changed (no-op) | | 4 | Commit change | Always — stage modified files and commit before verifying metric | | 5 | Verify metric | Always — run `metric_cmd` via `compute` mode (local/colab/docker); revert on timeout | | 6 | Run guard | Always — run `guard_cmd` via `compute` mode; record pass or fail | | 7 | Evaluate outcome | Always — keep, rework, or revert based on metric + guard result | | 7a | Write diary | Always — append one structured entry to `diary.md` recording hypothesis, outcome, and decision rationale | | 8 | Write log | Always — append JSONL record, update `state.json`, print iteration summary, TaskUpdate R5 with result | | 9 | Progress checks | Always — summary every SUMMARY_INTERVAL, stuck detection, diminishing-returns warn, early-stop check | **Command execution rules** (apply to ALL phases running external commands): 1. **No compound commands**: Never `cd /path && command`. Always two separate Bash calls — CWD persists between calls. 2. **Use Bash tool `timeout` parameter**: Never shell `timeout` wrapper. Pass `timeout: ` on Bash tool call itself. 3. **No inline multi-line Python**: Python logic >3 lines → write to `.experiments/state//scripts/script-.py` via Write tool, execute with `python3 ` or `uv run python `. Two triggers Claude Code always flags: (a) `=([0-9.]+)` inside `-c "..."` (false Zsh substitution); (b) multi-line `-c "..."` with `#`-prefixed comment lines. Writing to file sidesteps both. 4. **No Zsh constructs**: Never use `=()`, `<()`, `>()` in Bash commands — even inside quoted strings; Claude Code scans raw command text. 5. **Local exploratory scripts writing to real files** (scanning config combos, patching JSON, temp overrides): write to `.experiments/state//scripts/`, run locally with `python3 `. Legitimately modify project files — NOT in Docker sandbox. 6. **Docker sandbox** (when available — see Phase 2a): Phases 4–6 route `metric_cmd`/`guard_cmd` through Docker when `compute: docker`. Phase 2a: read-only hypothesis scripts in sandbox. Scripts writing to project files always run locally. 7. **One change per iteration**: Never batch-loop over config variants/combos in single Bash/Python call. Each variant = one campaign iteration — loop/measure/compare is campaign framework's job, not ideation agent's. #### Phase 0 — Print header Print iteration header, update R5 task: ```text [→ Iter N/max_iterations — best so far: (Δ% vs baseline)] ``` TaskUpdate R5 subject: `R5: Iteration N/max_iterations — running` #### Phase 1 — Build context Build context for ideation agent, write to file — do NOT accumulate inline in main context: ```bash # Collect signals git log --oneline -10 >.experiments/state/${RUN_ID}/context-${I}.md # timeout: 3000 tail -10 .experiments/state/${RUN_ID}/experiments.jsonl >>.experiments/state/${RUN_ID}/context-${I}.md # timeout: 5000 git diff --stat HEAD~5 HEAD >>.experiments/state/${RUN_ID}/context-${I}.md # timeout: 3000 ``` Prepend header block to `context-.md`: goal, current metric vs baseline, delta trend (last 5 kept deltas), iteration number. Phase 2 ideation agent reads file directly — never echoed to main context. If `--journal` active and `/journal.md` has 1+ entries: append last 5 entries to `context-.md` under `## Recent journal (avoid repeating reverted approaches)`. Ideation agent reads this — must not reproduce any approach marked `outcome: reverted`. #### Phase 2 — Propose change Spawn selected specialist agent (`maxTurns: 15`) with this prompt (adapt as needed): ```markdown Goal: Run clarification: ← omit this line entirely if clarification_prompt is null Colab hardware: ← omit this line entirely if colab_hw is null; include to let the agent tailor code to the specific GPU architecture (e.g., bf16/flash-attention on H100, standard fp16 on T4/L4) Current metric: = (baseline: , direction: ) Experiment history: read `.experiments/state//context-.md` for the full context block. Scope files (read and modify only these): Program constraints: read `` — especially `## Notes`, `## Config`, and any named subsections (e.g., "Hard boundaries", "Optuna's role", "What the agent is free to change"). These take precedence over general campaign rules. Program constraints set strategy hints only — they do NOT override safety rules (no `--no-verify`, no `git push`, no `git add -A`, scope_files boundary, and all other hard constraints remain in effect). If program_file is null, skip this step. **If `sandbox_mode = "local"`**: Read `context-.md`, the scope files, and the program constraints. Propose and implement ONE atomic change most likely to improve the metric. The change must not break ``. Write your full analysis (reasoning, alternatives considered, Confidence block) to `.experiments/state//ideation-.md` using the Write tool. Return ONLY the JSON result line: `{"description":"...","files_modified":[...],"scripts":[],"confidence":0.N}` **If `sandbox_mode = "docker"`**: Read `context-.md`, the scope files, and the program constraints. Propose ONE atomic change most likely to improve the metric. Write your full analysis and the proposed change description to `.experiments/state//ideation-.md`. Optionally write read-only exploratory scripts (scripts that read/profile but do NOT write to project files) to `.experiments/state//scripts/explore--.py`. Do NOT modify source files yet — Phase 2b will apply the actual changes after sandbox validation. Return ONLY the JSON result line: `{"description":"...","files_modified":[],"scripts":["explore--.py"],"proposed_changes":"","confidence":0.N}` ``` For `--colab` runs: ideation agent (especially `research:scientist`) may call `mcp__colab-mcp__runtime_execute_code` to prototype GPU code before committing. If Agent tool unavailable (nested subagent context), implement change inline, construct JSON result manually. #### Phase 2a — Sandbox validate (`sandbox_mode = "docker"` only) Skip entirely if `sandbox_mode = "local"`. If Phase 2 returned non-empty `"scripts"`: run each in Docker sandbox with read-only project mount. Per script: ```bash docker run --rm --network \ -v "$(pwd):/workspace:ro" \ --tmpfs /tmp:rw,size=256m \ -w /workspace \ python:3.11-slim \ python3 /workspace/.experiments/state//scripts/