---
name: scaffold-experiment
description: Set up complete experimental infrastructure for all runs in a designed experiment. Orchestrates parallel generation of fine-tuning configs (via scaffold-torchtune) and evaluation configs (via scaffold-inspect). Use after design-experiment to prepare configs before running experiments.
---

# Scaffold Experiment

You help users automatically set up the complete experimental infrastructure - both fine-tuning and evaluation configurations - for all runs in a designed experiment.

## Your Task

Orchestrate the scaffolding process by reading tool specifications from experiment_summary.yaml and launching the appropriate subagents:

1. Read experiment_summary.yaml to identify which tools are being used
2. Launch the preparation subagent (`scaffold-torchtune`) **only if at least one run has `type: "fine-tuned"`** (see "Deciding Which Subagents to Launch" below)
3. Launch the evaluation subagent (`scaffold-inspect`)
4. Wait for the launched subagent(s) to complete and report their results

This ensures the entire experiment is ready to execute from training through evaluation. When both subagents are launched they run in parallel in separate context windows since their outputs do not depend on one another.

## Deciding Which Subagents to Launch

Preparation (training) is only needed for runs that train in *this* experiment. Eval-only experiments — where every run is a base model (`type: "control"`) or a pre-existing checkpoint (`type: "eval-only"`) — have nothing to fine-tune, so launching `scaffold-torchtune` would do nothing.

**Rule:** Launch `scaffold-torchtune` **if and only if** at least one run has `type: "fine-tuned"`. Always launch `scaffold-inspect`.

```python
import yaml
with open(f"{experiment_dir}/experiment_summary.yaml") as f:
    config = yaml.safe_load(f)
runs = config.get("runs", []) or []
finetuned_runs = [r for r in runs if isinstance(r, dict) and r.get("type") == "fine-tuned"]
needs_torchtune = len(finetuned_runs) > 0
```

- **`needs_torchtune` true:** launch both subagents in parallel (one message, two Task calls).
- **`needs_torchtune` false:** launch `scaffold-inspect` alone. Note in the summary and log that torchtune was skipped because the experiment has no fine-tuned runs (it is eval-only).

**Post-decision assert (cheap insurance against a silent miss):** after deciding, confirm the partition is correct — if `needs_torchtune` is false, assert that `finetuned_runs` is genuinely empty before skipping. A wrongly-skipped fine-tuned run would silently lose its training configs and only surface as a failure at `run-experiment`. If the assert ever trips, do **not** skip — launch `scaffold-torchtune`.

**Current tool support:**
- **Preparation:** torchtune only (via `scaffold-torchtune` subagent)
- **Evaluation:** inspect-ai only (via `scaffold-inspect` subagent)

**Future tool support:** This orchestrator is designed to route to different worker subagents based on tool choices documented in experiment_summary.yaml. Future iterations may support additional frameworks.

## Dependency version check

Run before proceeding to catch stale envs (user pulled new pins but didn't re-run `pip install -e .`):

```bash
python scripts/check_env.py
```

- **Exit 0**: proceed.
- **Exit 1**: show the printed `STALE ENV` table to the user, ask whether to `pip install -e .` first or continue anyway.

## Workflow

1. **Locate experiment** - Find the experiment directory (usually current directory or ask user)
2. **Verify experiment_summary.yaml exists** - Ensure design phase is complete
3. **Read tool specifications** - Parse experiment_summary.yaml to identify preparation and evaluation tools
4. **Validate tool support** - Ensure the specified tools have corresponding worker subagents
5. **Prepare data (if applicable)** - If `data.data_generation` block is present, run `src/tools/experiment/prepare_data.py` to materialize the declared dataset before subagents launch
6. **Launch preparation and evaluation subagents in parallel** - Use Task tool to launch both simultaneously
7. **Wait for both subagents to complete** - Each will report back when done
8. **Create orchestration log** - Document the scaffolding process in `logs/scaffold-experiment.log`
9. **Report combined summary** - Show user complete status of both scaffolding phases

## Finding the Experiment

**If user runs skill without arguments:**
- Check if current directory contains `experiment_summary.yaml`
- If not, ask user for the experiment directory path

**If user provides a path:**
- Use that path as the experiment directory

## Verification Before Starting

Before beginning scaffolding, perform **minimal structural validation**:

1. **experiment_summary.yaml exists:**
   ```bash
   ls {experiment_dir}/experiment_summary.yaml
   ```
   If missing, report error and suggest running `design-experiment` skill first.
   DO NOT launch subagents.

2. **experiment_summary.yaml is readable:**
   ```python
   import yaml
   with open(f"{experiment_dir}/experiment_summary.yaml") as f:
       config = yaml.safe_load(f)
   ```
   If unreadable or invalid YAML, report error. DO NOT launch subagents.

**Note on validation division:**
- **Skill validates:** Structure only (file existence, readability, tool recognition)
- **Agents validate:** Domain-specific content (parameters, paths, configuration)

The subagents (scaffold-torchtune, scaffold-inspect) will perform complete validation of:
- Required parameters presence and validity
- Path accessibility
- Configuration correctness
- Environment settings from claude.local.md

## Tool to Subagent File Mapping

This orchestrator routes to different subagent specifications based on tool choices in experiment_summary.yaml:

**Preparation tools:**
- `torchtune` → [optimizers/torchtune_agent.md](optimizers/torchtune_agent.md)

**Evaluation tools:**
- `inspect-ai` → [evaluators/inspect_agent.md](evaluators/inspect_agent.md)

**Adding new tools:** Create the corresponding agent file (optimizers/{tool}_agent.md or evaluators/{tool}_agent.md) and add to this mapping.

## Reading Tool Specifications

Read experiment_summary.yaml to determine which subagents to launch.

**See [parsing.md](parsing.md) for:**
- How to parse the tools section
- Tool to subagent mapping logic
- Error handling for missing/unsupported tools
- Logging examples

## Orchestration Steps

### How to Launch Worker Subagents

**IMPORTANT:** Use the `Task` tool to launch worker subagents (NOT the `SlashCommand` tool).

**Correct approach for parallel execution:**

Launch both subagents in a single message with multiple Task tool calls. This runs them in parallel.

**Example:**
```
I'll launch both the torchtune and inspect-ai scaffolding subagents in parallel.

[Use Task tool with subagent_type="scaffold-torchtune"]
[Use Task tool with subagent_type="scaffold-inspect"]
```

**Subagent prompts should:**
- Specify the experiment directory path
- Ask the subagent to read experiment_summary.yaml
- Request generation of all necessary configuration files
- Ask for a detailed log file (logs/scaffold-torchtune.log or logs/scaffold-inspect.log)
- Request a summary report of created files and any errors

**Why this matters:**
- Worker subagents like `scaffold-torchtune` and `scaffold-inspect` are launched via the Task tool
- They run in separate context windows (not the main conversation)
- They execute independently and report back when complete
- Running them in parallel saves time since they don't depend on each other

### Step 0: Prepare Data (if applicable)

If experiment_summary.yaml contains a `data.data_generation` block, run the prepare_data tool before launching any subagents:

```bash
python -m cruijff_kit.tools.experiment.prepare_data {experiment_dir}
```

**Behavior:**
- Exits 0 if no `data.data_generation` block exists or the declared dataset is generated successfully.
- Exits 1 on any failure. If this happens, **do not launch subagents** — report the error and direct the user to `logs/scaffold-prepare-data.log`.

**Currently supported generators:**
- `model_organism` — cheap, deterministic sequence datasets (`src/tools/model_organisms/`). See template schema for parameters.

### Step 1: Launch Preparation Subagent (only if there are fine-tuned runs)

**First check `needs_torchtune`** (see "Deciding Which Subagents to Launch" above). If no run has `type: "fine-tuned"`, **skip this step entirely** — the experiment is eval-only — and proceed to Step 2 with `scaffold-inspect` alone.

Otherwise, invoke the preparation subagent based on tool specification in experiment_summary.yaml.

**For torchtune:** See [optimizers/torchtune_agent.md](optimizers/torchtune_agent.md) for:
- Complete prompt template
- Subagent responsibilities and execution details
- Expected output structure
- Error handling

Launch the subagent using the Task tool with the prompt template from the agent file.

### Step 2: Launch Evaluation Subagent

Invoke the appropriate evaluation subagent based on tool specification in experiment_summary.yaml.

**For inspect-ai:** See [evaluators/inspect_agent.md](evaluators/inspect_agent.md) for:
- Complete prompt template
- Subagent responsibilities and execution details
- Expected output structure
- Error handling

Launch the subagent using the Task tool with the prompt template from the agent file.

### Step 3: Wait for Subagent Completion

**After launching both subagents in parallel:**
- Each subagent will execute autonomously in its own context window
- You will receive a report back from each subagent when it completes
- The reports are returned as tool results from the Task tool calls
- Do NOT proceed until both subagents have reported back

**Processing subagent reports:**
1. Read the response from scaffold-torchtune subagent
   - Extract list of created runs
   - Note any errors or warnings
   - Identify path to logs/scaffold-torchtune.log
2. Read the response from scaffold-inspect subagent
   - Extract list of created evaluation scripts
   - Note any errors or warnings
   - Identify path to logs/scaffold-inspect.log
3. Verify both subagents completed successfully or note failures

## Logging

Create an orchestration log at `{experiment_dir}/logs/scaffold-experiment.log` that records the high-level scaffolding process.

**See [logging.md](logging.md) for:**
- Complete log format specification
- Action types and when to log them
- What to log vs what goes in subagent logs
- Example log entries for all scenarios
- Error handling patterns

**Key principle:** The orchestration log tracks coordination and timing. Detailed implementation goes in subagent logs (logs/scaffold-torchtune.log, logs/scaffold-inspect.log).

## Error Handling

**If experiment_summary.yaml not found:**
- Report error to user
- Suggest running `design-experiment` skill first
- Do not proceed

**If optimization subagent fails:**
- Log the failure with details from subagent report
- Ask user: "Fine-tuning scaffolding failed. Do you want to continue with evaluation scaffolding?"
- If yes, evaluation can still be scaffolded for base model runs
- If no, stop and report failure

**If evaluation subagent fails:**
- Log the failure with details from subagent report
- Note that fine-tuning can still proceed independently
- Report in summary which evaluations couldn't be configured
- Still consider overall scaffolding partially successful

**If both subagents fail:**
- Report complete failure
- Direct user to individual subagent logs (logs/scaffold-torchtune.log, logs/scaffold-inspect.log)
- Suggest checking experiment_summary.yaml for completeness
- May need to run subagents individually for debugging

**If a subagent doesn't report back:**
- This should not happen with the Task tool
- If it does, report the issue and suggest checking the agent configuration
- User may need to run the subagent manually

## Output Summary

After completing orchestration, provide a comprehensive summary:

```markdown
## Scaffold Experiment Complete

Successfully scaffolded experiment:
`/scratch/gpfs/MSALGANIK/niznik/ck-projects/capitalization/cap_4L_lora_lr_sweep_2025-10-22/`

### Fine-Tuning Configurations (scaffold-torchtune)

✓ 2 runs configured successfully

**Created runs:**
- Llama-3.2-1B-Instruct_rank4/
- Llama-3.2-1B-Instruct_rank8/

**Each run contains:**
- setup_finetune.yaml (configuration)
- finetune.yaml (torchtune config)
- finetune.slurm (SLURM script)

### Evaluation Configurations (scaffold-inspect)

✓ 2 evaluation cells configured successfully

**Created cells:** (per-cell layout, issue #498 — one directory per (task, epoch))
- Llama-3.2-1B-Instruct_rank4/eval/capitalization_epoch0/
- Llama-3.2-1B-Instruct_rank8/eval/capitalization_epoch0/

**Each cell directory contains:**
- eval_config.yaml (per-cell evaluation configuration)
- cell.slurm (SLURM script)
- logs/ (for inspect-ai `.eval` output)

### Logs Created

- `logs/scaffold-experiment.log` - Orchestration log (this process)
- `logs/scaffold-prepare-data.log` - Data-generation details (only if `data.data_generation` block present)
- `logs/scaffold-torchtune.log` - Fine-tuning scaffolding details
- `logs/scaffold-inspect.log` - Evaluation scaffolding details

### Next Steps

**Recommended workflow:**
1. Review the generated configurations (optional)
2. Run `run-experiment` skill to execute the complete workflow:
   - Fine-tuning via `run-torchtune`
   - Evaluation via `run-inspect`
3. Run `explore-experiment` skill to interpret results

## Validation Before Completion

Before reporting success, verify:
- ✓ experiment_summary.yaml was found and read
- ✓ Optimization subagent was launched and reported back
- ✓ Evaluation subagent was launched and reported back
- ✓ Both subagent log files exist (i.e., logs/scaffold-torchtune.log, logs/scaffold-inspect.log)
- ✓ Run directories exist with expected structure (check 1-2 examples)
- ✓ Evaluation directories exist with expected structure (check 1-2 examples)
- ✓ Orchestration log was created

**Note:** You don't need to verify every file - the subagents have already done detailed verification. Just spot-check a few directories to confirm the structure is correct.

## Important Notes

### Orchestration Principles

- This skill **orchestrates** rather than implements - it launches autonomous subagents
- Each subagent maintains its own detailed log
- The orchestration log tracks high-level flow and timing
- Subagents can be run independently if needed (outside of this skill)
- Partial success is acceptable (e.g., fine-tuning configs generated but eval fails)

### Parallel Execution

- **When both subagents are needed** (the experiment has fine-tuned runs), launch them in a **single message** with multiple Task tool calls
- For an eval-only experiment (no fine-tuned runs), launch `scaffold-inspect` alone — there is no preparation subagent to parallelize with
- Do NOT launch them sequentially in separate messages
- **Do NOT use `run_in_background: true`** — background agents cannot surface permission prompts to the user, so all tool calls get auto-denied. Foreground parallel (multiple Task calls in one message) works correctly.
- The subagents run independently in separate context windows
- They can work simultaneously because their outputs don't depend on each other
- Wait for the launched subagent(s) to complete before proceeding to create the orchestration log

### Subagent Communication

- Each subagent receives its own prompt with specific instructions
- Subagents have full access to tools (Read, Write, Edit, Bash, etc.)
- Subagents report back once in a final message when complete
- You cannot send follow-up messages to subagents
- If a subagent needs more information, include it in the initial prompt

### Error Recovery

If scaffolding fails:
1. Check orchestration log (scaffold-experiment.log) for high-level flow
2. Check individual subagent logs (logs/scaffold-torchtune.log, logs/scaffold-inspect.log) for details
3. Fix the issue (e.g., missing inspect-ai task script, incorrect paths in claude.local.md)
4. Re-run this skill (subagents should handle existing files gracefully)
5. Or run individual subagents directly via Task tool for targeted fixes
