---
name: sherlock-handoff
description: Migrate an in-progress Sherlock training to iris-hi by rsync-ing the resume checkpoint and re-submitting with the same wandb.run_name so it picks up where Sherlock left off. ~4× speedup (Sherlock 2080Ti → iris-hi H100). Invoke when a Sherlock training is end-to-end blocking AND iris-hi has a free QOS slot.
disable-model-invocation: false
argument-hint: --run-name <wandb-run-name> --script <training-script-path> --array <seed-int> [--from sherlock] [--to iris-hi]
---

# `/sherlock-handoff` — Live Training Migration

Cancels a running Sherlock training, rsyncs its `_resume/<hash>/` state to iris, and resubmits on iris-hi with the same `wandb.run_name`. The training's `AutoResumeManager` (see `sir.training.resume_manager`) keys resume identity by W&B entity+project+run_name hash → matching the rsync'd dir → loads `resume_state.pt` → continues from `last_checkpoint_step`.

**Arguments**: `$ARGUMENTS`

## When to use

ALL of these must hold:

1. The Sherlock training is **on the critical path** (a missing seed blocks pillar completion or user's end-to-end timeline).
2. **iris-hi has free QOS slot** (cap=6 GPU/user) — check with `squeue -u $USER -p iris-hi -h | wc -l`.
3. **Sherlock remaining ETA > 30 min**. Math: `(300000 - current_step) / sherlock_rate_kph` vs `(300000 - current_step) / iris_hi_rate_kph`. Sherlock 2080Ti ~19k/h, iris-hi H100 ~75k/h (4× speedup), iris-hi L40s ~60k/h (3×). For Sherlock step < 250k, almost always a win.

If any gate fails, **do not handoff** — wasted overhead.

## Procedure (6 steps, in order)

### 1. Confirm the training and find its resume hash on Sherlock

```bash
RUN_NAME="<wandb-run-name>"  # e.g. idql-n128-square-d1-blind-mining-v2u-v2-r15-actor1024-ch5-seed-1

# Find resume hash by grepping resume_meta.json
ssh sherlock "for h in \$(ls /scratch/groups/cbfinn/ankile/self-improving-robots/checkpoints/_resume/ 2>/dev/null); do
  meta=/scratch/groups/cbfinn/ankile/self-improving-robots/checkpoints/_resume/\$h/resume_meta.json
  if [ -f \"\$meta\" ] && grep -q \"$RUN_NAME\" \"\$meta\"; then
    step=\$(python -c \"import json; d=json.load(open('\$meta')); print(d['last_checkpoint_step'])\")
    echo \"\$h step=\$step\"
  fi
done"
```

You should see one hash + last_checkpoint_step. If none, the run hasn't checkpointed yet — abort handoff.

### 2. scancel Sherlock FIRST (avoid mid-write rsync)

```bash
ssh sherlock "scancel <sherlock-jobid>"
```

Verify with `ssh sherlock "squeue -j <jobid>"` — should be empty.

### 3. rsync `_resume/<hash>/` to iris

```bash
HASH="<resume-hash>"
rsync -av --quiet \
  "sherlock:/scratch/groups/cbfinn/ankile/self-improving-robots/checkpoints/_resume/$HASH/" \
  "checkpoints/_resume/$HASH/"
```

Verify:
```bash
ls -la "checkpoints/_resume/$HASH/"
# Expect resume_meta.json + resume_state.pt (~600MB)
```

### 3b. ⚠ Rewrite `run_dir` in `resume_meta.json` to an iris-side path

**The Sherlock `resume_meta.json` has `run_dir` pointing at `/scratch/groups/cbfinn/...` which iris can't `mkdir` — the training will crash in ~30s with `PermissionError: [Errno 13] Permission denied: '/scratch'`.** Always patch the path post-rsync.

```bash
python << EOF
import json
from pathlib import Path
iris_ckpt_base = Path('/iris/u/ankile/self-improving-robots-workspace/self-improving-robots/checkpoints').resolve()
for h in ('$HASH',):  # add more hashes if doing multiple handoffs at once
    meta_path = Path(f'checkpoints/_resume/{h}/resume_meta.json')
    m = json.load(open(meta_path))
    new_dir = iris_ckpt_base / Path(m['run_dir']).name
    print(f'{h}: {m["run_dir"]} -> {new_dir}')
    m['run_dir'] = str(new_dir)
    new_dir.mkdir(parents=True, exist_ok=True)  # pre-create so train.py doesn't try mid-run
    json.dump(m, open(meta_path, 'w'), indent=2)
EOF
```

This is the silent-failure mode: ExitCode=0:0, job COMPLETED in 28-30 seconds, error buried in the log. Verify the patch by `grep run_dir checkpoints/_resume/$HASH/resume_meta.json` — should show the iris path.

### 4. Submit on iris-hi with same script and array index

```bash
sbatch --partition=iris-hi \
       --nodelist=iris9,iris10,iris-hgx-1 \
       --time=18:00:00 \
       --array=<seed-int> \
       <training-script-path>
```

**`--nodelist=iris9,iris10,iris-hgx-1` is mandatory** — drop iris-hgx-2 per [[iris_hgx2_hf_timeout]] (silent HF ConnectTimeout kills trainings in ~10 min).

### 5. Wait for it to start, watch the log

Once SLURM allocates the job, `tail` the log:
```bash
ls -lt logs/$(basename <training-script-path> .sh)-*.out | head -1
tail -30 <newest-log>
```

**Confirm "Resuming from step <N>"** in the first 1-2 min after job start. If the log shows `step: 1000/300000` instead of resume, the hash dir didn't transfer — abort and restart.

### 6. Clean up race-pair tracking

- Update `.overnight_ledger.md` race-pair table — the iris-hi job is now the sole copy.
- If you had Sherlock pending versions of other races, this is a good moment to evaluate cancelling them too (iris-hi is faster regardless).

## Common pitfalls

| Pitfall | Symptom | Fix |
|---|---|---|
| Rsync before scancel | Truncated resume_state.pt | scancel first, verify, then rsync |
| Wrong checkpoint dir path | "No such file" | Sherlock uses `$GROUP_SCRATCH/$USER/self-improving-robots/checkpoints/_resume/` (not `$GROUP_HOME`) |
| **Forgot to patch run_dir** in resume_meta.json | Silent fail ExitCode=0 in ~30s with `PermissionError: [Errno 13] Permission denied: '/scratch'` in log | **Always do step 3b** — rewrite `run_dir` to iris-side path before submitting |
| Submitted to iris main (default partition) | Job lands on flaky iris-hgx-2 or gets preempted | Always use `--partition=iris-hi` + nodelist exclude |
| Same job array submitted on both clusters | Dispatcher sees 2 finished runs with same name → ambiguity error | scancel Sherlock BEFORE iris-hi submit |
| `_resume/<hash>/` already exists on iris | rsync may merge with stale state | OK if iris dir is from an earlier run with the same hash (same wandb.run_name); the rsync overwrites with newer Sherlock state |

## Mechanics reference

- Resume key: SHA1 of `{entity}/{project}/{run_name}` — generated by `AutoResumeManager` in `sir/training/resume_manager.py`.
- `resume_state.pt`: policy + optimizer + scheduler + rng + step. ~600MB for IDQL n=128 actor1024.
- `resume_meta.json`: small bookkeeping (last_step, run_id, run_dir, completed flag).
- Cluster paths:
  - Sherlock `_resume/`: `/scratch/groups/cbfinn/ankile/self-improving-robots/checkpoints/_resume/`
  - iris `_resume/`: `./checkpoints/_resume/` (working dir at repo root)
- Auto-resume requires `cfg.training.resume_checkpoint_freq == cfg.wandb.log_freq` (typically 5000 steps). At resume_checkpoint_freq=5000, worst-case loss is 5000 steps ≈ 15 min on Sherlock.

## Speed table for handoff ROI

| Source GPU | Target GPU | Speedup | Break-even ETA |
|---|---|---|---|
| Sherlock 2080Ti (19k/h) | iris-hi H100 (75k/h) | 4× | 12 min |
| Sherlock 2080Ti | iris-hi L40s (60k/h) | 3× | 16 min |
| Sherlock 2080Ti | Delta A100 (28k/h) | 1.5× | 25 min |
| Sherlock 2080Ti | Delta A40 (16k/h) | 0.85× | NO HANDOFF |

(Per [[bigactor_training_step_rates]]. Pascal Titan Xp/P100 ~1.8k/h — these need an immediate cancel+resubmit with `[GPU_GEN:VLT\|TUR\|AMP\|HPR\|LOV]` constraint, not a handoff.)

## Reverse direction (iris → Sherlock)

Rarely useful — Sherlock is slower. Only do this if iris-hi is under sustained preemption pressure and Sherlock has free TUR/AMP slots. Same procedure but swap source/target paths.

## Related skills + memories
- `/iris` — partition map, nodelist conventions
- `/sherlock` — gpu partition, scratch layout
- `/job-monitor` — uses this skill when handoff opportunity detected
- [[feedback_sherlock_to_iris_checkpoint_handoff]] — policy memo (when to migrate)
- [[iris_hgx2_hf_timeout]] — why drop iris-hgx-2
- [[sherlock_gpu_constraint]] — Pascal-avoidance for fresh Sherlock submits
- [[training_resume_mechanism]] — how wandb.run_name keys auto-resume
- [[bigactor_training_step_rates]] — speed table
