--- name: sherlock-handoff description: Migrate an in-progress Sherlock training to iris-hi by rsync-ing the resume checkpoint and re-submitting with the same wandb.run_name so it picks up where Sherlock left off. ~4× speedup (Sherlock 2080Ti → iris-hi H100). Invoke when a Sherlock training is end-to-end blocking AND iris-hi has a free QOS slot. disable-model-invocation: false argument-hint: --run-name --script --array [--from sherlock] [--to iris-hi] --- # `/sherlock-handoff` — Live Training Migration Cancels a running Sherlock training, rsyncs its `_resume//` state to iris, and resubmits on iris-hi with the same `wandb.run_name`. The training's `AutoResumeManager` (see `sir.training.resume_manager`) keys resume identity by W&B entity+project+run_name hash → matching the rsync'd dir → loads `resume_state.pt` → continues from `last_checkpoint_step`. **Arguments**: `$ARGUMENTS` ## When to use ALL of these must hold: 1. The Sherlock training is **on the critical path** (a missing seed blocks pillar completion or user's end-to-end timeline). 2. **iris-hi has free QOS slot** (cap=6 GPU/user) — check with `squeue -u $USER -p iris-hi -h | wc -l`. 3. **Sherlock remaining ETA > 30 min**. Math: `(300000 - current_step) / sherlock_rate_kph` vs `(300000 - current_step) / iris_hi_rate_kph`. Sherlock 2080Ti ~19k/h, iris-hi H100 ~75k/h (4× speedup), iris-hi L40s ~60k/h (3×). For Sherlock step < 250k, almost always a win. If any gate fails, **do not handoff** — wasted overhead. ## Procedure (6 steps, in order) ### 1. Confirm the training and find its resume hash on Sherlock ```bash RUN_NAME="" # e.g. idql-n128-square-d1-blind-mining-v2u-v2-r15-actor1024-ch5-seed-1 # Find resume hash by grepping resume_meta.json ssh sherlock "for h in \$(ls /scratch/groups/cbfinn/ankile/self-improving-robots/checkpoints/_resume/ 2>/dev/null); do meta=/scratch/groups/cbfinn/ankile/self-improving-robots/checkpoints/_resume/\$h/resume_meta.json if [ -f \"\$meta\" ] && grep -q \"$RUN_NAME\" \"\$meta\"; then step=\$(python -c \"import json; d=json.load(open('\$meta')); print(d['last_checkpoint_step'])\") echo \"\$h step=\$step\" fi done" ``` You should see one hash + last_checkpoint_step. If none, the run hasn't checkpointed yet — abort handoff. ### 2. scancel Sherlock FIRST (avoid mid-write rsync) ```bash ssh sherlock "scancel " ``` Verify with `ssh sherlock "squeue -j "` — should be empty. ### 3. rsync `_resume//` to iris ```bash HASH="" rsync -av --quiet \ "sherlock:/scratch/groups/cbfinn/ankile/self-improving-robots/checkpoints/_resume/$HASH/" \ "checkpoints/_resume/$HASH/" ``` Verify: ```bash ls -la "checkpoints/_resume/$HASH/" # Expect resume_meta.json + resume_state.pt (~600MB) ``` ### 3b. ⚠ Rewrite `run_dir` in `resume_meta.json` to an iris-side path **The Sherlock `resume_meta.json` has `run_dir` pointing at `/scratch/groups/cbfinn/...` which iris can't `mkdir` — the training will crash in ~30s with `PermissionError: [Errno 13] Permission denied: '/scratch'`.** Always patch the path post-rsync. ```bash python << EOF import json from pathlib import Path iris_ckpt_base = Path('/iris/u/ankile/self-improving-robots-workspace/self-improving-robots/checkpoints').resolve() for h in ('$HASH',): # add more hashes if doing multiple handoffs at once meta_path = Path(f'checkpoints/_resume/{h}/resume_meta.json') m = json.load(open(meta_path)) new_dir = iris_ckpt_base / Path(m['run_dir']).name print(f'{h}: {m["run_dir"]} -> {new_dir}') m['run_dir'] = str(new_dir) new_dir.mkdir(parents=True, exist_ok=True) # pre-create so train.py doesn't try mid-run json.dump(m, open(meta_path, 'w'), indent=2) EOF ``` This is the silent-failure mode: ExitCode=0:0, job COMPLETED in 28-30 seconds, error buried in the log. Verify the patch by `grep run_dir checkpoints/_resume/$HASH/resume_meta.json` — should show the iris path. ### 4. Submit on iris-hi with same script and array index ```bash sbatch --partition=iris-hi \ --nodelist=iris9,iris10,iris-hgx-1 \ --time=18:00:00 \ --array= \ ``` **`--nodelist=iris9,iris10,iris-hgx-1` is mandatory** — drop iris-hgx-2 per [[iris_hgx2_hf_timeout]] (silent HF ConnectTimeout kills trainings in ~10 min). ### 5. Wait for it to start, watch the log Once SLURM allocates the job, `tail` the log: ```bash ls -lt logs/$(basename .sh)-*.out | head -1 tail -30 ``` **Confirm "Resuming from step "** in the first 1-2 min after job start. If the log shows `step: 1000/300000` instead of resume, the hash dir didn't transfer — abort and restart. ### 6. Clean up race-pair tracking - Update `.overnight_ledger.md` race-pair table — the iris-hi job is now the sole copy. - If you had Sherlock pending versions of other races, this is a good moment to evaluate cancelling them too (iris-hi is faster regardless). ## Common pitfalls | Pitfall | Symptom | Fix | |---|---|---| | Rsync before scancel | Truncated resume_state.pt | scancel first, verify, then rsync | | Wrong checkpoint dir path | "No such file" | Sherlock uses `$GROUP_SCRATCH/$USER/self-improving-robots/checkpoints/_resume/` (not `$GROUP_HOME`) | | **Forgot to patch run_dir** in resume_meta.json | Silent fail ExitCode=0 in ~30s with `PermissionError: [Errno 13] Permission denied: '/scratch'` in log | **Always do step 3b** — rewrite `run_dir` to iris-side path before submitting | | Submitted to iris main (default partition) | Job lands on flaky iris-hgx-2 or gets preempted | Always use `--partition=iris-hi` + nodelist exclude | | Same job array submitted on both clusters | Dispatcher sees 2 finished runs with same name → ambiguity error | scancel Sherlock BEFORE iris-hi submit | | `_resume//` already exists on iris | rsync may merge with stale state | OK if iris dir is from an earlier run with the same hash (same wandb.run_name); the rsync overwrites with newer Sherlock state | ## Mechanics reference - Resume key: SHA1 of `{entity}/{project}/{run_name}` — generated by `AutoResumeManager` in `sir/training/resume_manager.py`. - `resume_state.pt`: policy + optimizer + scheduler + rng + step. ~600MB for IDQL n=128 actor1024. - `resume_meta.json`: small bookkeeping (last_step, run_id, run_dir, completed flag). - Cluster paths: - Sherlock `_resume/`: `/scratch/groups/cbfinn/ankile/self-improving-robots/checkpoints/_resume/` - iris `_resume/`: `./checkpoints/_resume/` (working dir at repo root) - Auto-resume requires `cfg.training.resume_checkpoint_freq == cfg.wandb.log_freq` (typically 5000 steps). At resume_checkpoint_freq=5000, worst-case loss is 5000 steps ≈ 15 min on Sherlock. ## Speed table for handoff ROI | Source GPU | Target GPU | Speedup | Break-even ETA | |---|---|---|---| | Sherlock 2080Ti (19k/h) | iris-hi H100 (75k/h) | 4× | 12 min | | Sherlock 2080Ti | iris-hi L40s (60k/h) | 3× | 16 min | | Sherlock 2080Ti | Delta A100 (28k/h) | 1.5× | 25 min | | Sherlock 2080Ti | Delta A40 (16k/h) | 0.85× | NO HANDOFF | (Per [[bigactor_training_step_rates]]. Pascal Titan Xp/P100 ~1.8k/h — these need an immediate cancel+resubmit with `[GPU_GEN:VLT\|TUR\|AMP\|HPR\|LOV]` constraint, not a handoff.) ## Reverse direction (iris → Sherlock) Rarely useful — Sherlock is slower. Only do this if iris-hi is under sustained preemption pressure and Sherlock has free TUR/AMP slots. Same procedure but swap source/target paths. ## Related skills + memories - `/iris` — partition map, nodelist conventions - `/sherlock` — gpu partition, scratch layout - `/job-monitor` — uses this skill when handoff opportunity detected - [[feedback_sherlock_to_iris_checkpoint_handoff]] — policy memo (when to migrate) - [[iris_hgx2_hf_timeout]] — why drop iris-hgx-2 - [[sherlock_gpu_constraint]] — Pascal-avoidance for fresh Sherlock submits - [[training_resume_mechanism]] — how wandb.run_name keys auto-resume - [[bigactor_training_step_rates]] — speed table