---
name: experiment-queue
description: SSH job queue for multi-seed/multi-config ML experiments with OOM-aware retry, stale-screen cleanup, and wave-transition race prevention. Use when user says "batch experiments", "队列实验", "run grid", "multi-seed sweep", "auto-chain experiments", or when /run-experiment is insufficient for 10+ jobs that need orchestration.
argument-hint: [manifest-or-grid-spec]
allowed-tools: Bash(*), Read, Grep, Glob, Edit, Write, Agent, Skill(run-experiment), Skill(monitor-experiment)
---

# Experiment Queue

Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.

## When to Use This Skill

Use when `/run-experiment` is insufficient:
- **≥10 jobs** that need batching across GPUs
- **Multi-seed sweeps** (e.g., 21 seeds × 12 cells)
- **Wave transitions** (run wave 1, wait, run wave 2, wait, run wave 3...)
- **Teacher+student chains** (train teacher then distill; auto-trigger student after teacher done)
- **OOM-prone configs** where you need to retry with different GPU or wait
- **Mixed seed grids** where failed cells need re-running

Do NOT use for:
- Single ad-hoc experiment (use `/run-experiment`)
- Modal/Vast.ai deployments (those have their own orchestration)
- Experiments that need manual inspection between runs

## Why This Exists

Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:

1. **Stale screens** — python finishes, wandb uploads, screen hangs, next wave blocked
2. **OOM on shared GPU** — previous job's memory not yet released
3. **Wave race** — new wave launches before previous wave fully settles
4. **Missing checkpoints** — student launches before teacher saved
5. **Parser duplication** — rewriting multi-seed analysis python every batch

All of these are pure engineering friction that can be orchestrated.

## Core Concepts

### Job Manifest

A manifest lists jobs with explicit state:

```yaml
project: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllm
# Optional: override conda hook path if conda is not at a standard location.
# Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string.
# Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
# or the ARIS_CONDA_HOOK environment variable.
# conda_hook: /custom/path/to/conda
ssh: SJTUServer5
default_cmd: >
  python run_pc_distill_exp.py --backbone softmax --lam 0.5
  --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4

preconditions:
  - type: checkpoint_exists
    path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt

gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500  # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
  delay: 120
  max_attempts: 3

jobs:
  - id: s200_N64_n50K
    args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
  - id: s200_N128_n50K
    args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
  # ... 14 more
```

### Job State Machine

```
pending → running → completed
                 ↘ failed_oom → pending (after delay) [retry up to N]
                 ↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending
```

### Wave Orchestration

A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:
1. All current-wave python processes have exited
2. No stale screens remain for current-wave tags
3. GPU memory has dropped below threshold (≤500 MiB)
4. Precondition checks pass for next-wave jobs

## Workflow

### Step 1: Parse Manifest / Build from Grid

Input can be:
- **YAML manifest** (explicit job list, recommended for complex cases)
- **Grid spec** (Cartesian product of param values, e.g., `N=[64,128,256] × n=[50K,150K,500K,652K]`)
- **Natural language description** (Claude parses into manifest)

Save the built manifest to `<project>/experiment_queue/<timestamp>/manifest.json` for reproducibility.

### Step 2: Pre-flight

- Check SSH connection works
- Check conda env exists on remote
- Check `cwd` exists on remote
- Check all preconditions (checkpoints, input files)
- Check GPU availability (at least `max_parallel` free GPUs)

If any precondition fails, show user which jobs are blocked and why.

### Step 3: Launch Scheduler

Run `tools/queue_manager.py` (bundled with this skill) as a detached `nohup` process on the SSH host:

```bash
ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
  --manifest /tmp/manifest.json \
  --state /tmp/queue_state.json \
  --log /tmp/queue.log \
  > /tmp/queue_mgr.log 2>&1 &'
```

The scheduler:
- Reads manifest
- Loops: for each pending job, assign to free GPU, launch via `screen`
- Polls job status (every 60s)
- Detects stale screens (python exited but screen detached → kill)
- Detects OOM (CUDA OOM in log → mark failed_oom → retry after delay)
- Detects completion (expected output JSON/file exists) → mark completed
- Launches next wave when current wave settles
- Writes state to `queue_state.json` continuously

### Step 4: Monitoring

User can check state anytime:

```bash
ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'
```

Or invoke `/monitor-experiment` which reads the state file.

### Step 5: Post-completion

When all jobs in `manifest.json` are `completed` or `stuck`:
- Scheduler exits cleanly
- Write final summary to `<project>/experiment_queue/<timestamp>/summary.md`
- Invoke `/analyze-results` if `analyze_on_complete: true`

## Grid Spec Syntax

Instead of writing 24 job entries manually:

```yaml
grid:
  N: [64, 128, 256]
  n: [50000, 150000, 500000, 652000]
  seed: [42, 200, 201]
template:
  id: "s${seed}_N${N}_n${n}"
  args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}
```

Expands to 36 jobs automatically.

## Wave Chaining

For sequential phases (teacher → student):

```yaml
phases:
  - name: train_teachers
    grid:
      N: [384, 512]
    template:
      cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
      output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
  
  - name: distill_students
    depends_on: train_teachers
    grid:
      N: [384, 512]
      seed: [42, 200, 201]
    template:
      cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
      output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.json
```

Scheduler enforces `depends_on`: `distill_students` jobs stay `pending` until all
`train_teachers` jobs are `completed`.

## OOM Handling

Detect OOM from stdout:
```regex
torch\.OutOfMemoryError: CUDA out of memory
```

On detection:
1. Mark job `failed_oom`
2. Kill screen
3. Wait `oom_retry.delay` seconds
4. Check if current GPU is free; if not, try another free GPU
5. Requeue as `pending`
6. Max `oom_retry.max_attempts` before marking `stuck`

## Stale Screen Detection

Every 60s, for each running screen:
1. Check screen exists (`screen -ls`)
2. Check python PID still running (`ps -p`)
3. If screen exists but python exited:
   - If expected output file exists → mark `completed`, kill stale screen
   - If no output file → mark `failed_other`, kill screen

## Resume-on-restart

If scheduler crashes / is killed:
1. Read `queue_state.json`
2. For each `running` job: check screen; if still alive, keep; if not, re-evaluate state
3. For each `pending`: continue normally
4. Idempotent: safe to restart scheduler without losing state

## Output: Summary Report

```markdown
# Experiment Queue Summary

**Project**: dllm_distill
**Started**: 2026-04-16 11:36:29
**Completed**: 2026-04-16 18:02:14
**Total wall-clock**: 6h 25m
**Jobs**: 40 completed, 2 OOM-retried then completed, 0 stuck

## Phases
| Phase | Jobs | Success | OOM retries | Duration |
| --- | --- | --- | --- | --- |
| train_teachers | 2 | 2 | 0 | 58m |
| distill_students | 24 | 24 | 2 | 4h 02m |
| multi_seed_validation | 16 | 16 | 0 | 1h 25m |

## Results Files
- 42 JSON files in `figures/pcdistill_sw_*.json`

## Next Steps
- Run `/analyze-results` on output JSONs
- Figures auto-regen via `artifact-sync` (if configured)
```

## Comparison with `/run-experiment`

| Feature | `/run-experiment` | `experiment-queue` |
| --- | --- | --- |
| Single-shot experiment | ✅ | ✅ (overkill) |
| Multi-GPU parallel | Basic | Proper scheduling |
| Wave transitions | Manual | Automatic |
| OOM retry | Manual | Automatic |
| Stale screen cleanup | Manual | Automatic |
| Teacher→student chain | Manual | Built-in |
| State persistence | No | Yes (JSON) |
| Resume on crash | No | Yes |
| Grid expansion | Manual | Declarative |

**Rule**: Use `/run-experiment` for ≤5 jobs. Use `experiment-queue` for ≥10 jobs or anything with phases.

## Key Rules

- **Never overlap screens on the same GPU** — always wait for `memory.used < 500 MiB` before launching new job
- **Always write state to disk** — every state change flushed to `queue_state.json`
- **Idempotent scheduler** — safe to restart; picks up from state file
- **Expected-output-based completion** — don't trust screen state alone; verify output file exists
- **Bounded retry** — max N OOM retries, then mark `stuck` and alert
- **Dependencies enforced at launch** — never launch student before teacher checkpoint exists

## Known Failure Modes

- **SSH connection drop during scheduling**: scheduler keeps running on remote (nohup), just reconnect and check
- **GPU reservation by another user**: scheduler waits, does not pre-empt
- **Disk full on remote**: scheduler detects write failure, marks all pending `stuck`, alerts

## Example Session

User: "跑 T5+T6 全部实验：T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"

Claude invokes `/experiment-queue`:
1. Parses description into 2-phase manifest
2. Phase 1: T5 (16 jobs, no teacher dependency) + T6 teacher training (2 jobs)
3. Phase 2: T6 distillation (24 jobs, depends on teachers)
4. Deploys scheduler via nohup
5. Reports: "Scheduler PID 93534, total 42 jobs, estimated 6-7h wall-clock"

Then user can check anytime or wait for summary report.

## See Also

- `/run-experiment` — single experiment deployment
- `/monitor-experiment` — check progress (now reads from queue_state.json)
- `/analyze-results` — post-hoc analysis
- `tools/queue_manager.py` (bundled) — the scheduler implementation
- `tools/build_manifest.py` (bundled) — build manifest from grid spec

## Rationale / Source

Identified via 2026-04-16 post-mortem analysis (Codex GPT-5.4 xhigh) of a 1.5-day
multi-seed paper experiment session:

- Wall-clock sink: stale screens, OOM, wave transitions, manual parser
- Token sink: re-writing orchestration code each session
- Cognitive sink: tracking which cells succeeded, which failed, which to retry

This skill targets the wall-clock sink specifically; see `artifact-sync` and
`paper-fix-auto-apply` for the other two.
