--- name: multi-node-slurm description: Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation. when_to_use: Writing or converting Slurm sbatch scripts, scaling to multiple nodes, debugging NCCL/launch failures, or investigating a commit that caused multi-node training failures; 'run on multiple nodes', 'sbatch script', 'NCCL timeout', 'multi-node OOM'. --- # Multi-Node Slurm Convert single-node `uv run python -m torch.distributed.run` commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures. ## Two Approaches: srun-native vs uv run torch.distributed | Approach | `ntasks-per-node` | Process spawning | Best for | |---|---|---|---| | **srun-native** (preferred) | 8 | Slurm spawns 8 tasks/node | Conversion, inference, Bridge scripts | | **uv run torch.distributed** (legacy) | 1 | `uv run python -m torch.distributed.run` spawns 8 procs/node | MLM pretrain_gpt.py | **Prefer srun-native** — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives `RANK`, `WORLD_SIZE`, `LOCAL_RANK`, `MASTER_ADDR`, `MASTER_PORT` from SLURM env vars (`SLURM_PROCID`, `SLURM_NTASKS`, `SLURM_LOCALID`, `SLURM_NODELIST`) via `common_utils.py` helpers called during `initialize.py` distributed init, so you never need to set them manually. ## Cluster Environment ### Container ```bash CONTAINER_IMAGE=".sqsh" CONTAINER_MOUNTS=":,:/opt/Megatron-Bridge,:/opt/data" ``` ### Standard Paths ```bash WORKDIR="/opt/Megatron-Bridge" DATA_PATH="/dclm_01_01_text_document" ``` ### Tokens / Caches ```bash export GH_TOKEN= export HF_TOKEN= export HF_HOME=/HF_HOME export UV_CACHE_DIR="/uv_cache" export NEMO_HOME="/cache/nemo" ``` **Important**: `NEMO_HOME` must point to a shared filesystem (e.g. Lustre) for multi-node SFT/PEFT jobs. The default (`/root/.cache/nemo`) is container-local and not shared across nodes. Without this, packed-sequence data files prepared on node 0 are invisible to other nodes, causing `TypeError: 'NoneType' object is not an iterator`. ### Log Directory ```text /logs/_ ``` ## srun-native Approach (Preferred) Slurm spawns all processes directly. No `torch.distributed.run`, no TRAIN_CMD escaping. ### SBATCH Headers ```bash #SBATCH --job-name=- #SBATCH --nodes= #SBATCH --ntasks-per-node=8 # Slurm spawns 8 tasks per node #SBATCH --gpus-per-node=8 #SBATCH --time=00:30:00 #SBATCH --account= #SBATCH --partition=batch #SBATCH --output=/logs/_%j.log #SBATCH --exclusive ``` ### Build and Launch Two-phase srun: first a single-process srun to populate the uv cache, then the full multi-node srun. ```bash # Env exports at sbatch level (before srun) export TORCH_NCCL_AVOID_RECORD_STREAMS=1 export NCCL_NVLS_ENABLE=0 # Phase 1: Single-process uv sync to build/populate the shared cache srun --mpi=pmix -N 1 --ntasks=1 \ --container-image="$CONTAINER_IMAGE" \ --container-mounts="$CONTAINER_MOUNTS" \ --no-container-mount-home \ bash -c "cd $WORKDIR && uv sync" # Phase 2: Full multi-node run (uv sync is a fast no-op since cache is warm) srun --mpi=pmix \ --container-image="$CONTAINER_IMAGE" \ --container-mounts="$CONTAINER_MOUNTS" \ --no-container-mount-home \ bash -c "cd $WORKDIR && uv sync && uv run --no-sync python " ``` ### srun-native Key Points - Phase 1 runs `uv sync` once on a single node/process, building all wheels into the shared cache on Lustre - Phase 2's `uv sync` is a fast no-op (everything is cached) — safe to run on all ranks without sleep guards - `initialize.py` + `common_utils.py` auto-set `RANK`, `WORLD_SIZE`, `LOCAL_RANK`, `MASTER_ADDR`, `MASTER_PORT` from SLURM env vars - Env vars like `HF_TOKEN`, `HF_HOME`, `UV_CACHE_DIR` exported at sbatch level are inherited by srun tasks - Reference: `examples/models/glm/glm_45v/slurm_sft.sh`, `examples/models/minimax/minimax_m2/slurm_conversion.sh` --- ## uv run torch.distributed Approach (Legacy) Use when the script requires `torch.distributed.run` (e.g., MLM pretrain_gpt.py) or when Bridge's `initialize.py` is not in the call path. ### 1. Add SBATCH Headers ```bash #SBATCH --job-name=- #SBATCH --nodes= #SBATCH --ntasks-per-node=1 # ALWAYS 1 — torchrun handles per-node spawning #SBATCH --gpus-per-node=8 #SBATCH --time=00:30:00 #SBATCH --account= #SBATCH --partition=batch #SBATCH --output=/logs/_%j.log #SBATCH --exclusive ``` **Critical**: `--ntasks-per-node=1`, NOT 8. `uv run python -m torch.distributed.run --nproc_per_node=8` spawns 8 processes per node. Using `ntasks-per-node=8` causes EADDRINUSE port collisions (8 tasks x 8 procs = 64 per node). ### 2. Convert to Multi-Node Replace single-node: ```bash uv run python -m torch.distributed.run --nproc_per_node=8 \