--- name: cluster-scripts description: Generate Slurm job scripts and submission scripts for the Snellius HPC cluster from DVC pipeline stages. Use this skill whenever the user mentions "cluster", "snellius", "slurm", "sbatch", "submit job", "run on cluster", "HPC", or wants to run experiments on a remote GPU cluster. Also trigger when the user adds new DVC stages and might need corresponding cluster scripts, or asks about job submission, array jobs, or cluster resource allocation. --- # Generating Snellius Cluster Scripts from DVC Stages This skill translates DVC pipeline stages (defined in `dvc.yaml`) into Slurm batch scripts for the Snellius HPC cluster. The project already has a working set of cluster scripts in `cluster/` -- read them first to stay consistent with established conventions. ## Before writing anything 1. Read `dvc.yaml` to understand which stages need cluster scripts 2. Read `params.yaml` to know which parameters each stage uses 3. Read the existing cluster scripts in `cluster/` to match the project's conventions 4. Identify the DVC stage's `cmd`, `params`, `deps`, and `outs` -- these determine the job script's command, parameter reads, and output paths ## Architecture: Two-tier orchestration DVC defines the *logical* pipeline (what to run, with what params, producing what outputs). Cluster scripts define the *physical* execution (resource allocation, environment setup, Slurm orchestration). The cluster scripts replicate DVC's commands but replace `uv run python` with direct `python` calls (the venv has everything installed). There are three kinds of scripts, each with a distinct role: ### 1. Job scripts (`job_.sh`) These are the actual Slurm batch scripts that run on compute nodes. Each one handles a single *environment* (like T-maze or epistemic maze) but dispatches to different stages via environment variables. **The dispatch pattern:** Rather than writing one job script per DVC stage, group related stages into a single job script that uses a `case` statement on `STAGE_TYPE` (and optionally `ANALYSIS`, `INFERENCE_MODE`, `STRATEGY`) to select what to run. This keeps the cluster directory manageable and makes the relationship between stages explicit. **When to use a separate job script instead:** If the stage needs fundamentally different Slurm resources (different memory, time limits, array jobs), it needs its own job script. For example, MiniGrid episodes need 64GB RAM and array jobs, while aggregation needs 1 CPU and 4GB -- these can't share a script because SBATCH directives are fixed at submission time (though they can be overridden with `sbatch` flags). ### 2. Submit scripts (`submit_.sh`) These run on the login node and orchestrate job submission. They: - Check if outputs already exist (skip completed stages) - Submit jobs with the right env vars via `sbatch --export` - Wire up dependencies between jobs via `--dependency=afterok:` - Collect job IDs for dependency chaining ### 3. Top-level dispatcher (`submit_all.sh`) Delegates to environment-specific submit scripts. Handles MiniGrid's array job logic inline because it's structurally different from the dispatch-based environments. ## Template: Job script ```bash #!/usr/bin/env bash #SBATCH --job-name=aif- #SBATCH --partition=gpu_a100 #SBATCH --gpus=1 #SBATCH --cpus-per-task=18 #SBATCH --mem=16G #SBATCH --time=00:30:00 #SBATCH --output=logs/_%x_%j.out #SBATCH --error=logs/_%x_%j.err # # Dispatches based on STAGE_TYPE env var: # -- requires () # -- requires () # ... set -euo pipefail PROJECT_DIR="${SLURM_SUBMIT_DIR:-.}" cd "$PROJECT_DIR" mkdir -p logs export JAX_PLATFORMS="cuda" source cluster/setup_env.sh echo "Running ${STAGE_TYPE} on $(hostname) at $(date)" python -c "import jax; print(f'JAX devices: {jax.devices()}')" case "${STAGE_TYPE:?STAGE_TYPE not set}" in ) python scripts//