---
name: fabric-gpu
description: Provision GPU nodes on FABRIC with driver installation and CUDA setup
allowed-tools:
  - Read
  - Grep
  - Glob
  - Write
  - Edit
  - Bash
---

# Instructions

When invoked, generate code to provision a FABRIC node with GPU(s) and install drivers. The workflow is:

1. Find a site with available GPUs
2. Create a slice with a GPU node
3. Submit and wait for the slice
4. Install NVIDIA/CUDA drivers
5. Reboot and verify with `nvidia-smi`

Ask the user which GPU type they need and whether they need CUDA toolkit.

# API Reference

## Available GPU Models

| Model String | GPU | Filter Field |
|-------------|-----|-------------|
| `GPU_TeslaT4` | NVIDIA Tesla T4 | `tesla_t4_available` |
| `GPU_RTX6000` | NVIDIA RTX 6000 | `rtx6000_available` |
| `GPU_A30` | NVIDIA A30 | `a30_available` |
| `GPU_A40` | NVIDIA A40 | `a40_available` |

## Available FPGA Models

| Model String | FPGA | Filter Field |
|-------------|------|-------------|
| `FPGA_Xilinx_U280` | Xilinx Alveo U280 | `fpga_u280_available` |
| `FPGA_Xilinx_SN1022` | Xilinx SN1022 | `fpga_sn1022_available` |

## Finding GPU Sites

```python
fablib.get_random_site(
    filter_function=lambda x: x["rtx6000_available"] > 0
)
```

## Adding GPU to Node

```python
node.add_component(model="GPU_RTX6000", name="gpu1")
```

# Patterns

## Complete GPU Node with CUDA Driver Install

```python
from fabrictestbed_extensions.fablib.fablib import FablibManager

fablib = FablibManager()

# Step 1: Find a site with the desired GPU
GPU_MODEL = "GPU_RTX6000"
GPU_FILTER_FIELD = "rtx6000_available"

site = fablib.get_random_site(
    filter_function=lambda x: x[GPU_FILTER_FIELD] > 0
)
print(f"Selected site: {site}")

# Step 2: Create slice
slice = fablib.new_slice(name="gpu-experiment")

node = slice.add_node(
    name="gpu-node",
    site=site,
    cores=8,
    ram=32,
    disk=100,
    image="default_ubuntu_22",
)
node.add_component(model=GPU_MODEL, name="gpu1")

# Step 3: Submit
slice.submit()
print("Slice is ready!")
```

```python
# Step 4: Install NVIDIA CUDA drivers
node = slice.get_node(name="gpu-node")

# Verify GPU PCI device is visible
stdout, stderr = node.execute("sudo apt-get install -y -q pciutils && lspci | grep -i 'nvidia\\|3d controller'")
print(stdout)

# Install prerequisites
commands = [
    "sudo apt-get -q update",
    "sudo apt-get -q install -y linux-headers-$(uname -r) gcc",
]
for cmd in commands:
    node.execute(cmd)

# Install CUDA (adjust distro/version as needed)
distro = "ubuntu2204"
version = "12.6"
architecture = "x86_64"

commands = [
    f"wget https://developer.download.nvidia.com/compute/cuda/repos/{distro}/{architecture}/cuda-keyring_1.1-1_all.deb",
    "sudo dpkg -i cuda-keyring_1.1-1_all.deb",
    "sudo apt-get -q update",
    f"sudo apt-get -q install -y cuda-{version.replace('.', '-')}",
]
for cmd in commands:
    stdout, stderr = node.execute(cmd)

# Step 5: Reboot to load driver
node.execute("sudo reboot")

# Wait for node to come back
slice.wait_ssh(timeout=360, interval=10, progress=True)
slice.update()
slice.test_ssh()

# Step 6: Verify
stdout, stderr = node.execute("nvidia-smi")
print(stdout)
```

## Quick GPU Setup (Ubuntu 24)

```python
# Simpler driver install for Ubuntu 24.04
node = slice.get_node(name="gpu-node")

commands = [
    "sudo apt-get update -qq",
    "sudo apt-get install -y -qq linux-headers-$(uname -r)",
    "sudo apt-get install -y -qq nvidia-driver-535",
]
for cmd in commands:
    node.execute(cmd)

node.execute("sudo reboot")
slice.wait_ssh(timeout=360, interval=10, progress=True)

stdout, stderr = node.execute("nvidia-smi")
print(stdout)
```

## Tesla T4 Node

```python
site = fablib.get_random_site(
    filter_function=lambda x: x["tesla_t4_available"] > 0
)

slice = fablib.new_slice(name="t4-experiment")
node = slice.add_node(name="t4-node", site=site, cores=8, ram=32, disk=100, image="default_ubuntu_22")
node.add_component(model="GPU_TeslaT4", name="gpu1")
slice.submit()
```

## Multiple GPUs on One Node

```python
site = fablib.get_random_site(
    filter_function=lambda x: x["rtx6000_available"] >= 2
)

slice = fablib.new_slice(name="multi-gpu")
node = slice.add_node(name="gpu-node", site=site, cores=16, ram=64, disk=200, image="default_ubuntu_22")
node.add_component(model="GPU_RTX6000", name="gpu1")
node.add_component(model="GPU_RTX6000", name="gpu2")

slice.submit()
```

## GPU Node on Specific Host

```python
site = "CERN"
worker = "cern-w6.fabric-testbed.net"

slice = fablib.new_slice(name="gpu-specific-host")
node = slice.add_node(
    name="gpu-node",
    site=site,
    host=worker,
    cores=10,
    ram=32,
    disk=100,
    image="default_ubuntu_22",
)
node.add_component(model="GPU_A30", name="gpu1")
slice.submit()
```

## FPGA Provisioning

```python
site = fablib.get_random_site(
    filter_function=lambda x: x["fpga_u280_available"] > 0
)

slice = fablib.new_slice(name="fpga-experiment")
node = slice.add_node(name="fpga-node", site=site, cores=8, ram=32, disk=100)
node.add_component(model="FPGA_Xilinx_U280", name="fpga1")
slice.submit()
```

## Driver Install for CentOS/Rocky

```python
# For CentOS 9 Stream or Rocky 9
commands = [
    "sudo dnf install -y epel-release",
    "sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo",
    "sudo dnf install -y kernel-devel kernel-headers",
    "sudo dnf install -y nvidia-driver",
]
for cmd in commands:
    stdout, stderr = node.execute(cmd)

node.execute("sudo reboot")
slice.wait_ssh(timeout=360, interval=10, progress=True)

stdout, stderr = node.execute("nvidia-smi")
print(stdout)
```

## Cleanup

```python
slice.delete()
```
