---
name: nvidia-datacenter-bringup
description: Bring up NVIDIA HGX/DGX datacenter GPU hosts on Ubuntu 24.04 LTS — air-gapped or connected, Secure Boot enabled. Covers B300/B200/H100/A100/L40S/L4 driver+fabricmanager+NVLSM+DOCA-OFED install order and exact package set from NVIDIA CUDA repo + DOCA repo. Triggers on B300/B200/HGX/DGX install, "fabricmanager won't start", "system not yet initialized" / cudaErrorSystemNotReady, NVLSM missing, ib_umad not loading, DOCA-OFED before NVIDIA driver, nvidia-driver-pinning-XXX, nvlink5-XXX, nvidia-open vs cuda-drivers, "Blackwell requires open kernel modules", ConnectX-7/8 bridge device, FM exact-version-match, gpu-operator cuda-validator CrashLoopBackOff, B300 PCI ID 0x3182, air-gap CUDA + DOCA mirror, three-tier DOCA GPG key, MOK enrollment, DKMS sign, Dell PowerEdge XE9780/XE9785 baseboard firmware v1.4.30, iDRAC Redfish virtual AC cycle DellOemChassis.ExtendedReset, generic "install nvidia driver ubuntu 24.04 datacenter".
when_to_use: Use for bringing up an NVIDIA datacenter GPU host (HGX, DGX, or inference-card servers) on Ubuntu 24.04 LTS. Covers the full air-gapped path from clean OS install to the gpu-operator cuda-validator pod passing. Trigger when fabricmanager fails to start, nvidia-smi sees no GPUs on B200/B300, the cuda-validator loops with "system not yet initialized", a host needs coherent DOCA-OFED + NVLSM + NVIDIA driver install, MOK enrollment for Secure Boot is needed, or a Dell XE9780/XE9785 chassis needs baseboard firmware updated.
---

# nvidia-datacenter-bringup

Opinionated greenfield recipe for **NVIDIA datacenter GPUs on Ubuntu 24.04 LTS** — get from a clean OS install to a healthy host where `nvidia-smi` reports all GPUs, `nvidia-fabricmanager` is `active (running)`, and the gpu-operator `cuda-validator` pod passes. Air-gap is the primary case; connected sites use the same packages from the same upstream URLs.

## Decision tree

| Question | Answer | Read |
|---|---|---|
| Has Blackwell silicon (B300/B200/B100)? | Yes | Open kernel modules **mandatory** — proprietary is unsupported [[open-modules-transition]] |
| Grace Hopper (GH200)? | Yes | Open kernel modules **mandatory** (same as Blackwell). Otherwise 8-GPU SXM path [[hopper-recipe]] |
| HGX 8-GPU SXM with 3rd-gen NVSwitch (H100/H200/H800 in XE9680 or similar)? | Yes | Open recommended (not mandatory). Use `cuda-drivers-fabricmanager-<branch>` meta; **skip** `nvlink5-<branch>`, NVLSM, DOCA-OFED entirely. Min driver 525+ for H100, 535+ for H200 [[hopper-recipe]] |
| HGX 4-GPU SXM (H100 in XE8640, A100 4-GPU)? | Yes | **No NVSwitch on this baseboard** — direct NVLink mesh between 4 GPUs. **Skip fabricmanager entirely** + DOCA + NVLSM. Three-package install: driver + container-toolkit [[hopper-recipe]] |
| HGX A100 8-GPU (2nd-gen NVSwitch)? | Yes | Same as 8-GPU SXM H100 path. Min driver 450.xx. ALI not available — FM trains NVLinks at boot |
| L40S, L40, L4, H200 NVL, or other PCIe-only? | Yes | Driver + container-toolkit only. **Skip** DOCA-OFED, fabricmanager, NVLSM entirely. Validation reduces to `nvidia-smi` clean — no fabric registration to check |
| 4th-gen NVSwitch (B200/B300/B100)? | Yes | Must install `nvlink5-<branch>` meta (or `nvlsm + libnvsdm + libibumad3 + infiniband-diags` separately). FM talks to switches via CX7 bridge — DOCA-OFED is non-optional [[fabric-manager-guide]] §"Additional Steps for B200/B300" |
| Dell XE9780 or XE9785 chassis? | Yes | **Step 0 is firmware ≥ v1.4.30**. Use Redfish `DellOemChassis.ExtendedReset`, NOT BIOS Full Power Cycle [[dell-firmware]] |
| Dell XE9680 / XE9640 / XE8640? | Yes | Enable **iDRAC Direct USB Port in BIOS** before GPU baseboard FW updates (KB 000308105). Same `DellOemChassis.ExtendedReset` activation pattern [[dell-firmware]] |
| Secure Boot enabled? | Yes | Enroll one DKMS MOK — signs both NVIDIA and DOCA-OFED modules [[secure-boot]] |
| Air-gapped? | Yes | Mirror three repos. Total budget ~3 GB. Three-tier GPG keys for DOCA [[airgap-mirror]] |
| gpu-operator on top? | Yes | Pre-installed driver mode: `driver.enabled=false`, `toolkit.enabled=false`. Know about B300 issue #2231 [[gpu-operator]] |

## The 10-step install order

Order matters: DOCA before driver, MOK before any DKMS build, firmware before everything.

```
0. Dell baseboard firmware ≥ v1.4.30 + Redfish virtual AC cycle      [[dell-firmware]]
1. OS prep: kernel headers, blacklist nouveau                         [[recipe]]
2. Secure Boot: generate + import MOK (once, signs both stacks)       [[secure-boot]]
3. Add repos: NVIDIA CUDA + DOCA (+ Ubuntu archive subset)            [[airgap-mirror]]
4. Install DOCA — modules-load.d generated, ib_umad autoloads         [[recipe]]
5. Pin NVIDIA driver branch: apt install nvidia-driver-pinning-580    [[packages]]
6. Install driver: apt install nvidia-open-580 (open is mandatory)    [[packages]]
7. Install fabric stack: apt install nvlink5-580                      [[packages]]
8. Install nvidia-container-toolkit (lives in same CUDA repo)         [[packages]]
9. Validate: nvidia-smi clean, FM active, ib_umad loaded, Fabric Completed/Success
```

`nvlink5-580` is the **compute-only** fabric meta — pulls `nvidia-fabricmanager + nvlsm + libnvsdm + libnvidia-nscq + nvidia-imex + collectx-bringup + mft{,-oem,-autocomplete} + nvidia-dkms-580-open + libnvidia-compute-580` (and transitively `libibumad3` + `infiniband-diags`). It does NOT pull `nvidia-utils-580` / `nvidia-smi` — that's why step 6 installs `nvidia-open-580` (full userland) AS WELL. Dependency tree verified live from the NVIDIA CUDA repo for Ubuntu 24.04 amd64 — see [[packages]].

## Pitfalls catalogue

The mistakes that cost hours, ordered by frequency in the wild:

1. **Proprietary modules on Blackwell.** `nvidia-driver-580-server` (no `-open`) installs cleanly but the kernel module silently doesn't bind to B300 silicon. `nvidia-smi` reports no GPUs. NVIDIA's open-modules transition blog: proprietary is **unsupported** on Blackwell and Grace Hopper. Cure: use `nvidia-open-<branch>` from the CUDA repo. See [[open-modules-transition]].

2. **DOCA installed AFTER the driver.** `ib_umad` does not autoload at boot; `nvidia-fabricmanager.service` fails. NVIDIA forum 353369. Cure: install DOCA first. If already broken, reinstall DOCA OR `echo ib_umad | sudo tee /etc/modules-load.d/ib_umad.conf && sudo modprobe ib_umad && sudo systemctl restart nvidia-fabricmanager`. See [[troubleshooting]].

3. **Dell BIOS Full Power Cycle.** The iDRAC web-UI "Full Power Cycle" and the standard Redfish `Chassis.Reset` `FullPowerCycle` do **not** reliably activate the GPU baseboard subcomponents (HMC, NVSwitch, CX bridge firmware). Symptom: firmware update "succeeded" but old version still reports. Cure: Dell-specific `DellOemChassis.ExtendedReset`. See [[dell-firmware]].

4. **Mixing Ubuntu archive `nvidia-driver-XXX-server` with CUDA-repo `nvlsm`.** Ubuntu archive's FM version lags NVIDIA's by patches; NVLSM is only in CUDA repo. Rip out Ubuntu archive NVIDIA packages and go single-source CUDA repo. See [[packages]].

5. **MOK enrolled for NVIDIA but not for DOCA-OFED (or vice versa).** Both stacks are DKMS-built. One `/var/lib/dkms/mok.pub` signs both — enroll once, covers everything. There is no Canonical-signed precompiled HWE module for ConnectX, so MOK is unavoidable on B300 + Secure Boot. See [[secure-boot]].

6. **`cudaErrorSystemNotReady` blamed on driver.** This is the textbook FM-not-running signature on any NVSwitch system. The fix is always: confirm `nvidia-fabricmanager.service` is `active (running)` and its version matches the loaded driver. See [[troubleshooting]] and [[gpu-operator]].

7. **gpu-operator B300 PCI ID warning loop (#2231).** Cosmetic, not fatal. `nvidia-operator-validator` logs `unable to get device name: failed to find device with id '3182'`. Open in upstream since 2026-03-18; NVIDIA acknowledged "B300 should be supported" but waiting on must-gather. Symptom is benign — pods still work. See [[gpu-operator]].

8. **Driver 570.158.01 + operator-managed driver mode.** FM start script broken (gpu-operator #1595). Cure: pin ≥570.172.08 OR move to 580+. Or just use pre-installed-driver mode and let the host's systemd unit own FM. See [[gpu-operator]].

## Validation

After step 9 these should all be true. Expected output on a healthy 8-GPU B300:

```bash
$ nvidia-smi --query-gpu=name,driver_version --format=csv,noheader
NVIDIA B300 SXM6 270GB, 580.126.20
NVIDIA B300 SXM6 270GB, 580.126.20
# ... 8 identical lines

$ systemctl is-active nvidia-persistenced nvidia-fabricmanager nvidia-nvlsm
active
active
active

$ lsmod | grep -E '^nvidia|^nvidia_uvm|^nvidia_peermem|^ib_umad' | awk '{print $1}'
nvidia_peermem
nvidia_uvm
nvidia
ib_umad

$ nvidia-smi -q -i 0 | grep -A 2 Fabric
        Fabric
            State                  : Completed
            Status                 : Success

$ nvidia-smi nvlink --status -i 0 | head -3
GPU 0: NVIDIA B300 SXM6 270GB (UUID: ...)
         Link 0: 200 GB/s
         Link 1: 200 GB/s
```

If any value is missing or wrong (e.g. `State: Not Initialized`, `lsmod` missing `ib_umad`, `systemctl is-active` reports `inactive`), see [[troubleshooting]] keyed by the symptom.

On a fresh 8-GPU B300, FM takes ~30–90 s to finish fabric registration on first boot. The gpu-operator validator pod may CrashLoopBackoff transiently during this window — only sustained crashes >3 min are real failures.

## Hard out-of-scope

- Windows, workstation/Optimus, non-Ubuntu (RHEL/SLES handled by NVIDIA's own guides)
- vGPU / MIG runtime config (separate concern from bring-up)
- DGX OS (NVIDIA's preinstalled bundle — different workflow, see DGX OS docs)
- gpu-operator helm-values for full cluster config (host-side stops at validator passing)
- Multi-node NVLink Switch (NVL72-style); single-chassis NVL8 only
- Data-plane CX-8 NIC config (bridge role only — those are the management PFs)
- Vendor firmware for non-Dell chassis (Supermicro/HPE/Lenovo — defer to vendor docs)

## References

| Topic | File |
|---|---|
| Full install recipe with commands | [[recipe]] |
| Exact apt package matrix (verified live) | [[packages]] |
| Air-gap mirror setup + GPG keys + sizing | [[airgap-mirror]] |
| Secure Boot + MOK + DKMS pipeline | [[secure-boot]] |
| Dell baseboard firmware + Redfish AC cycle | [[dell-firmware]] |
| gpu-operator host-contract + cuda-validator triage | [[gpu-operator]] |
| Symptom → cause → fix playbook | [[troubleshooting]] |
| Dated URL index | [[sources]] |
| Open follow-ups / ceiling findings | [[improvement-backlog]] |
| NVIDIA Fabric Manager User Guide (offline copy) | [[fabric-manager-guide]] |
| NVIDIA Driver Install Guide — Ubuntu chapter | [[driver-install-ubuntu]] |
| NVIDIA Driver Install Guide — Kernel modules | [[driver-install-kernel-modules]] |
| NVIDIA Driver Install Guide — Advanced Options | [[driver-install-advanced-options]] |
| DOCA Host Installation Guide — Ubuntu 24.04 | [[doca-install-ubuntu]] |
| NVIDIA blog: open kernel modules transition | [[open-modules-transition]] |
| Dell HGX B300 SXM6 firmware release notes (v1.4.30) | [[b300-firmware-release-notes]] |
| Hopper recipe — H100 / H200 / GH200 simplified install paths | [[hopper-recipe]] |
| One-shot host health check (`bash`) | `scripts/health-check.sh` |
