---
name: hyperpod-ssm
description: Remote command execution and file transfer on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). This is the primary interface for accessing HyperPod nodes — direct SSH is not available. Use when any skill, workflow, or user request needs to execute commands on cluster nodes, upload files to nodes, read/download files from nodes, run diagnostics, install packages, or perform any operation requiring shell access to HyperPod instances. Other HyperPod skills depend on this skill for all node-level operations.
metadata:
  version: "1.0.0"
---

# HyperPod SSM Access

## Prerequisites

- **`aws` CLI v2**, authenticated for the target account/Region.
- **`session-manager-plugin`** — installed alongside the AWS CLI.
- **`jq`** — the scripts build JSON payloads with it.
- **`unbuffer`** (from the `expect` package) — wraps `aws ssm start-session` with a PTY so the session-manager-plugin flushes stdout instead of racing to close. Without it, calls intermittently return empty output with `Cannot perform start session: EOF` even when the command ran. Install with `sudo yum install expect`, `sudo apt install expect`, or `brew install expect`. `ssm-exec.sh` detects and uses it automatically; falls back with a warning if missing.

## SSM Target Format

Target: `sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>`

- `CLUSTER_ID`: Last segment of cluster ARN (NOT the cluster name). Extract via `get-cluster-info.sh`.
- `GROUP_NAME`: Instance group name — retrieve via `list-nodes.sh`.
- `INSTANCE_ID`: EC2 instance ID (e.g., `i-0123456789abcdef0`)

## Scripts

Three scripts under `scripts/`. Resolve cluster info and nodes **once**, then execute per node.

### get-cluster-info.sh — Resolve cluster name → ID (call once)

```bash
scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION]
# Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."}
```

### list-nodes.sh — List all nodes with pagination (call once)

```bash
scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID]
# Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.)
```

`list-cluster-nodes` paginates at 100 nodes. This script handles pagination automatically.

### ssm-exec.sh — Execute command on a node (call per node)

```bash
# Execute — with pre-built target
scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION]

# Execute — with parts
scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION]

# Upload
scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION]

# Read remote file
scripts/ssm-exec.sh --target TARGET --read REMOTE_PATH [--region REGION]
```

## Running Commands Across Many Nodes

SSM `start-session` rate limit: **3 TPS** per account. Plan batch size and delay accordingly.

`aws ssm send-command` does NOT support `sagemaker-cluster:` targets — only `start-session` works.

## Manual SSM Commands

When the scripts aren't suitable, use `aws ssm start-session` directly with `AWS-StartNonInteractiveCommand`. Wrap every invocation in `unbuffer` — without it, stdout is intermittently empty (see Prerequisites).

```bash
cat > /tmp/cmd.json << 'EOF'
{"command": ["bash -c 'echo hello && whoami'"]}
EOF

unbuffer aws ssm start-session \
  --target sagemaker-cluster:{CLUSTER_ID}_{GROUP_NAME}-{INSTANCE_ID} \
  --region REGION \
  --document-name AWS-StartNonInteractiveCommand \
  --parameters file:///tmp/cmd.json
```

- Always use a JSON file for `--parameters` — inline parameters break with special characters.
- The document's `command` parameter is argv, not shell input. Wrap multi-statement scripts in `bash -c '...'` so pipes, semicolons, and redirects evaluate.

## Common Diagnostic Commands

| Task             | Command                                                        |
| ---------------- | -------------------------------------------------------------- |
| Lifecycle logs   | `cat /var/log/provision/provisioning.log`                      |
| Memory           | `free -h`                                                      |
| Disk/mounts      | `df -h && lsblk`                                               |
| GPU status       | `nvidia-smi`                                                   |
| GPU memory       | `nvidia-smi --query-gpu=memory.used,memory.total --format=csv` |
| EFA/network      | `fi_info -p efa`                                               |
| CloudWatch agent | `sudo systemctl status amazon-cloudwatch-agent`                |
| Top processes    | `ps aux --sort=-%mem \| head -20`                              |

## Key Details

- Default SSM non-interactive user is `root`.
- SSM rate limit: **3 TPS** per account.
- For interactive sessions (rare), omit `--document-name` to get a shell.
- Interactive commands (vim, top) are not supported via `AWS-StartNonInteractiveCommand`.
- Large outputs may be truncated by SSM.
- For troubleshooting common errors, see [references/troubleshooting.md](references/troubleshooting.md).
