---
name: system-tuning
description: 'Restore system responsiveness via safe, ordered process cleanup and agent-swarm hygiene.'
---
# System Tuning Skill

> **Quick Ref:** Diagnose → reap free wins → kill stuck children → fix confused parents → renice survivors → verify. Output: `.agents/system-tuning/YYYY-MM-DD-triage.md`.

**YOU MUST EXECUTE THIS WORKFLOW. Do not just describe it.**

When a dev box turns sluggish — high load, swap pressure, agent sprawl, or compile storms — work the kill ladder from the safest move outward. Cleanup target is usually a confused parent agent, not the visible child.

## When To Use

| Symptom | Use this skill |
|---|---|
| Load average drifts above core count | yes |
| Killed processes respawn within seconds | yes — see [references/whack-a-mole-anti-pattern.md](references/whack-a-mole-anti-pattern.md) |
| Tmux / zellij sessions piling up | yes — see [references/agent-swarm-cleanup.md](references/agent-swarm-cleanup.md) |
| One genuinely hot path | run `$perf` first |

## Loops

Three loops, in order. Each loop has a check, an action, and a verification.

### 1. Diagnose Without Touching

```bash
uptime && nproc
cat /proc/pressure/cpu /proc/pressure/io /proc/pressure/memory 2>/dev/null
ps -eo stat | grep -c '^Z'
ps aux --sort=-%cpu | head -10
```

Capture the baseline. Every kill below should move at least one number; if nothing moves, you killed the wrong thing.

### 2. Walk The Kill Hierarchy

Least-invasive first; escalate only when the previous rung did not help. Full ordering and signal escalation rules in [references/kill-hierarchy.md](references/kill-hierarchy.md).

```
zombies & exited sessions   →  reap, no risk
stuck child processes       →  SIGTERM, wait 3s, then SIGKILL if needed
confused parent agents      →  kill the parent so it stops respawning children
renice the survivors        →  starve the heavy work without killing it
```

Stop at the first rung that restores responsiveness.

### 3. Clean Up The Swarm

If the box hosts multiple coding agents, second-order problems dominate: duplicated build trees, orphaned helpers, stale multiplexer sessions, agents older than the work that spawned them. See [references/agent-swarm-cleanup.md](references/agent-swarm-cleanup.md) for shapes and the cleanup move per shape.

### 4. Verify

After every loop re-read the same metrics. Write a one-line delta:

```
load: 38.4 -> 12.1   |   cpu_pressure_avg10: 61% -> 8%   |   zombies: 14 -> 0
```

No delta written → cleanup not done.

## Quick-Start Checklist

```bash
# Baseline
uptime; cat /proc/pressure/cpu

# Reap free wins
ps -eo stat | grep -c '^Z'
zellij list-sessions 2>&1 | grep -c EXITED
zellij delete-all-sessions --yes 2>/dev/null

# Stuck children (12h+)
ps -eo pid,etimes,args --sort=-etimes | awk '$2 > 43200' | head

# Confused parents (16h+ agents)
ps -eo pid,etimes,args | grep -E 'claude|codex' | awk '$2 > 57600'

# Renice live compilation
for pid in $(pgrep -f /bin/cargo) $(pgrep cc1plus); do
  renice 19 -p "$pid" 2>/dev/null
  ionice -c 3 -p "$pid" 2>/dev/null
done

# Verify
sleep 10 && uptime && cat /proc/pressure/cpu
```

## Protected Processes

Never signal these without explicit operator approval:

```
systemd, sshd, dbus, cron
postgres, mysql, redis, nginx, caddy
docker, containerd, k3d, kubelet
the multiplexer holding your sessions (tmux server, wezterm-mux-server, zellij)
```

If unsure, document the candidate and leave it running.

## Output

Triage reports go to `.agents/system-tuning/YYYY-MM-DD-triage.md` with:

1. Baseline metrics
2. Kill log with reason per signal
3. Renice / ionice changes
4. Post-cleanup metrics
5. Anything escalated for operator decision

## See Also

- [perf](../perf/SKILL.md) — Optimize a single hot path once the system is responsive
- [bug-hunt](../bug-hunt/SKILL.md) — Investigate why a process loops or hangs
- [scope](../scope/SKILL.md) — Lock edit scope before running cleanup in a shared workspace

## Local Resources

### references/

- [references/kill-hierarchy.md](references/kill-hierarchy.md)
- [references/whack-a-mole-anti-pattern.md](references/whack-a-mole-anti-pattern.md)
- [references/agent-swarm-cleanup.md](references/agent-swarm-cleanup.md)

## Attribution

Methodology pattern-adopted from the jsm `system-performance-remediation` skill. See `LICENSE.md` in this skill directory. No source text reused.
