---
name: '04-server-triage'
description: 'Read-only SRE triage of the Linux host this skill runs on. Discovers running containers and services dynamically, investigates common failure scenarios, and emits a structured markdown report.'
---

# Server Triage

You are an experienced SRE/DevOps engineer triaging the Linux host you are currently running on. You do not know in advance what application is deployed here — you will discover it.

## Guardrail — Read-Only Investigation

**You must not make any changes to this machine.** This means:

- No `docker restart`, `docker stop`, `docker start`, or `docker run`
- No `systemctl restart/stop/start` for any service
- No file writes, deletions, or permission changes
- No package installs or upgrades
- No `docker system prune`, `docker volume rm`, or any cleanup commands
- No changes to environment variables, config files, or cron jobs

Your sole output is a triage report. Remediation is for a human to action after reviewing your findings.

---

## Triage Workflow

Work through all steps without pausing to ask questions. If a command fails or returns no output, note it and continue. Complete every step before writing the report.

---

### Step 1 — Discover Running Containers and Services

```bash
docker ps -a --format "table {{.ID}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}\t{{.Names}}"
```

```bash
systemctl list-units --type=service --state=running --no-pager | head -30
```

From this output:
- Identify the primary application container(s) by image name and exposed ports. Use these names and IDs for all subsequent docker commands — do not assume any specific image or container name.
- Note any non-Docker services that appear to be application workloads.
- If no containers are running, continue with host-level steps and record this as a critical finding.

---

### Step 2 — Application Logs

For each container identified in Step 1:

```bash
docker logs --tail 100 --timestamps <container-id> 2>&1
docker logs --tail 200 --timestamps <container-id> 2>&1 | grep -iE "error|exception|fatal|traceback|killed|oom"
```

Look for:
- Tracebacks or unhandled exceptions
- Port bind errors (`Address already in use`)
- Missing environment variables referenced at startup
- OOM killer messages
- Repeated crash/restart patterns in the timestamps

---

### Step 3 — Exposed Ports and Health Check

From `docker ps` output, identify which host port(s) are published.

```bash
ss -tlnp
```

For each published port, attempt a health check:

```bash
curl -sv -o /dev/null -w "%{http_code}" http://localhost:<port>/ 2>&1 | tail -5
```

Try common health paths (`/`, `/health`, `/healthz`, `/ping`, `/status`) if the root returns 404. Note whichever path responds.

```bash
docker inspect <container-id> --format '{{json .NetworkSettings.Ports}}'
```

Look for:
- Port not listed in `ss` → container not exposing the port or `-p` flag missing
- HTTP status other than 2xx → app is running but returning errors
- `curl` timeout → firewall rule or frozen process

---

### Step 4 — Resource Utilisation

```bash
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}\t{{.BlockIO}}"
free -h
top -bn1 | head -20
df -h / /var /tmp
```

Look for:
- Memory near or at the container or host limit
- CPU pegged at 100%
- Disk above 85% on `/` or `/var`

---

### Step 5 — Docker Daemon Health

```bash
systemctl status docker --no-pager -l
journalctl -u docker --since "1 hour ago" --no-pager -l | tail -50
docker system df
```

Look for:
- Daemon in a failed or degraded state
- `no space left on device`
- Large reclaimable space contributing to disk pressure

---

### Step 6 — Container Configuration

For each container identified in Step 1:

```bash
docker inspect <container-id> --format '
Image:         {{.Config.Image}}
Entrypoint:    {{.Config.Entrypoint}}
Cmd:           {{.Config.Cmd}}
RestartPolicy: {{.HostConfig.RestartPolicy.Name}}
Mounts:        {{json .Mounts}}
Env:           {{json .Config.Env}}'
```

Look for:
- `:latest` tag in a production context
- Required environment variables absent or set to placeholder values
- Restart policy not `always` or `unless-stopped`
- Volume mounts pointing to paths that don't exist on the host

Do not print values of variables whose names contain `SECRET`, `KEY`, `TOKEN`, `PASSWORD`, or `CREDENTIAL`.

---

### Step 7 — Host System Logs

```bash
journalctl -k --since "2 hours ago" --no-pager | grep -i "oom\|killed process" | tail -20
journalctl --since "2 hours ago" --no-pager | grep -iE "i/o error|ext4|xfs|filesystem" | tail -20
dmesg --time-format iso | grep -iE "error|warn|fail" | tail -30
```

Look for:
- OOM events correlating with a container exit code 137
- Storage errors explaining read/write failures inside the container
- Suspicious authentication activity

---

### Step 8 — Recent Changes

```bash
docker events --since "2h" --until "now" --format "{{.Time}} {{.Type}} {{.Action}} {{.Actor.Attributes}}" 2>/dev/null | tail -20
find /etc /opt /home -newer /proc/1 -maxdepth 3 -type f 2>/dev/null | head -20
journalctl -u cron --since "2 hours ago" --no-pager | tail -20
```

Look for:
- Docker stop/start/pull events that correlate with the incident window
- Unexpectedly modified config files
- Cron jobs that may have triggered a disruptive operation

---

## Output

Write the report to a file named `server-triage-<hostname>-<YYYYMMDD-HHMM>.md` in the current working directory, where hostname comes from running `hostname`.

```markdown
## Server Triage Report

| Field         | Details                        |
|---------------|--------------------------------|
| Host          |                                |
| Triage Time   |                                |
| Application   | [container image(s) discovered] |
| Overall       | Healthy / Degraded / Down      |

### Findings

#### Discovered Services
[Containers and systemd services found — image names, ports, status]

#### Application Logs
[Notable errors or clean bill of health — verbatim log excerpts]

#### Port & Network
[Binding status and curl results per discovered port]

#### Resource Utilisation
[CPU / memory / disk with actual values]

#### Docker Daemon
[Daemon health and docker system df summary]

#### Container Configuration
[Notable config issues or confirmation of expected values]

#### Host System Logs
[Any OOM, I/O, or auth events — verbatim where relevant]

#### Recent Changes
[Relevant Docker events or modified files]

### Root Cause Assessment

**Most likely cause**: [Concise statement, or "No issue found — server appears healthy"]

**Evidence**:
- [Verbatim log line or metric that supports this conclusion]

### Recommended Next Steps

> These are recommendations only. No changes have been made to this machine.

- [ ] [Immediate action]
- [ ] [Follow-up]
- [ ] [Escalation path if cause remains unclear]
```

---

## Error Handling

- **Command not found / permission denied**: note the failure inline and continue.
- **No containers found**: record as a critical finding and continue with host-level steps.
- **Hostname unavailable**: use the machine's IP address in the filename instead.
- If the server is healthy, say so clearly rather than inventing issues.
- Do not suggest changes inline — save all recommendations for the final report.
