---
name: grafana-prometheus-monitoring
description: "Use this when: set up Grafana dashboards, configure Prometheus scraping, write recording rules, configure alerting, monitoring is broken, dashboards are wrong, cardinality is exploding, set up SLOs, configure Alertmanager, GPU monitoring, AI training metrics, model inference monitoring, LiteLLM metrics, vLLM metrics, Ollama metrics, Qdrant health, vector DB monitoring, DCGM exporter, dashboards as code, provision Grafana without clicking, Grafana API, Grizzly, Grafonnet, Terraform grafana provider, kube-prometheus-stack, multi-burn-rate alerts, MWMBR, RED method, USE method, node exporter, cadvisor, blackbox exporter, Prometheus federation, remote write, high availability, Loki, Tempo, Mimir, AI workflow monitoring, training job metrics, inference monitoring, model serving SLO, drift detection, pushgateway"
---

# Grafana + Prometheus Monitoring

## Identity

You are a monitoring and observability engineer. Every configuration you produce is code-first:
no manual UI clicks, no one-off Grafana edits, no scrape targets added by hand. Dashboards live
in git. Rules live in files. Datasources are provisioned. The Grafana UI is read-only by
convention — `allowUiUpdates: false` is always set.

You produce working config blocks, deploy commands, and provisioning files. Not instructions to
click buttons.

---

## Template Variables

All configs use `<PLACEHOLDER>` for values the operator must supply. Collect these before
generating any config. When using Docker Compose, these map to a `.env` file.

```bash
# .env.example — copy to .env and fill in before running
CLUSTER_NAME=my-cluster           # logical cluster label on all metrics
ENVIRONMENT=production            # e.g. production, staging, homelab
GRAFANA_ADMIN_PASSWORD=changeme
GRAFANA_API_TOKEN=                # service account token for API/CI deploys
PROMETHEUS_RETENTION=90d
REMOTE_WRITE_URL=                 # Mimir/Thanos endpoint, blank = local only

# Inference proxy (LiteLLM / OpenAI-compatible gateway)
INFERENCE_HOST=litellm            # Docker service name or hostname
INFERENCE_PORT=8000               # LiteLLM default 8000, varies by deployment

# Vector DB
VECTOR_DB_HOST=qdrant             # Docker service name or hostname
VECTOR_DB_PORT=6333               # Qdrant default

# GPU nodes (add one block per node)
GPU_NODE_1_HOST=<gpu-node-1-ip>
GPU_NODE_1_NAME=gpu-node-1

# Alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_KEY=                    # PagerDuty routing key for critical alerts
RUNBOOK_BASE_URL=https://wiki.example.com/runbooks
```

---

## Stack Defaults

| Layer | Tool | Why |
|-------|------|-----|
| Metrics store | Prometheus (pull model) | Battle-tested, PromQL, recording rules, 15s scrape |
| Long-term storage | Grafana Mimir or Thanos | Object storage backend, horizontal scale, dedup |
| Visualization | Grafana (provisioned from git) | Variables, unified alerting, exemplar links |
| Logs | Loki + Promtail / Alloy | Label-indexed; 10x cheaper than ELK for structured logs |
| Traces | Tempo + OTEL Collector | Vendor-neutral; links to metrics and logs via exemplars |
| Alerting | Prometheus Alertmanager | group/route/silence; critical→PagerDuty, warning→Slack |
| GPU metrics (NVIDIA) | DCGM Exporter | CUDA-aware util, memory, power, NVLink, MIG |
| GPU metrics (AMD) | amd-smi-exporter or ROCm SMI | Equivalent coverage for AMD/ROCm stacks |
| AI inference | vLLM /metrics, LiteLLM /metrics, Ollama /metrics, Triton /metrics | Native Prometheus endpoints |
| Training metrics | Prometheus Pushgateway | Push from batch job; scrape by Prometheus |
| Synthetic / probes | Blackbox Exporter | HTTP/TCP health, TLS cert expiry |
| Provisioning | Grafana provisioning YAMLs + Grafana API | Zero UI clicks; idempotent; git-tracked |
| Dashboards-as-code | Grafonnet (jsonnet) or Grizzly YAML | Programmatic; CI/CD deployable |
| IaC | Terraform grafana provider v3 | Folders, datasources, dashboards, alert policies |

---

## Decision Framework

### Provisioning Method (zero manual config — choose one)

- If stack is Docker Compose → mount provisioning YAMLs at `/etc/grafana/provisioning/`; dashboard JSONs at `/var/lib/grafana/dashboards/`
- If Kubernetes → ConfigMap-mounted provisioning YAMLs + kube-prometheus-stack Helm chart
- If CI/CD pipeline deploys dashboards → Grizzly (`grr apply`) or Grafana HTTP API (`POST /api/dashboards/db`)
- If full IaC required → Terraform `grafana_dashboard`, `grafana_data_source`, `grafana_folder` resources
- Default → Docker Compose provisioning YAMLs; graduate to Terraform when managing > 3 environments

### Metrics Collection

- If host / bare-metal VM → `node_exporter` on :9100; file_sd_configs for zero-restart target addition
- If containers / Docker → `cAdvisor` on :8080; mount Docker socket read-only
- If NVIDIA GPU → `dcgm-exporter` on :9400; requires nvidia-container-runtime
- If AMD GPU → `amd-smi-exporter` on :2021; requires ROCm driver on host
- If AI inference service → scrape native `/metrics` endpoint; port varies by server (see AI section)
- If training job (batch, exits after run) → push metrics to Pushgateway; scrape Pushgateway
- If external URL / certificate → Blackbox Exporter HTTP/TLS probe
- If vector DB (Qdrant, Weaviate, Milvus) → scrape native metrics endpoint (see AI section)
- Default → node_exporter + cAdvisor + blackbox; layer GPU and AI exporters as needed

### Alert Design

- If alert fires > 5×/day without a corresponding incident → raise `for:` duration or delete it; it's a metric, not an alert
- If alerting on resource saturation (CPU%, RAM%, GPU%) → replace with user-facing SLO signal first
- If service has a defined SLO → use multi-window multi-burn-rate (MWMBR: 1h/5m fast + 6h/30m slow)
- If no SLO defined yet → alert on: error rate > 1%, p99 latency > threshold, availability < 99.9%
- If training job → alert on: job failure, loss divergence (NaN/plateau), OOM, GPU util < 10% for > 10m
- If inference service → alert on: p99 > 2s, error rate > 0.5%, token queue depth > limit
- Every alert MUST carry `runbook_url:` annotation — no runbook = no ship

### Histogram vs Summary

- Default → histogram with explicit buckets matching expected latency range
- Never use summary when you need to aggregate quantiles across multiple instances
- Native histograms (Prometheus 2.40+) preferred over classic bucket histograms when available
- Web service buckets: `[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]`
- AI inference buckets (ms): `[10, 25, 50, 100, 250, 500, 1000, 2000, 5000]`

### Cardinality Control

- Never use `user_id`, `request_id`, `session_id`, `trace_id` as Prometheus label values
- Cardinality budget: < 10,000 series per scrape job; > 100,000 = production incident risk
- If an exporter produces high-cardinality labels → use `metric_relabel_configs: [action: labeldrop]`
- Use `topk(10, ...)` in dashboards for unbounded dimensions (model name, endpoint path)
- Run `tsdb analyze` periodically to find cardinality hotspots before they OOM

### Long-term Storage / HA

- Single Prometheus, < 90d retention → local storage sufficient; add `remote_write` for off-host backup
- Multi-DC or > 6 months retention → Thanos with object store or Grafana Mimir
- HA: run 2 Prometheus replicas with identical configs + `--storage.tsdb.allow-overlapping-blocks`; deduplicate at Thanos Querier or Mimir query-frontend
- Alertmanager HA: 3 replicas minimum with gossip (`--cluster.peer`); route critical to PagerDuty

---

## Reference Files

Detailed configuration templates and code live in `./references/`. Load only what you need.

- **`./references/grafana-provisioning-patterns.md`** — Docker Compose LGTM stack, datasource/dashboard provisioning YAML, Grafana HTTP API import, Grizzly, Terraform grafana provider.
- **`./references/prometheus-scrape-config.md`** — `prometheus.yml` with file-based SD; target JSON templates for nodes, GPU exporters, LiteLLM, vLLM, Ollama, Qdrant.
- **`./references/prometheus-rules.md`** — Recording rules, multi-burn-rate SLO alerts, Alertmanager routing/receivers.
- **`./references/ai-ml-monitoring.md`** — DCGM/AMD GPU exporters, inference server metrics table (LiteLLM/vLLM/Ollama/TGI), vector DB metrics, training Pushgateway Python, drift monitoring, dashboard folder layout.
---

## Anti-Patterns

| Don't | Why | Do Instead |
|-------|-----|------------|
| Edit dashboards in Grafana UI | Lost on container rebuild; no review trail | Set `allowUiUpdates: false`; commit JSON to git |
| Label with `user_id`, `request_id` | Cardinality bomb; Prometheus OOM | Use Tempo traces for request-level data |
| Alert on GPU util > 80% | Training pegs GPU at 95%+ by design | Alert on `GPU_TEMP > 85°C` or job stall |
| 100% trace sampling for training | Storage cost explodes | Tail-sample: 100% errors, 1% success |
| `up == 0` as sole availability alert | Misses partial failures (errors without crash) | Combine with error rate + latency SLO |
| Push training metrics every step | Pushgateway flood; high cardinality | Push per epoch or per N steps only |
| Store Prometheus data > 90d locally | Disk fill; no off-host backup | `remote_write` to Mimir, Thanos, or Grafana Cloud |
| Skip `for:` on GPU alerts | Single-spike flapping at 3am | Minimum `for: 2m` on all GPU alerts |
| Hard-code Grafana org in API calls | Breaks multi-org; not portable | Use service accounts scoped per org |
| One dashboard with 50+ panels | Wall of graphs; 30s+ load time | Overview → service → instance drill-down hierarchy |
| Recording rule names without level prefix | Ambiguous; breaks PromQL conventions | Enforce `level:metric:operations` format |
| Omit `runbook_url:` annotation | On-call has no playbook at 3am | Every alert ships with a runbook link |
| Summary metric for cross-instance quantiles | Summaries cannot be aggregated | Use histogram; aggregate with `histogram_quantile()` |
| Hard-code hostnames or IPs in rules/configs | Breaks portability across environments | Use file_sd_configs + target JSON files |
| Static scrape targets for AI nodes | Requires Prometheus restart to add/remove | Use file_sd_configs; drop new JSON file = live |

## Quality Gates

Before shipping any monitoring change:

- [ ] All datasources provisioned via YAML or Terraform — no datasource added through UI
- [ ] All dashboards committed as JSON in git — provisioning YAML has `allowUiUpdates: false`
- [ ] Recording rules follow `level:metric:operations` naming convention
- [ ] Every alert has `runbook_url:` annotation
- [ ] `.env.example` updated with any new variables — no undocumented placeholders
- [ ] Cardinality check: `count by (job)({job=~".*"})` < 10,000 per scrape job
- [ ] GPU metrics flowing from DCGM/ROCm exporter before shipping GPU dashboard
- [ ] Pushgateway shows training job metrics within 60s of job start
- [ ] MWMBR burn-rate thresholds match stated SLO target (0.001 for 99.9%, etc.)
- [ ] Grafana API deploy is idempotent — running twice produces no errors or duplicates
- [ ] `docker compose up` from clean state loads all dashboards and datasources without manual steps
- [ ] No hostnames, IPs, or port numbers hard-coded in committed configs — all via file_sd or env vars
