---
name: forge-observability
description: Production observability via OpenTelemetry. Three pillars (traces, metrics, logs) with explicit correlation, OTel SDK setup, span discipline (semantic conventions, error capture), RED + USE metrics, sampling strategy, SLOs over SLIs, alert design (page on user pain, not on cause). Contains paste-ready OTel SDK init, span helpers, a Prom scraper-friendly metrics block. Use when adding observability to a service or auditing an existing one.
license: MIT
---

# forge-observability

You are wiring observability into a service that real users depend on. Default agent-written observability is a `console.log` and a Prometheus counter named `requests_total`. When it breaks at 3am, the operator finds neither the trace nor the cause - just a graph that went red. This skill exists to fix that.

The mental model: **three pillars - traces, metrics, logs - linked by IDs.** Traces tell you "what one request did." Metrics tell you "what the system did in aggregate." Logs tell you "what one specific event was about." Without IDs that link them, you have three disconnected datasets and no story.

## Quick reference (the things you must never ship)

1. A request handler that does not produce a trace span.
2. An error path that logs the message but not the trace ID.
3. Metrics named `count`, `time`, `errors` without dimensions.
4. Alerts on every error (alert fatigue) instead of on user impact.
5. Sampling rate hardcoded at 100% in production (cost) or 1% in staging (debugging-blind).
6. Spans named after code locations (`handler.go:42`) instead of operations (`http.GET /orders`).
7. Cardinality bombs: dimensions like `user_id` on a counter.
8. SLO with no error budget burn alert.
9. Logging the full request body at info level (PII + bytes + redaction risk).
10. Trace context not propagated across an HTTP / queue boundary.

## Hard rules

### Three pillars + IDs

**1. Traces, metrics, logs. Each pillar carries the correlation ID.**

| Pillar | Cardinality | Storage cost | When to use |
| --- | --- | --- | --- |
| **Traces** | High (one per request) | $$$ - sampled | What one request did, end to end |
| **Metrics** | Low (aggregated) | $ | Aggregate behavior, alerting |
| **Logs** | Medium | $$ | Specific event detail |

**2. Every request gets a `trace_id`.** Every log line and span carries it. Every error includes it in the user-facing error response (forge-api-design rule 7).

```ts
// reference: getting the active trace id
import { trace } from "@opentelemetry/api";

function getTraceId(): string | undefined {
  return trace.getActiveSpan()?.spanContext().traceId;
}

// include it in every log
log.info({ trace_id: getTraceId(), order_id }, "order.created");

// and in error responses
return errorResponse(c, 500, "internal_error", "Something went wrong.", { trace_id: getTraceId() });
```

### OpenTelemetry SDK setup

**3. Use OpenTelemetry, not vendor SDKs.** OTel is the standard; vendor SDKs are tied to one vendor.

```ts
// reference: minimal OTel SDK init for a Node service
// instrumentation.ts (loaded BEFORE app code)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base";

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: "orders-api",
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.GIT_SHA ?? "dev",
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? "development",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT,
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT }),
    exportIntervalMillis: 30_000,
  }),
  sampler: new TraceIdRatioBasedSampler(Number(process.env.OTEL_SAMPLE_RATIO ?? "0.1")),
  instrumentations: [getNodeAutoInstrumentations({
    "@opentelemetry/instrumentation-fs": { enabled: false },  // fs is too noisy
  })],
});

sdk.start();
```

**4. Auto-instrumentation handles 90% of spans.** HTTP, fetch, pg, redis, etc. all auto-traced. Manual spans for custom operations only.

### Span discipline

**5. Spans named after operations, not code locations.**

```
GOOD                       BAD
HTTP GET /orders           handler.ts:42
db.query SELECT users      query
stripe.charge.create       chargeCustomer
```

**6. Use semantic conventions for attributes.** [OpenTelemetry semantic conventions](https://opentelemetry.io/docs/specs/semconv/).

```ts
import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("orders-api");

await tracer.startActiveSpan("order.create", async (span) => {
  span.setAttributes({
    "order.customer_id": input.customer_id,
    "order.currency": input.currency,
    "order.item_count": input.items.length,
    "order.total_cents": total,
  });

  try {
    const order = await createOrder(input);
    span.setAttribute("order.id", order.id);
    span.setStatus({ code: SpanStatusCode.OK });
    return order;
  } catch (err) {
    span.recordException(err as Error);
    span.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message });
    throw err;
  } finally {
    span.end();
  }
});
```

**7. Record exceptions on the span, not just in logs.**

**8. Status code is OK or ERROR.** No `WARN`.

### Metrics

**9. RED for request-driven services.**

| Metric | Meaning |
| --- | --- |
| **R**ate | Requests per second |
| **E**rrors | Error rate (or error count) |
| **D**uration | Latency distribution (percentiles) |

**10. USE for resource-driven services.**

| Metric | Meaning |
| --- | --- |
| **U**tilization | % of resource in use |
| **S**aturation | Queue depth / wait time |
| **E**rrors | Resource errors |

**11. Names follow OTel semantic conventions.** `http.server.request.duration`, `db.client.connections.usage`. Not `time_elapsed_ms_avg`.

**12. Dimensions (attributes) are LOW cardinality.** `route`, `status_class` (2xx/4xx/5xx), `method`. Never `user_id`, `request_id`, full URL with IDs.

```ts
// BAD: cardinality bomb
counter.add(1, { user_id: "...", path: "/orders/01HXY..." });

// GOOD: low cardinality
counter.add(1, { route: "/orders/:id", method: "GET", status_class: "2xx" });
```

**13. Histogram bucket boundaries chosen.** Auto buckets often miss your actual distribution.

```ts
const httpDuration = meter.createHistogram("http.server.request.duration", {
  description: "HTTP request duration",
  unit: "ms",
  advice: { explicitBucketBoundaries: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10_000] },
});
```

### Sampling

**14. Head-based sampling for default. Tail-based for high-signal traces only.**

```
Default:  10-20% head sampling      (every nth request traced regardless of outcome)
High-signal: tail-based             (1% sampled by default, but 100% of errors + 100% of slow requests)
```

**15. Always trace errors. Always trace slow requests.** Even at 1% baseline sampling.

**16. Sample rate per environment.**
- Production: 1-10% baseline + 100% errors/slow
- Staging: 100%
- Dev: 100%

### SLIs / SLOs

**17. Define SLIs from user pain, not from infrastructure.**

| Good SLI | Bad SLI |
| --- | --- |
| "% of homepage requests completing in <1s" | "CPU utilization <80%" |
| "% of orders processed within 5min of submission" | "Queue depth <1000" |
| "% of API requests returning a non-5xx response" | "Memory usage <2GB" |

**18. SLO is your target.** "99.9% of homepage requests <1s, over a 30-day window."

**19. Error budget = (1 - SLO) over the window.** 99.9% = 43m of error per month. Alert on burn rate, not on individual errors.

**20. Burn-rate alerts.** Page when you would exhaust the budget at the current rate in <1 hour (fast burn) or <6 hours (slow burn).

### Alert design

**21. Page on user pain, not on cause.** "Login endpoint p99 >3s" is pain. "Database CPU 90%" is cause - might be fine if cache hot.

**22. Every alert has a runbook.** What does the alert mean, how to diagnose, how to mitigate. If you cannot write five lines, you do not have an alert; you have a notification.

**23. Severity tiers.**

| Tier | Response time | Channel |
| --- | --- | --- |
| **Page** | Wake the oncall | PagerDuty / Opsgenie |
| **Alert** | Look within 24h | Email / Slack |
| **Ticket** | Look within a week | Issue tracker |

**24. Alert hygiene.** A page that fires more than once a week without being a real incident is broken. Tune or delete.

### Logs (in context of observability)

See [`forge-logging`](../backend/forge-logging/SKILL.md) for the full discipline. Critical points here:

**25. Every log line carries `trace_id`, `span_id`, `service.name`.**

**26. Errors logged once at the catch site.** Span records the exception (rule 7); log captures the additional context.

### Cost

**27. Trace storage is expensive. Sample.** A 100% sample at 1000 RPS produces ~100M spans/day. At $0.50/M ingested, that's $50/day for one service.

**28. Metrics storage is cheap. Use them freely.** A counter is a few bytes per scrape.

**29. Log retention tiers.** Hot (search) for 7-14 days, cold (compliance) for 90+ days. Most queries hit the last 24 hours.

### Distributed tracing

**30. Context propagation across HTTP / queue boundaries.** W3C `traceparent` header is the standard.

```ts
// outgoing HTTP - OTel auto-instrumentation does this for fetch / axios / undici
// for queues, propagate manually:
import { context, propagation } from "@opentelemetry/api";

// producer
const carrier: Record<string, string> = {};
propagation.inject(context.active(), carrier);
await queue.publish(message, { headers: carrier });

// consumer
const ctx = propagation.extract(context.active(), message.headers);
context.with(ctx, () => processMessage(message));
```

## Common AI-output patterns to reject

| Pattern | Why wrong | Fix |
| --- | --- | --- |
| Vendor-specific SDK (Datadog, New Relic) hardcoded | Lock-in | OTel SDK + vendor as exporter |
| Span named `handler.ts:42` | Code location, not operation | `http.GET /orders/:id` |
| Counter with `user_id` dimension | Cardinality bomb | Aggregate at handler level: `route` + `status_class` |
| Alert on every 5xx | Alert fatigue | SLO + burn-rate |
| `console.log(err)` no trace_id | Cannot correlate | Structured log with trace_id |
| 100% sample rate in production | $$$ | 1-10% + 100% errors/slow |
| 1% sample in staging | Debug-blind | 100% in staging |
| No `service.version` on spans | Cannot tell what code emitted | Set in Resource |
| Histogram auto-buckets only | Misses real distribution | Explicit bucket boundaries |
| Alert without a runbook | Pages without action | Five-line runbook minimum |

## Worked example: a fully-traced handler

```ts
import { trace, SpanStatusCode, context } from "@opentelemetry/api";
import { metrics } from "@opentelemetry/api";

const tracer = trace.getTracer("orders-api");
const meter  = metrics.getMeter("orders-api");

const ordersCreated = meter.createCounter("orders.created", {
  description: "Number of orders created",
});
const ordersDuration = meter.createHistogram("orders.create.duration", {
  unit: "ms",
  description: "Order creation latency",
  advice: { explicitBucketBoundaries: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000] },
});

app.post("/v1/orders", async (c) => {
  return tracer.startActiveSpan("http.POST /v1/orders", async (span) => {
    const t0 = performance.now();
    const traceId = span.spanContext().traceId;

    try {
      const body = await c.req.json();
      const parsed = CreateOrderInput.safeParse(body);
      if (!parsed.success) {
        span.setAttribute("http.response.status_code", 400);
        span.setStatus({ code: SpanStatusCode.OK });  // 4xx is NOT a server error
        return errorResponse(c, 400, "validation_failed", "Validation failed.", {
          trace_id: traceId,
          fields: issuesToFields(parsed.error),
        });
      }

      span.setAttributes({
        "order.customer_id": parsed.data.customer_id,
        "order.currency": parsed.data.currency,
        "order.item_count": parsed.data.items.length,
      });

      const order = await createOrder(parsed.data);
      span.setAttribute("order.id", order.id);

      ordersCreated.add(1, { currency: parsed.data.currency, status_class: "2xx" });
      ordersDuration.record(performance.now() - t0, { status_class: "2xx" });

      span.setStatus({ code: SpanStatusCode.OK });
      return c.json({ data: order }, 201);
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message });
      ordersCreated.add(1, { currency: "unknown", status_class: "5xx" });
      ordersDuration.record(performance.now() - t0, { status_class: "5xx" });

      logger.error({ err, trace_id: traceId }, "order.create.failed");
      return errorResponse(c, 500, "internal_error", "Something went wrong.", { trace_id: traceId });
    } finally {
      span.end();
    }
  });
});
```

What this shows: span named after operation + route (rule 5); semantic attributes (rule 6); 4xx is `OK` for the span (it's the caller's fault, not the server's) (rule 8); exception recorded on the span AND logged separately with the same trace_id (rules 7, 25-26); metrics with low-cardinality dimensions (rule 12); explicit histogram buckets (rule 13); `trace_id` returned to the client so support can pull the trace from logs.

## Workflow

When adding observability:

1. **Set up the OTel SDK at the entry point.** `instrumentation.ts` loaded before app code (Node `--require`).
2. **Auto-instrumentation first.** Covers HTTP, fetch, DB, redis - 90% of spans.
3. **Add manual spans for custom operations.** Use semantic conventions.
4. **Define your SLIs and SLOs.** User-facing.
5. **Wire RED metrics on the boundary.** Rate, Errors, Duration with low-cardinality dimensions.
6. **Wire SLO burn-rate alerts.** Page on burn; ticket on individual errors.
7. **Write runbooks for every page-able alert.**

## Verification

Manual checklist:

- [ ] Every request produces a trace with the operation name.
- [ ] Every log line carries `trace_id`.
- [ ] Errors recorded on spans (not just in logs).
- [ ] No high-cardinality dimensions on metrics.
- [ ] Sampling rate set per environment.
- [ ] SLOs defined and burn-rate alerts configured.
- [ ] Every page-able alert has a runbook.

## When to skip this skill

- Personal projects with <10 daily users.
- Static sites (no requests to trace).
- Pure CLIs (already log to stdout).

## Related skills

- [`forge-logging`](../backend/forge-logging/SKILL.md) - structured logging with trace_id propagation.
- [`forge-error-handling`](../backend/forge-error-handling/SKILL.md) - error boundaries that record exceptions on the span.
- [`forge-api-design`](../backend/forge-api-design/SKILL.md) - request_id correlation already established here.
- [`forge-kubernetes`](forge-kubernetes/SKILL.md) - sidecar / agent patterns for OTel collection.
