---
name: ai-observability
description: Use when designing AI agent observability architecture, integrating mission-control dashboards, or defining AI-specific KPI schemas
version: "1.0"
owner: platform-governance
tier: full
source: .enterprise/governance/agent-skills/ai-observability/SKILL.md
quick: .enterprise/governance/agent-skills/ai-observability/SKILL-QUICK.md
portable: true
---

# AI Observability — Full Reference

> Tier 2: complete observability model, metric schemas, mission-control integration, and KPI dashboard.
> Source: AI-SDLC v1.0 §9 (Observabilidade), §10 (KPIs), mission-control API (see `/opt/references/mission-control/`).

---

## 1. Metric Schema (AI-SDLC §9)

Full metric set as defined by AI-SDLC:

```yaml
metrics:
  session:
    tokens_input: integer        # tokens in context window at session start
    tokens_output: integer       # tokens generated by model in session
    context_usage_pct: float     # context_used / context_max × 100
    session_duration_s: integer  # wall clock seconds

  task:
    task_id: string
    agent: string                # GHOST, ORBIT, KUBE, etc.
    tasks_executed: integer
    execution_time_s: integer
    status: success | failure | partial
    rework: boolean              # true if task was re-executed after failure

  workflow:
    epic_id: string
    phases_completed: integer
    total_phases: integer
    gate_failures: integer
    gate_warnings: integer
    cycle_time_s: integer        # phase_0.start → phase_10.end

  cost:
    model: string                # claude-sonnet-4-6, etc.
    tokens_total: integer        # tokens_input + tokens_output
    cost_usd: float              # tokens_total × model_price
```

---

## 2. Native HSEOS Metrics Collection

### 2.1 Workflow state file

Path: `.hseos-output/<epic-id>/state.yaml`

Extract delivery metrics:

```bash
# Cycle time
START=$(yq '.phases.preflight.started_at' state.yaml)
END=$(yq '.phases.consolidation.completed_at' state.yaml)
echo "Cycle time: $(( END - START )) seconds"

# Gate failures across epic
grep -r "FAILURES" .logs/validation/ | grep -v "FAILURES : 0" | wc -l
```

### 2.2 Quality gate logs

Path: `.logs/validation/gate-<timestamp>.log`

```bash
# Aggregate gate results
PASSES=$(grep "PASS" .logs/validation/*.log | wc -l)
FAILS=$(grep "FAIL\b" .logs/validation/*.log | wc -l)
echo "Gate failure rate: $(echo "scale=2; $FAILS / ($FAILS + $PASSES)" | bc)"
```

### 2.3 Story throughput

```bash
# Completed tasks in epic
grep -c "^\- \[x\]" .specs/features/<feature>/tasks.md
```

---

## 3. mission-control Integration

### 3.1 What mission-control provides

Mission-control (`/opt/references/mission-control/`) is an AI agent orchestration dashboard with:
- Agent registry and heartbeat tracking
- Task assignment and status reporting
- Skill registry
- Real-time SSE/WebSocket events
- REST API at `http://localhost:3000` (default)

Auth: `x-api-key: $MC_API_KEY`

### 3.2 Installation

```bash
cd /opt/references/mission-control
npm install
cp .env.example .env   # configure MC_API_KEY, DB path
npm run dev            # or docker-compose up
```

### 3.3 SABLE integration — register + report

SABLE registers itself and reports metrics at the end of each workflow phase:

```bash
# Register SABLE at workflow start
curl -X POST http://localhost:3000/api/adapters \
  -H "x-api-key: $MC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "framework": "generic",
    "action": "register",
    "payload": {
      "agentId": "sable-01",
      "name": "SABLE",
      "metadata": { "epic": "<epic-id>", "capabilities": ["runtime-verify", "finops-audit"] }
    }
  }'

# Report task completion (per phase)
curl -X POST http://localhost:3000/api/adapters \
  -H "x-api-key: $MC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "framework": "generic",
    "action": "task_complete",
    "payload": {
      "agentId": "sable-01",
      "taskId": "phase-9-runtime-verify",
      "status": "success",
      "metrics": {
        "gate_failures": 0,
        "cycle_time_s": 1240,
        "tasks_executed": 3
      }
    }
  }'
```

### 3.4 ORBIT integration — workflow orchestration events

ORBIT emits phase transitions to mission-control:

```bash
# On each phase completion
curl -X POST http://localhost:3000/api/adapters \
  -H "x-api-key: $MC_API_KEY" \
  -d '{ "framework": "generic", "action": "heartbeat",
        "payload": { "agentId": "orbit-01", "status": "online",
                     "metrics": { "current_phase": 8, "phases_completed": 7 } } }'
```

---

## 4. KPI Dashboard (AI-SDLC §10)

### 4.1 KPIs available today (no mission-control)

| KPI | Target | How to measure |
|---|---|---|
| Gate failure rate | < 5% per epic | `.logs/validation/` aggregation |
| Delivery cycle time | Trending down | Workflow state timestamps |
| Story completion rate | > 95% per sprint | `tasks.md` `[x]` count |

### 4.2 KPIs with mission-control

| KPI | Target | How to measure |
|---|---|---|
| Cost per feature (USD) | Trending down | `tokens_total × model_price` |
| % stateless execution | > 90% | Sessions without history dependency |
| Context budget adherence | > 80% sessions within 60% | `context_usage_pct ≤ 60` |
| Rework rate | < 10% | Tasks re-executed after failure |
| Average tasks/session | Stable | `tasks_executed` per session |
| Error rate per agent | < 2% | Failed tasks per agent role |

### 4.3 Suggested dashboard layout

```
┌─ AI-SDLC Dashboard ──────────────────────────────────────────┐
│                                                               │
│  DELIVERY                     COST                           │
│  ┌─────────────────┐          ┌─────────────────┐           │
│  │ Cycle time      │          │ Cost/feature    │           │
│  │ Gate fail rate  │          │ Tokens/session  │           │
│  │ Story rate      │          │ Cost trend      │           │
│  └─────────────────┘          └─────────────────┘           │
│                                                               │
│  CONTEXT                      QUALITY                        │
│  ┌─────────────────┐          ┌─────────────────┐           │
│  │ Budget adherence│          │ % stateless     │           │
│  │ Avg context %   │          │ Rework rate     │           │
│  │ Sessions >60%   │          │ Error rate      │           │
│  └─────────────────┘          └─────────────────┘           │
└───────────────────────────────────────────────────────────────┘
```

---

## 5. Escalation

Escalate to SABLE (governance audit) when:
- Gate failure rate > 10% in a sprint
- Any session exceeds 60% context (without middleware enforcement)
- Cost per feature increases > 20% week-over-week
- Rework rate > 15% (indicates task contracts are too loose)


## Quick Mode

For low-context activation, load `.enterprise/governance/agent-skills/ai-observability/SKILL-QUICK.md` or `QUICK.md` first. Load this full skill for deep analysis, violation fixing, or formal review gates.

