--- name: ai-observability description: Use when designing AI agent observability architecture, integrating mission-control dashboards, or defining AI-specific KPI schemas version: "1.0" owner: platform-governance tier: full source: .enterprise/governance/agent-skills/ai-observability/SKILL.md quick: .enterprise/governance/agent-skills/ai-observability/SKILL-QUICK.md portable: true --- # AI Observability — Full Reference > Tier 2: complete observability model, metric schemas, mission-control integration, and KPI dashboard. > Source: AI-SDLC v1.0 §9 (Observabilidade), §10 (KPIs), mission-control API (see `/opt/references/mission-control/`). --- ## 1. Metric Schema (AI-SDLC §9) Full metric set as defined by AI-SDLC: ```yaml metrics: session: tokens_input: integer # tokens in context window at session start tokens_output: integer # tokens generated by model in session context_usage_pct: float # context_used / context_max × 100 session_duration_s: integer # wall clock seconds task: task_id: string agent: string # GHOST, ORBIT, KUBE, etc. tasks_executed: integer execution_time_s: integer status: success | failure | partial rework: boolean # true if task was re-executed after failure workflow: epic_id: string phases_completed: integer total_phases: integer gate_failures: integer gate_warnings: integer cycle_time_s: integer # phase_0.start → phase_10.end cost: model: string # claude-sonnet-4-6, etc. tokens_total: integer # tokens_input + tokens_output cost_usd: float # tokens_total × model_price ``` --- ## 2. Native HSEOS Metrics Collection ### 2.1 Workflow state file Path: `.hseos-output//state.yaml` Extract delivery metrics: ```bash # Cycle time START=$(yq '.phases.preflight.started_at' state.yaml) END=$(yq '.phases.consolidation.completed_at' state.yaml) echo "Cycle time: $(( END - START )) seconds" # Gate failures across epic grep -r "FAILURES" .logs/validation/ | grep -v "FAILURES : 0" | wc -l ``` ### 2.2 Quality gate logs Path: `.logs/validation/gate-.log` ```bash # Aggregate gate results PASSES=$(grep "PASS" .logs/validation/*.log | wc -l) FAILS=$(grep "FAIL\b" .logs/validation/*.log | wc -l) echo "Gate failure rate: $(echo "scale=2; $FAILS / ($FAILS + $PASSES)" | bc)" ``` ### 2.3 Story throughput ```bash # Completed tasks in epic grep -c "^\- \[x\]" .specs/features//tasks.md ``` --- ## 3. mission-control Integration ### 3.1 What mission-control provides Mission-control (`/opt/references/mission-control/`) is an AI agent orchestration dashboard with: - Agent registry and heartbeat tracking - Task assignment and status reporting - Skill registry - Real-time SSE/WebSocket events - REST API at `http://localhost:3000` (default) Auth: `x-api-key: $MC_API_KEY` ### 3.2 Installation ```bash cd /opt/references/mission-control npm install cp .env.example .env # configure MC_API_KEY, DB path npm run dev # or docker-compose up ``` ### 3.3 SABLE integration — register + report SABLE registers itself and reports metrics at the end of each workflow phase: ```bash # Register SABLE at workflow start curl -X POST http://localhost:3000/api/adapters \ -H "x-api-key: $MC_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "framework": "generic", "action": "register", "payload": { "agentId": "sable-01", "name": "SABLE", "metadata": { "epic": "", "capabilities": ["runtime-verify", "finops-audit"] } } }' # Report task completion (per phase) curl -X POST http://localhost:3000/api/adapters \ -H "x-api-key: $MC_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "framework": "generic", "action": "task_complete", "payload": { "agentId": "sable-01", "taskId": "phase-9-runtime-verify", "status": "success", "metrics": { "gate_failures": 0, "cycle_time_s": 1240, "tasks_executed": 3 } } }' ``` ### 3.4 ORBIT integration — workflow orchestration events ORBIT emits phase transitions to mission-control: ```bash # On each phase completion curl -X POST http://localhost:3000/api/adapters \ -H "x-api-key: $MC_API_KEY" \ -d '{ "framework": "generic", "action": "heartbeat", "payload": { "agentId": "orbit-01", "status": "online", "metrics": { "current_phase": 8, "phases_completed": 7 } } }' ``` --- ## 4. KPI Dashboard (AI-SDLC §10) ### 4.1 KPIs available today (no mission-control) | KPI | Target | How to measure | |---|---|---| | Gate failure rate | < 5% per epic | `.logs/validation/` aggregation | | Delivery cycle time | Trending down | Workflow state timestamps | | Story completion rate | > 95% per sprint | `tasks.md` `[x]` count | ### 4.2 KPIs with mission-control | KPI | Target | How to measure | |---|---|---| | Cost per feature (USD) | Trending down | `tokens_total × model_price` | | % stateless execution | > 90% | Sessions without history dependency | | Context budget adherence | > 80% sessions within 60% | `context_usage_pct ≤ 60` | | Rework rate | < 10% | Tasks re-executed after failure | | Average tasks/session | Stable | `tasks_executed` per session | | Error rate per agent | < 2% | Failed tasks per agent role | ### 4.3 Suggested dashboard layout ``` ┌─ AI-SDLC Dashboard ──────────────────────────────────────────┐ │ │ │ DELIVERY COST │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Cycle time │ │ Cost/feature │ │ │ │ Gate fail rate │ │ Tokens/session │ │ │ │ Story rate │ │ Cost trend │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ CONTEXT QUALITY │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Budget adherence│ │ % stateless │ │ │ │ Avg context % │ │ Rework rate │ │ │ │ Sessions >60% │ │ Error rate │ │ │ └─────────────────┘ └─────────────────┘ │ └───────────────────────────────────────────────────────────────┘ ``` --- ## 5. Escalation Escalate to SABLE (governance audit) when: - Gate failure rate > 10% in a sprint - Any session exceeds 60% context (without middleware enforcement) - Cost per feature increases > 20% week-over-week - Rework rate > 15% (indicates task contracts are too loose) ## Quick Mode For low-context activation, load `.enterprise/governance/agent-skills/ai-observability/SKILL-QUICK.md` or `QUICK.md` first. Load this full skill for deep analysis, violation fixing, or formal review gates.