---
name: aio
description: "Agentic Implementation Optimizer — audit existing agentic AI implementations for reliability/cost/architecture issues, walk through creating new agentic implementations from scratch, or extend existing implementations with new capabilities. Use when: '/aio', 'audit my agent', 'build an agent', 'agentic review', 'extend this agent', 'agent architecture', 'optimize my agent'. Covers architecture patterns, reliability engineering, guardrails, cost optimization, observability, escape-hatch/abstention design, gap-signal capture, and production readiness."
---

# Agentic Implementation Optimizer (AIO)

Expert-level guidance for building, auditing, and extending agentic AI systems. Grounded in Anthropic's published research, peer-reviewed papers, and industry data compiled through May 2026.

---

## Mode Detection

Determine the **mode** from the user's request:

| Trigger | Mode |
|---------|------|
| "audit", "review my agent", "check my agent", "assess", `--audit` | **Audit** |
| "create", "build", "new agent", "start from scratch", "walk me through", `--create` | **Create** |
| "extend", "add feature", "add capability", "improve", "optimize", `--extend` | **Extend** |
| (ambiguous or no indicator) | **Ask the user** which mode they want |

---

## Cross-Cutting Principle: Resourcing, Escape Hatches & Gap Signals

These three concerns are one loop, and they apply in **every** mode — Audit checks for them, Create builds them in, Extend must preserve them. Most "hallucination" in production is not a model defect; it is an agent forced to act without the resources to act correctly, with no sanctioned way to decline. Close this loop and hallucinations convert from silent failures into a labeled backlog of exactly what to build next.

### 1. Progressive disclosure — give the agent the resources it needs, just-in-time

An agent makes bad decisions when the context it needs is absent, buried, or dumped all at once. Provision context the way Anthropic's Agent Skills and "Effective Context Engineering for AI Agents" guidance prescribe — **progressive disclosure**: surface a thin, high-signal layer by default and let the agent pull deeper detail on demand, rather than front-loading everything into the prompt.

- **Layered instructions**: a short always-loaded core; detailed references, examples, and edge-case playbooks fetched only when the task touches them (skill/sub-skill loading, retrieval, sub-agent hand-off).
- **Just-in-time tools/resources**: prefer tools that *fetch* context (search, read, list) over statically stuffing the same context — the agent retrieves what the specific step needs, when it needs it.
- **Why it matters**: "When Context Hurts" (arxiv 2605.04361) shows identical context can yield up to 20× gains *or* 46% degradation; "Context Length Alone Hurts" (arxiv 2510.05381) shows degradation even with perfect retrieval. More context is not more capability — disclose progressively and measure the sign.

The goal: by the time the agent must decide, it either *has* what it needs or it can *go get it*. If neither is true, it must hit an escape hatch — never guess.

### 2. Escape hatches — sanction the decision to NOT decide

An escape hatch is a first-class, designed path the agent takes when it lacks the resources or confidence for a high-confidence decision. Stated to the agent plainly:

> If you do not have the information/data you need to make a high-confidence decision, **do not make one.** Abstain, take the safe action, and log the gap.

This is **abstention / selective prediction** as an architectural primitive, not an afterthought:

- **Define the safe default per decision point** — escalate to a human, return a typed "insufficient-context" result, fall back to a deterministic path, or no-op. Silence-and-guess is never the default.
- **Gate on calibrated confidence, not vibes** — trajectory-level confidence (HTC, arxiv 2601.15778) drives the abstain/HITL threshold, calibrated per domain (see HITL thresholds under Safety). Where the model can answer partially, hedge specificity rather than refuse outright (Calibrated Claim-Level Specificity, arxiv 2604.17487).
- **Make the hatch cheap and reachable in the tool surface** — give the agent an explicit `flag_gap` / `escalate` / `report_insufficient_context` action. If the only available actions are "answer," the agent *will* answer — wrong. An escape hatch the agent can't reach isn't one.
- **Recovery ≠ diagnosis** — PROBE (arxiv 2605.08717) shows a 65%/22% diagnosis-vs-recovery gap; the escape hatch is the *guaranteed* recovery path when in-loop recovery fails. It always exists, even when the agent can't fix the problem itself.

### 3. Gap signals — every abstention and hallucination is a labeled data point

The payoff: an escape hatch isn't just a safety valve, it's an **instrument**. Each trip emits a structured signal pointing directly at a missing resource, dataset, or capability.

> Each hallucination/abstention flags an improvement: data that's missing, or a function your users need that you don't have. Frequent trips at one decision point = a critical gap, ranked for you by real traffic.

- **Route every trip to a durable sink** — not a log line that scrolls away. A structured event (`{decision_point, missing_resource, confidence, inputs_hash, timestamp}`) to a queryable store (event log, warehouse table, issue tracker).
- **Capture hallucinations as the same signal class** — when post-hoc evals or user feedback catch a confident-but-wrong answer, that's an escape hatch that *should have fired and didn't*. Feed it back as both a gap signal and a calibration-threshold correction.
- **Close the loop into the roadmap** — aggregate by decision point, rank by frequency × cost. High-frequency gaps become the next tool, dataset, or integration. This is the agent telling you what to build, grounded in real traffic instead of guesses. (Complements the failure-attribution loop under Reliability — attribution says *why a run failed*; gap signals say *what's structurally missing*.)
- **This is active learning for your product surface** — abstentions are the highest-information points in your traffic; they mark exactly where the system's competence boundary lies.

**Mode hooks:** Audit verifies all three exist and that gap signals actually reach a queryable sink (a hatch with no sink is theater). Create builds the abstention primitive in Phase D and the signal pipeline in Phase F. Extend gives every new capability its own escape hatch and signal route before shipping.

---

## Mode 1: Audit

Perform a comprehensive audit of an existing agentic implementation. Spawn sub-agents in parallel to cover independent audit dimensions, then synthesize findings.

### Step 1: Discovery (Sub-Agent)

Spawn an Explore agent to map the agentic implementation:
- Identify the orchestration pattern in use (single agent, prompt chain, routing, orchestrator-workers, full agent loop, multi-agent)
- Catalog all tools/functions the agent can call
- Map the context management strategy (system prompts, memory, RAG, scratchpads)
- Identify state management approach (vector stores, KV, knowledge graphs, checkpointing)
- Locate error handling, retry logic, and fallback paths
- Identify any guardrails, validation, or human-in-the-loop gates
- Find evaluation/testing infrastructure (evals, unit tests, integration tests)
- Locate observability instrumentation (tracing, logging, metrics)

Return a structured summary of findings to the main thread.

### Step 2: Parallel Audit Agents

Based on discovery findings, spawn the following audit agents **in parallel**:

#### 2a. Architecture Audit Agent
Evaluate against the pattern complexity hierarchy:

| Pattern | When Justified |
|---------|---------------|
| Single LLM + retrieval | Well-scoped task, few tools |
| Prompt chaining | Fixed sequential subtasks with validation gates |
| Routing | Input classification dispatches to specialized handlers |
| Parallelization | Independent subtasks or voting for confidence |
| Orchestrator-workers | Dynamic task decomposition needed |
| Evaluator-optimizer | Clear evaluation criteria, iterative refinement helps |
| Full agent loop | Open-ended, tool-using, multi-step reasoning |

**Key check**: Is the implementation over-architected? Single agents match or beat multi-agent systems when compute is normalized (arxiv 2604.02460). Multi-agent only wins with heterogeneous model combinations or structured adversarial debate. Flag any multi-agent setup that doesn't justify itself.

Evaluate tool design against Anthropic's guidance:
- Do tool specs use action verb openings, explicit parameter types with units, few-shot examples?
- Is a rationale required before tool calls?
- Are field names semantic (`name`, `image_url`) not opaque (`uuid`, `mime_type`)?
- Is pagination/truncation implemented for large responses?
- Are there too many narrow tools vs. fewer well-designed ones?

Evaluate context engineering:
- Are sub-agents used for isolation with condensed summaries?
- Is compaction implemented for approaching context limits?
- Are there structured memory layers (scratchpads, summaries) vs. raw conversation stuffing?

Verify observability uses OTel GenAI semantic conventions (the emerging standard for agent tracing).

#### 2b. Reliability Audit Agent
Check the three-layer testing strategy:

**Layer 1 — Unit Tests (Deterministic)**:
- Are there mocked LLM tests for tool routing, argument extraction, result handling, schema validation?
- Do these run in CI on every commit?

**Layer 2 — Evals (Quality Tests)**:
- Are there curated datasets with LLM-as-judge scoring?
- Are there threshold assertions (e.g., `assert avg_accuracy >= 0.8`)?
- Are prompts treated as versioned code with regression testing?

**Layer 3 — Integration/E2E Tests**:
- Are full multi-step workflows tested against real or simulated environments?
- Do tests cover: API timeouts, auth failures, incorrect external responses, edge cases?

Check metrics coverage across all three levels (session, trace, span):
- Quality: Goal accuracy, task completion rate, reasoning coherence
- Tool Use: Selection accuracy, parameter correctness
- Safety: Hallucination rate, PII exposure, toxicity, prompt injection
- Performance: Latency (p50/p95/p99), throughput, cost per task
- Business: User satisfaction, escalation rate, cost efficiency

Check structured output guarantees:
- Is constrained decoding used (Outlines, vLLM)? This is strictly more reliable than validation+retry.
- If using validation+retry (Instructor, Guardrails AI), is that a conscious tradeoff or oversight?

Check error handling:
- Are opaque errors transformed into actionable guidance?
- Is there two-phase initialization (setup agent vs. execution agent)?
- Are sub-agents writing to filesystem vs. through lead agent?
- **Is there a failure-attribution loop?** Long-horizon agents need root-cause attribution feeding corrective signals back to the agent. AgentDebug (ICLR 2026) and the newer ErrorProbe (arxiv 2604.17658, April 2026) anchor this online; PALADIN-style failure-rich training is the offline complement. Empirical grounding from HORIZON (arxiv 2604.11978, April 2026): **agents collapse via compounding off-path tool calls, not single hard steps** — your attribution loop must recognize trajectory drift, not just terminal failures. Attribution itself is now cheap and bounded — MASPrism (arxiv 2605.07509) attributes from prefill signals ~6.7× faster than replay, and Conformal Agent Error Attribution (arxiv 2605.06788) gives coverage-guaranteed localization sets. Recommend the offline+online pair for any production agent that runs >10 steps.
- **Does the agent actually RECOVER, not just diagnose?** PROBE (arxiv 2605.08717, May 2026) surfaces the **diagnosis–recovery gap**: 65% diagnosis accuracy but only 22% recovery success on unresolved SWE cases. Correct attribution does NOT imply the agent can act on it. Audit recovery as a separate, harder metric — a team that "has a failure-attribution loop" may still have no working recovery. Flag any recovery story that's measured only by attribution accuracy.

Check evaluation framing for long-horizon agents:
- **Is reliability measured beyond pass@1?** For agents that run more than a handful of steps, single-shot success rate is the wrong metric. Look for reliability-decay metrics (RDC, VAF, GDS — arxiv 2603.29231) or equivalent: how does success rate degrade over horizon length? Are there checkpoint-and-restart policies for "meltdown" runs?

Check escape hatches & gap-signal capture (see **Cross-Cutting Principle**):
- **Can the agent abstain?** Is there an explicit sanctioned path — an `escalate` / `flag_gap` / `report_insufficient_context` tool or typed "insufficient-context" result — for when the agent lacks the resources/confidence to decide? If the only available actions are "answer," flag it: the agent is structurally forced to guess.
- **Is abstention gated on calibrated confidence?** Trajectory-level confidence (arxiv 2601.15778) driving the threshold, not an uncalibrated self-report.
- **Do escape-hatch trips and caught hallucinations reach a queryable sink?** A structured gap-signal store (event log / warehouse / issue tracker), not a scrolling log line. A hatch with no durable sink is theater — flag it.
- **Is there a loop from gap signals to the roadmap?** Aggregation by decision point, ranked by frequency × cost. If nobody mines the signals, the instrument is dead.
- **Is context provisioned via progressive disclosure?** Thin high-signal core + just-in-time fetch, not everything front-loaded (arxiv 2605.04361, 2510.05381). Over-stuffed context is itself a reliability defect.

#### 2c. Safety & Guardrails Audit Agent
Check the guardrail stack at input, flow, and output layers:
- Input: Is there prompt injection/jailbreak prevention?
- Flow: Is there flow control (e.g., NeMo Guardrails Colang DSL)?
- Output: Is there output validation (e.g., Guardrails AI Pydantic validators)?

Is the guardrail architecture positioned *outside* agent execution loops (agent control plane pattern)?

Check alignment with OWASP Agentic Top 10 (ASI01-ASI10) and whether the "least agency" principle is applied.

Check **concrete prompt injection defenses** (not just risk taxonomies):
- Is there a provenance-aware defense like **ARGUS** (arxiv 2605.03378, May 2026 — current SOTA at 3.8% ASR / 87.5% utility on AgentLure) or at least an indirect-injection defense like ICON (arxiv 2602.20708)? Static input filters are insufficient.
- Is request structure enforced via something like PCFI / Prompt Control-Flow Integrity (arxiv 2603.18433) — system/dev/user/retrieved segments treated as priority-tiered, not concatenated text?
- For agents that receive any retrieved content (RAG, tool outputs, web), assume injection-by-default and architect for it.
- **Don't expect a wrapper to save you**: The Defense Trilemma (arxiv 2604.06436, April 2026) is a Lean-4 impossibility proof that no continuous, utility-preserving wrapper defense can make all outputs safe. Recommend at least one of: training-time alignment, discontinuous filters at action boundaries, or architectural separation between deliberation and action.
- **Treat tool-execution surfaces as untrusted RPC**: Microsoft Security disclosed CVE-2026-26030 and CVE-2026-25592 (May 7 2026) — prompt injection escalating to RCE via Semantic Kernel string interpolation + exposed file-write tools. Sandbox tools that touch the filesystem; assume any retrieved/generated string can become a shell. The May 2026 vm2 CVE cluster (13 CVEs, CVSS 9–10) shows language-level sandboxes are not hard boundaries — recommend process-level isolation (gVisor/Firecracker/microVMs) beneath any in-process interpreter sandbox.
- **Two independent impossibility results now bound injection defense**: the Defense Trilemma (Lean-4) and "AI Agents May Always Fall for Prompt Injections" (Contextual Integrity, arxiv 2605.17634) jointly retire "we'll add an injection filter" as a complete answer. Audit for **impact containment** (least agency, action-boundary gating, provenance auditing) — not just a filter.
- **Check agent identity & MCP auth** (often the real gap): post-RSAC-2026, identity tracing — not content filtering — is the dominant MCP weakness (~38% of MCP servers ship with no auth; confused-deputy via OAuth token pass-through). For any MCP-wired or multi-agent system, verify per-agent identity, scoped least-privilege credentials, and delegation provenance (arxiv 2604.23280 enumerates the five structural gaps). Flag token pass-through and over-broad scopes as HIGH.

Check **EU AI Act readiness** (high-risk applicability date provisionally delayed to **December 2 2027** by the May 7 2026 Council–Parliament agreement; obligations and penalties unchanged):
- ≥6-month log retention with override mechanisms and external-monitoring data flows (Articles 19/26)
- Penalties up to €15M or 3% global turnover — flag as HIGH (was CRITICAL pre-delay) for any EU-deployed or EU-user-touching agent without retention/override infrastructure on the roadmap. Final adoption pending; teams already EU-deployed should keep infrastructure built for the Aug 2026 deadline running as a safety margin.
- **The delay does NOT cover transparency.** EC Draft Guidelines on AI Transparency (May 2026) confirm **Article 50(1) agent-disclosure obligations still bind at August 2 2026** — a deployer must disclose whenever a human is likely interacting with an AI agent, plus multi-layered synthetic-content marking. Flag any EU-user-facing agent lacking interaction disclosure as HIGH on the *unchanged* Aug 2 2026 clock. Don't let the high-risk delay create false comfort about transparency duties.
- The five compliance properties (traceability, explainability, authorization, immutability, reproducibility) listed under Production Readiness map directly to AI Act requirements
- **US-side parallel**: NIST AI Agent Standards Initiative (nist.gov/caisi) — voluntary track covering agent identity/auth, JIT access, action-level approvals; AI Agent Interoperability Profile due Q4 2026. Pair with OWASP/MS Toolkit on technical side and EU AI Act on regulatory side as the emerging compliance triangle.

Check human-in-the-loop design:
- What pattern is used? (Multi-tier oversight / Synchronous approval / Asynchronous audit)
- Are confidence thresholds appropriate for the domain?
  - Financial services: 90-95%
  - Customer service (routine): 80-85%
  - Escalation rate target: 10-15% (60%+ signals miscalibration)
- Is there selective automation vs. comprehensive? (Klarna lesson: automate selectively)

#### 2d. Production Readiness Audit Agent
Check cost optimization:
- Is model routing/tiering implemented (cheap for classification, frontier for complex reasoning)?
- Is prompt caching in use? (Highest ROI optimization) **But check what's being cached — "Don't Break the Cache" (arxiv 2601.06007) shows naive full-context caching can paradoxically *increase* latency for long-horizon agents. Operational rule: cache the system prompt and stable tool specs; exclude dynamic tool results and per-step context.**
- Is semantic caching implemented?
- Is batching used where applicable (50% token discount)?
- Are output tokens optimized? (Output tokens dominate agent spend)

Check observability:
- Is distributed tracing implemented (OpenTelemetry-compatible)?
- Is semantic drift detection in place?
- Are circuit breakers implemented for long-running operations?

Check compliance/auditability (if applicable):
- Traceability (structured logging, event emission)
- Explainability (flight recorder for prompts/tool calls/reasoning)
- Authorization (action-level permission gates)
- Immutability (write-once storage with cryptographic signatures)
- Reproducibility (deterministic fallback paths)
- Kill switches (error-rate thresholds, incident freeze protocols)

### Step 3: Synthesize & Report

Collect all sub-agent findings and produce a structured audit report:

```
## AIO Audit Report

### Architecture Assessment
- Pattern: [identified pattern] — [appropriate / over-architected / under-architected]
- Tool Design: [score] — [key findings]
- Context Engineering: [score] — [key findings]

### Reliability Assessment
- Unit Tests: [present/absent] — [findings]
- Evals: [present/absent] — [findings]
- Integration Tests: [present/absent] — [findings]
- Metrics Coverage: [findings]
- Structured Output: [approach used] — [findings]

### Safety & Guardrails Assessment
- Input Protection: [present/absent] — [findings]
- Flow Control: [present/absent] — [findings]
- Output Validation: [present/absent] — [findings]
- Human-in-the-Loop: [pattern used] — [findings]

### Production Readiness Assessment
- Cost Optimization: [findings]
- Observability: [findings]
- Compliance: [findings if applicable]

### Priority Recommendations
1. [Highest impact recommendation]
2. [Second highest]
3. [Third highest]
...

### Reference
[Relevant papers/guides for identified gaps]
```

Use a severity scale: CRITICAL / HIGH / MEDIUM / LOW for each finding.

---

## Mode 2: Create

Walk the user through building a new agentic implementation from scratch using a phased approach. This is interactive — ask questions, validate understanding, and build incrementally.

### Step 1: Requirements Gathering

Ask the user these questions (adapt based on what they've already told you):

1. **What problem are you solving?** What does the agent need to accomplish? What are the inputs and outputs?
2. **What tools/APIs does the agent need access to?** External services, databases, file systems, etc.
3. **What's the complexity?** Is this a well-scoped task or open-ended reasoning? How many steps? How much branching?
4. **What's the risk profile?** Customer-facing? Financial impact? Compliance requirements? What happens when it's wrong?
5. **What's the scale?** Expected volume, latency requirements, cost constraints.
6. **What's the tech stack?** Language, framework preferences, existing infrastructure.

### Step 2: Architecture Recommendation

Based on requirements, recommend the **simplest pattern that works** from the hierarchy:

| Pattern | When to Use | Complexity |
|---------|-------------|------------|
| Single LLM + retrieval | Well-scoped, few tools | Lowest |
| Prompt chaining | Fixed sequential subtasks with validation gates | Low |
| Routing | Input classification dispatches to specialized handlers | Low |
| Parallelization | Independent subtasks or voting for confidence | Medium |
| Orchestrator-workers | Dynamic task decomposition needed | Medium |
| Evaluator-optimizer | Clear evaluation criteria, iterative refinement helps | Medium |
| Full agent loop | Open-ended, tool-using, multi-step reasoning | Highest |

**Always justify why you're NOT recommending a simpler pattern.** Default to the simplest option. Single agents match or beat multi-agent systems when compute is normalized.

Recommend:
- **Orchestration framework** (LangGraph for stateful/mission-critical, CrewAI for rapid prototyping, Claude Agent SDK for filesystem-heavy, DSPy for programmatic optimization)
- **State/memory backend** (vector stores for semantic retrieval, KV for speed-critical, knowledge graphs for complex relationships, git checkpointing for rollback)
- **Tool design approach** following Anthropic's guidance

### Step 3: Implementation Walkthrough

Guide the user through building it phase by phase. For each phase, explain what you're building and why before writing code:

**Phase A: Core Agent & Tools**
- System prompt design
- Tool definitions (action verb specs, parameter types, examples, semantic field names)
- Basic orchestration loop
- Rationale-before-tool-call pattern

**Phase B: Context Engineering**
- Sub-agent isolation strategy (if applicable)
- Compaction strategy for context limits
- Structured memory layers (scratchpads, summaries)

**Phase C: Reliability**
- Structured output guarantees (prefer constrained decoding; fallback to validation+retry)
- Error handling with actionable guidance
- Unit test scaffolding (mock LLM, test tool routing/argument extraction/schema validation)
- Initial eval dataset and assertions

**Phase D: Guardrails & Safety**
- Input validation (prompt injection prevention)
- Output validation (Pydantic validators or equivalent)
- Human-in-the-loop gates for high-risk operations
- Confidence thresholds appropriate to domain
- Escape hatches: an explicit abstention path per high-stakes decision (escalate / typed insufficient-context result / safe deterministic fallback) — never let "guess" be the only available action
- Confidence-gated abstention using trajectory-level confidence, calibrated per domain

**Phase E: Production Hardening**
- Model routing/tiering for cost optimization
- Prompt caching implementation
- Observability (distributed tracing, structured logging)
- Circuit breakers for anomalous behavior
- Kill switch infrastructure

**Phase F: Continuous Improvement Infrastructure**
- Trace collection pipeline
- Eval suite with regression testing in CI/CD
- Prompt versioning
- Canary deployment strategy
- Gap-signal capture: route every escape-hatch trip and caught hallucination to a structured, queryable sink (`{decision_point, missing_resource, confidence, timestamp}`)
- Gap-to-roadmap loop: aggregate gap signals by decision point, rank by frequency × cost, feed the top gaps into the build backlog — the agent telling you what to build next

At each phase, offer to write the code or just explain the approach — let the user drive.

### Step 4: Production Readiness Checklist

Before declaring "done", walk through this checklist:

- [ ] Architecture pattern justified (simplest that works)
- [ ] Tools follow Anthropic's design guidance
- [ ] Context engineering strategy in place (progressive disclosure — just-in-time, not front-loaded)
- [ ] Unit tests for tool routing/schema validation
- [ ] Eval suite with threshold assertions
- [ ] Integration tests for failure modes
- [ ] Input guardrails (prompt injection prevention)
- [ ] Output validation
- [ ] Human-in-the-loop for high-risk operations
- [ ] Escape hatches: sanctioned abstention path at every high-stakes decision (never guess-only)
- [ ] Gap-signal capture wired to a queryable sink + a loop into the roadmap
- [ ] Cost optimization (model routing, caching)
- [ ] Observability (tracing, logging, metrics)
- [ ] Error handling with actionable guidance
- [ ] Kill switch / circuit breaker
- [ ] Compliance requirements met (if applicable)

---

## Mode 3: Extend

Add new capabilities to an existing agentic implementation while maintaining or improving reliability.

### Step 1: Understand Current State

Spawn an Explore agent to map the existing implementation (same as Audit Step 1 discovery). Produce a concise summary of:
- Current architecture pattern
- Existing tools and capabilities
- Context management approach
- Testing/eval coverage
- Known gaps or issues

### Step 2: Extension Requirements

Ask the user:
1. **What capability are you adding?** New tool, new workflow, new integration, performance improvement, etc.
2. **Why?** Business driver, user feedback, operational need.
3. **What constraints exist?** Must not break existing functionality, latency budget, cost budget, compliance.

### Step 3: Impact Analysis (Sub-Agent)

Spawn an agent to analyze the impact of the proposed extension:
- Does this require changing the architecture pattern? (e.g., single agent -> orchestrator-workers)
- Which existing tools/workflows are affected?
- What new tools need to be designed?
- What's the testing impact? (New evals needed, existing evals affected)
- What's the cost impact? (More tool calls, larger context, new model tier)
- What's the safety impact? (New failure modes, new guardrail needs)

### Step 4: Implementation Plan

Produce a plan that covers:

1. **Tool additions/modifications** following Anthropic's tool design guidance
2. **Orchestration changes** (if pattern needs to evolve)
3. **Context engineering updates** (new memory needs, compaction changes)
4. **New guardrails** for new capabilities — including an escape hatch (abstention/escalation path) and a gap-signal route for the new capability, so a feature that can't yet be served correctly fails safe and tells you what's missing instead of guessing
5. **Testing additions**:
   - New unit tests for new tool routing/schema
   - New eval cases for new capabilities
   - Regression tests to ensure existing functionality is preserved
6. **Observability updates** (new spans, new metrics)
7. **Rollout strategy** (feature flags, canary, rainbow deployment)

### Step 5: Execute

Implement the extension following the plan. After each significant change:
- Run existing tests to catch regressions
- Add new tests for new functionality
- Verify observability covers new code paths

---

## Knowledge Base

This skill draws on the following research and guidance. Reference these when making recommendations:

### Anthropic Published Guidance
- "Building Effective Agents" (Dec 2024) — pattern hierarchy, start simple
- "Effective Harnesses for Long-Running Agents" (2025) — two-phase init, feature-list JSON, distributed output
- "Effective Context Engineering for AI Agents" (2025) — sub-agent isolation, compaction, structured memory
- "Writing Effective Tools for AI Agents" (2025) — tool spec contracts, rationale-before-call, semantic fields
- "Demystifying Evals for AI Agents" (anthropic.com/engineering) — production eval methodology

### Key Research
- arxiv 2604.02460 (Apr 2026) — single agents match/beat multi-agent at equal compute
- arxiv 2602.16666 (Feb 2026) — reliability lags capability; scaling alone insufficient
- arxiv 2602.03442 (Feb 2026) — A-RAG: hierarchical retrieval tools, sub-query decomposition most impactful
- arxiv 2601.21912 — ProRAG: process-supervised RAG, SOTA on 5 benchmarks
- arxiv 2601.15778 — HTC/Confidence Calibration: trajectory-level confidence for HITL thresholds
- arxiv 2602.05665 — Graph-based Agent Memory survey: graph vs vector retrieval for agent memory
- arxiv 2510.04618 (ICLR 2026) — ACE: evolving context playbooks, +10.6% on agent benchmarks
- arxiv 2509.25238 (ICLR 2026) — PALADIN: failure-rich trajectory training for agent robustness
- arxiv 2509.10769 — AgentArch: 18 configs x 6 LLMs, best only 35.3% on complex tasks, no universal optimal architecture
- arxiv 2508.16153 — Memento: memory-augmented RL, top-1 GAIA without fine-tuning
- arxiv 2506.14852 — agentic plan caching: 50% cost reduction, 27% latency reduction
- arxiv 2502.03771 — vCache: verified semantic caching with error guarantees
- arxiv 2604.18071 (Apr 2026) — Architectural Design Decisions in Harnesses: empirical 5-pattern taxonomy; registry-style tools still beat MCP/plugin in production
- arxiv 2602.22769 — AMA-Bench: plain long-context often beats dedicated memory systems (counter to memory-system hype)
- arxiv 2603.07670 — Memory for Autonomous LLM Agents survey: temporal-scope / representational-substrate / control-policy taxonomy
- arxiv 2603.29231 (Mar 2026) — Beyond pass@1: reliability-decay metrics (RDC, VAF, GDS) for long-horizon agents
- AgentDebug (ICLR 2026, openreview PFR4E8583W) — failure-attribution loops yield up to +26% relative success
- arxiv 2603.03305 — DCCD: draft-then-constrain decoding fixes the quality regression of naive constrained decoding
- arxiv 2602.20708 — ICON: indirect prompt injection defense at 0.4% ASR with >50% utility preserved
- arxiv 2603.18433 — PCFI: Prompt Control-Flow Integrity, runtime priority enforcement
- arxiv 2511.14136 — CLEAR: accuracy-only optimization yields 4.4-10.8x more expensive agents
- arxiv 2601.22037 — AWO: compile recurring agent loops into deterministic meta-tools (-11.9% calls, +4.2pp success)
- arxiv 2603.07379 (Mar 2026) — SoK: Agentic RAG, field-defining systematization
- ShinkaEvolve (Sakana AI, ICLR 2026 RSI) — self-improving evolution discovered novel MoE loss beating DeepSeek SOTA in 30 generations
- Berkeley RDI (Apr 2026) — all 8 leading agent benchmarks exploitable to near-perfect scores
- arxiv 2604.11978 (Apr 2026) — HORIZON: agents collapse via compounding off-path tool calls, not single hard steps; LLM-as-judge attribution κ=0.84
- arxiv 2604.17658 (Apr 2026) — ErrorProbe: anomaly→backward-trace→hypothesis-validation pipeline with verified episodic memory; upgrade beyond AgentDebug
- arxiv 2605.04361 (May 2026) — When Context Hurts: identical context yields up to 20× gains OR 46% degradation; one diagnostic trial forecasts the sign
- arxiv 2605.06320 (May 2026) — LATTE: shared evolving DAG beats static-hierarchy MAS at 47.5% tokens, 79.7% acc vs 33.9% MetaGPT
- arxiv 2604.14228 (Apr 2026) — Dive into Claude Code: 98.4% of LOC is harness/infra
- arxiv 2604.16548 (Apr 2026) — Mnemonic Sovereignty: memory needs a security model (six-phase lifecycle × four security objectives)
- arxiv 2604.20158 (Apr 2026) — DPM Stateless Decision Memory: append-only event log + task-conditioned projection for regulated workflows
- arxiv 2605.02363 (May 2026) — When Correct Isn't Usable: 7-9B models hit 85% task accuracy but 0% output accuracy
- arxiv 2604.14862 (Apr 2026) — Schema Key Wording: schema design is prompt design
- arxiv 2605.03378 (May 2026) — ARGUS: provenance-aware injection defense, 3.8% ASR / 87.5% utility (supersedes ICON)
- arxiv 2604.06436 (Apr 2026) — Defense Trilemma: Lean-4 impossibility proof for wrapper defenses
- arxiv 2604.17487 (Apr 2026) — Calibrated Claim-Level Specificity: hedge specificity instead of refusing
- arxiv 2601.06007 — Don't Break the Cache: naive full-context caching can increase latency for long-horizon agents
- arxiv 2605.03228 (May 2026) — MAGE Shadow Memory: parallel security channel for continual-learning memory
- arxiv 2604.27003 (Apr 2026) — Continual Learning Moves to Memory: experience-reuse beats gradient-based continual learning
- Microsoft Security CVEs (May 7 2026) — Semantic Kernel RCE (CVE-2026-26030, CVE-2026-25592)
- NIST AI Agent Standards Initiative (Q2 2026) — US-side governance counterpart to OWASP + EU AI Act
- EU AI Act Council–Parliament Provisional Agreement (May 7 2026) — high-risk applicability delayed to Dec 2 2027
- EC Draft Guidelines on AI Transparency (May 2026) — Article 50(1) agent-disclosure still binds Aug 2 2026 despite high-risk delay
- arxiv 2605.17634 (May 2026) — AI Agents May Always Fall for Prompt Injections: Contextual-Integrity impossibility result (second formal proof after Defense Trilemma)
- vm2 Sandbox-Escape CVE Wave (May 2026) — 13 CVEs CVSS 9–10; injection→sandbox-escape→host RCE; sandbox language layer ≠ trust boundary
- arxiv 2604.23280 (Apr 2026) — AI Identity: five structural gaps; MCP auth/identity is the load-bearing security gap (RSAC 2026: ~38% MCP servers no auth)
- arxiv 2605.08717 (May 2026) — PROBE: the diagnosis–recovery gap (65% diagnosis, 22% recovery); recovery is a separate, harder metric
- arxiv 2605.09252 (May 2026) — tool-necessity linearly decodable from hidden states; Probe&Prefill cuts unnecessary tool calls ~48%
- arxiv 2605.08477 (May 2026) — full-horizon planning + lazy replanning matches step-by-step at 2–3× fewer tokens
- arxiv 2604.01212 (Apr 2026) — YC-Bench: contamination-immune 52-week sim; scratchpad persistence is the strongest long-horizon predictor
- arxiv 2604.05336 (Apr 2026) — TRACE: capability-targeted training beats GEPA on scaling (+14.1 pts τ²-bench)
- arxiv 2605.07164 (May 2026) — ExpWeaver: invoke experience selectively at high-uncertainty, not always
- arxiv 2602.09902 (Feb 2026) — Routing/Cascades game theory: static threshold often beats cascading
- OTel GenAI conventions v1.37–v1.41 — invoke_agent spans + MCP trace-context propagation (v1.39)
- State of FinOps 2026 — AI spend 31%→98% of FinOps scope; agents run a triple cost meter
- Gartner (May 26 2026) — 40% of enterprises to demote/decommission agents by 2027; proportional (not binary) governance across four autonomy levels
- Gartner (Apr 28 2026) — agent sprawl: 18% have full inventory; 1,600+ agents/enterprise projected by end-2026
- Databricks 2026 State of AI Agents — governance 12× / evals 6× production graduation (20k+ orgs)

### Agentic Memory / RAG

Memory demonstrably improves agent outcomes, but naive memory management is actively harmful. This makes the case for **well-architected** memory systems with correction mechanisms, not against memory itself.

**Memory improves outcomes:**
- arxiv 2303.11366 — **Reflexion** (Shinn et al., NeurIPS 2023): Agents storing natural-language self-critiques in episodic memory achieve +11% HumanEval, +22% AlfWorld, +20% HotPotQA. No gradient updates — persistent text memory alone.
- arxiv 2304.03442 — **Generative Agents** (Park et al., UIST 2023, Stanford/Google): Memory stream with episodic log + reflection synthesis + retrieval scored by recency/importance/relevance. Ablation: removing reflection or retrieval significantly degraded behavioral coherence.
- arxiv 2504.19413 — **Mem0** (Chhikara et al., 2025): 26% improvement over OpenAI baseline on LOCOMO benchmark, 91% lower p95 latency, 90%+ token savings. Graph-augmented memory added ~2% further improvement.

**Naive memory is actively harmful:**
- arxiv 2505.16067 — **Experience-Following Behavior** (Xu et al., 2025): Agents exhibit "experience-following" where high similarity to stored memory causes blind replication of past output. Error propagation from wrong memories; utility-based deletion strategies yielded +10% over naive approaches.
- arxiv 2510.05381 — **Context Length Alone Hurts** (2025): Performance degrades 13.9-85% as context length increases, even with perfect retrieval and irrelevant tokens replaced with whitespace.
- arxiv 2410.11414 — **ReDeEP** (ICLR 2025): RAG models hallucinate even with accurate retrieval because parametric knowledge overrides external context in the residual stream.

**Architecture patterns that work:**
- arxiv 2310.08560 — **MemGPT** (Packer et al., 2023, UC Berkeley): Virtual context management with OS-inspired memory hierarchy. Tiered memory (main + disk) enables beyond-context-window operation.
- arxiv 2502.12110 — **A-MEM** (Xu et al., 2025): Zettelkasten-style structured memory with dynamic indexing and cross-linking outperforms flat retrieval stores.

**Key takeaway**: When auditing or designing agent memory, check for (1) retrieval quality scoring beyond naive similarity, (2) memory correction/deletion mechanisms to prevent error propagation, (3) tiered memory to avoid context bloat, and (4) structured memory organization over flat vector stores.

**Don't add memory by default**: AMA-Bench (arxiv 2602.22769, Feb 2026) found plain long-context baselines often beat dedicated memory systems across 6 agent task families, with high cross-task variance among memory architectures. Before recommending a memory system in Audit/Create/Extend, require justification: a workload that demonstrably needs cross-session recall, multi-hop relational reasoning, or context that exceeds long-context capacity. Otherwise the simpler answer is the better one.

### Industry Data
- Only 5% of GenAI projects reach scale (MIT)
- 40% of agentic AI projects predicted canceled by end of 2027 (Gartner)
- 57% of companies have agents in production; only 14% truly production-ready (G2, Aug 2025)
- Average pilot-to-production ROI: only 1.7x; only 6% qualify as high performers (McKinsey 2025)
- Stanford Enterprise AI Playbook: orchestration layer is durable advantage; model interchangeable in 42% of cases
- LangChain: only 52% run offline evals, 37% online evals (eval adoption gap)
- SWE-bench Pro replaces SWE-bench Verified (contaminated); real ceiling ~46%

### Counterintuitive Findings
- Multi-agent often worse than single-agent at equal compute — many gains are inflated token budget artifacts
- Constrained decoding > validation loops, yet most production systems use the latter
- Distilled small models can beat their teachers (Magnet-14B > Gemini-1.5-pro on function calling)
- AI intensifies work rather than reducing it (University of Michigan)
- Agent scaffolding/context engineering matters more than model choice (SWE-bench Pro: same model varies 50-55% based on scaffolding)
- Continuous improvement without fine-tuning is viable (Memento: top-1 GAIA via episodic case memory alone)
- Agent memory helps AND hurts simultaneously — Reflexion/Mem0 show clear gains, but experience-following causes error propagation and context bloat degrades performance even with perfect retrieval
- RAG hallucination is a retrieval-independent problem — parametric knowledge overrides retrieved context in the residual stream (ReDeEP), so retrieval accuracy alone is insufficient
- Hallucination is often a resourcing failure, not a model defect — an agent forced to act without the context it needs and with no sanctioned way to decline; the fix is progressive disclosure + an escape hatch, and each trip becomes a labeled signal of a missing resource/dataset/function (see Cross-Cutting Principle)

### Frameworks Reference
| Framework | Sweet Spot |
|-----------|-----------|
| LangGraph | Stateful, mission-critical, maximum control |
| CrewAI | Rapid prototyping, role-based crews |
| AutoGen/MS Agent Framework | Conversational multi-agent (higher cost) |
| OpenAI Agents SDK | Lightweight, handoff-based |
| Claude Agent SDK | Lifecycle hooks, filesystem-heavy tasks |
| DSPy / GEPA optimizer | Programmatic prompt optimization, compiled reliability (GEPA succeeds MIPROv2) |

### Evaluation Tools
| Tool | Type | Strength |
|------|------|----------|
| Braintrust | Commercial | Trajectory scoring, prompt playground |
| LangSmith | Commercial | LangChain-native, annotation queues |
| DeepEval | Open-source | 30+ metrics, pytest CI/CD integration |
| Langfuse | Open-source | Self-hostable, tracing + prompt management |
| Arize Phoenix | Open-source | Embedding drift detection, deep tracing |
| RAGAS | Open-source | RAG pipeline evaluation standard |

### Guardrail Tools
| Tool | Role | Strength |
|------|------|----------|
| NVIDIA NeMo Guardrails | Input/flow control | Colang DSL, sub-100ms, open-source |
| Guardrails AI | Output validation | 50+ composable Pydantic validators |
| Galileo (Luna-2 SLMs) | Evaluation | 0.95 F1 at 98% lower cost than LLM judges |
| Lakera | Security firewall | Single API for prompt injection/jailbreak prevention |
| Azure AI Content Safety | Content moderation | 0-6 severity scoring, Prompt Shields |
| OWASP Agentic Top 10 | Compliance framework | ASI01-ASI10 risk taxonomy for agentic systems |
| MS Agent Governance Toolkit | Open-source governance | Covers all OWASP Agentic Top 10 risks, framework-agnostic |

### Cost Optimization Strategies (47-80% Reduction Achievable)
- **Model routing**: Tier 1 (Haiku/GPT-4o-mini) for classification, Tier 2 (Sonnet/GPT-4o) for moderate, Tier 3 (Opus) for complex
- **Prompt caching**: Highest ROI single optimization
- **Semantic caching**: Embedding similarity cuts inference costs up to 73%
- **Agentic plan caching**: Reuse structured plan templates (50% cost, 27% latency reduction)
- **Batching**: 50% token discount for batch workloads (OpenAI + Anthropic, as of Apr 2026)
- **Key insight**: Output tokens dominate agent spend; optimize for fewer, more effective tool calls