---
name: agent-review-panel
description: >
  Orchestrate a multi-agent adversarial review panel where several Claude Code
  subagents with different perspectives independently review a piece of work,
  debate with each other, reach (or fail to reach) consensus, then a supreme
  judge renders the final verdict. Use this skill whenever the user asks for a
  "review panel", "multi-agent review", "adversarial review", "have agents
  debate this", "review with multiple perspectives", "panel review", "get
  different opinions on this code/plan/doc", or invokes /agent-review-panel.
  Also trigger when a user says things like "I want thorough feedback from
  different angles", "stress-test this design", "red team this", "get a second
  (third, fourth) opinion", "fresh eyes on this", "multiple reviewers",
  "devil's advocate perspective", "every angle covered", "I want agents to
  argue pros and cons", "independently evaluate", "critical look from security
  and performance angles", "high-stakes — cover every angle", or "debate the
  pros and cons". This skill is specifically about launching multiple reviewer
  agents with distinct personas who discuss and debate — NOT for single-reviewer
  code review, quick sanity checks, bug fixes, deployment tasks, addressing
  existing PR comments, skill improvement, peer review, code explanation, or
  writing tests. Supports "deep research mode" when user says "deep review",
  "thorough review", "research review", or passes "deep" to
  /agent-review-panel — adds web research for domain best practices before
  launching reviewers. Supports "multi-run union mode" when user says
  "multi-run review", "run N times and merge", "run twice", "run 3 times",
  or "maximum coverage review" — repeats the panel with rotated persona sets
  and merges results with stability scoring. Supports "data flow trace
  tiers" (Standard/Thorough/Exhaustive) when user says "thorough review",
  "exhaustive review", "trace everything", or "catch all bugs" — dedicates
  a pre-review phase to tracing data through critical paths and flagging
  composition/seam bugs.
---

# Agent Review Panel v3.5.0

A multi-agent adversarial review system based on nine research foundations:
ChatEval (ICLR 2024), AutoGen, Du et al. (ICML 2024), MachineSoM (ACL 2024),
DebateLLM, DMAD (ICLR 2025), "Talk Isn't Always Cheap" (ICML 2025),
CONSENSAGENT (ACL 2025), Trust or Escalate (ICLR 2025 Oral).

## When NOT to Use This Skill

Do NOT trigger for these requests — they need single-agent handling or other skills:
- Single code review ("review this function for bugs")
- Quick sanity checks ("just a quick look before I push")
- Bug fixes ("fix the type errors", "fix the failing test")
- Peer review without multi-perspective signal ("peer review this doc")
- Code explanation ("what does this code do?")
- Deployment tasks ("deploy to staging")
- Addressing existing feedback ("address the PR comments")
- Skill improvement ("make this skill better") → use schliff
- Writing tests, READMEs, or documentation
- Asking for a single opinion ("what do you think?", "is this any good?")

The key signal is **multiple independent perspectives** — if the user wants one opinion, don't launch a panel.

## Input

This skill takes as input one or more of: file paths to review, inline code/text
in the conversation, a git diff or PR reference, or a plan/design document.
It expects the user to specify (or let it auto-detect) what to review.

## Dependencies

This skill depends on the Agent tool to launch parallel subagent reviewers and
requires bash for context gathering (grep, file reads). All agents MUST use
`model: "opus"`. This includes VoltAgent specialist agents launched via
`subagent_type` — always pass `model: "opus"` explicitly alongside
`subagent_type` to override the agent's default model. Omitting it causes
the launched agent to fall through to its own frontmatter-declared model
(which may be sonnet or haiku), introducing cross-run reasoning variance.
Knowledge mining reads from memory paths if they exist; if not available,
it degrades gracefully — no hard dependency.

**HTML report CDN dependencies (Phase 15.3 output file only):** The generated
`review_panel_report.html` loads Tailwind CSS, Chart.js, and — new in v2.15 —
Prism.js from CDN for syntax highlighting in the Code Evidence sections of
expandable issue cards. If the CDNs are unreachable, the HTML degrades
gracefully: layout and text remain readable, charts show a placeholder, code
blocks render as unstyled monospace.

**Optional enhancement:** When VoltAgent specialist agents are installed, the
panel can use them instead of generic persona-prompted agents for stronger
domain-specific reviews. See "VoltAgent Integration" section below.

This skill is scoped to multi-perspective adversarial review. For skill
improvement requests, use schliff instead. For post-review plan updates,
use plan-review-integrator. Supported versions: Claude Code v1.0+.

## Examples

**Example 1: Code review panel**
Input: "Do a review panel on src/auth/middleware.ts — I want multiple perspectives before merging"
Output: Classifies as pure code → selects Correctness Hawk + Architecture Critic + Security Auditor + Devil's Advocate → gathers context → 4 parallel reviewers → 2 debate rounds → completeness audit → claim verification → supreme judge → writes `review_panel_report.md`

**Example 2: Mixed content with deep research**
Input: "Deep review of our migration plan — it includes SQL and Terraform"
Output: Classifies as mixed → adds Code Quality Auditor + Data Quality Auditor (SQL signal) + Reliability/SRE (infra signal) → runs web research for best practices → full panel → report with epistemic labels

## Process Overview

```
Phase 1:    Setup                     → Identify work, pick personas, define criteria
Phase 2:    Data Flow Trace           → Trace critical path(s), document schemas [code only] (v2.14)
Phase 3:    Independent Review        → All reviewers evaluate in parallel (no cross-talk)
Phase 4:    Private Reflection        → Each reviewer re-reads source, rates own confidence
Phase 5:    Debate (rounds 1–3)       → Reviewers engage with each other + find new issues
Phase 6:    Round Summarization       → Distill resolved/unresolved points between rounds
Phase 7:    Blind Final               → Each reviewer gives final score independently
Phase 8:    Completeness Audit        → Dedicated agent scans for what the panel missed
Phase 9:    Verify Commands           → Run up to 5 reviewer verification commands (advisory)
Phase 10:   Claim Verification        → Verify all line-number citations against source
Phase 11:   Severity Verification     → Read actual code for every P0/P1, downgrade if overstated + web-verify external domain claims (v2.16.3)
Phase 12:   Verification Tier Assign  → Confidence draft (12a) + judge-advised refinement (12b)
Phase 13:   Targeted Verification     → Persona-matched agents dispatched per dispute point
Phase 14:   Supreme Judge             → Opus arbitrates everything including verification round
Phase 14.5: Post-Judge Verification   → Re-verify judge-introduced P0/P1 against ground truth (v3.2.0)
Phase 15:   Output Generation         → (parent) Three output files (all sequential: 15.1 → 15.2 → 15.3)
  Phase 15.1: Primary Markdown Report → Structured markdown summary (review_panel_report.md)
  Phase 15.2: Process History         → Full director's-cut log (review_panel_process.md)
  Phase 15.3: HTML Report             → Interactive dashboard (review_panel_report.html)

[Multi-Run mode (--runs N > 1): repeat Phases 2–15 with rotated personas, then:]
Phase 16:   Merge                     → Deduplicate, score stability, produce merged report (v2.14)
```

---

## Phase 1: Setup

### Identify the Work

The user provides: file paths, inline content, git diff/PR, or a plan/design doc.
Collect full content, then run Context Gathering (below).

**Classify content type** (matters for persona selection):
- **Pure code** — only code files
- **Pure plan/design** — architecture docs, proposals, RFCs
- **Mixed** — plans with code snippets, SQL, or config
- **Documentation** — READMEs, guides, API docs

### Review Mode Detection (v2.8)

Auto-detect review mode from content type. No user toggle.

| Content Type | Review Mode | Behavior |
|---|---|---|
| Pure code | **Precise** | Every finding MUST cite a specific file, line number, or code snippet. Findings without concrete evidence are demoted to [UNVERIFIED]. |
| Pure plan/design | **Exhaustive** | Broader risk identification allowed. Findings may reference design sections or architectural patterns without line-number evidence. |
| Mixed | **Precise** for code, **Exhaustive** for prose | Reviewers label each finding with its mode. Code findings without line citations are demoted. |
| Documentation | **Exhaustive** | Same as plan/design. |

The detected mode is injected into Phase 3 reviewer prompts and the judge prompt.
Report header states the detected mode.

### Detect Content Signals

Scan work for technology-specific signals (case-insensitive, 3+ keyword threshold).
See `references/signals-and-checklists.md` for the full detection table and domain
checklists. Signal detection only fires when auto-selecting personas.

### Context Gathering

**Run these steps before launching reviewers for file-path reviews. Skipping is
the #1 cause of incorrect [CRITICAL] recommendations.**

1. **Sibling Directory Scan** — From reviewed files' parent, scan for `docs/`,
   `README*`, `CLAUDE.md`, `config.py`, `package.json`, etc. Read first 50 lines
   of each. If files are nested, scan both immediate parent and project root.

2. **Reference Tracing** — Scan for imports, config references, cross-file
   references in comments, SQL table references, file path strings.

3. **Safety Mechanism Discovery** — Grep reviewed code + imports for: `_valid`,
   `_flag`, `_guard`, `_check`, `_mask`, `<= target_date`, `BETWEEN`, `fillna`,
   `COALESCE`, `try/except`, `DELETE FROM`, `MERGE`, `WRITE_TRUNCATE`,
   `upsert`, `idempoten`, `--dry-run`, `duplicate`, `assertion`. Note what each
   guards against. **Critical:** When a finding claims "X is missing", verify
   the claim by grepping the actual code — existing safety mechanisms are the
   #1 thing panels miss (v2.6 benchmark: panel claimed "non-idempotent writes"
   but DELETE-then-INSERT with duplicate validation already existed).

3b. **Temporal Scope Verification** — When the work contains ANY temporal
   claims (e.g., "excludes Christmas", "masks winter period", "filters out
   weekends", "pre-period starts after X"), verify that the exclusion applies
   to ALL instances across the full date range, not just the first/most-obvious
   one. Common failure: "excludes Christmas" via a Jan 6 start date only
   excludes the first Christmas — a second Christmas 12 months later may still
   be in the training window. This class of bug evaded 3 rounds of adversarial
   review (12 reviewers) in a real engagement — the user caught it 6 days later.
   **Inject into reviewer prompts:** "For any temporal exclusion claim, count
   how many instances of the excluded event exist in the date range and verify
   ALL are excluded, not just one."

3c. **Codebase State Check (v2.10)** — When reviewing code that lives in a git
   repository, determine the exact codebase state being reviewed. This prevents
   the panel from flagging code as "missing" when it exists on main but not in
   the reviewed branch/worktree.

   **Why this matters:** In a real engagement, a 4-reviewer panel + completeness
   auditor unanimously flagged a class as "non-existent" — but it existed on
   `main` (merged via a PR after the worktree branched). All reviewers checked
   the worktree files, none checked `main`. The finding was confidently wrong.

   **Steps:**
   ```bash
   # 1. Detect if we're in a worktree
   git rev-parse --is-inside-work-tree 2>/dev/null && \
   WORKTREE=$(git rev-parse --show-toplevel) && \
   BRANCH=$(git rev-parse --abbrev-ref HEAD)

   # 2. Find the default branch (main or master)
   DEFAULT_BRANCH=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^refs/remotes/origin/@@') || DEFAULT_BRANCH="main"

   # 3. Find the branch point and count divergence
   MERGE_BASE=$(git merge-base HEAD origin/$DEFAULT_BRANCH 2>/dev/null)
   COMMITS_BEHIND=$(git rev-list --count HEAD..origin/$DEFAULT_BRANCH 2>/dev/null || echo "unknown")

   # 4. List PRs/commits merged to main since branch point
   git log --oneline $MERGE_BASE..origin/$DEFAULT_BRANCH 2>/dev/null | head -20
   ```

   **If commits_behind > 0:** Include a `[STALE_BRANCH]` warning in the context
   brief listing what was merged to main since the branch point. Inject into ALL
   reviewer prompts: "The code under review is {N} commits behind {default_branch}.
   These changes were merged since: {list}. Before claiming code or features are
   'missing', check whether they exist on {default_branch} via
   `git show {default_branch}:{filepath}`."

   **If in a git worktree specifically** (detected via `git worktree list`):
   add extra emphasis — worktrees are commonly used for isolated development and
   are especially prone to divergence from main.

   **Record in Context Brief:** Add a "Codebase State" section with: branch name,
   commits behind main, key PRs merged since branch point, worktree status.

4. **Knowledge Mining (tiered loading)** — Mine local knowledge using a 3-tier
   approach to minimize token waste while maximizing relevant context:

   **L0 — Index scan (~100 tokens each).** Read only index lines and frontmatter
   `description` fields. Filter for relevance to the work under review.
   - `MEMORY.md` — read index lines only (each ~150 chars). Match keywords from
     the work's content type, domain, and technology signals.
   - `~/.claude/skills/*/SKILL.md` — read only the YAML `description:` field
     (glob + grep for `^description:`). Match against detected content signals.
   - `CLAUDE.md` — always load (small, high-authority).

   **L1 — Summary scan (~500 tokens each).** For L0-matched items only, read
   frontmatter + first paragraph to confirm relevance.
   - Memory files (`feedback_*.md`, `project_*.md`) — read first 20 lines.
     `feedback_*.md` files matching the review domain get automatic L2 promotion
     (past corrections are HIGHEST PRIORITY).
   - Skill files — read the `description:` + `## When NOT to Use` section.
   - `lessons.md` — scan for lines matching review domain keywords.

   **L2 — Full content (no limit).** Only for confirmed-relevant items from L1.
   - Read the complete file for items that passed L1 relevance check.
   - Typical yield: 3-8 files at L2 out of 50+ candidates at L0.

   **Deduplication:** if the same insight appears in multiple sources, include
   only the most specific version (project > global > skill).

5. **Web Research** (deep research mode only) — Triggers when user requests
   "deep review" or 5+ keywords from a signal group with no built-in checklist.
   Cap at 2 web searches. Tag findings with [WEB]. If the built-in domain
   checklist already covers the signal group, web research is skipped unless
   explicitly requested.

6. **Context Brief** — Compile into structured brief with sections: Codebase
   State, System Documentation Found, Referenced Files, Safety Mechanisms,
   Knowledge Mining Results, Web Research Findings, Domain Checklist, Context Gaps.

7. **User Confirmation** — If significant context gaps exist or deep research
   is available but not requested, ask before proceeding.

### Select Personas with Agreement Intensity

If user specifies personas, use those. Otherwise select 4 from content-type sets:

**For code/implementation:**
1. **Correctness Hawk** (30%) — Bugs, logic errors, edge cases
2. **Architecture Critic** (50%) — Design patterns, coupling, extensibility
3. **Security Auditor** (30%) — Vulnerabilities, injection, auth gaps
4. **Devil's Advocate** (20%) — Challenges everything, proposes alternatives

**For plans/designs (pure — no code):**
1. **Feasibility Analyst** (60%) — Technical feasibility, timeline realism
2. **Stakeholder Advocate** (50%) — Business perspective, ROI
3. **Risk Assessor** (30%) — Failure modes, dependencies
4. **Devil's Advocate** (20%)

**For mixed content (plans WITH code/SQL/config) — CRITICAL:**
1. **Feasibility Analyst** (60%)
2. **Code Quality Auditor** (40%) — Line-by-line scrutiny of every snippet
3. **Risk Assessor** (30%)
4. **Devil's Advocate** (20%)

**For documentation:**
1. **Clarity Editor** (60%)
2. **Technical Accuracy Reviewer** (30%)
3. **Completeness Checker** (40%)
4. **Devil's Advocate** (20%)

After base selection, auto-add signal-detected personas (up to 6 total).
Replace Devil's Advocate first if at cap (keep ≥1 DA if panel ≥4).

**CRITICAL: If work contains ANY code/SQL/config snippets, always include
Code Quality Auditor — the #1 cause of missed details in v1.**

### Reasoning Strategy Assignment (DMAD, ICLR 2025)

| Persona Type | Strategy | Injection |
|---|---|---|
| Correctness Hawk / Code Quality Auditor | Systematic enumeration | "Enumerate every code path, constant, edge case." |
| Architecture Critic / Feasibility Analyst | Backward reasoning | "Start from desired outcome, trace backward." |
| Security Auditor / Risk Assessor | Adversarial simulation | "Imagine you are an attacker. How would you break this?" |
| Devil's Advocate | Analogical reasoning | "Compare to known failure patterns from similar projects." |
| Stakeholder Advocate / Clarity Editor | First-principles | "Question every assumption from scratch." |
| Auto-added specialists | Checklist verification | "Use your domain checklist. Verify each item." |

### Default Evaluation Criteria

Correctness, Completeness, Quality, Edge Cases (override if user specifies).

### VoltAgent Integration (v2.9)

VoltAgent specialist agents (130+ across 10 families) have built-in domain
expertise via their system prompts, making them stronger reviewers than generic
persona-prompted agents. When available, the panel should **upgrade** personas
to VoltAgent agents. Full catalog: github.com/VoltAgent/awesome-claude-code-subagents
(a point-in-time snapshot of every agent the mapping tables below may reference
is vendored at `references/voltagent-catalog.json`; run `scripts/refresh-voltagent-catalog.sh`
to regenerate it and `scripts/voltagent-catalog-check.sh` to detect drift).

**Step 1: Check availability.** During Phase 1 setup, check whether VoltAgent
agents are available by scanning the system-reminder agent list for any
`voltagent-*` prefixed agents. Note which families are installed.

**Step 2: Map personas to specialists.** Use this mapping table.
*(When launching any persona via `subagent_type`, ALWAYS pass `model: "opus"`. v2.14.)*

#### Core Persona Mapping (review panel built-in personas)

| Persona | Primary VoltAgent | Alt VoltAgent | Fallback |
|---|---|---|---|
| Correctness Hawk | `voltagent-qa-sec:code-reviewer` | `voltagent-qa-sec:debugger` | Generic + prompt |
| Architecture Critic | `voltagent-qa-sec:architect-reviewer` | `voltagent-infra:cloud-architect` | Generic + prompt |
| Security Auditor | `voltagent-qa-sec:security-auditor` | `voltagent-qa-sec:penetration-tester` | Generic + prompt |
| Code Quality Auditor | `voltagent-qa-sec:code-reviewer` | | Generic + prompt |
| Feasibility Analyst | `voltagent-data-ai:data-scientist` | `voltagent-biz:business-analyst` | Generic + prompt |
| Risk Assessor | `voltagent-qa-sec:chaos-engineer` | `voltagent-domains:risk-manager` | Generic + prompt |
| Performance Specialist | `voltagent-qa-sec:performance-engineer` | `voltagent-infra:sre-engineer` | Generic + prompt |
| Stakeholder Advocate | `voltagent-biz:product-manager` | `voltagent-biz:business-analyst` | Generic + prompt |
| Devil's Advocate | Generic + prompt | | (intentionally generic) |
| Data Quality Auditor | `voltagent-data-ai:data-analyst` | `voltagent-data-ai:data-engineer` | Generic + prompt |
| Reliability/SRE | `voltagent-infra:sre-engineer` | `voltagent-infra:devops-incident-responder` | Generic + prompt |
| DevOps/Infra | `voltagent-infra:devops-engineer` | `voltagent-infra:platform-engineer` | Generic + prompt |
| Database Specialist | `voltagent-data-ai:database-optimizer` | `voltagent-data-ai:postgres-pro` | Generic + prompt |
| Clarity Editor | `voltagent-dev-exp:documentation-engineer` | `voltagent-biz:technical-writer` | Generic + prompt |
| Technical Accuracy | `voltagent-qa-sec:code-reviewer` | | Generic + prompt |
| Completeness Checker | `voltagent-qa-sec:qa-expert` | | Generic + prompt |

#### Signal-Detected Specialist Mapping (auto-added by content signals)

When content signals trigger auto-addition of specialist reviewers, use these
VoltAgent agents instead of generic personas:

| Content Signal | Auto-Add Persona | VoltAgent `subagent_type` |
|---|---|---|
| SQL / database queries | Data Quality Auditor | `voltagent-data-ai:database-optimizer` |
| Terraform / IaC | Infrastructure Reviewer | `voltagent-infra:terraform-engineer` |
| Terragrunt | Infrastructure Reviewer | `voltagent-infra:terragrunt-expert` |
| Docker / containers | Container Reviewer | `voltagent-infra:docker-expert` |
| Kubernetes / k8s | K8s Reviewer | `voltagent-infra:kubernetes-specialist` |
| CI/CD / pipelines | Pipeline Reviewer | `voltagent-infra:deployment-engineer` |
| ML / model training | ML Reviewer | `voltagent-data-ai:ml-engineer` |
| LLM / prompts | LLM Reviewer | `voltagent-data-ai:llm-architect` |
| NLP / text processing | NLP Reviewer | `voltagent-data-ai:nlp-engineer` |
| React / frontend | Frontend Reviewer | `voltagent-lang:react-specialist` |
| TypeScript | TS Reviewer | `voltagent-lang:typescript-pro` |
| Python | Python Reviewer | `voltagent-lang:python-pro` |
| Go / Golang | Go Reviewer | `voltagent-lang:golang-pro` |
| Rust | Rust Reviewer | `voltagent-lang:rust-engineer` |
| Java / Spring | Java Reviewer | `voltagent-lang:java-architect` |
| .NET / C# | .NET Reviewer | `voltagent-lang:csharp-developer` |
| Ruby / Rails | Rails Reviewer | `voltagent-lang:rails-expert` |
| PHP / Laravel | PHP Reviewer | `voltagent-lang:laravel-specialist` |
| Swift / iOS | iOS Reviewer | `voltagent-lang:swift-expert` |
| Flutter / Dart | Flutter Reviewer | `voltagent-lang:flutter-expert` |
| GraphQL | GraphQL Reviewer | `voltagent-core-dev:graphql-architect` |
| WebSocket / real-time | Real-time Reviewer | `voltagent-core-dev:websocket-engineer` |
| Microservices | Architecture Reviewer | `voltagent-core-dev:microservices-architect` |
| API design | API Reviewer | `voltagent-core-dev:api-designer` |
| Network / DNS / routing | Network Reviewer | `voltagent-infra:network-engineer` |
| Azure | Azure Reviewer | `voltagent-infra:azure-infra-engineer` |
| Active Directory | AD Security Reviewer | `voltagent-qa-sec:ad-security-reviewer` |
| PowerShell | PowerShell Reviewer | `voltagent-qa-sec:powershell-security-hardening` |
| Compliance / GDPR / SOC2 | Compliance Reviewer | `voltagent-qa-sec:compliance-auditor` |
| Accessibility / a11y | Accessibility Reviewer | `voltagent-qa-sec:accessibility-tester` |
| Error handling / logging | Error Reviewer | `voltagent-qa-sec:error-detective` |
| Test automation | Test Reviewer | `voltagent-qa-sec:test-automator` |
| Blockchain / Web3 | Blockchain Reviewer | `voltagent-domains:blockchain-developer` |
| Payment / fintech | Fintech Reviewer | `voltagent-domains:fintech-engineer` |
| IoT / embedded | Embedded Reviewer | `voltagent-domains:embedded-systems` |
| SEO | SEO Reviewer | `voltagent-domains:seo-specialist` |
| Quant / financial models | Quant Reviewer | `voltagent-domains:quant-analyst` |
| Vue / Nuxt | Vue Reviewer | `voltagent-lang:vue-expert` |
| Angular | Angular Reviewer | `voltagent-lang:angular-architect` |
| Next.js | Next.js Reviewer | `voltagent-lang:nextjs-developer` |
| React Native / Expo | Mobile Reviewer | `voltagent-lang:expo-react-native-expert` |
| Electron / desktop apps | Desktop Reviewer | `voltagent-core-dev:electron-pro` |
| Django | Django Reviewer | `voltagent-lang:django-developer` |
| FastAPI | FastAPI Reviewer | `voltagent-lang:fastapi-developer` |
| Spring Boot | Spring Boot Reviewer | `voltagent-lang:spring-boot-engineer` |
| Symfony | Symfony Reviewer | `voltagent-lang:symfony-specialist` |
| C / C++ | C++ Reviewer | `voltagent-lang:cpp-pro` |
| Kotlin | Kotlin Reviewer | `voltagent-lang:kotlin-specialist` |
| Elixir / Phoenix | Elixir Reviewer | `voltagent-lang:elixir-expert` |
| MLOps / model deployment | MLOps Reviewer | `voltagent-data-ai:mlops-engineer` |
| Reinforcement learning | RL Reviewer | `voltagent-data-ai:reinforcement-learning-engineer` |
| Prompt optimization / evals | Prompt Reviewer | `voltagent-data-ai:prompt-engineer` |
| AI systems / agentic apps | AI Systems Reviewer | `voltagent-data-ai:ai-engineer` |
| Database admin / replication / HA | DBA Reviewer | `voltagent-infra:database-administrator` |
| Incident response / outages | Incident Reviewer | `voltagent-infra:incident-responder` |
| Windows Server / IIS | Windows Reviewer | `voltagent-infra:windows-infra-admin` |
| Cloud security / secrets mgmt | Cloud Security Reviewer | `voltagent-infra:security-engineer` |
| CLI tools / TUIs | CLI Reviewer | `voltagent-dev-exp:cli-developer` |
| MCP servers / tools | MCP Reviewer | `voltagent-dev-exp:mcp-developer` |
| Refactoring / legacy modernization | Refactoring Reviewer | `voltagent-dev-exp:refactoring-specialist` |
| Build systems / bundlers | Build Reviewer | `voltagent-dev-exp:build-engineer` |
| Dependencies / supply chain | Dependency Reviewer | `voltagent-dev-exp:dependency-manager` |
| Git workflows / branching | Git Workflow Reviewer | `voltagent-dev-exp:git-workflow-manager` |
| Game dev / engines | Game Reviewer | `voltagent-domains:game-developer` |
| API docs / OpenAPI | API Docs Reviewer | `voltagent-domains:api-documenter` |
| Legal / licensing / contracts | Legal Reviewer | `voltagent-biz:legal-advisor` |
| UX research / usability | UX Research Reviewer | `voltagent-biz:ux-researcher` |

#### Multi-Agent Orchestration Mapping (for pre/post-panel phases)

All launches below MUST pass `model: "opus"` explicitly (v2.14).

| Review Phase | VoltAgent `subagent_type` | Use When |
|---|---|---|
| Data Flow Trace (Phase 2) | `voltagent-data-ai:data-engineer`, `model: "opus"` | Trace data paths, document schemas at boundaries (v2.14) |
| Completeness Audit (Phase 8) | `voltagent-meta:knowledge-synthesizer`, `model: "opus"` | Synthesize what the panel missed |
| Claim Verification (Phase 10) | `voltagent-qa-sec:code-reviewer`, `model: "opus"` | Verify line-number citations |
| Severity Verification (Phase 11) | `voltagent-qa-sec:debugger`, `model: "opus"` | Read actual code for P0/P1 findings |
| Tier Refinement Advisor (Phase 12b) | Generic, `model: "opus"` | (must be domain-neutral to refine tiers) |
| Verification Agents (Phase 13) | Persona-matched — see Phase 13 table, `model: "opus"` | Each agent matched to claim type |
| Supreme Judge (Phase 14) | Generic, `model: "opus"` | (judge must be domain-neutral) |
| Judge-Output Verifier (Phase 14.5) | Generic, `model: "opus"` | Re-verifies judge-introduced P0/P1 against ground truth via grep/Read/git (v3.2.0) |
| HTML Report Agent (Phase 15.3) | `voltagent-lang:javascript-pro`, `model: "opus"` | Generate interactive HTML dashboard with expandable issue cards (v2.15). Reads from disk: Phase 15.1 report + Phase 15.2 process history + rendering spec from prompt-templates.md (v2.16.4). Loads Tailwind, Chart.js, and Prism.js via CDN. |
| Merge Agent (Phase 16) | `voltagent-meta:knowledge-synthesizer`, `model: "opus"` | Deduplicate + score stability in multi-run mode (v2.14) |

**Step 3: Suggest installation when beneficial.** If a selected persona would
benefit from a VoltAgent agent but the agent family is not available, suggest
installation to the user:

> "This review would benefit from VoltAgent specialist agents for deeper
> domain-specific analysis. You can install the relevant families with:
>
> **Quick install (CLI):**
> `claude plugin install voltagent-qa-sec`  — security, code review, testing
> `claude plugin install voltagent-data-ai` — data science, ML, databases
> `claude plugin install voltagent-infra`   — DevOps, cloud, Terraform
> `claude plugin install voltagent-lang`    — language specialists (TS, Python, Go, Rust)
> `claude plugin install voltagent-biz`     — product, business analysis
> `claude plugin install voltagent-domains` — fintech, blockchain, IoT
>
> **Or browse via marketplace:**
> `/plugin marketplace add VoltAgent/awesome-claude-code-subagents`
> then `/plugin install <name>@voltagent-subagents`
>
> Continue without them? They're optional — the review will still work
> with generic persona-prompted agents."

Only suggest installation **once per session**. List only the families relevant
to the detected content signals, not all 10. If the user declines or the agents
aren't available, proceed with the generic fallback silently.

**Step 4: Launch with `subagent_type` AND `model: "opus"`.** When launching Phase 3 agents:
- `subagent_type: "voltagent-qa-sec:code-reviewer", model: "opus"` (when available)
- Omit `subagent_type`, pass `model: "opus"` explicitly (generic agent fallback)

**CRITICAL (v2.14):** ALWAYS pass `model: "opus"` even when using `subagent_type`.
VoltAgent agents may declare their own default model (sonnet, haiku) in their
frontmatter. Without an explicit override, the panel silently runs on mixed
models, producing different reasoning depths across runs. The VoltAgent
agent's value lives in its system prompt and tool access, NOT its default
model. Forcing opus preserves the domain expertise while guaranteeing
consistent reasoning depth. This fix resolves an invisible source of
cross-run variance documented in the v2.10→v2.14 consistency analysis.

The persona prompt is STILL included even when using VoltAgent agents — it
provides the review-panel-specific context (agreement intensity, reasoning
strategy, evaluation criteria) that the VoltAgent agent doesn't have natively.

---

## Live-State Claim Discipline (v3.3.0)

Resolves [#40](https://github.com/wan-huiyan/agent-review-panel/issues/40). A
panel reading source code can verify what a script or manifest WILL do if
executed — it cannot verify what production infrastructure IS doing right now.
Conflating the two produced a false-positive P0 "IAM/IAP divergence" finding
that survived all 5 reviewers, 3 debate rounds, and the Supreme Judge: the
agents read `echo "gcloud ... --role=..."` lines in two deploy scripts
(operator-facing documentation printed to the terminal at deploy-completion
time) as if they were the live IAM bindings of the deployed service. A single
`gcloud run services describe` would have falsified it in 30 seconds.

This discipline applies to any finding that asserts a fact about **live
state** — deployed IAM/IAP/auth config, a running cron schedule, a BigQuery
table's partition key, a production env var, a load balancer's routing. It is
NOT limited to security. It is injected into Phase 3 reviewer prompts, the
Phase 5 debate prompt, Phase 11 severity verification, and the Phase 14 judge
prompt.

### Rule 1 — Declarative vs. imperative vs. documentation

Reviewers must distinguish three things a source file can contain:

| Category | Example | What it proves |
|---|---|---|
| **Declarative config** | `gcloud run deploy ... --no-allow-unauthenticated`, a Terraform resource, a YAML manifest | The deploy WILL create this config if run — still not proof it WAS run |
| **Imperative documentation** | `echo "  gcloud beta run services add-iam-policy-binding ..."`, a comment, a README snippet, a printed "next steps" blurb | A human is being TOLD to run this later. The script itself does not. |
| **Live state** | output of `gcloud ... describe` / `... get-iam-policy`, `bq show`, `aws ... describe-*`, `kubectl get`, `crontab -l` | What production actually is right now |

Lines inside `echo "..."`, `print(...)`, comments, heredocs echoed to a
terminal, or string literals in "usage" / "next steps" blocks are
**documentation, not configuration**. They are never evidence for a live-state
claim. Configuration claims must come from declarative deploy flags / manifests;
live-state claims must come from live `describe`-class output.

### Rule 2 — Live-state claims need live evidence

Every finding that asserts a live-infrastructure or runtime-state fact carries
one of two epistemic tags:

- **`[LIVE-VERIFIED]`** — backed by output from a live-state command
  (`gcloud ... describe` / `... get-iam-policy`, `bq show`, `aws ... describe-*`,
  `kubectl get`, `crontab -l`, etc.) that the panel actually ran or that the
  user supplied.
- **`[STATIC-INFERENCE]`** — inferred from source code, config files, or deploy
  scripts only. The panel did not (or could not) observe live state.

A `[STATIC-INFERENCE]` live-state claim is capped at **P1**, no matter how many
reviewers cite it. P0 ("block the demo") requires `[LIVE-VERIFIED]` — or the
finding must be reworded as "the deploy script would configure X" (a
`[PLAN_RISK]`, not an `[EXISTING_DEFECT]`). When the panel lacks the tools to
obtain live evidence, the finding must say so explicitly rather than inferring.

### Rule 3 — Consensus does not compound on a shared artifact

When 2+ reviewers reach the same conclusion by reading the **same** source
lines, that is consensus on an *interpretation*, not independent verification
of a *fact*. It must not be promoted to `[VERIFIED]` or used to justify P0.
Phase 6 (Sycophancy Detection) flags this pattern; the judge tags it
`[STATIC-INFERENCE-CONSENSUS]` and requires independent live verification
before any P0 promotion. Cross-citation chains (Security F3 → Architecture F4
→ DA CF2) over the same artifact lines are a single source, not three.

### Rule 4 — Pre-promotion falsification check

Before any finding is promoted to P0 — in debate (Phase 5) or by the judge
(Phase 14) — answer two questions:

1. **What single observation would prove this finding wrong?**
2. **Is that observation cheap to obtain?**

If a P0 can be falsified by one read-only command (a `describe`, a `show`, a
`grep`) and no agent ran it, the finding is at most P1 until verified. Record
the falsification test alongside the finding.

---

## Phase 2: Data Flow Trace (v2.14)

A dedicated agent traces data through the critical path(s) of the work
BEFORE reviewers begin, producing a structured Data Flow Map. This phase
specifically targets **composition defects** — bugs where two individually-
correct functions produce incorrect results together. These bugs are
structurally invisible to reviewers who read each function in isolation.

Research foundations: Meta semi-formal certificate prompting (2026, 78%→93%
accuracy), LLMDFA (NeurIPS 2024, 87% precision), RepoAudit (ICML 2025,
78% precision with demand-driven exploration), BugLens (ASE 2025, 7x false
positive reduction), ZeroFalse (2025, F1 0.955).

### Skip Conditions

- Pure plans/design (no code)
- Pure documentation (no code)
- Code with no detectable data transforms (pure API routing, static config,
  declarative-only files)

When Phase 2 is skipped, note the reason in the Context Brief and the report
header. Proceed directly to Phase 3.

### Tier System

Three tiers, user-selectable via `--trace {tier}` or natural language:

| Tier | Trigger Phrases | Paths Traced | Overhead | Token Budget |
|------|----------------|--------------|----------|--------------|
| **Standard** (default) | no modifier, "review" | Single most important path | ~5 min | ~8k |
| **Thorough** | "thorough review", "thorough trace", `--trace thorough` | Top 3 paths + transform completeness | ~15 min | ~20k |
| **Exhaustive** | "exhaustive review", "trace everything", "catch all bugs", `--trace exhaustive` | ALL paths from every entry point | No limit | No limit |

**Tier detection priority:**
1. Explicit `--trace {tier}` flag
2. Natural language keywords in user's original prompt
3. Default: Standard

"Deep review" (which triggers web research) combines with Standard trace
unless the user also specifies a trace tier.

### Critical Path Identification (orchestrator, not subagent)

Before launching the Data Flow Tracer, the orchestrator identifies entry
points and ranks them by data complexity:

1. **Find entry points.** Scan for structural markers:
   - Web frameworks: `@app.route`, `@router.get/post`, `@api_view`, Django CBVs
   - CLI: `@click.command`, `@app.command` (Typer), `if __name__ == "__main__":`, argparse
   - Background: `@app.task` (Celery), AWS `lambda_handler`, Kafka/SQS consumers
   - Scripts: `main()`, top-level script execution

2. **Rank by data complexity.** Count on each path:
   - Number of function calls
   - Number of data transforms (map/filter/reduce/apply/merge/join/groupby/pivot)
   - Number of I/O boundaries (DB, HTTP, file, queue)
   - Presence of transform/back-transform pairs

3. **Select paths per tier:**
   - Standard: top-ranked path only
   - Thorough: top 3 paths
   - Exhaustive: all paths

### The Data Flow Tracer Agent

Single agent (`model: "opus"`). VoltAgent mapping: `voltagent-data-ai:data-engineer`
primary, `voltagent-qa-sec:code-reviewer` fallback. **Always pass `model: "opus"`**
even when using `subagent_type`.

Uses the **semi-formal certificate approach** from Meta's 2026 agentic code
reasoning research. At each function boundary on the critical path, the agent
produces a certificate:

```
FUNCTION: {name} ({file}:{line})
INPUT_SCHEMA:
  - parameter types (declared or inferred)
  - known constraints at call site
  - which parameters are externally controlled
TRANSFORM:
  - what the function does
  - key assignments and branches
  - external calls (I/O, DB, modules)
OUTPUT_SCHEMA:
  - return type
  - tainted/derived fields
  - guaranteed invariants
COMPOSITION_CHECK: (vs next function)
  - Does OUTPUT_SCHEMA satisfy next INPUT_SCHEMA?
  - Fields required but not guaranteed?
  - Tainted fields reaching sensitive parameters?
INVARIANT_STATUS:
  - preserved or violated invariants
  - violations flagged as P0 candidates
```

See `references/prompt-templates.md` for the full Phase 2 Data Flow Tracer
prompt.

### Mandatory Invariant Checks (at every boundary)

1. **Schema preservation** — output schema matches next function's expected input
2. **Transform/back-transform completeness** — list forward transforms (log,
   encode, serialize) and back-transforms (exp, decode, deserialize). Any
   field in forward but not back is a P0 candidate. See the Transform/Back-
   Transform Completeness checklist in `references/signals-and-checklists.md`.
3. **Row count stability** — joins/merges/reindex/groupby should not silently
   add or remove rows
4. **Null semantics** — `fillna(0)` does not destroy meaningful missingness
5. **Temporal consistency** — date filters applied to all date columns;
   ALL instances of an excluded event (e.g., BOTH Christmases) handled

### Output and Integration with Phase 3

The Data Flow Tracer produces a **Data Flow Map** containing:
- List of paths traced
- Per-function certificates
- Invariant violations table (P0 candidates)
- Transform completeness table
- Clean paths (where all invariants hold)

**Integration with Phase 3:** The Data Flow Map is injected into every
reviewer's Phase 3 prompt as dedicated context. Invariant violations are
flagged as P0 candidates; reviewers must either validate them (agree they're
real P0s) or explicitly challenge them with reasoning. Reviewers are NOT
required to agree with the tracer — this is an additional input, not a
mandate.

When no violations are found, reviewers receive a short "clean trace"
confirmation instead.

---

## Phase 3: Independent Review (Round 0)

Launch ALL reviewer agents **in parallel** using Agent tool with `model: "opus"`.
When VoltAgent integration is active, use `subagent_type` from the mapping table.
Each gets the structured prompt from `references/prompt-templates.md` (Phase 3
template) with their persona, agreement intensity, reasoning strategy, context
brief, and the full work content inside injection boundaries. The prompt also
carries the Live-State Claim Discipline (Rules 1–2): reviewers must tag every
live-infrastructure/runtime-state claim `[LIVE-VERIFIED]` or `[STATIC-INFERENCE]`
and must not treat `echo`/comment/usage-blurb lines as configuration.

Collect all N independent reviews.

**Output (v3.1.0+):** Each reviewer subagent writes its full review to
`state/reviewer_<name>_phase_3.md` and returns only the path + a 100-word
summary. The orchestrator does NOT hold verbatim reviews in its window.

---

## Phase 4: Private Reflection

Launch all reviewers **in parallel**, each receiving ONLY their own review.
They re-read source, rate confidence per finding (High/Medium/Low), note new
issues, identify most/least defensible findings. See `references/prompt-templates.md`.

**Output (v3.1.0+):** Each reviewer's reflection is written to
`state/reviewer_<name>_phase_4.md`. Subagent returns only path + 100-word
summary.

---

## Phase 5: Debate (Rounds 1-3, adaptive)

Launch all reviewers **in parallel** each round. Each receives their own review
+ reflection, all others' feedback, and unresolved points from previous round.

**Output (v3.1.0+):** Each reviewer's per-round debate response is written
to `state/reviewer_<name>_phase_5_round<R>.md` (R = 1, 2, or 3). Round 1 is
mandatory; rounds 2 and 3 follow the existing convergence-based skip rules.
Subagent returns only path + 100-word summary.

**Pre-promotion falsification check (v3.3.0).** Before any finding is promoted
to — or kept at — P0 in any debate round, the reviewer must state the single
observation that would falsify it and whether that observation is cheap to
obtain (see Live-State Claim Discipline Rule 4). A P0 that one read-only command
could falsify, with no agent having run it, is capped at P1 until verified.

### Phase 6: Round Summarization

After each round, summarize (no agent needed):
- **Resolved this round** — who agreed, what convinced them
- **Still in dispute** — with inlined source excerpts (max 10 lines per dispute,
  first 5 + last 5 if longer; max 3 disputes). If a reviewer's claim cannot be
  traced to a specific source location, tag `[source not cited by reviewer]`.
- **New discoveries** — from which reviewer

### Sycophancy Detection (CONSENSAGENT)

Count position changes toward majority. If >50% lack new evidence → inject
sycophancy alert into next round prompt for all reviewers.

**Shared-artifact consensus (v3.3.0).** Also flag when 2+ reviewers agree on a
claim by reading the **same** source lines without independent verification —
including cross-citation chains where each reviewer cites the previous one's
finding rather than the source. This is consensus on an *interpretation*, not
on a *fact*: it does not compound to `[VERIFIED]` and must not justify P0. Tag
such points `[STATIC-INFERENCE-CONSENSUS]` (Live-State Claim Discipline Rule 3)
and route them to verification before any P0 promotion.

### Convergence Check

- All disputes minor/stylistic → stop
- Substantive disagreements remain → continue
- New discoveries still emerging → continue
- Maximum 3 rounds regardless

---

## Phase 7: Blind Final Assessment

Launch all reviewers one final time in parallel. Each gives final score, top 3
points, recommendation, one-line verdict. Others do NOT see these.

**Output (v3.1.0+):** Each reviewer's blind final is written to
`state/reviewer_<name>_phase_7.md`. Subagent returns only path + 100-word
summary of new findings.

---

## Phase 8: Completeness Audit

Single agent (`model: "opus"`) hunts for what the entire panel missed. Does NOT
evaluate quality — only finds overlooked details, edge cases, constants, code.
See `references/prompt-templates.md` for full prompt.

**Mandatory audit checks (in addition to general completeness):**
- **Temporal scope verification:** For every claim that excludes, filters, or
  masks a time period, count all instances in the full date range and verify
  each is handled. Example: "excludes Christmas" with 2 years of data must
  exclude BOTH Christmases. This is the #1 class of bug that reviewers miss
  because they focus on the method, not the temporal arithmetic.

**Output (v3.1.0+):** Subagent writes full output to `state/phase_8_audit.md`
and returns only path + 100-word summary.

---

## Phase 9: Verification Command Execution (v2.8)

Run up to 5 reviewer `verification_command` entries for P0/P1 findings (P0 first).
Validate read-only (grep/cat/head/tail/wc only), execute via Bash, annotate:
`[CMD_CONFIRMED]`, `[CMD_CONTRADICTED]` (demote 1 level), `[CMD_INCONCLUSIVE]`,
`[CMD_FAILED]`. **Advisory, not gating** — demotes but does not delete.
Skip this phase if no verification commands were provided.

---

## Phase 10: Claim Verification

Single agent (`model: "opus"`) checks all reviewer citations against source.
Classifies each as [VERIFIED], [INACCURATE], [MISATTRIBUTED], [HALLUCINATED],
or [UNVERIFIABLE]. Results feed into judge prompt.

**Output (v3.1.0+):** Subagent writes full output to `state/phase_10_claim_verification.md`
and returns only path + 100-word summary.

---

## Phase 11: Severity Verification (v2.7)

Single agent (`model: "opus"`) that reads the actual codebase to verify every
P0 and P1 finding before the judge sees them. This phase exists because panels
systematically overstate severity when they lack runtime context (v2.6
benchmark: 2/3 P0 findings were overstated after code investigation).

**For each P0/P1 finding, the agent must:**

1. **Classify as `[EXISTING_DEFECT]` or `[PLAN_RISK]`**
   - `[EXISTING_DEFECT]`: The bug exists in the current running code right now
   - `[PLAN_RISK]`: The risk would only materialise if the plan is implemented as written
   - P0 severity requires `[EXISTING_DEFECT]`. A `[PLAN_RISK]` is at most P1.

1b. **Live-state claim classification (v3.3.0)** — If the finding asserts a fact
   about live infrastructure or runtime state (deployed IAM/IAP/auth config, a
   running cron schedule, a production env var, a BigQuery partition key, a load
   balancer's routing), apply the Live-State Claim Discipline:
   - Tag `[LIVE-VERIFIED]` only if backed by live `describe`-class command
     output the panel ran or the user supplied; otherwise tag `[STATIC-INFERENCE]`.
   - A `[STATIC-INFERENCE]` live-state claim is capped at **P1** regardless of
     reviewer count. Do NOT let it stay P0 on consensus alone.
   - Reject `echo`/comment/usage-blurb lines as evidence — those are
     documentation, not configuration. A claim resting only on such lines is
     `[STATIC-INFERENCE]` at best and is frequently `[INACCURATE]`.

2. **Verify the claim against actual code**
   - If the finding says "X is missing", grep for X in the actual codebase
   - If the finding says "X pattern is wrong", read the referenced code and check
   - If the finding cites a specific file/line, read that file and verify
   - If no reviewer cited a specific line number, flag as `[UNCITED]`

3. **Check for existing safety mechanisms**
   - Grep for DELETE, MERGE, upsert, idempotent, dry-run, duplicate, assertion
     patterns near the referenced code
   - A finding about "missing safety" is invalid if the safety exists but the
     reviewer didn't look for it

4. **Output a severity verification table:**

```
| Finding | Panel Severity | Verified? | Actual Severity | Reason |
|---------|---------------|-----------|-----------------|--------|
| ...     | P0            | No        | Not a bug       | Grep found no bf/af COALESCE pattern |
| ...     | P0            | Partial   | P1              | DELETE-then-INSERT already exists |
```

5. **External domain claim detection and web verification (v2.16.3)**

   **Why this exists:** Consensus P0 findings that depend on external domain
   knowledge bypass the Phase 12/13 dispute-verification pipeline entirely
   (because there is no dispute to trigger it). But all reviewers can be wrong
   the same way — shared model bias or shared domain knowledge gaps. In a real
   engagement (PUMA GA4 audit, 2026-04-09), all 4 reviewers unanimously flagged
   "50 months = GA4 360" as P0 without verifying whether 50 months is even a
   valid GA4 setting. The claim happened to be correct, but the panel had no
   mechanism to verify it. If the source data had been wrong, the panel would
   have confidently presented an incorrect P0.

   **For each P0/P1 finding, classify whether it depends on external knowledge:**

   - **External domain claim:** The finding's validity depends on facts outside
     the reviewed codebase — product feature limits, API behavior, regulatory
     jurisdiction, pricing tiers, platform capabilities, protocol specifications,
     third-party documentation. Examples: "50 months retention means GA4 360",
     "GDPR applies to Mexico", "this API rate-limits at 100 req/s."
   - **Internal claim:** The finding is fully verifiable from the reviewed code,
     config, or documentation. No external knowledge needed.

   **For each finding classified as external domain claim:**
   - Run a web search to verify the specific factual premise (cap: 2 searches
     per claim, 5 claims max per review)
   - Tag result: `[WEB-VERIFIED]` (confirmed by authoritative source),
     `[WEB-CONTRADICTED]` (external source disagrees — demote severity by 1 level),
     `[WEB-INCONCLUSIVE]` (no authoritative source found — flag for judge)
   - Include the source URL and key quote in the verification table
   - Regulatory/jurisdiction claims (e.g., "GDPR applies to X country") are
     ALWAYS classified as external domain claims

   **Extended severity verification table:**

   ```
   | Finding | Severity | Domain Type | Web Result | Source | Adjusted Severity |
   |---------|----------|-------------|------------|--------|-------------------|
   | ...     | P0       | External    | [WEB-VERIFIED] | support.google.com/... | P0 (confirmed) |
   | ...     | P1       | External    | [WEB-CONTRADICTED] | gdpr.eu/... | P2 (demoted) |
   | ...     | P0       | Internal    | N/A        | N/A    | P0 (code-verified) |
   ```

   **Skip condition:** If all P0/P1 findings are internal claims (fully
   verifiable from the reviewed content), skip web verification.

Results feed into the Supreme Judge prompt. The judge MUST reference the
verification table when ruling on disagreements.

**Output (v3.1.0+):** Subagent writes full output to `state/phase_11_severity_verification.md`
and returns only path + 100-word summary.

---

## Phase 12: Verification Tier Assignment (v2.11)

After Phases 8–11, collect all **unresolved dispute points** from Phase 6
summaries plus any **high-uncertainty action items** bearing `[SINGLE-SOURCE]`,
`[DISPUTED]`, or `[UNVERIFIED]` labels. Each point is assigned a depth tier that
controls the verification agent's budget and capabilities in Phase 13.

**Skip condition:** If there are zero unresolved disputes and zero unverified
action items, skip Phases 12 and 13 entirely.

### Tier Definitions

| Tier | Budget | Capabilities | When to Use | Example |
|---|---|---|---|---|
| **Light** | ~2k tokens | grep/read only, no web search | Factual claim checkable in a single file or constant lookup | "Reviewer A claims the threshold constant is 0.05 but the report says 0.5 — check the code." |
| **Standard** | ~8k tokens | Multi-file reads, import tracing, static analysis | Claim requires following logic across files or comparing multiple outputs | "Two reviewers disagree on whether the rate-limiter handles concurrent requests — trace the implementation across its dependencies." |
| **Deep** | ~32k tokens | Web search, multi-round reasoning | Requires external knowledge, novel domain, or fundamental disagreement unresolvable from code alone | "Security reviewer claims the PRNG is cryptographically weak for this use case — requires researching current best practices for the specific algorithm." |

### Assignment Pipeline (default: both steps; quick mode: step 1 only)

Tier assignment runs as a two-step pipeline. Step 1 is always fast; step 2
(the judge refinement) is the default but can be skipped by requesting
"quick tier assignment" or "confidence-based tiers only".

**Step 1 — Confidence-Based Draft (always runs; no agent needed):**

The orchestrator derives initial tier assignments from existing Phase 4
confidence ratings and debate round signals:

- **Deep**: Any reviewer rated the claim Low confidence in Phase 4, OR the
  point remained unresolved across 2+ debate rounds, OR the claim requires
  external or runtime knowledge (e.g., production behavior, third-party API
  semantics, literature validation)
- **Standard**: Any reviewer rated Medium/mixed confidence, OR unresolved for
  exactly 1 debate round, OR claim requires cross-file logic tracing
- **Light**: All reviewers rated the claim High confidence AND it is a simple
  checkable fact (file exists, value matches, line present)

Produces a draft tier table:
```
| Point # | Summary | Draft Tier | Signal (confidence ratings + rounds unresolved) |
|---------|---------|------------|------------------------------------------------|
```

**Step 2 — Judge-Advised Refinement (default: on):**

A single Opus agent (Phase 12b) receives the confidence-based draft table and
all supporting context (context brief, Phase 6 summaries, Phase 7 blind finals,
completeness audit, claim and severity verification results). Its job is to
**review and refine** the draft — upgrade, downgrade, or confirm each tier with
reasoning. It also assigns the verification persona per point.

The advisor works from the draft rather than from scratch: the confidence ratings
give it the "ground-level" signal from reviewers who lived through the debate,
and the advisor's role is oversight and correction, not cold assessment from zero.

Final tier table:
```
| Point # | Summary | Draft Tier | Final Tier | Override Reason | Suggested Persona |
|---------|---------|------------|------------|-----------------|-------------------|
```

---

## Phase 13: Targeted Verification Agents (v2.11)

Dispatch one verification agent per collected dispute/action item. All Light and
Standard agents launch **in parallel**; Deep agents can also parallelize unless
they share a scarce resource (e.g., web search rate limits).

### Persona Matching

Classify each claim's type and select the matching verification persona. VoltAgent
agents are preferred when available; fall back to generic + focused prompt.

| Claim Type | Verification Persona | VoltAgent (preferred) |
|---|---|---|
| Statistical / numerical | Data Scientist | `voltagent-data-ai:data-scientist` |
| Code correctness / logic | Code Reviewer | `voltagent-qa-sec:code-reviewer` |
| Architecture / design | Architect Reviewer | `voltagent-qa-sec:architect-reviewer` |
| Security vulnerability | Security Auditor | `voltagent-qa-sec:security-auditor` |
| Performance / scalability | Performance Engineer | `voltagent-qa-sec:performance-engineer` |
| Database / SQL | Database Expert | `voltagent-data-ai:database-optimizer` |
| Infrastructure / ops | SRE | `voltagent-infra:sre-engineer` |
| Framing / narrative | Domain expert | Generic + domain context |
| Business logic / feasibility | Business Analyst | Generic + business context |
| Default / unclear | Verification Agent | Generic + focused prompt |

### Capability Limits by Tier

- **Light** (~2k tokens): May only grep/read/head/tail. Single focused query.
  Return one of `[VR_CONFIRMED]`, `[VR_REFUTED]`, `[VR_INCONCLUSIVE]` with one
  piece of quoted evidence. Do not expand scope beyond the specific claim.
- **Standard** (~8k tokens): May read multiple files, trace imports, run static
  analysis commands. Return verdict with supporting evidence from multiple sources.
  Explore adjacent code only if directly relevant to the dispute.
- **Deep** (~32k tokens): Full agent capabilities including web search and multiple
  reasoning rounds. Return a comprehensive verdict; cite external sources when they
  resolve the dispute. Scope limited to the specific dispute — do not produce a
  second full review.

### Verdict Labels

- `[VR_CONFIRMED]` — Evidence confirms the original claim
- `[VR_REFUTED]` — Evidence contradicts the claim
- `[VR_PARTIAL]` — Claim is partially supported; the agent qualifies what holds
- `[VR_INCONCLUSIVE]` — Insufficient evidence to verify either way
- `[VR_NEW_FINDING]` — Verification revealed an additional issue beyond the dispute

### Verification Round Summary

After all agents complete, compile into a summary table:

```
| Point | Tier | Persona | Verdict | Key Evidence |
|-------|------|---------|---------|--------------|
```

This table is passed to Phase 14 as input item 8.

---

## Phase 13.5: Pre-Judge Verification Gate (v3.1.0)

Before launching the Supreme Judge (Phase 14), the orchestrator MUST verify
that all mandatory phase outputs exist on disk. This gate is the load-bearing
guardrail against silent compression of Phases 4 / 5 / 7.

**Gate logic (orchestrator-executed, no subagent dispatch):**

For each reviewer in the panel, verify these files exist under `state/`
(or `state/run_<N>/` in multi-run mode):

| Required file | Phase | Mandatory |
|---|---|---|
| `reviewer_<name>_phase_3.md` | Independent review | Always |
| `reviewer_<name>_phase_4.md` | Private reflection | Always |
| `reviewer_<name>_phase_5_round1.md` | Debate round 1 | Always (rounds 2/3 per existing skip rules) |
| `reviewer_<name>_phase_7.md` | Blind final | Always |

Plus panel-level files:
- `phase_8_audit.md`
- `phase_10_claim_verification.md`
- `phase_11_severity_verification.md`

**For each required file, run three checks:**

1. **Existence check** — file is present on disk.
2. **Minimum-bytes check** — file size ≥ 500 bytes. Below this is empirically
   a stub (subagent crashed mid-write or returned a placeholder).
3. **Required-headers check** — parse the file and confirm it contains the
   required schema sections for that phase (e.g., a Phase 3 review must
   contain a Score, a Findings section, and severity tags). The exact required
   sections per phase are defined in `references/prompt-templates.md`.

**On gate failure for any file:**

1. Log loudly: `GATE FAIL: <file> missing | stub | malformed`
2. Re-dispatch the subagent for the missing/malformed phase output.
3. Re-run the gate after re-dispatch.
4. **Single retry only.** If the second attempt also fails, do NOT block the
   run. Mark the phase as unrecoverable, write the COMPRESSED RUN header in
   Phase 15.1 (see Phase 15.1 spec), and proceed to Phase 14 with the
   partial input. The deliverable is produced with explicit warning rather
   than failing entirely — partial review with loud warning beats no review.

**On full gate pass:** proceed to Phase 14. The COMPRESSED RUN header is NOT
emitted (its absence is the green light).

**Debate-presence assertion (v3.5.0) — distinct from per-file compression.**
The per-file gate above re-dispatches an *individual* missing file. But the
dominant real-world failure (2026-06-06 audit: 50/51 runs had no debate) is
the *wholesale* skip — the orchestrator never ran Phase 5 at all and jumps
from independent reviews straight to the judge. So, separately from the
per-file check, **count the `reviewer_*_phase_5_round1.md` files across the
whole panel** (or `state/run_<N>/` in multi-run mode):

- **If the count is ZERO when mode = full panel** (the entire debate phase is
  absent, not just one reviewer's file), this is the **NO-DEBATE** condition.
  Do NOT proceed to Phase 14 silently. Instead:
  1. **Preferred — run Phase 5 now.** Debate was skipped; execute it (round 1
     is non-skippable per the protocol) and re-run this assertion.
  2. **If debate is genuinely unavailable for this execution shape** (e.g.,
     the run is executing as a parallel Workflow / ultracode fan-out with no
     sequential cross-talk primitive — see *Debate inside a Workflow* below),
     stamp the **`[NO-DEBATE]` banner** (Phase 15.1 / 15.3) and lower the
     verdict confidence (cap at Medium). The judge still rules, but the report
     announces that no adversarial cross-examination happened.
- **If the count is ≥ 1**, debate ran; no NO-DEBATE banner. (Individual
  missing round-1 files for *some* reviewers remain a per-file COMPRESSED
  case, handled above.)

**Detection is not solely anchored here.** Phase 13.5 does not fire on every
execution shape (an inline/workflow-shaped run can skip this gate entirely —
that is exactly how the audit's silent skips slipped through). The
load-bearing NO-DEBATE check therefore *also* runs at the Phase 15.1
report-write chokepoint (every completed run passes through it). See Phase
15.1.

**Why bytes + headers, not just existence:** A subagent can write a stub and
crash, leaving an empty/partial file. Existence alone passes the gate on a
stub. Bytes + required-headers makes the check load-bearing. This mirrors
how the Phase 15 verification gate (v2.16.4) validates HTML output
structurally, not just by file presence.

---

## Phase 14: Supreme Judge

Single agent (`model: "opus"`). The launch prompt is ~200 tokens of metadata:
the paths to the state files produced by Phases 3, 4, 5, 7, 8, 10, 11, and
13. The judge **reads state files on demand** using the Read tool — it does
NOT receive verbatim phase outputs pre-stuffed into its launch prompt. This
mirrors the Phase 15.3 HTML-agent pattern (v2.16.4) and caps the judge's
window load even when the panel has produced hundreds of kilobytes of
material.

The judge's ruling is materialized to `state/phase_14_judge_ruling.md` so
Phase 15.1 can later consume it from disk (rather than from chat).

Steps (in order):
0. Review verification results (claims, severity, commands, **and verification round**)
0.5a-b. Verify audit findings, anti-rhetoric assessment
0.5c. Severity dampening — minimum evidence-justified severity. **In Precise mode, findings without code citations cannot exceed P2.** **Live-State Claim Discipline (v3.3.0): a live-infrastructure/runtime-state claim tagged `[STATIC-INFERENCE]` cannot exceed P1, and a P0 that one cheap read-only command could falsify (with no agent having run it) is capped at P1 until verified.**
0.5d. Coverage check — flag unexamined risk categories, scan source for gaps
1-3. Debate quality, disagreement rulings, consensus correctness. **A `[STATIC-INFERENCE-CONSENSUS]` point — multiple reviewers agreeing off the same artifact lines — counts as one source, not independent verification.**
4-5. Absent-safeguard check, independent gap scan, score assessment
6-7. Epistemic label classification (including `[LIVE-VERIFIED]` / `[STATIC-INFERENCE]` / `[STATIC-INFERENCE-CONSENSUS]` for live-state claims), final verdict
8-9. Action items, meta-observation
10. **Write ruling to `{state_dir}/phase_14_judge_ruling.md`** (v3.1.0+).

See `references/prompt-templates.md` for the full judge prompt.

---

## Phase 14.5: Post-Judge Verification Gate (v3.2.0)

The Supreme Judge in Phase 14 can introduce **new** P0/P1 findings as a
side effect of its Step-0 Verification Review — findings the panel never
raised. Phase 11 (Severity Verification) only re-verifies panel-raised
P0/P1, so judge-introduced findings bypass every prior verification phase.
A 2026-04-27 README review run produced a hallucinated "12 unresolved git
conflict markers" P0 (the file was clean — `wc -l` and `grep -c` both
confirmed) and that single fabricated finding drove a 3/10 REJECT-AND-
REWRITE verdict (issue [#41](https://github.com/wan-huiyan/agent-review-panel/issues/41)).

Phase 14.5 closes this gap by re-verifying every judge-introduced P0/P1
against ground truth before Phase 15.1 generates the report.

A single Opus agent runs after Phase 14 and before Phase 15.1. Inputs are
the paths to `{state_dir}/phase_14_judge_ruling.md`,
`{state_dir}/phase_11_severity_verification.md`, and
`{state_dir}/phase_8_audit.md`. The agent has grep / Read / Bash tools.

Steps:
1. Classify each P0/P1 finding in the judge ruling as `[PANEL-RAISED]`
   (skip — covered by Phase 11) or `[JUDGE-INTRODUCED]` (verify here).
2. For every `[JUDGE-INTRODUCED]` finding, run a ground-truth check
   appropriate to the claim type (location, state, existence, external
   domain) using grep/Read/git/Bash. Quote actual command output.
3. Issue a verdict per finding: `[JUDGE-CONFIRMED]` (passes through),
   `[JUDGE-HALLUCINATED]` (demote to P3 or remove if actively
   contradicted), or `[JUDGE-PARTIAL]` (demote one level, edit to retain
   only the replicated sub-claim).
4. If any P0 was demoted/removed, recompute the verdict score against
   the panel mean and document the override.
5. Write the full verification table to
   `{state_dir}/phase_14_5_judge_verification.md`. Phase 15.1 reads it.

**Phase 15.1 banner.** When the gate produces any `[JUDGE-HALLUCINATED]`
entry, Phase 15.1 MUST emit this block immediately after the
Executive Summary (and after the Compressed Run banner if present):

```markdown
> ⚠️ **Judge Verification:** N judge-introduced finding(s) flagged as
> [JUDGE-HALLUCINATED] in Phase 14.5. Verdict score replaced with panel
> mean (X/10 → Y/10). Affected action items below carry the
> [JUDGE-HALLUCINATED] suffix.
```

Affected action items keep the `[JUDGE-HALLUCINATED]` epistemic-label
suffix in both the markdown report and the HTML dashboard expandable
issue card metadata.

**Empty case.** If the gate produces zero `[JUDGE-INTRODUCED]` findings
(every P0/P1 was already panel-raised), it writes a stub file with the
single line "No judge-introduced findings to verify" so Phase 15.1's
disk-read still succeeds.

See `references/prompt-templates.md` for the full Judge-Output
Verification Agent prompt.

---

## Phase 15: Output Generation

Three output files are written at the end of every review. They are produced
in strict sequence: Phase 15.1 first, then Phase 15.2, then Phase 15.3.
Phase 15.3 runs AFTER Phase 15.2 (not in parallel) so that the Phase 15.3
agent can read the already-written Phase 15.1 and 15.2 files from disk,
avoiding the need for the orchestrator to inject all structured data and
process history into the agent prompt from its own context window.

---

### Phase 15.1: Primary Markdown Report

Write structured summary to `review_panel_report.md` (or user-specified name).
This is the main deliverable — concise, structured, action-oriented.

**Compressed-run warning (v3.1.0+):** If the Phase 13.5 verification gate
detected any unrecoverable missing phase output, Phase 15.1 MUST emit this
block as the FIRST content of the report (before any other section,
including Executive Summary):

```markdown
> ⚠️ **COMPRESSED RUN — Phases skipped: <comma-separated list, e.g., "4 (security), 5 (security, devils-advocate)">**
>
> This run did not complete the full panel protocol. The Supreme Judge ruled
> on partial input. Findings below should be treated as **lower confidence**
> than a full-run report. Re-run the panel for a complete review.
```

Additionally, in compressed runs, every action item MUST have `[COMPRESSED]`
appended to its epistemic label (e.g., `[CONSENSUS][COMPRESSED]`,
`[VERIFIED][COMPRESSED]`).

For full runs, the warning block is absent. Its absence is the green-light
signal that the panel completed the full protocol.

**No-debate warning (v3.5.0) — the load-bearing debate-skip chokepoint.**
Phase 15.1 is the terminal step every completed run passes through, so the
NO-DEBATE detection is anchored HERE (not only in the Phase 13.5 gate, which
an inline/workflow-shaped run can skip — that is precisely how the audit's
silent skips slipped past). **Before writing the report, the orchestrator
MUST independently check whether any `reviewer_*_phase_5_round1.md` state
files exist** (under `state/`, or any `state/run_<N>/` in multi-run mode). If
**none exist** — the adversarial debate (Phase 5) did not run for this panel —
Phase 15.1 MUST emit this block as report content (immediately AFTER the
COMPRESSED RUN block if one is present, otherwise FIRST, before the Executive
Summary):

```markdown
> ⚠️ **[NO-DEBATE] — adversarial debate (Phase 5) did not run.**
>
> Reviewers evaluated independently but never cross-examined each other's
> findings. The Supreme Judge reconciled disagreements alone, without a debate
> record. Treat consensus and disagreement rulings as **lower confidence** —
> no reviewer had the chance to revise a verdict in light of a peer's. For a
> high-stakes or adversarial-tradeoff decision, re-run the **full** panel with
> debate (invoke as a skill, not a workflow), or use the debate-in-Workflow
> recipe.
```

In a no-debate run:
- Every action item MUST have `[NO-DEBATE]` appended to its epistemic label
  (e.g., `[CONSENSUS][NO-DEBATE]`, `[SINGLE-SOURCE][NO-DEBATE]`).
- The `**Confidence:**` header field MUST be capped at **Medium** (if the
  judge ruled High, lower it one level to Medium and note why); a no-debate
  run can never report High confidence.
- The "Debate Rounds + Summaries" collapsible in Detailed Reviews renders the
  placeholder "No debate rounds — Phase 5 did not run for this panel."

**Banner stacking & the COMPRESSED overlap.** COMPRESSED (per-file loss) and
NO-DEBATE (wholesale Phase-5 absence) are distinct signals and **stack**: when
both apply, render NO-DEBATE first (it is the more specific, higher-severity
signal for the verdict), then COMPRESSED. NO-DEBATE is the named signal for
zero Phase-5 output, so a COMPRESSED block need not also enumerate "5" in its
phases-skipped list when the NO-DEBATE banner is present. For full runs where
debate ran, the NO-DEBATE block is absent; its absence is the green-light
signal that adversarial debate occurred.

```markdown
# Review Panel Report
**Work reviewed:** {title/path}  |  **Date:** {today}
**Panel:** {N} reviewers + Auditor + Judge
**Verdict:** {recommendation}  |  **Confidence:** {High|Medium|Low}
**Auto-detected signals:** {list or "None — base set used"}
**Review mode:** {Precise|Exhaustive|Mixed} (auto-detected from content type)
**Data flow trace:** {Standard|Thorough|Exhaustive} tier | {N} paths traced | {M} invariant violations (v2.14)
{If skipped: "**Data flow trace:** Skipped ({reason — pure docs / no transforms / plan-only})"}
**Codebase state:** {branch name} | {N commits behind {default_branch}} | {worktree: yes/no}
{If multi-run: "**Runs:** {N} (personas rotated per schedule)"}
{If multi-run: "**Run stability:** {X}% of findings appeared in 2+ runs | {Y} single-run findings"}
{If stale: "⚠️ STALE BRANCH — {N} commits merged to {default_branch} since branch point. Findings about missing code should be verified against {default_branch}."}

## Executive Summary
{Judge's verdict, 3-5 sentences. Score X/10.}
{If score spread < 2: Correlation Notice about shared model biases}
{If Low confidence: "⚠️ HUMAN REVIEW RECOMMENDED"}

## Scope & Limitations
{What was reviewed. What CANNOT be evaluated: runtime behavior, production
data, security via dynamic analysis. Structural limitation: shared base model.}
Epistemic labels: [VERIFIED] [CONSENSUS] [SINGLE-SOURCE] [UNVERIFIED] [DISPUTED] [WEB-VERIFIED] [WEB-CONTRADICTED] [WEB-INCONCLUSIVE] [JUDGE-HALLUCINATED] [LIVE-VERIFIED] [STATIC-INFERENCE] [STATIC-INFERENCE-CONSENSUS]
Defect type labels: [EXISTING_DEFECT] (bug in current code) [PLAN_RISK] (risk if plan is implemented as written)

## Score Summary
| Reviewer | Persona | Intensity | Initial | Final | Recommendation |

## Consensus Points
{Bullet list of points all/most reviewers agreed on, confirmed by judge}

## Disagreement Points (with judge rulings)
{Each disagreement: Side A, Side B, Verification Round result if run, Judge's ruling with reasoning}

## Completeness Audit Findings
{New issues found by auditor, verified by judge}

## Coverage Gaps (if any)
{Risk categories no reviewer examined, with judge's independent assessment}

{If multi-run: "## Run Comparison"}
{If multi-run: Table showing which findings appeared in which runs, with stability labels}

## Action Items (with severity AND epistemic labels{, and stability labels if multi-run})

## Detailed Reviews (collapsible sections)
- Data Flow Map (Phase 2, v2.14) — if tracer ran
- Round 0: Independent Reviews
- Private Reflections
- Debate Rounds + Summaries
- Final Blind Assessments
- Completeness Audit
- Verification Command Execution Results
- Claim Verification Report
- Severity Verification Table
- Verification Tier Assignment (4.8)
- Targeted Verification Results (4.9)
- Supreme Judge Full Analysis
```

---

### Phase 15.2: Full Agent Process History

Write `review_panel_process.md` — the "director's cut". This is a complete,
chronological, verbatim log of every agent's output with nothing summarized away.
The orchestrator assembles this from accumulated outputs; no new agent needed.

**Persona profiles are embedded** at the point each agent first enters the flow:
before each agent's output, a structured "Persona Profile" block documents that
agent's role, expertise, reasoning strategy, agreement intensity (for panelists),
matched-claim-type (for Phase 13 agents), and which phases they participated in.
This makes the process history fully self-explanatory to a reader who wasn't present.

Structure (in order, verbatim for each):

```
Persona Profiles Registry (at top)
  - All panelist profiles listed before any review output
  - Phase 12b tier advisor profile
  - Phase 13 verification agent profiles (added as they are assigned)
  - Supreme judge profile

Phase 1: Setup
  - Context Brief (full)
  - Persona selection rationale
  - Review mode detection

Phase 3: Independent Reviews
  - [Persona Profile — Persona A] full profile block
  - [Persona A] Full review text
  - [Persona Profile — Persona B] full profile block
  - [Persona B] Full review text
  - ... (all N)

Phase 4: Private Reflections
  - [Persona A] Full reflection + per-finding confidence ratings
  - [Persona B] Full reflection
  - ... (all N)

Phase 5: Debate Rounds
  - Round 1: All reviewer responses (verbatim)
  - Phase 6 Summary: Resolved / Still in dispute / New discoveries
  - Round 2: All reviewer responses (if run)
  - Phase 6 Summary: ...
  - Round 3: ... (if run)

Phase 7: Blind Final Assessments
  - [Persona A] Final score, top 3 points, recommendation, verdict
  - [Persona B] ...
  - ... (all N, unsealed)

Phase 8: Completeness Audit
  - Full auditor output

Phase 9: Verification Command Execution
  - Each command run, raw output, annotation

Phase 10: Claim Verification
  - Full verification table + flagged claims

Phase 11: Severity Verification
  - Full severity verification table + reasoning per finding

Phase 12: Verification Tier Assignment
  - Phase 12a: Confidence-based draft table (with signals)
  - [Persona Profile — Tier Refinement Advisor] profile block
  - Phase 12b: Tier refinement advisor full output (overrides + reasoning)

Phase 13: Targeted Verification Agents
  - [Persona Profile — Verification Agent: Point #1] full profile block
    (role, matched-claim-type, why matched, tier, VoltAgent subagent or generic)
  - [Point #1 — Tier — Persona] Full investigation trail, what was searched,
    what was found, full reasoning, verdict
  - [Persona Profile — Verification Agent: Point #2] ...
  - [Point #2 ...] (all N verification agents, verbatim)

Phase 14: Supreme Judge Deliberation
  - [Persona Profile — Supreme Judge] profile block
  - Full judge output (all steps, unabridged)
```

See `references/prompt-templates.md` for the Phase 15.2 assembly spec.

---

### Phase 15.3: Interactive HTML Report

Launch a single Opus agent to write `review_panel_report.html` — a polished,
self-contained single-file interactive dashboard with **expandable issue
cards** (v2.15).

**CRITICAL — Data passing strategy (v2.16.4 context-pressure fix):** Do NOT
inject the structured data or process history into the agent prompt from the
orchestrator's context. Instead, the agent prompt MUST instruct the agent to
read from disk:
1. Read `review_panel_report.md` (already written by Phase 15.1) for all
   structured summary data (verdict, scores, action items, consensus, etc.)
2. Read `review_panel_process.md` (already written by Phase 15.2) for
   verbatim reviewer narratives, debate transcripts, judge rulings, and
   verification agent trails — extracting per-finding content for the
   10-section accordion
3. Read `references/prompt-templates.md` starting from the line
   `## Phase 15.3: HTML Report Generation Prompt` for the full rendering
   spec (HTML structure, CSS, JS, expandable card schema, filter logic,
   Prism.js setup, print styles)

**Path resolution (CRITICAL):** The orchestrator MUST resolve all paths to
absolute paths before including them in the Phase 15.3 agent prompt. The
subagent has no knowledge of the skill installation directory or the user's
output directory. Substitute:
- `{output_dir}` → the actual resolved output directory (where Phase 15.1
  wrote `review_panel_report.md`)
- `{skill_dir}` → the absolute path to the skill's `references/` directory
- If the user specified a custom output name (e.g., `--output my_review.md`),
  use the actual filenames, not the defaults

The orchestrator's Phase 15.3 launch prompt should be SHORT (~10 lines):
```
You are the Phase 15.3 HTML Report Agent. Generate `{output_dir}/{html_filename}`
by reading these files:
1. {output_dir}/{report_filename} — structured review data
2. {output_dir}/{process_filename} — verbatim narratives and transcripts
3. {skill_dir}/references/prompt-templates.md (search for "Phase 15.3: HTML
   Report Generation Prompt") — the authoritative rendering spec
Follow the rendering spec exactly. Write the complete HTML file.
```

This keeps the orchestrator's launch prompt under 200 tokens instead of
700+ lines, eliminating the context-pressure failure mode.

**Features:**
- Dashboard overview: verdict, score, panel composition at a glance
- Stats row: issue counts by severity (P0–P3), tier (Light/Standard/Deep),
  verdict (VR_CONFIRMED/VR_REFUTED/VR_PARTIAL/VR_INCONCLUSIVE/VR_NEW_FINDING)
- Charts: confidence distribution, tier breakdown (donut), verdict breakdown
  (horizontal bar), pipeline flow (issues entering/surviving each verification phase)
- **Panel Gallery**: collapsible section with avatar cards for every agent —
  panelists (role, agreement intensity, reasoning strategy, phase badges), Phase
  13 verification specialists (matched claim type, why matched, tier, "verified
  N items" count), and support agents (auditor, judge, tier advisor). Clicking a
  panelist card filters the issue list to items they raised.
- **Expandable issue cards (v2.15)**: each card is a native `<details>` element.
  The collapsed state shows the one-line summary; the expanded state reveals a
  10-section accordion (each section is its own nested `<details>`):
  1. 📖 **Narrative** — full reviewer reasoning (verbatim, not summarized)
  2. 📄 **Code Evidence** — file:line snippets with Prism.js syntax highlighting
  3. 👥 **Raised by** — per-reviewer severity + reasoning grid
  4. 🔍 **Verification Trail** — full VR agent output (if verified)
  5. 💬 **Debate** — round-by-round transcript (if disputed)
  6. ⚖️ **Judge Ruling** — full reasoning + severity-change explanation
  7. 🛠️ **Fix Recommendation** — proposed change + before/after code + regression test + blast radius + effort
  8. 🔗 **Cross-references** — related findings with relationship labels
  9. 🏷️ **Epistemic Tags** — hover tooltips explaining each label
  10. 📊 **Prior Runs** — meta-review comparison (if multi-run)

  Empty sections render "No {section} data" placeholders — all 10 sections
  always present for consistent card structure.
- **Deep-link support**: `report.html#issue-A1` auto-opens that card and scrolls
- **Keyboard navigation**: ↑/↓ between cards, Enter expands, Home/End jump to first/last, `/` focuses search
- **Expand all / Collapse all** controls at the top of the Issues tab
- **Print-friendly**: `@media print` forces all details open, inverts theme, hides charts
- Filter bar: filter by severity, tier, verdict, epistemic label simultaneously
- Sort controls: by severity, confidence, tier
- Inline CSS/JS; Tailwind CSS, Chart.js, and **Prism.js** (v2.15, new) loaded via CDN

**Chart.js wrapper-div mandate (v3.2.0).** Every Chart.js `<canvas>` MUST
be wrapped in a `<div style="position: relative; height: 220px; width: 100%;">`
(or equivalent class) with explicit pixel height. The bare `<canvas height>`
attribute is a no-op when `responsive: true` and the dashboard always uses
`maintainAspectRatio: false` — without a height-bounded parent, the canvas
grows on every layout pass, producing infinite vertical growth on open,
scroll, resize, or interaction. See issue
[#42](https://github.com/wan-huiyan/agent-review-panel/issues/42) for the
2026-04-27 reproduction. The Phase 15.3 prompt enforces this; the test
suite asserts every `<canvas>` has a position-relative height-bounded
parent.

See `references/prompt-templates.md` for the Phase 15.3 agent prompt with the
full 10-section schema and rendering spec.

**Compressed-run banner (v3.1.0+):** If the source Phase 15.1 markdown
report begins with the `⚠️ COMPRESSED RUN` blockquote, Phase 15.3 MUST render
a prominent red banner at the top of the HTML body containing the same
warning text. Suggested CSS:

```html
<div role="alert" style="background:#FEE2E2; color:#991B1B; padding:1rem 1.25rem; margin:1rem 0; border:2px solid #DC2626; border-radius:6px;">
  <strong>⚠️ COMPRESSED RUN — Phases skipped: <list></strong>
  <p>This run did not complete the full panel protocol. ... Re-run the panel for a complete review.</p>
</div>
```

The banner appears above the report header summary card.

**No-debate banner (v3.5.0):** If the source Phase 15.1 markdown report
contains the `⚠️ [NO-DEBATE]` blockquote, Phase 15.3 MUST render a prominent
amber banner at the top of the HTML body with the same warning text. Use a
distinct amber/orange palette so it reads as separate from the red COMPRESSED
banner; when both are present, render NO-DEBATE first, then COMPRESSED.
Suggested CSS:

```html
<div role="alert" style="background:#FEF3C7; color:#92400E; padding:1rem 1.25rem; margin:1rem 0; border:2px solid #D97706; border-radius:6px;">
  <strong>⚠️ [NO-DEBATE] — adversarial debate (Phase 5) did not run.</strong>
  <p>Reviewers evaluated independently but never cross-examined each other. The judge reconciled disagreements without a debate record — treat rulings as lower confidence. Re-run the full panel with debate for high-stakes decisions.</p>
</div>
```

The banner appears above the report header summary card (and above the
COMPRESSED banner if both are present).

---

### Phase 15 Verification Gate (MANDATORY — v2.16.4)

Before reporting completion, verify ALL THREE output files exist by checking
that each file was successfully written (e.g., `ls -la review_panel_report.md
review_panel_process.md review_panel_report.html`).

**If all three files exist:** proceed to the completion message below.

**If `review_panel_report.html` is missing (Phase 15.3 failed):**
1. Log: "Phase 15.3 HTML report generation failed. Retrying..."
2. Retry Phase 15.3 ONCE with the same disk-reading prompt (the agent reads
   from disk, so no orchestrator context re-assembly is needed)
3. After retry, verify again
4. If the file now exists: proceed to completion message
5. If still missing after retry: report the two files that DO exist, and
   tell the user: "The HTML report could not be generated automatically.
   To generate it manually, say: **generate the HTML review report**"

**Completion message (only after verification passes):**
Tell user:
- Paths to all output files that were successfully written
- Verdict + score (from primary report)
- Counts: consensus points, disagreements, action items, verification verdicts
- Top P0 action item (if any)
- Note: HTML report requires internet connection for Tailwind CSS, Chart.js, and Prism.js CDNs
- HTML footer should read "Agent Review Panel v3.5.0" (MUST match the full semver from `plugin.json` — update this line whenever the version is bumped)

---

### Manual HTML Report Recovery (v2.16.4)

If the user asks to "generate the HTML report" or "generate the HTML review
report" after a review has completed (whether Phase 15.3 failed or the user
wants to regenerate), launch the Phase 15.3 agent with the same disk-reading
prompt described above. Resolve all paths to absolute paths. The agent MUST:
1. Read the Phase 15.1 output file (e.g., `review_panel_report.md`) for
   structured data — use the actual filename from the completed review
2. Read the Phase 15.2 output file (e.g., `review_panel_process.md`) for
   verbatim content
3. Read the skill's `references/prompt-templates.md` (absolute path) starting
   from "Phase 15.3: HTML Report Generation Prompt" for the rendering spec

Do NOT write a generic styled HTML page from the orchestrator's memory of the
review. The spec in `references/prompt-templates.md` is authoritative — it
specifies Tailwind CSS, Chart.js, Prism.js, the 10-section expandable accordion,
Panel Gallery, filter logic, keyboard navigation, deep-linking, and print styles.
Any HTML report that does not follow this spec is non-compliant.

---

## Review-Mode Spectrum & Debate-in-Workflow (v3.5.0)

This skill is the **full adversarial panel** — its distinguishing feature is
Phase 5–7 **debate** (reviewers cross-examine each other before a judge rules).
Debate is expensive (sequential cross-talk) and only pays off when reviewers
would genuinely change each other's verdicts. A 2026-06-06 audit of 51 real
runs found debate ran in only 1 — most reviews were (correctly or not) routed
to debate-less fast modes. Pick the mode deliberately:

| Want | Use | Debate? |
|---|---|---|
| Fast eyes on a tiny PR | `code-review` / `single-agent-multi-persona-review` | no |
| Independent parallel lenses, small PR, autonomous multi-PR run | `parallel-panel-streamlined-no-debate` | no (by design) |
| **Adversarial tradeoff, high-stakes gating, debate would change the verdict** (security vs perf, "is this P0 real", merge go/no-go) | **this skill (full panel) — invoke as a skill, NOT a workflow** | **yes (Phase 5–7)** |

**Why "invoke as a skill, not a workflow" matters.** Debate lives in this
skill's Agent-tool orchestration. The Workflow / ultracode engine is a
*parallel fan-out* engine (`parallel()` / `pipeline()` — agents never see each
other) whose canonical recipe is literally "find → verify → judge". Running
"review this in ultracode" therefore produces a structurally debate-less run,
and the panel's NO-DEBATE banner will fire. If you *want* the panel's depth
under a Workflow, you must author debate as an explicit phase.

### Debate-in-Workflow recipe (ultracode-mode)

Debate IS achievable inside a Workflow — it just isn't the default shape. The
trick: a debate "round" is just re-spawning each reviewer agent **with its
peers' prior-round findings injected** (it reads peer state files). Sequential
cross-talk becomes a pipeline where stage N reads stage N−1's siblings:

```js
// 1. Round 1 — independent reviews, in parallel (no cross-talk yet).
const round1 = await parallel(PERSONAS.map(p => () =>
  agent(`Review the work as ${p.name}. Write findings to state/reviewer_${p.key}_phase_5_round1.md`,
        {phase: 'Review', schema: FINDINGS_SCHEMA})));

// 2. Debate round 2 — each reviewer re-runs WITH every peer's round-1 findings
//    as input, and rebuts/revises. This is the cross-examination.
const round2 = await parallel(PERSONAS.map((p, i) => () =>
  agent(`You are ${p.name}. Your round-1 findings: ${JSON.stringify(round1[i])}.
         Your peers' round-1 findings: ${JSON.stringify(round1.filter((_,j)=>j!==i))}.
         Where do you concede, push back, or find a NEW issue their angle exposes?
         Write to state/reviewer_${p.key}_phase_5_round2.md`,
        {phase: 'Debate', schema: REBUTTAL_SCHEMA})));

// 3. Judge reconciles WITH the debate record (not alone).
const ruling = await agent(`Adjudicate. Read the round-1 and round-2 state files;
  rule on each disagreement citing how the debate moved (or didn't).`,
  {phase: 'Judge', schema: RULING_SCHEMA});
```

Authoring an explicit `Debate` phase (the audit's one debating run used
`phases [Review, Debate, Audit+Verify, Judge]`) is what makes the round-1
state files exist — which in turn satisfies the NO-DEBATE check. A Workflow
that skips the `Debate` phase will (correctly) get the NO-DEBATE banner.

---

## Multi-Run Union Protocol (v2.14)

A single panel run catches ~60–70% of discoverable issues. Independent runs
with rotated persona compositions have only ~30% finding overlap — meaning
each run catches issues the others miss. For high-stakes reviews, the
Multi-Run Union Protocol runs the panel N times and merges results.

### Invocation

- **Flag:** `--runs N` (explicit count)
- **Natural language:** "run 3 times and merge", "multi-run review",
  "run twice with different reviewers", "maximum coverage review"
- **Default:** N=1 (no merge, single-run mode)
- **"Multi-run" without N:** defaults to 2

### Persona Rotation Schedule

Deterministic given the run number. Run 1 uses the base set; subsequent
runs use complementary sets to maximize coverage diversity.

| Run # | Persona Set | Purpose |
|-------|------------|---------|
| 1 | Standard content-type base set + signal specialists | Canonical review |
| 2 | Complementary: Code Quality Auditor, Performance Specialist, Methodology Analyst, DA + DIFFERENT signal specialists than Run 1 | Catch what Run 1 missed |
| 3 | Adversarial-heavy: 3 Devil's Advocates (different reasoning strategies) + 1 Correctness Hawk | Stress-test consensus |
| 4+ | Cycle through 1–3 with shuffled signal specialists | Diminishing returns |

**Run 3 Devil's Advocates** use different reasoning strategies:
1. Analogical reasoning ("compare to known failure patterns from similar projects")
2. Adversarial simulation ("imagine you are an attacker / malicious user")
3. Failure mode enumeration ("list every way this could fail in production")

### Key Rules for Multi-Run Mode

1. **Content classification runs ONCE** (in Run 1). The classification is
   FIXED for all subsequent runs. This eliminates the primary source of
   cross-run non-determinism documented in the consistency analysis.
2. **Phase 2 (Data Flow Trace) runs ONCE** (in Run 1). The Data Flow Map is
   cached and shared with all subsequent runs. The trace is deterministic
   for a given codebase; re-running would not produce different paths.
3. **Each run independently executes Phases 3–15** with its own persona set.
4. **Per-run reports** are written to `review_panel_report_run{N}.md`.
5. **After all runs complete**, Phase 16 (Merge) runs once to produce the
   final merged `review_panel_report.md`.
6. **Runs MAY execute in parallel** if the orchestrator supports it (launching
   multiple run orchestrations as parallel background agents). Sequential
   execution is also acceptable.

---

## Phase 16: Merge (v2.14, multi-run only)

Single agent (`model: "opus"`). VoltAgent mapping:
`voltagent-meta:knowledge-synthesizer` (always pass `model: "opus"`).

The Merge Agent receives all N per-run reports and executes:

1. **Collect all findings** from all runs, preserving severity, location,
   bug class, epistemic label, and source run number.

2. **Deduplicate by semantic similarity.** Two findings are duplicates if AND
   ONLY IF:
   - Same location (same file AND same function, OR lines within 10 of each other)
   - AND same bug class
   - Different bug classes at same location → keep both
   - Same bug class at different locations → keep both
   - When in doubt, prefer keeping duplicates over false merging

3. **Score stability.** For each merged finding, count how many runs produced it:
   - `[N/N RUNS]` — found in every run, highest confidence
   - `[K/N RUNS]` (1 < K < N) — found in multiple runs, medium-high confidence
   - `[1/N RUNS]` — single-run finding, NOT demoted. Single-run findings
     often represent unique persona insights that only one configuration
     surfaced. The consistency analysis proved single-run P0s are often the
     most valuable findings.

4. **Resolve severity disagreements.** When runs disagree on severity for a
   merged finding, use the HIGHEST severity from any run (conservative:
   false negatives are invisible while false positives are visible and
   dismissible). Note the range: "P0 (Run 1) / P1 (Run 2)".

5. **Resolve judge divergence.** If per-run judges gave scores more than 2
   points apart, flag `[JUDGE_DIVERGENCE]`, explain what drove the difference
   (different persona focus? different threat model?), and provide an
   independent merged assessment.

6. **Produce the merged report** at `review_panel_report.md`. Per-run reports
   remain at `review_panel_report_run{N}.md` for audit trail.

### Merged Report Additions

The single-run Phase 15.1 report format is extended with:

**New header fields:**
```
**Runs:** {N} (personas rotated per schedule)
**Run stability:** {X}% of findings appeared in 2+ runs
**Unique to single run:** {Y} findings
```

**New required section:**
```
## Run Comparison
| Finding | Run 1 | Run 2 | Run 3 | Merged Severity | Stability |
```

**New label type in Scope & Limitations:**
```
Stability labels: [N/N RUNS] (high confidence) [K/N RUNS] (medium) [1/N RUNS] (single-angle)
```

**Action items gain a stability label:**
```
1. **[P0] [VERIFIED] [2/2 RUNS]** Add mutex lock around token refresh
2. **[P1] [CONSENSUS] [1/2 RUNS]** Sanitize error messages *(Run 2 only: Security Auditor)*
```

See `references/prompt-templates.md` for the full Phase 16 Merge Agent prompt.

---

## Implementation Notes

### State files (v3.1.0+)

Subagent outputs for Phases 3, 4, 5, 7, 8, 10, 11, and 14 are written to disk
under a `state/` subdirectory of the review output directory, then the
subagent returns only the file path plus a 100-word summary. The orchestrator
reads files on demand rather than holding verbatim subagent outputs in its
context window.

Reviewer state files use the naming convention
`state/reviewer_<name>_phase_<N>.md` (where `<name>` is the persona slug and
`<N>` is the phase number); orchestrator-level state files include
`state/phase_8_audit.md`, `state/phase_10_claim_verification.md`,
`state/phase_11_severity_verification.md`, `state/phase_14_judge_ruling.md`, and `state/phase_14_5_judge_verification.md` (v3.2.0).

**Single-run layout:**

```
docs/reviews/<date>-<topic>/
├── state/
│   ├── reviewer_<name>_phase_3.md         # independent review
│   ├── reviewer_<name>_phase_4.md         # private reflection
│   ├── reviewer_<name>_phase_5_round1.md  # debate response
│   ├── reviewer_<name>_phase_7.md         # blind final assessment
│   ├── phase_8_audit.md
│   ├── phase_10_claim_verification.md
│   ├── phase_11_severity_verification.md
│   ├── phase_14_judge_ruling.md
│   └── phase_14_5_judge_verification.md      # v3.2.0 — post-judge gate
├── review_panel_report.md                  # Phase 15.1
├── review_panel_process.md                 # Phase 15.2
└── review_panel_report.html                # Phase 15.3
```

**Multi-run layout (Phase 16):**

```
docs/reviews/<date>-<topic>/
├── state/
│   ├── run_1/reviewer_<name>_phase_3.md
│   ├── run_1/reviewer_<name>_phase_4.md
│   ├── ...
│   ├── run_2/reviewer_<name>_phase_3.md
│   └── ...
```

Each run's state lives under `state/run_<N>/` (e.g.
`state/run_1/reviewer_<name>_phase_3.md`,
`state/run_2/reviewer_<name>_phase_3.md`). The merge step (Phase 16) reads
state files from each run independently when computing union findings.

This pattern mirrors `overnight-insight-discovery`, `successor-handoff`, and
`cloud-run-results-bq-postsync` — every long-running multi-agent skill in the
local catalog routes intermediate outputs through disk to keep the
orchestrator window small.

- **Parallel execution:** Phases 3, 4, 5, 7 use single message with multiple
  Agent tool calls. Phases 2, 8, 9, 10, 11, 12, 13, 14 are sequential (Phase 9 is
  orchestrator-driven via Bash, not a subagent). Phase 12a is orchestrator
  logic (no agent). Phase 12b is a single Opus agent. Phase 13 agents launch
  in parallel (single message with one Agent call per dispute point). Phases
  15.1, 15.2, and 15.3 run in strict sequence (15.1 → 15.2 → 15.3). Phase
  15.3 runs AFTER 15.2 so its agent can read the already-written files from
  disk instead of requiring the orchestrator to inject all data in-context.
- **Context management:** Full content in Phases 2, 3, 8, 14. Phase 6 summaries
  with source excerpts in debate rounds for long works (>500 lines).
- **Error handling:** Retry failed agents once. Proceed with minimum 2 reviewers.
  Note gaps in report. Phase 15.3 has an explicit verification gate (v2.16.4):
  if the HTML file is missing after the agent returns, retry once before
  degrading to 2-file output with a manual recovery instruction for the user.
- **Idempotent:** Safe to re-run on the same content — each invocation produces
  an independent panel with no side effects from previous runs.
- **Auto-persona algorithm:** Classify → base set → signal scan → add up to 6 →
  replace DA first. See `references/signals-and-checklists.md` for signal table.
- **Multi-run execution (v2.14):** When `--runs N > 1`, Phase 1 runs once
  (shared classification + signal detection + context brief), Phase 2 runs
  once in Run 1 (cached Data Flow Map), then Phases 3–15 repeat N times
  with rotated personas, then Phase 16 merges. Runs MAY execute in parallel
  (independent orchestrations) or sequentially.
- **Force opus (v2.14):** ALWAYS pass `model: "opus"` when launching agents,
  even with `subagent_type`. VoltAgent agents may have sonnet/haiku defaults
  in their frontmatter; without explicit override, reviewer reasoning depth
  varies across runs. This was an invisible source of cross-run variance
  in v2.9–v2.13.

## Edge Cases

- **No content provided:** Ask user what to review. Do not launch a panel with empty input.
- **Very large files (>500 lines):** Use Phase 6 summaries with excerpts instead of full content in debate rounds. Cap at 20k lines total.
- **Binary/image files:** Skip. Note in report: "Binary files excluded from review."
- **Single tiny file (<20 lines):** Reduce to 2 reviewers (minimum). Full panel is overkill.
- **No P0/P1 findings:** Skip Phases 9 and 11. Proceed directly to claim verification.
- **No unresolved disputes or unverified action items:** Skip Phases 12 and 13. Proceed directly to Phase 14.
- **All reviewers agree (score spread < 2):** Flag correlated-bias warning in report. Do NOT skip debate — unanimous agreement is the most dangerous failure mode.
- **Phase 2 skipped (v2.14):** For pure docs/plans, or code with no data transforms (pure API routing, static config), skip Phase 2 entirely. Note reason in Context Brief and report header: "Data flow trace: Skipped ({reason})". Proceed directly to Phase 3.
- **Single-run mode (v2.14):** `--runs 1` (default) skips Phase 16 (Merge). Report is written directly to `review_panel_report.md` by Phase 15.1. No stability labels. No Run Comparison section.
- **Multi-run with N > 3 (v2.14):** Persona rotation cycles through Runs 1/2/3 schedule with shuffled signal specialists. N > 4 has diminishing returns — warn the user that marginal finding discovery drops sharply after Run 3.
- **Multi-run judge divergence (v2.14):** If per-run judge scores span > 2 points, Phase 16 flags `[JUDGE_DIVERGENCE]` and provides an independent merged assessment rather than averaging.
- **Exhaustive trace on very large codebases (v2.14):** No token budget limit. If the file is > 20k lines, Phase 2 may take > 30 min. Warn the user and offer Thorough tier as alternative.
- **HTML report soft size cap (v2.15):** Target 150–250KB, soft cap 500KB. If the combined structured data (all 10 expandable sections across all findings) exceeds 500KB, the Phase 15.3 agent SHOULD offer a "slim" mode that drops verbatim `fullEvidence` and `debateTranscript` content (replacing with summaries). Slim mode is indicated in the report header and footer.
- **Prism.js CDN unreachable (v2.15):** If the Prism.js CDN fails to load, code evidence blocks render as unstyled `<pre><code>` elements (still readable, just without syntax colors). Wrap Prism calls in `try/catch` to prevent a CDN failure from breaking the page. This is consistent with the existing graceful-degradation approach for Tailwind and Chart.js CDN failures.
- **Empty expandable sections (v2.15):** When a finding lacks data for any of the 10 accordion sections (e.g., no debate, no prior runs), render a "No {section} data" placeholder instead of omitting the section. Every expanded card must show all 10 sections in the same order for consistent structure. This prevents the v2.13 nice-shtern compliance gap where agents silently omitted the expand button when evidence fields were empty.

For full prompt templates, see `references/prompt-templates.md`.
For version history, see `references/changelog.md`.
