---
name: forge-subagent-eval
description: Evaluating subagent outputs before trusting them. Shape and budget validation, refusal handling, partial-result recovery, hallucination defenses (URL/quote/file verification), retry policy with sharper brief, fan-out aggregation, audit trail. Contains ready-to-paste verification helper. Use when an orchestrator agent has received output from a subagent and needs to decide whether to use, retry, or fall back.
license: MIT
---

# forge-subagent-eval

You are the parent of a subagent that just returned a result. Default orchestrators treat that result as ground truth and act on it. That is how multi-agent systems fail silently: a subagent says "deployed successfully" when it deployed nothing, and the parent reports the deploy done. This skill exists to put a check between subagent output and parent action.

The rule: **never act irreversibly on a subagent claim without verifying.** Trust scales inversely with stakes.

## Quick reference (the things you must never ship)

1. Subagent output passed to `JSON.parse(...)` then trusted directly.
2. A cited URL/file/function used by the parent without verifying it exists.
3. A subagent's "I checked everything, it's fine" treated as a verification.
4. Retry on the same brief that failed (will fail again).
5. Fan-out result concatenated without deduplication.
6. Partial fan-out failures treated as total failure (or ignored entirely).
7. Action taken on a subagent claim of an external change ("deployed", "sent email", "wrote file") without checking the artifact.
8. No log of the parent's decision (accept / retry / reject) per spawn.
9. Unbounded retry loop on a refused subagent.
10. Quote attributed to a source without checking the quote appears in the source.

## Hard rules

### Shape and budget

**1. Validate the return shape before reading the content.** Asked for JSON with specific fields and got prose? Reject and retry with a sharper brief.

```ts
import { z } from "zod";

const ExpectedSchema = z.object({
  findings: z.array(z.object({
    file: z.string(),
    line: z.number().int().positive(),
    snippet: z.string().max(500),
  })).max(50),
  open_questions: z.array(z.string()).max(10),
});

function validate(subagentOutput: string) {
  let parsed: unknown;
  try { parsed = JSON.parse(subagentOutput); }
  catch { return { ok: false, reason: "not_json" as const }; }
  const result = ExpectedSchema.safeParse(parsed);
  if (!result.success) return { ok: false, reason: "shape_mismatch" as const, issues: result.error.issues };
  return { ok: true, data: result.data };
}
```

**2. Validate the size budget.** Asked for under 300 words and got 2000? The subagent failed to scope. Use only the relevant portion or retry. Tolerating budget overruns trains the subagent to keep ignoring them.

**3. Reject empty success.** `{ "result": "done" }` with no evidence is suspect. Demand the artifact: file path that was changed, row that was created, diff, URL.

### Content checks

**4. Check internal consistency.** "I checked 30 files" but result lists 3? Something is wrong. Surface the discrepancy.

**5. Check for failure modes your brief warned against.** Brief said "do not propose architectural changes"? Scan output for architectural proposals. Brief said "JSON only"? Scan for prose.

**6. Verify external claims by cross-check.** Subagent says "the deploy succeeded"? Check the deployment system directly. Says "the test passes"? Run the test. The subagent's claim is a hypothesis, not a fact.

### Refusal and inability

**7. Subagent refusals are signals, not failures.** "I cannot complete this because X" is more valuable than garbage. Use the refusal: change the brief and retry, or skip.

**8. Distinguish "did not try" from "tried and failed."** Returned without using any tools → probably did not engage. Ran 10 tools then gave up → has done work you can learn from.

### Hallucination defenses

**9. Filenames, function names, URLs, citations are checked.**

```ts
// reference: verify every cited file exists
import { existsSync } from "node:fs";

function verifyClaims(findings: Array<{ file: string; line: number }>) {
  const broken = [];
  for (const f of findings) {
    if (!existsSync(f.file)) {
      broken.push({ ...f, why: "file_not_found" });
    }
  }
  return broken;
}
```

```ts
// verify cited URLs actually resolve
async function verifyUrl(url: string): Promise<boolean> {
  try {
    const res = await fetch(url, { method: "HEAD", redirect: "follow", signal: AbortSignal.timeout(5000) });
    return res.ok;
  } catch { return false; }
}
```

**10. Numerical claims are checked against the source.** "I found 23 instances" - count them yourself if the count matters.

**11. Quotes are checked against the source.** A quoted function body should be verified word-for-word against the actual file when the quote drives decisions.

### Partial-result recovery

**12. Partial success is usable. Design for it.** 8 of 10 fan-out subagents return, 2 fail? Use the 8 and decide separately about the 2. Do not treat the whole fan-out as failed.

```ts
type FanOutResult<T> = { item: string; result: T | null; error: unknown };

function aggregate<T>(results: FanOutResult<T>[]) {
  const successes = results.filter((r) => r.result !== null);
  const failures  = results.filter((r) => r.result === null);
  return {
    success_rate: successes.length / results.length,
    successes:    successes.map((r) => ({ item: r.item, data: r.result })),
    failures:     failures.map((r) => ({ item: r.item, error: String(r.error) })),
  };
}
```

**13. Aggregate before judging.** Fan-out of 10 each finding 3 violations should not just concatenate to 30 - deduplicate, normalize, present as one set.

**14. Reconcile conflicts between subagents explicitly.** Subagent A says "function X exists at line 42," B says "X does not exist." Surface the conflict; do not silently pick one.

### When to retry

**15. Retry once with a sharper brief, not twice with the same brief.** Same brief produces same failure.

```ts
async function withRetry<T>(
  spawn: (brief: string) => Promise<T>,
  initialBrief: string,
  validate: (out: T) => { ok: true } | { ok: false; reason: string },
  refineBrief: (originalBrief: string, reason: string) => string,
): Promise<{ ok: true; data: T } | { ok: false; reason: string }> {
  // First attempt
  let result = await spawn(initialBrief);
  let validation = validate(result);
  if (validation.ok) return { ok: true, data: result };

  // One retry with refined brief
  const sharper = refineBrief(initialBrief, validation.reason);
  result = await spawn(sharper);
  validation = validate(result);
  if (validation.ok) return { ok: true, data: result };

  return { ok: false, reason: `retry_failed: ${validation.reason}` };
}
```

**16. Escalate model or grant more autonomy on retry.** Haiku timed out? Try Sonnet. Read-only could not gather enough? Grant edit access.

**17. After two failures, fall back to inline work or surface to the user.** Three retries is wasted compute.

### Audit trail

**18. Record every subagent call: brief, raw return, accept/retry/reject decision, reason.** Without this, debugging "why did the system do that?" is impossible.

```ts
logger.info({
  parent_call_id: parentId,
  subagent_role: role,
  brief_hash: hash(brief),
  return_size_bytes: result.length,
  validation_outcome: validation.ok ? "accepted" : `rejected:${validation.reason}`,
  retry_attempt: attempt,
  duration_ms,
}, "subagent.eval");
```

**19. Sample a fraction of subagent outputs for offline review.** Daily or weekly, look at actual returns. Drift in quality is the most common silent failure mode.

### When verification is too expensive

**20. Sometimes verification costs more than the work itself.** A subagent summarizing a 100-page document - re-reading the document defeats the purpose. Accept low-stakes outputs after lightweight checks (shape, length, presence of expected keywords); reserve full verification for high-stakes ones.

**21. Define "high-stakes" up front.** Outputs that influence file edits, external API calls, money, or user-facing messages are high-stakes. Outputs that go into a scratchpad are low-stakes.

## Common AI-output patterns to reject

| Pattern | Why wrong | Fix |
| --- | --- | --- |
| `JSON.parse(output)` then use | Crashes on prose, hallucination, malformed | safeParse with schema, retry on failure |
| Trust "I checked everything" | Unverifiable claim | Demand the artifact (diff, file path, count) |
| Retry with same brief | Same failure | Refine brief based on validation reason |
| Fan-out concat without dedup | Duplicates, overlapping findings | Aggregate + deduplicate |
| Partial fan-out treated as fail | Loses 8 good results to 2 bad | Use successes, surface failures separately |
| Action on claim without verify | Subagent says "deployed", parent says "deploy done" | Check deployment system directly |
| No retry cap | Wasted compute on stuck subagent | Hard cap at 2 attempts |
| Cited URL used unchecked | Hallucinated URLs are common | HEAD-request URLs before quoting |
| Quote used without source check | Fabricated quotes look plausible | Open the file and grep for the quote |

## Worked example: evaluating a research subagent

```ts
import { z } from "zod";

const FindingSchema = z.object({
  topic: z.string(),
  summary: z.string().max(2000),
  sources: z.array(z.string().url()).max(10),
});

type Finding = z.infer<typeof FindingSchema>;

async function evaluateResearch(raw: string, originalBrief: string): Promise<
  | { ok: true; data: Finding; verified_urls: string[] }
  | { ok: false; reason: string }
> {
  // 1. Shape
  let parsed: unknown;
  try { parsed = JSON.parse(raw); }
  catch { return { ok: false, reason: "not_json" }; }

  const validation = FindingSchema.safeParse(parsed);
  if (!validation.success) {
    return { ok: false, reason: `shape_mismatch: ${validation.error.issues[0]?.message}` };
  }

  // 2. Length budget
  if (validation.data.summary.split(/\s+/).length > 220) {
    return { ok: false, reason: "summary_over_budget" };
  }

  // 3. Cross-check: do the cited URLs actually resolve?
  const verified: string[] = [];
  for (const url of validation.data.sources) {
    if (await verifyUrl(url)) verified.push(url);
  }
  if (verified.length === 0 && validation.data.sources.length > 0) {
    return { ok: false, reason: "all_sources_unreachable" };
  }

  return { ok: true, data: validation.data, verified_urls: verified };
}

// orchestrator usage
const raw = await spawn(brief);
const result = await evaluateResearch(raw, brief);

if (!result.ok) {
  logger.warn({ reason: result.reason }, "research.subagent.rejected");
  // refine the brief, retry once
  const sharper = brief + `\n\nIMPORTANT: ${result.reason}. Adjust your response.`;
  const raw2 = await spawn(sharper);
  const result2 = await evaluateResearch(raw2, sharper);
  if (!result2.ok) {
    // give up; surface failure
    return { ok: false, reason: result2.reason };
  }
  return result2;
}

return result;
```

What this shows: shape validation (rule 1); budget check (rule 2); cross-check that cited URLs resolve (rule 9); retry once with a sharper brief (rule 15); fall back after second failure (rule 17); structured logging on rejection (rule 18).

## Workflow

When evaluating a subagent return:

1. **Shape check.** Matches expected return shape?
2. **Budget check.** Within length/time budget?
3. **Content sanity.** Internal consistency, expected fields populated.
4. **Spot-verify claims that drive irreversible action.** Open files, run tests, hit APIs.
5. **Decide: accept, retry with sharper brief, fall back, or escalate.**
6. **Log the decision.**

## Verification

This skill is about *doing* verification; verification of the skill itself is by review. Manual checklist for orchestrator code:

- [ ] Every subagent result is shape-validated before use.
- [ ] High-stakes claims (file edits, deploys, money) cross-checked against the source.
- [ ] Retry policy bounded (max 2 retries).
- [ ] Subagent returns logged with the parent's decision.
- [ ] Cited URLs verified to resolve before quoting.

## When to skip this skill

- Single-agent contexts.
- Low-stakes pipelines where output is reviewed by the user before action.
- Synthetic / test fixtures where the subagent is mocked.

## Related skills

- [`forge-multi-agent`](../forge-multi-agent/SKILL.md) - the orchestration patterns this skill complements.
- [`forge-agent-prompt`](../forge-agent-prompt/SKILL.md) - the brief whose output you are evaluating.
- [`forge-citation`](../../research/forge-citation/SKILL.md) - the citation discipline subagents should follow.
