---
name: forge-evals
description: Designing evals for LLM features. Golden datasets, rubric scoring, LLM-as-judge calibration, regression detection in CI, online A/B tests, cost+latency budgets, adversarial cases. Contains paste-ready eval harness with promptfoo / vitest, judge prompt template. Use when shipping an LLM feature to production, not for one-shot experimentation.
license: MIT
---

# forge-evals

You are shipping an LLM feature whose quality cannot be measured by a test suite. Default agent-written LLM apps go to production with zero evals; quality is "looks good." Then a prompt change two weeks later silently regresses the median output, nobody knows until users complain. This skill exists to put real numbers on subjective quality.

The mental model: **an eval is a test suite for prose.** It is a held-out dataset, a scoring function, a number you watch over time. The number tells you whether yesterday's change was an improvement, a regression, or noise.

## Quick reference (the things you must never ship)

1. An LLM feature shipped without any eval.
2. Eval set of <30 examples.
3. Same set used for tuning AND measuring.
4. Single aggregate quality score with no per-criterion breakdown.
5. LLM-as-judge with no calibration against human ratings.
6. Eval that has not been re-run in 3 months.
7. Eval threshold lowered to make a regression pass.
8. Eval that measures Opus output while shipping on Haiku.
9. Offline eval as the only signal (no online A/B).
10. Eval set with no refusal-expected cases.

## Hard rules

### The golden dataset

**1. Start with 50+ real or realistic examples.** Below 50, variance dominates. Above 500, diminishing returns.

**2. The dataset is held out from training/tuning.** Iterate prompts against a separate dev set. Train/dev/test split applies to prompt engineering too.

**3. Each example has: input, expected behavior (not exact output), evaluation criteria.** Exact-match scoring is rarely possible.

```yaml
# examples/01_refund_request.yaml
id: 01_refund_request
input: "Charge of $42.99 on my card but I never placed an order. Refund please."
expected:
  category: refund
  confidence: high
  refuses: false
criteria:
  - correct_category: { weight: 0.5 }
  - high_confidence:  { weight: 0.3 }
  - terse_reason:     { weight: 0.2, max_words: 12 }
```

**4. Cover the long tail.** Median cases are easy. Half the value is coverage of: ambiguous queries, adversarial inputs, refusal cases, edge formats, out-of-domain.

**5. Maintain the dataset like code.** Version-controlled, reviewed, updated when the product spec changes.

### Scoring

**6. Pick the simplest scoring that captures the goal.** Then add complexity only when needed.

| Task type | Scoring |
| --- | --- |
| Classification | Accuracy |
| Extraction | Precision / Recall / F1 against ground truth |
| Open-ended generation | Rubric scoring (1-5 or pass/fail per criterion) |
| Code generation | Functional correctness via test suite |

**7. Multi-criterion rubrics beat a single quality score.** "Factual? Well-formatted? Cites sources? Tone appropriate?" - score each separately.

**8. LLM-as-judge is fine; LLM-as-only-judge is fragile.** Use as one signal in a panel, calibrate against human ratings on a sample, audit periodically.

### LLM-as-judge specifics

**9. Judge sees rubric + input + output. Not the model that generated.** Hide the source.

**10. Judge returns structured output: per-criterion score + brief reason.**

```ts
// reference judge prompt
const JUDGE_PROMPT = `
You evaluate the quality of a classifier output.

<rubric>
- correct_category:   Was the output category the same as the expected?
- confidence_appropriate: For confident inputs, was confidence "high"? For ambiguous inputs, was it "low"?
- terse_reason:       Is "reason" under 12 words?
</rubric>

<input>{{input}}</input>
<expected>{{expected}}</expected>
<output>{{output}}</output>

Return strict JSON:
{
  "correct_category":        { "pass": boolean, "note": "<10 word reason" },
  "confidence_appropriate":  { "pass": boolean, "note": "..." },
  "terse_reason":            { "pass": boolean, "note": "..." }
}
`;
```

**11. Judge prompt is version-controlled and eval'd.** If the judge drifts, all metrics drift. Calibrate against humans periodically.

**12. Different model for judging than generating, where feasible.** Reduces shared-failure-mode bias.

### Regression detection

**13. Run eval on every prompt change, every model change, every retrieval change.** Regression test in CI.

```yaml
# .github/workflows/eval.yml
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@SHA
      - run: npm ci
      - run: npm run eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Fail on regression
        run: |
          npm run eval:compare-baseline -- \
            --max-aggregate-drop 0.02 \
            --min-per-criterion 0.80
```

**14. Set ship threshold and block threshold. Document them.** "Aggregate must be within 2% of baseline; no individual criterion below 80%." Without thresholds, evals are decoration.

**15. Track per-example outcomes, not just aggregates.** A 90% aggregate that's 100% on easy and 50% on hard hides regression in the long tail.

**16. A failing example is a story, not a number.** Investigate. Often the eval surfaces a real product issue, not a model issue.

### A/B tests in production

**17. Offline evals predict online performance imperfectly. Run online A/B before full rollout.**

**18. Online metric is user-visible: thumbs up/down, task completion, time-on-task, downstream conversion.** Not "the LLM answered something."

**19. Statistical significance matters.** A 3% improvement on 100 samples is noise. Set sample sizes per expected effect size.

### Cost and latency

**20. Track p50/p95 latency and per-call cost as eval dimensions.** A 5% quality lift at 3x cost may not be a win.

**21. Eval at the price point you will ship at.** Eval on Opus, ship on Haiku = meaningless.

### Specific frameworks

- **promptfoo** - YAML-defined evals, multiple providers, web UI for diffs. Good default for prompt-heavy apps.
- **Anthropic Workbench** - integrated for Claude, good for iteration.
- **DeepEval** - Python, integrates with pytest.
- **Braintrust / LangSmith / Phoenix** - hosted, support online traffic capture.
- **Plain pytest/vitest + JSON dataset** - fine for small projects. Do not over-tool.

### Dataset hygiene

**22. Examples are realistic, not toy.** Real user queries from logs (with redaction) beat handcrafted "what is the capital of France?"

**23. Sensitive data redacted or synthetic.** Eval datasets get checked into repos.

**24. Document each example's intent.** Why is this in the dataset? What does it test? Without intent, examples rot.

**25. Examples expire.** A 2023 product-policy question may no longer apply. Date examples; refresh annually.

### Categories

**26. Accuracy evals: did the model get the right answer?** For factoid, extraction, classification.

**27. Behavior evals: did the model follow format/tone/safety rules?** Refused inappropriate requests? Used correct citation format? Stayed in scope?

**28. Robustness evals: does the model produce the same answer with paraphrased input?** Quality should not depend on exact wording.

**29. Adversarial evals: prompt injection, jailbreaks, edge cases.** Important if your app handles user-supplied input that flows near the LLM.

## Common AI-output patterns to reject

| Pattern | Why wrong | Fix |
| --- | --- | --- |
| Shipping with "we tested manually" | No regression detection | 50+ example eval set in CI |
| Same set for tuning + measuring | Overfit | Train/dev/test split |
| Single "quality" score | Hides per-criterion regressions | Rubric, score each separately |
| LLM-judge with no human calibration | Drifts silently | Periodic human review on a sample |
| Eval that has not run in months | Probably already broken | Cron in CI |
| Threshold lowered to pass | Cheating | Investigate the regression instead |
| Aggregate only | Long-tail regression hidden | Per-example outcomes tracked |
| Eval on a different model than prod | Meaningless | Eval at the price point you ship |
| No refusal-expected cases | Hallucination undetected | At least 6 refusal cases in 50 |

## Worked example: eval harness with vitest

```ts
// eval/dataset.ts
import { z } from "zod";

export const EvalExample = z.object({
  id: z.string(),
  input: z.string(),
  expected: z.object({
    category: z.enum(["billing", "technical", "refund", "other"]),
    confidence: z.enum(["high", "low"]),
  }),
});
export type EvalExample = z.infer<typeof EvalExample>;

export const dataset: EvalExample[] = [
  { id: "01", input: "Refund please.", expected: { category: "refund", confidence: "high" } },
  { id: "02", input: "App crashes on iPhone.", expected: { category: "technical", confidence: "high" } },
  { id: "03", input: "Can I get a discount?", expected: { category: "other", confidence: "low" } },
  // ... 47 more examples
];
```

```ts
// eval/run.test.ts
import { describe, expect, it } from "vitest";
import { dataset } from "./dataset.js";
import { classifySupportEmail } from "../src/classifier.js";

describe("support-classifier eval", () => {
  const results: Array<{ id: string; correct: boolean; output: unknown }> = [];

  for (const ex of dataset) {
    it(`${ex.id}: ${ex.input.slice(0, 40)}...`, async () => {
      const output = await classifySupportEmail(ex.input);
      const correct = output.category === ex.expected.category &&
                      output.confidence === ex.expected.confidence;
      results.push({ id: ex.id, correct, output });
      // do not fail individual tests - accumulate, then check aggregate
    });
  }

  it("aggregate accuracy >= 0.85", () => {
    const accuracy = results.filter((r) => r.correct).length / results.length;
    console.log(`Aggregate accuracy: ${(accuracy * 100).toFixed(1)}% (${results.filter((r) => r.correct).length}/${results.length})`);
    expect(accuracy).toBeGreaterThanOrEqual(0.85);
  });

  it("per-criterion >= 0.80", () => {
    const byCategory = new Map<string, { ok: number; total: number }>();
    for (const r of results) {
      const expectedCategory = dataset.find((d) => d.id === r.id)!.expected.category;
      const bucket = byCategory.get(expectedCategory) ?? { ok: 0, total: 0 };
      bucket.total++;
      if (r.correct) bucket.ok++;
      byCategory.set(expectedCategory, bucket);
    }
    for (const [cat, b] of byCategory) {
      const acc = b.ok / b.total;
      expect(acc, `category=${cat} accuracy ${acc.toFixed(2)} < 0.80`).toBeGreaterThanOrEqual(0.80);
    }
  });
});
```

What this demonstrates: structured eval set (rule 3); per-example results captured (rule 15); aggregate threshold (rule 14); per-criterion threshold (rule 15) - prevents one good category from masking a regressed one; runs in CI via `npm run eval`.

## Workflow

When designing an eval for an LLM feature:

1. **Define the goal.** What does "good" look like? Write it down.
2. **Collect 50 real examples.** From logs, real-case mocks, domain-expert input. Half median, half edge.
3. **Write the rubric.** 3-7 criteria, each pass/fail or 1-5.
4. **Pick scoring approach.** Exact-match if possible, LLM-judge otherwise, human for calibration.
5. **Run baseline. Record the number.**
6. **Iterate the prompt on the DEV set, not the eval set.**
7. **Re-run eval after each meaningful change. Track over time.**
8. **Wire into CI before shipping.**

## Verification

Manual checklist - evals are structural:

- [ ] Golden dataset version-controlled, 50+ examples.
- [ ] Rubric has 3+ criteria, scored independently.
- [ ] Baseline recorded for at least one model.
- [ ] CI runs eval on prompt/model changes.
- [ ] Ship/block thresholds documented.
- [ ] Human review on a sample at least monthly.

## When to skip this skill

- One-shot LLM use that ships once.
- Research prototypes with no production exposure.
- Code where output correctness is checkable by unit tests.

## Related skills

- [`forge-prompt-engineering`](../forge-prompt-engineering/SKILL.md) - the prompt being eval'd.
- [`forge-rag`](../forge-rag/SKILL.md) - retrieval + generation evals.
- [`forge-tests`](../../testing/forge-tests/SKILL.md) - test hygiene principles also apply to evals.
