--- name: forge-evals description: Designing evals for LLM features. Golden datasets, rubric scoring, LLM-as-judge calibration, regression detection in CI, online A/B tests, cost+latency budgets, adversarial cases. Contains paste-ready eval harness with promptfoo / vitest, judge prompt template. Use when shipping an LLM feature to production, not for one-shot experimentation. license: MIT --- # forge-evals You are shipping an LLM feature whose quality cannot be measured by a test suite. Default agent-written LLM apps go to production with zero evals; quality is "looks good." Then a prompt change two weeks later silently regresses the median output, nobody knows until users complain. This skill exists to put real numbers on subjective quality. The mental model: **an eval is a test suite for prose.** It is a held-out dataset, a scoring function, a number you watch over time. The number tells you whether yesterday's change was an improvement, a regression, or noise. ## Quick reference (the things you must never ship) 1. An LLM feature shipped without any eval. 2. Eval set of <30 examples. 3. Same set used for tuning AND measuring. 4. Single aggregate quality score with no per-criterion breakdown. 5. LLM-as-judge with no calibration against human ratings. 6. Eval that has not been re-run in 3 months. 7. Eval threshold lowered to make a regression pass. 8. Eval that measures Opus output while shipping on Haiku. 9. Offline eval as the only signal (no online A/B). 10. Eval set with no refusal-expected cases. ## Hard rules ### The golden dataset **1. Start with 50+ real or realistic examples.** Below 50, variance dominates. Above 500, diminishing returns. **2. The dataset is held out from training/tuning.** Iterate prompts against a separate dev set. Train/dev/test split applies to prompt engineering too. **3. Each example has: input, expected behavior (not exact output), evaluation criteria.** Exact-match scoring is rarely possible. ```yaml # examples/01_refund_request.yaml id: 01_refund_request input: "Charge of $42.99 on my card but I never placed an order. Refund please." expected: category: refund confidence: high refuses: false criteria: - correct_category: { weight: 0.5 } - high_confidence: { weight: 0.3 } - terse_reason: { weight: 0.2, max_words: 12 } ``` **4. Cover the long tail.** Median cases are easy. Half the value is coverage of: ambiguous queries, adversarial inputs, refusal cases, edge formats, out-of-domain. **5. Maintain the dataset like code.** Version-controlled, reviewed, updated when the product spec changes. ### Scoring **6. Pick the simplest scoring that captures the goal.** Then add complexity only when needed. | Task type | Scoring | | --- | --- | | Classification | Accuracy | | Extraction | Precision / Recall / F1 against ground truth | | Open-ended generation | Rubric scoring (1-5 or pass/fail per criterion) | | Code generation | Functional correctness via test suite | **7. Multi-criterion rubrics beat a single quality score.** "Factual? Well-formatted? Cites sources? Tone appropriate?" - score each separately. **8. LLM-as-judge is fine; LLM-as-only-judge is fragile.** Use as one signal in a panel, calibrate against human ratings on a sample, audit periodically. ### LLM-as-judge specifics **9. Judge sees rubric + input + output. Not the model that generated.** Hide the source. **10. Judge returns structured output: per-criterion score + brief reason.** ```ts // reference judge prompt const JUDGE_PROMPT = ` You evaluate the quality of a classifier output. - correct_category: Was the output category the same as the expected? - confidence_appropriate: For confident inputs, was confidence "high"? For ambiguous inputs, was it "low"? - terse_reason: Is "reason" under 12 words? {{input}} {{expected}} {{output}} Return strict JSON: { "correct_category": { "pass": boolean, "note": "<10 word reason" }, "confidence_appropriate": { "pass": boolean, "note": "..." }, "terse_reason": { "pass": boolean, "note": "..." } } `; ``` **11. Judge prompt is version-controlled and eval'd.** If the judge drifts, all metrics drift. Calibrate against humans periodically. **12. Different model for judging than generating, where feasible.** Reduces shared-failure-mode bias. ### Regression detection **13. Run eval on every prompt change, every model change, every retrieval change.** Regression test in CI. ```yaml # .github/workflows/eval.yml jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@SHA - run: npm ci - run: npm run eval env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} - name: Fail on regression run: | npm run eval:compare-baseline -- \ --max-aggregate-drop 0.02 \ --min-per-criterion 0.80 ``` **14. Set ship threshold and block threshold. Document them.** "Aggregate must be within 2% of baseline; no individual criterion below 80%." Without thresholds, evals are decoration. **15. Track per-example outcomes, not just aggregates.** A 90% aggregate that's 100% on easy and 50% on hard hides regression in the long tail. **16. A failing example is a story, not a number.** Investigate. Often the eval surfaces a real product issue, not a model issue. ### A/B tests in production **17. Offline evals predict online performance imperfectly. Run online A/B before full rollout.** **18. Online metric is user-visible: thumbs up/down, task completion, time-on-task, downstream conversion.** Not "the LLM answered something." **19. Statistical significance matters.** A 3% improvement on 100 samples is noise. Set sample sizes per expected effect size. ### Cost and latency **20. Track p50/p95 latency and per-call cost as eval dimensions.** A 5% quality lift at 3x cost may not be a win. **21. Eval at the price point you will ship at.** Eval on Opus, ship on Haiku = meaningless. ### Specific frameworks - **promptfoo** - YAML-defined evals, multiple providers, web UI for diffs. Good default for prompt-heavy apps. - **Anthropic Workbench** - integrated for Claude, good for iteration. - **DeepEval** - Python, integrates with pytest. - **Braintrust / LangSmith / Phoenix** - hosted, support online traffic capture. - **Plain pytest/vitest + JSON dataset** - fine for small projects. Do not over-tool. ### Dataset hygiene **22. Examples are realistic, not toy.** Real user queries from logs (with redaction) beat handcrafted "what is the capital of France?" **23. Sensitive data redacted or synthetic.** Eval datasets get checked into repos. **24. Document each example's intent.** Why is this in the dataset? What does it test? Without intent, examples rot. **25. Examples expire.** A 2023 product-policy question may no longer apply. Date examples; refresh annually. ### Categories **26. Accuracy evals: did the model get the right answer?** For factoid, extraction, classification. **27. Behavior evals: did the model follow format/tone/safety rules?** Refused inappropriate requests? Used correct citation format? Stayed in scope? **28. Robustness evals: does the model produce the same answer with paraphrased input?** Quality should not depend on exact wording. **29. Adversarial evals: prompt injection, jailbreaks, edge cases.** Important if your app handles user-supplied input that flows near the LLM. ## Common AI-output patterns to reject | Pattern | Why wrong | Fix | | --- | --- | --- | | Shipping with "we tested manually" | No regression detection | 50+ example eval set in CI | | Same set for tuning + measuring | Overfit | Train/dev/test split | | Single "quality" score | Hides per-criterion regressions | Rubric, score each separately | | LLM-judge with no human calibration | Drifts silently | Periodic human review on a sample | | Eval that has not run in months | Probably already broken | Cron in CI | | Threshold lowered to pass | Cheating | Investigate the regression instead | | Aggregate only | Long-tail regression hidden | Per-example outcomes tracked | | Eval on a different model than prod | Meaningless | Eval at the price point you ship | | No refusal-expected cases | Hallucination undetected | At least 6 refusal cases in 50 | ## Worked example: eval harness with vitest ```ts // eval/dataset.ts import { z } from "zod"; export const EvalExample = z.object({ id: z.string(), input: z.string(), expected: z.object({ category: z.enum(["billing", "technical", "refund", "other"]), confidence: z.enum(["high", "low"]), }), }); export type EvalExample = z.infer; export const dataset: EvalExample[] = [ { id: "01", input: "Refund please.", expected: { category: "refund", confidence: "high" } }, { id: "02", input: "App crashes on iPhone.", expected: { category: "technical", confidence: "high" } }, { id: "03", input: "Can I get a discount?", expected: { category: "other", confidence: "low" } }, // ... 47 more examples ]; ``` ```ts // eval/run.test.ts import { describe, expect, it } from "vitest"; import { dataset } from "./dataset.js"; import { classifySupportEmail } from "../src/classifier.js"; describe("support-classifier eval", () => { const results: Array<{ id: string; correct: boolean; output: unknown }> = []; for (const ex of dataset) { it(`${ex.id}: ${ex.input.slice(0, 40)}...`, async () => { const output = await classifySupportEmail(ex.input); const correct = output.category === ex.expected.category && output.confidence === ex.expected.confidence; results.push({ id: ex.id, correct, output }); // do not fail individual tests - accumulate, then check aggregate }); } it("aggregate accuracy >= 0.85", () => { const accuracy = results.filter((r) => r.correct).length / results.length; console.log(`Aggregate accuracy: ${(accuracy * 100).toFixed(1)}% (${results.filter((r) => r.correct).length}/${results.length})`); expect(accuracy).toBeGreaterThanOrEqual(0.85); }); it("per-criterion >= 0.80", () => { const byCategory = new Map(); for (const r of results) { const expectedCategory = dataset.find((d) => d.id === r.id)!.expected.category; const bucket = byCategory.get(expectedCategory) ?? { ok: 0, total: 0 }; bucket.total++; if (r.correct) bucket.ok++; byCategory.set(expectedCategory, bucket); } for (const [cat, b] of byCategory) { const acc = b.ok / b.total; expect(acc, `category=${cat} accuracy ${acc.toFixed(2)} < 0.80`).toBeGreaterThanOrEqual(0.80); } }); }); ``` What this demonstrates: structured eval set (rule 3); per-example results captured (rule 15); aggregate threshold (rule 14); per-criterion threshold (rule 15) - prevents one good category from masking a regressed one; runs in CI via `npm run eval`. ## Workflow When designing an eval for an LLM feature: 1. **Define the goal.** What does "good" look like? Write it down. 2. **Collect 50 real examples.** From logs, real-case mocks, domain-expert input. Half median, half edge. 3. **Write the rubric.** 3-7 criteria, each pass/fail or 1-5. 4. **Pick scoring approach.** Exact-match if possible, LLM-judge otherwise, human for calibration. 5. **Run baseline. Record the number.** 6. **Iterate the prompt on the DEV set, not the eval set.** 7. **Re-run eval after each meaningful change. Track over time.** 8. **Wire into CI before shipping.** ## Verification Manual checklist - evals are structural: - [ ] Golden dataset version-controlled, 50+ examples. - [ ] Rubric has 3+ criteria, scored independently. - [ ] Baseline recorded for at least one model. - [ ] CI runs eval on prompt/model changes. - [ ] Ship/block thresholds documented. - [ ] Human review on a sample at least monthly. ## When to skip this skill - One-shot LLM use that ships once. - Research prototypes with no production exposure. - Code where output correctness is checkable by unit tests. ## Related skills - [`forge-prompt-engineering`](../forge-prompt-engineering/SKILL.md) - the prompt being eval'd. - [`forge-rag`](../forge-rag/SKILL.md) - retrieval + generation evals. - [`forge-tests`](../../testing/forge-tests/SKILL.md) - test hygiene principles also apply to evals.