--- name: prompt-eval description: Run a prompt against a golden dataset and report per-rubric regression vs the prior pinned version, not just an aggregate score allowed-tools: Bash Read Edit argument-hint: " [--dataset ] [--baseline ] [--model ]" mode: [report] --- # Prompt Eval ## Purpose Score a prompt change against a fixed golden dataset and surface regressions per rubric — the failure modes that hide inside an aggregate "92% accuracy" number. Pins three things so the run is reproducible: - **Model version.** The exact provider model id (e.g., `claude-opus-4-7`, `gpt-5.1-2026-04-15`), not an alias. - **Rubric version.** A content hash of the rubric file at eval time. - **Dataset version.** A content hash of the dataset at eval time. Without those three pins, week-over-week comparisons drift silently when the model alias rolls forward or someone edits a rubric line. ## Scope - Reads a prompt from `/prompts//{system.md,user.tmpl}`. - Reads a dataset of `(input, expected, rubric_subset)` triples from `/eval/datasets/.jsonl`. - Reads rubrics from `/eval/rubrics/.yaml`. Each rubric is a named pass/fail (or 0–1 scalar) check applied to the model's output for a given input. - Runs the prompt over the dataset against the configured model, scores each example per rubric, writes results to `/eval/runs//`. - Diffs against the baseline run (default: most recent run on `main`) and reports per-rubric deltas. Aggregate score is included but secondary. - Designed for `eval-engineer` and `prompt-engineer` to share. The prompt-engineer iterates the prompt; the eval-engineer owns the rubric and dataset. ## When to use - Before merging a prompt change, to confirm no rubric regressed. - After a model version bump, to re-baseline (run, accept the new baseline, commit). - During prompt iteration, to keep a tight feedback loop on which rubric is improving and which is being traded away. - As the data source for `llm-output-gate` (CI-side gate). ## When NOT to use - For one-off "does this look right" spot checks — eyeball the output, don't run the whole dataset. - For ranking models against each other (model bake-off) — that's a different harness with different controls. - Without a dataset. If the project hasn't built golden examples yet, the answer is "build the dataset first." A 10-example dataset is better than no dataset; a zero-example "vibes" eval is worse than not running at all. ## Automated pass 1. **Resolve pins.** ```sh MODEL_ID=$(yakos runtime-info --model "${MODEL:-claude}" --resolve-version) RUBRIC_HASH=$(sha256sum eval/rubrics/${RUBRIC:-default}.yaml | cut -c1-12) DATASET_HASH=$(sha256sum eval/datasets/${DATASET:-default}.jsonl | cut -c1-12) RUN_ID="$(date -u +%Y%m%dT%H%M%SZ)-${MODEL_ID}-${DATASET_HASH}" ``` 2. **Run the prompt over the dataset.** One call per row. Cache the system prompt (it's identical across rows) so the cost stays reasonable on a 500-row dataset. ```sh yakos eval run \ --prompt "$PROMPT_ID" \ --dataset "eval/datasets/${DATASET}.jsonl" \ --rubric "eval/rubrics/${RUBRIC}.yaml" \ --model "$MODEL_ID" \ --out "eval/runs/${RUN_ID}/" ``` 3. **Score per rubric.** Each row in the dataset declares which rubrics apply (a row testing tone doesn't get scored on factual accuracy). Output is `eval/runs//scores.jsonl` with `{row_id, rubric, pass, score, notes}` per line. 4. **Diff against baseline.** Default baseline is the most recent run tagged on `main`; override with `--baseline ` to compare against a specific commit's run. ```sh yakos eval diff \ --baseline "eval/runs/$(git show "$BASELINE":eval/.latest-run)" \ --candidate "eval/runs/${RUN_ID}/" \ --by rubric > eval/runs/${RUN_ID}/diff.md ``` 5. **Compose the report.** Markdown table: - Per rubric: baseline pass-rate, candidate pass-rate, delta, sign. - Highlight any rubric with delta < 0 in red (regression). - Aggregate at the bottom, but flagged "secondary — rubric deltas above are the signal." - Pin block at top: `model=`, `rubric=`, `dataset=`, `run-id=`, `baseline=`. 6. **Exit code.** Zero if no rubric regressed beyond the project's tolerance (default: 0% — any rubric regression is a failure). Non- zero if regressed. The CI gate (`llm-output-gate`) consumes this exit code. ## Manual pass If the harness isn't wired yet: ```sh # 1. Pin the run MODEL_ID=$(yakos runtime-info --model claude --resolve-version) echo "model=$MODEL_ID rubric=$(sha256sum eval/rubrics/default.yaml | cut -c1-12)" # 2. Run the prompt over the dataset by hand (small N) while IFS= read -r row; do yakos dispatch prompt-runner "$(echo "$row" | jq -r .input)" \ --model "$MODEL_ID" --json >> eval/runs/manual.jsonl done < eval/datasets/default.jsonl # 3. Open the prior run side-by-side and eyeball per-rubric deltas diff <(jq -s 'group_by(.rubric) | ...' eval/runs/baseline.jsonl) \ <(jq -s 'group_by(.rubric) | ...' eval/runs/manual.jsonl) ``` ## Known gotchas - **Model alias drift.** Pinning to `claude-opus-4` (alias) instead of `claude-opus-4-7` (concrete version) means a silent re-baseline the next time the alias rolls. The skill resolves the alias to a concrete id and stores that. If the runtime won't expose the concrete id, mark the run `unpinned` in the report and warn loudly. - **Rubric edits invalidate baselines.** A 1-character change to a rubric file changes its hash; the diff against an old baseline now spans an apples-to-oranges comparison. The skill refuses to compare runs with different `rubric_hash` values unless `--force` is set. Editing a rubric should trigger a re-baseline, not a "let me just compare anyway." - **Rubric subset coverage.** If the dataset adds a row that doesn't list any rubric, scoring silently skips it. The skill warns when any row has zero applicable rubrics. - **Stochastic outputs.** Even at temperature 0, providers are not bitwise-deterministic. The dataset should be large enough that a single-row flip doesn't cross the regression threshold. Below ~50 rows per rubric, consider averaging across N runs. - **Cost.** A 500-row dataset on opus is ~$5–$15 per run; on cheap models it's <$1. Don't run on every commit — gate to PR open or prompt-touched paths. - **Eval set leakage.** If the same examples train your few-shot block, the eval is meaningless. Keep `eval/datasets/` separate from `prompts//examples.md`, and document the firewall in the project's `decisions.md`. ## References - `lib/agents/eval-engineer.md` — owns the rubric and dataset. - `lib/agents/prompt-engineer.md` — owns the prompt body. - `lib/skills/llm-output-gate/SKILL.md` — CI-side gate that consumes this skill's exit code. - `docs/eval-format.md` — dataset / rubric file schemas.