---
name: evidence-driven-agent-rules
description: Use when building or operating coding-agent infrastructure where the feedback signal exists (eval suites, run-level telemetry, A/B baselines, skill catalogs under test) and you want to capture observed agent failures, promote recurring patterns into rules, and score advanced (Level 4–5) agent-readiness. Triggers on "reflection log," "agent failure log," "evidence-driven scaffolding," "W1 ≥3 floor," "promote pattern to rule," "post-incident eval case," "harness maturity Level 4 / Level 5," "Sovereign Engineering," "Spec Architecture," or "set up a feedback loop for agent failures." Pair with `project-agentification` for project-context-first AGENTS.md scaffolding — this skill layers the evidence-driven workflow on top.
license: MIT
metadata:
  status: experimental
---

# Evidence-Driven Agent Rules (experimental)

Capture observed agent failures, promote recurring patterns into rules,
and score advanced (Level 4–5) agent-readiness. **For teams with a
feedback signal** — eval suites, run-level telemetry, A/B baselines, or
skill catalogs under test.

Pair with `project-agentification`:
- `project-agentification` scaffolds the project-context-first AGENTS.md
  any repo can use (stack, layout, commands, gates, sandbox). No
  failure-log dependency.
- This skill layers the evidence-driven workflow on top — reflection
  log + ≥3 promotion floor + Mündler citations — for the subset of
  repos where the feedback signal actually exists.

If the repo has no benchmark, no eval suite, no telemetry baseline, and
no "we observed the agent doing X" stream, use `project-agentification`
alone. This skill needs a signal to drive promotion.

## Core principle

**Single observations are cheap to record; rules cost trust to ship.**
The reflection-log workflow gates *promotion* (turning a logged pattern
into an `AGENTS.md` rule, hook, or CI gate) on ≥3 same-gap entries
because scaffolding from one observation easily overfits to boilerplate
(W1 — Mündler et al., arXiv:2602.11988, Feb 2026: autogenerated context
files drop task success ~3% and inflate cost >20%). Recording is the
opposite — low-friction, opt-in, always worth doing if you can write a
non-trivial `What to do differently` line.

## Bootstrap order

The W1 ≥3 floor creates a hard staging dependency for repos using this
workflow. **Scaffold in this order, never out of it:**

1. **Stage 0** — reflection-log directory (`docs/reflection-log/`) with
   `README.md` index + `_template.md` entry template + a `README.md
   §Agents` pointer at the repo root pointing into the log. The
   directory is exempt from the W1 floor because it does not *contain*
   agent instructions; it *captures the observations* future
   instructions will be hand-curated from. **The Stage-0 README MUST
   explicitly distinguish the recording bar (low — one observation with
   a `What to do differently` line is enough) from the promotion bar
   (high — ≥3 entries describing the same gap).** Conflating the two
   causes reviewers and agents to self-filter entries.
2. **Stage 1** — `AGENTS.md` exists. **This skill does not scaffold it
   — use `project-agentification` for that.** Once it's there, this
   skill's `promote` workflow can add rules to it that trace to
   reflection-log entries.
3. **Stage 2+** — hooks, gates, evals — each grounded in a promoted
   pattern from the log.

**Refuse to run `promote` without Stage 0 in place** (no log → no
patterns → no evidence-driven rule).

## Activation

- **Bare invocation** (`"set up reflection log"`, `"evidence-driven
  rules"`, `"use evidence-driven-agent-rules"`): show the intent menu
  inline (`capture` / `promote` / `assess-l4l5` — see §Workflow step 1
  below for what each does) and wait. No file inspection, no network
  calls, no writes. (Unlike `project-agentification`, this skill's
  narrower scope doesn't ship a separate router CSV; the SKILL.md body
  is the router.)
- **Concrete invocation** with intent inferable: skip to step 2 of the
  workflow.
- **Concrete invocation with ambiguous scope**: ask one blocker question
  identifying the intent; do not inspect private systems first.

## Workflow

1. **Pick intent.** Match the prompt to `capture` (scaffold the
   reflection log + README pointer), `promote` (find ≥3 same-gap
   entries and propose a rule/hook/gate that closes them), or
   `assess-l4l5` (score Level 4–5 maturity using the extension rubric).
   Ambiguous → ask once.
2. **Load grounded context.** Always load
   `references/empirical-warnings-w1.md` and the
   `references/playbooks/reflection-log.md` playbook. For `promote`,
   also load the reflection log itself (read every
   `docs/reflection-log/[0-9]*.md`). For `assess-l4l5`, also load
   `references/core/maturity-rubric.md`.
3. **For `capture`:** scaffold `docs/reflection-log/README.md` +
   `_template.md` from `templates/artifacts/reflection-log/`; add the
   `README.md §Agents` pointer block if it doesn't exist; refuse if
   AGENTS.md exists but doesn't link to the log (Stage 0 must be wired
   into the always-loaded surface — see `project-agentification` for the
   AGENTS.md `§Reflection-log workflow` template section).
4. **For `promote`:** group entries by `sub-surface:` (use `grep -l
   'sub-surface: <name>' docs/reflection-log/[0-9]*.md`); for any group
   with ≥3 entries, present them to the user; propose the smallest
   closing change (rule sentence in AGENTS.md, PreToolUse hook, CI
   check, or template edit); ask for confirmation before writing.
   Refuse promotion if the group has <3 entries (W1 floor).
5. **For `assess-l4l5`:** assume Levels 1–3 already scored by
   `project-agentification` (require user confirmation); score Levels
   4–5 against `references/core/maturity-rubric.md`; report ceiling and
   gaps. Assign stable IDs like `ED-L4L5-001` to trackable gaps using
   `references/trackable-findings.md`.
6. **Emit output.** Per-intent shape:
   - `capture` → list of files written + the §Agents pointer block
     diff + validation checklist.
   - `promote` → the proposed rule/hook/gate + the reflection-log
     entries it cites + the AGENTS.md / hook diff for confirmation.
   - `assess-l4l5` → maturity score per layer + ceiling + prioritized
     gaps.
7. **Create tracking state.** For `assess-l4l5` outputs with 7+ gaps, any
   level-ceiling blocker, or a save/track/workflow-state request, load
   `references/trackable-findings.md` and write both artifacts now: Markdown
   ledger at
   `docs/audits/evidence-driven-agent-rules-findings-ledger-<YYYY-MM-DD>-<scope-slug>.md`
   and workflow state at
   `docs/audits/evidence-driven-agent-rules-workflow-state-<YYYY-MM-DD>-<scope-slug>.json`.
   If the target is not a repo or `docs/audits/` is not writable, use
   `audit-artifacts/evidence-driven-agent-rules-{findings-ledger|workflow-state}-<YYYY-MM-DD>-<scope-slug>.{md|json}`.
   Report both paths. Roadmaps, issues, promotion changes, and non-tracking
   project-file edits still require confirmation.

## Recording bar vs. promotion bar

These two bars are different. The earlier (pre-2026-05-16) single-file
reflection-log shape conflated them, and at least one agent self-filtered
entries that should have been recorded by inheriting the high promotion
bar down to recording.

- **Recording bar (low).** If a contributor can write a non-trivial
  `## What to do differently` section, the entry is worth recording.
  One observation is enough. Do **not** filter on "is this a class /
  pattern / recurring?" at recording time — that filter belongs at the
  promotion step (`promote` intent), not at recording.
- **Promotion bar (high).** Three or more entries describing the same
  gap is the W1 floor for scaffolding a rule, hook, or AGENTS.md
  sentence to close it.

When in doubt: record. The `promote` workflow searches later.

## Empirical warnings

W1 (the ≥3 floor) lives in this skill at
`references/empirical-warnings-w1.md`. W2–W10 are cross-cutting and
live in the shared `references/empirical-warnings.md` (symlink to
`skills/_shared/empirical-warnings.md`).

## Reference map

- `references/empirical-warnings-w1.md` — W1 ≥3 floor + Mündler
  citations (sole-tenant of this skill).
- `references/empirical-warnings.md` — symlink to shared W2–W10.
- `references/lenses.md` — symlink to shared four-lens parallel
  dispatch.
- `references/agent-friendly-architecture.md` — shared note on repo structure graphs + boundary enforcement (used when promoting boundary-violation patterns into gates).
- `references/core/maturity-rubric.md` — Levels 4–5 (extends
  `project-agentification`'s Levels 1–3).
- `references/playbooks/reflection-log.md` — the only sub-surface this
  skill owns directly.
- `references/trackable-findings.md` — ledger, workflow-state, and closeout
  rules for `assess-l4l5` gaps.
- `templates/artifacts/reflection-log/` — `README.md` (index +
  recording-bar / promotion-bar callout) + `_template.md` (per-entry).
- `templates/findings-ledger.md` and `templates/workflow-state.json` — saved
  tracking artifacts for advanced maturity assessments.
- `evals/` — static checks, trigger evals, activation cases.

## See also

- `project-agentification` — scaffolds the project-context-first
  AGENTS.md this skill's `promote` intent adds rules to. Pair them.
- `skills/_shared/empirical-warnings.md` — W2–W10.
- `skills/_shared/lenses.md` — the four pre-write + one post-write
  reviewer lenses.
