--- name: improve-loop description: Briarwood's diagnostic-driven iteration loop. Use when the user wants to execute a multi-cycle improvement pass with measurement at the end (e.g., "execute Cycles K''/K'''/M'/O", "ship the followup #1 stack", "run the next §3.11 cycle"). Encodes the proven workflow from the 2026-05-03 session that shipped 30 commits and dropped bug_loss_rate 79.5% → 69.9%. Trigger when the user references a NEXT_CYCLES_HANDOFF doc, asks to execute a roadmap cycle, or asks to follow up on a measurement that surfaced specific failure modes. --- # Briarwood Diagnostic-Driven Improvement Loop This skill encodes the iteration loop that's been the most productive pattern for shipping Briarwood architecture in §3.11. The 2026-05-03 session shipped 30 commits using this loop, dropped bug_loss_rate 9.6 percentage points on the micro pool, and closed 6 of 14 followup items in the same day. The loop is: **handoff → per-cycle execution → harness measurement → diagnose gaps → file followups → repeat.** This skill owns the discipline at each step. **Read this skill in full at the start of any cycle execution.** Don't skip steps to save time — the discipline IS the productivity. --- ## Job 0: Pre-flight (5 min) **Before opening any code:** 1. Read `CLAUDE.md` startup contract. 2. Run the `.claude/skills/readme-discipline/SKILL.md` Job 1 drift check. 3. Read `docs/_NEXT_CYCLES_HANDOFF.md` (the handoff doc that defines the cycles you're about to execute). If no handoff doc exists, ask the user to draft one before proceeding. 4. Read the relevant `roadmap/strategic_initiatives/_*.md` umbrella for full context. 5. Read the latest `CURRENT_STATE.md` entry to know what shipped last. 6. Run `git log -10 --oneline` and `git status` — confirm `HEAD` is where you expect and the working tree is the state the handoff assumes. Surface any unexpected state to the user before editing. **Verify infrastructure:** 7. Confirm `pgrep -fl uvicorn` returns clean (no stale procs). 8. Confirm load-bearing docs are under 80% of the 256KB Read-tool cap: ```bash for f in CLAUDE.md DECISIONS.md ROADMAP.md CURRENT_STATE.md \ roadmap/*.md roadmap/*/*.md decisions/*.md; do bytes=$(wc -c < "$f"); pct=$((bytes * 100 / 262144)) [ "$pct" -ge 80 ] && echo "⚠ $f at $pct%" done ``` If any file is ≥80%, file a split as a §4 entry before adding more to it. (DECISIONS.md hit 280KB on 2026-05-03 and required emergency split — don't let that pattern repeat.) --- ## Job 1: Per-Cycle Execution For EACH cycle in the handoff doc, in order: ### 1.1 Pre-cycle sample (when the cycle targets a residual cluster) When a cycle targets "the residual F2 cluster" or "the F4 voice gap" or similar, **sample the actual cluster from the most recent harness diagnostic before writing code.** Don't trust the handoff's prediction of where the cluster lives — it may have shifted since. ```python import json from collections import Counter from pathlib import Path diag = json.loads(Path('data/eval/pressure_test/results_/diagnostic.json').read_text()) mode = 'F2_wiring_gap' # or whichever runs = [c for c in diag['classified_runs'] if c['mode'] == mode] fam = Counter() for c in runs: family = '_'.join(c['run_id'].split('__')[0].split('_')[:2]) fam[family] += 1 for f, n in fam.most_common(20): print(f' {f}: {n}') ``` The cross-tab tells you which scenarios dominate the residual. Aim your code at THOSE, not the handoff's hypothetical clusters. ### 1.2 Read affected module READMEs Per `readme-discipline` Job 2 — read the README for every module you're about to modify, plus immediate dependency READMEs that shape the contract. ### 1.3 Write code + tests - One logical change per cycle. Don't batch unrelated fixes. - Pin the contract change with focused tests BEFORE committing. - Use the same testing pattern the rest of the suite uses (see `tests/synthesis/test_llm_synthesizer.py` for the canonical scaffolding pattern). ### 1.4 Restart uvicorn (when prompt or dispatch changes) ```bash scripts/eval/restart_api.sh ``` If the cycle touches `briarwood/synthesis/`, `briarwood/agent/dispatch.py`, `briarwood/agent/required_coverage.py`, or any system-prompt code, the uvicorn restart is **mandatory** before any smoke test or harness. Local uvicorn doesn't auto-reload. ### 1.5 Smoke test via live /api/chat (MANDATORY, ~30 sec) **Before any harness pass, smoke test the dispatch path with a single live curl-equivalent call.** This catches schema bugs / wiring failures that would otherwise burn 22-80 min of harness time. The 2026-05-03 session lost ~30min to a `ComparableSale extra="forbid"` schema bug that would have been caught by a 30-second smoke test. The geocode audit fields took down `/api/chat` for every query. Pattern: ```python import json, urllib.request payload = {'messages': [{'role': 'user', 'content': ''}]} req = urllib.request.Request( 'http://127.0.0.1:8000/api/chat', data=json.dumps(payload).encode('utf-8'), headers={'Content-Type': 'application/json', 'Accept': 'text/event-stream'}, ) with urllib.request.urlopen(req, timeout=180) as resp: for raw in resp: line = raw.decode('utf-8', errors='replace').rstrip('\n') if line.startswith('data:'): try: ev = json.loads(line[5:].strip()) except json.JSONDecodeError: continue if ev.get('type') == 'text_delta': print(ev.get('content', ''), end='', flush=True) ``` If this call errors, the cycle isn't shipped. Fix the wiring before moving on. ### 1.6 Commit + README changelog - One commit per cycle. - Use structured commit messages (see [`commit-template.md`](references/commit-template.md)). - For contract-level changes, run `readme-discipline` Job 3 — update the module README's prose + Last Updated date + append a dated changelog entry. ### 1.7 Restart uvicorn AGAIN before the next cycle If the next cycle in the sequence touches the same paths, you need a fresh uvicorn before that cycle's smoke test too. --- ## Job 2: Harness Measurement **Choose your harness pool by scope:** | When | Pool | Wall clock | Use for | |---|---|---|---| | Iterating on a single fix | 5-fixture micro pool | ~22min | Fast signal between cycles | | Shipping a measurement to the cycle log | 20-fixture pool | ~80min | Comparable-to-prior-baseline | | Validating end-to-end after a stack | Either, then escalate | varies | Ship micro first, full only if numbers warrant | **Micro pool path:** ```bash caffeinate -i venv/bin/python3 -m briarwood.eval.pressure_test.runner \ --workers 4 \ --fixtures-pool data/saved_properties/_pressure_test_micro_pool.json ``` **Full 20-fixture pool path:** ```bash caffeinate -i venv/bin/python3 -m briarwood.eval.pressure_test.runner \ --workers 4 \ --fixtures-pool data/saved_properties/_pressure_test_pool_.json ``` (Don't pass `--out` — get a fresh timestamped dir. Resume support is in if a run dies mid-flight.) **Operational note:** keep laptop **open + plugged in**. `caffeinate -i` blocks idle sleep but NOT lid close. Resume support survives a kill. --- ## Job 3: Diagnostic + Per-Run Delta + Cycle Log After the harness completes: ```bash venv/bin/python3 -m briarwood.eval.diagnostic \ --results-dir data/eval/pressure_test/ \ --prior-dir data/eval/pressure_test/ \ --cycle-label "" venv/bin/python3 -m briarwood.eval.diagnostic.per_run_delta \ --prior data/eval/pressure_test//diagnostic.json \ --current data/eval/pressure_test//diagnostic.json venv/bin/python3 -m briarwood.eval.cycle \ --results-dir data/eval/pressure_test/ \ --label "" \ --signature '' ``` **Signature key shape (load-bearing):** the comparator dict uses `F_` keys (NOT `F__count`). The resolver appends `_count`. Example: ```json { "F3_substrate_gap": "<= -25", "F8_policy_aligned_loss": ">= 5", "honest_gap_rate": ">= 0.85" } ``` **v2 Cycle 4 with_ci qualifier:** when you want statistical rigor, append ` with_ci` to the comparator (`"<= -15 with_ci"`). The verdict-classifier requires both 95% CI bounds to clear. **Updating CURRENT_STATE.md:** prepend a new dated entry summarizing the cycle outcome. Old entries roll into "Earlier:" suffix prose. Don't grow CURRENT_STATE.md past 80% of the cap (~209KB) — split if approaching. --- ## Job 4: Diagnose `partial_match` / `target_unchanged` When the verdict isn't `target_hit`, the diagnostic question is: **substrate, synthesizer, or detector?** **Substrate** (data layer): the underlying records don't carry the signal the cycle assumed. Examples from 2026-05-03: - Bradley Beach narrative_notes lacked east-of-Main framing. - 130 Beacon Hill Manasquan was geocoded 10mi north (geocoder error). - Sea Girt $/sqft zone was empty because only 21 of 180 comps had lat/lon. **Synthesizer** (prose layer): the prompt didn't direct the LLM to the substrate, OR the LLM doesn't comply with the prompt directive. Examples: - Cycle K' tightened required-coverage but LLM still paraphrased "keep" instead of "hold" (needed structural section template, K''). - General_user persona got investor-style verdict framing because the router didn't detect the persona (reframe alone wasn't enough). **Detector** (eval layer): the F-mode classifier reads a different signal than the production code reads. Example: - `is_substrate_thin` checked `unified.comparable_sales.metrics.comp_count` but the F3 detector reads `events.rent_outlook.zillow_rental_comp_count` — two different fields. Cycle O followup A closed the gap. **Always trace the failure to one of these three layers BEFORE proposing a fix.** Mis-attribution wastes a cycle. --- ## Job 5: Human-Judge Calibration **Trigger this when:** - Two LLM judges disagree on >30% of cases. - A cycle produced no `briarwood_wins` flips despite passing signature targets. - You want to validate the moat before a launch decision. **Pick 12 stratified cases** from the latest 20-fixture pool: - 5-6 from the disagree-cluster (Claude=win, GPT-5=loss, OR vice versa) - 3 anchor cases where both LLMs agree Briarwood wins - 3 anchor cases where both LLMs agree baseline wins Save picks to `data/eval/pressure_test/_human_judge_picks_.json` (this path is gitignored). **Walk through cases one at a time conversationally:** For each case, surface to the user: - Question + persona + tier_observed - Briarwood prose (full) - Both baseline prose (Claude + GPT-5) - Both judge rationales User responds with `briarwood` / `baseline` / `tied` and (optional) one-line reasoning. Record verdicts. **Aggregate at the end:** - Per-judge agreement rate vs human - Note empty-baseline forfeits separately — they invalidate the judge's win count - Surface the architectural-merit principle if it shows up: "structure right, data fixable = win" (locked 2026-05-03) **Persist the calibration as a dated doc** at `data/eval/learning/human_judge_calibration_.md`. Include the followup queue with priority + effort + status. --- ## Job 6: Followup Queue Discipline **Every cycle that doesn't fully land surfaces followups.** File each one immediately: 1. Add to the calibration doc's followup queue (or the §3.X strategic initiative file if no calibration is open). 2. Tag with `[size: S/M/L/XL/XS]`, `[priority: 1-N]`, `[impact: ]`. 3. Mark status: `OPEN` / `RESOLVED — ` / `PARKED — `. 4. Cross-reference any prior cycle that surfaced or partially closed it. **Don't promise a followup without filing it.** If you tell the user "I'll address this next session" without a roadmap entry, it's ephemeral. --- ## Job 7: Closeout (end of session) **Before stopping:** 1. **Update `roadmap/resolved_index.md`** with rows for everything that closed this session. 2. **Update `CURRENT_STATE.md`** with the dated session-outcome entry. 3. **Update the strategic initiative file** (`roadmap/strategic_initiatives/_*.md`) with a "Post- outcome" section if a measurement landed. 4. **Write a session doc** at `docs/sessions/_session_doc.md` capturing every commit (see [`session-doc-template.md`](references/session-doc-template.md)). 5. **Run the size check** (Job 0 step 8) and split any file ≥80%. 6. **Commit the docs**: ```bash git add roadmap/ CURRENT_STATE.md docs/sessions/ git commit -m "docs(session): session record ( commits)" ``` --- ## Anti-patterns The 2026-05-03 session encountered each of these. Avoid them. - **Skipping the smoke test before the harness.** Cost: 30+ min lost to a schema bug that broke every chat query. - **Trusting the headline metric without checking baseline non-empty counts.** "64% Claude wins" was 100% measurement artifact (7/12 baselines were empty forfeits). - **Adding fields to JSON rows without scanning consumer schemas.** The `geocode_source` audit field violated `ComparableSale.extra="forbid"` on every query. - **Writing code from the handoff's predicted cluster instead of the current diagnostic's actual cluster.** Failure modes shift between cycles. Sample first, code second. - **Running the full 20-fixture pool when the micro pool would give faster signal.** 80min vs 22min. Reserve the full pool for shipping measurements, not iteration. - **Filing a followup without a priority + effort estimate.** Future sessions can't act on "investigate the X bug" without a size prediction. - **Promising a followup verbally without writing it down.** Followups evaporate without paper. - **Letting load-bearing docs grow past 80% of the cap.** DECISIONS.md hit 280KB before anyone noticed; the emergency split cost 30 min. --- ## Reference - `references/commit-template.md` — commit message scaffolding - `references/session-doc-template.md` — session-doc scaffolding - `references/diagnostic-snippets.md` — Python one-liners for sampling F2/F3/F4 clusters from a diagnostic.json - `data/eval/learning/human_judge_calibration_20260503.md` — canonical example of a calibration doc + followup queue ## Slash command Invoke this skill via `/improve-loop` to walk the user through the full session workflow from a handoff doc to commit + measurement.