---
name: improve-loop
description: Briarwood's diagnostic-driven iteration loop. Use when the user wants to execute a multi-cycle improvement pass with measurement at the end (e.g., "execute Cycles K''/K'''/M'/O", "ship the followup #1 stack", "run the next §3.11 cycle"). Encodes the proven workflow from the 2026-05-03 session that shipped 30 commits and dropped bug_loss_rate 79.5% → 69.9%. Trigger when the user references a NEXT_CYCLES_HANDOFF doc, asks to execute a roadmap cycle, or asks to follow up on a measurement that surfaced specific failure modes.
---

# Briarwood Diagnostic-Driven Improvement Loop

This skill encodes the iteration loop that's been the most productive
pattern for shipping Briarwood architecture in §3.11. The 2026-05-03
session shipped 30 commits using this loop, dropped bug_loss_rate
9.6 percentage points on the micro pool, and closed 6 of 14 followup
items in the same day.

The loop is: **handoff → per-cycle execution → harness measurement →
diagnose gaps → file followups → repeat.** This skill owns the
discipline at each step.

**Read this skill in full at the start of any cycle execution.**
Don't skip steps to save time — the discipline IS the productivity.

---

## Job 0: Pre-flight (5 min)

**Before opening any code:**

1. Read `CLAUDE.md` startup contract.
2. Run the `.claude/skills/readme-discipline/SKILL.md` Job 1 drift check.
3. Read `docs/<X.Y>_NEXT_CYCLES_HANDOFF.md` (the handoff doc that
   defines the cycles you're about to execute). If no handoff doc
   exists, ask the user to draft one before proceeding.
4. Read the relevant `roadmap/strategic_initiatives/<X.Y>_*.md` umbrella
   for full context.
5. Read the latest `CURRENT_STATE.md` entry to know what shipped last.
6. Run `git log -10 --oneline` and `git status` — confirm `HEAD` is
   where you expect and the working tree is the state the handoff
   assumes. Surface any unexpected state to the user before editing.

**Verify infrastructure:**

7. Confirm `pgrep -fl uvicorn` returns clean (no stale procs).
8. Confirm load-bearing docs are under 80% of the 256KB Read-tool cap:
   ```bash
   for f in CLAUDE.md DECISIONS.md ROADMAP.md CURRENT_STATE.md \
            roadmap/*.md roadmap/*/*.md decisions/*.md; do
     bytes=$(wc -c < "$f"); pct=$((bytes * 100 / 262144))
     [ "$pct" -ge 80 ] && echo "⚠ $f at $pct%"
   done
   ```
   If any file is ≥80%, file a split as a §4 entry before adding more
   to it. (DECISIONS.md hit 280KB on 2026-05-03 and required emergency
   split — don't let that pattern repeat.)

---

## Job 1: Per-Cycle Execution

For EACH cycle in the handoff doc, in order:

### 1.1 Pre-cycle sample (when the cycle targets a residual cluster)

When a cycle targets "the residual F2 cluster" or "the F4 voice gap"
or similar, **sample the actual cluster from the most recent harness
diagnostic before writing code.** Don't trust the handoff's prediction
of where the cluster lives — it may have shifted since.

```python
import json
from collections import Counter
from pathlib import Path
diag = json.loads(Path('data/eval/pressure_test/results_<latest>/diagnostic.json').read_text())
mode = 'F2_wiring_gap'  # or whichever
runs = [c for c in diag['classified_runs'] if c['mode'] == mode]
fam = Counter()
for c in runs:
    family = '_'.join(c['run_id'].split('__')[0].split('_')[:2])
    fam[family] += 1
for f, n in fam.most_common(20):
    print(f'  {f}: {n}')
```

The cross-tab tells you which scenarios dominate the residual. Aim
your code at THOSE, not the handoff's hypothetical clusters.

### 1.2 Read affected module READMEs

Per `readme-discipline` Job 2 — read the README for every module
you're about to modify, plus immediate dependency READMEs that shape
the contract.

### 1.3 Write code + tests

- One logical change per cycle. Don't batch unrelated fixes.
- Pin the contract change with focused tests BEFORE committing.
- Use the same testing pattern the rest of the suite uses (see
  `tests/synthesis/test_llm_synthesizer.py` for the canonical
  scaffolding pattern).

### 1.4 Restart uvicorn (when prompt or dispatch changes)

```bash
scripts/eval/restart_api.sh
```

If the cycle touches `briarwood/synthesis/`, `briarwood/agent/dispatch.py`,
`briarwood/agent/required_coverage.py`, or any system-prompt code,
the uvicorn restart is **mandatory** before any smoke test or harness.
Local uvicorn doesn't auto-reload.

### 1.5 Smoke test via live /api/chat (MANDATORY, ~30 sec)

**Before any harness pass, smoke test the dispatch path with a single
live curl-equivalent call.** This catches schema bugs / wiring failures
that would otherwise burn 22-80 min of harness time.

The 2026-05-03 session lost ~30min to a `ComparableSale extra="forbid"`
schema bug that would have been caught by a 30-second smoke test. The
geocode audit fields took down `/api/chat` for every query.

Pattern:
```python
import json, urllib.request
payload = {'messages': [{'role': 'user', 'content': '<question that exercises the cycle>'}]}
req = urllib.request.Request(
    'http://127.0.0.1:8000/api/chat',
    data=json.dumps(payload).encode('utf-8'),
    headers={'Content-Type': 'application/json', 'Accept': 'text/event-stream'},
)
with urllib.request.urlopen(req, timeout=180) as resp:
    for raw in resp:
        line = raw.decode('utf-8', errors='replace').rstrip('\n')
        if line.startswith('data:'):
            try: ev = json.loads(line[5:].strip())
            except json.JSONDecodeError: continue
            if ev.get('type') == 'text_delta':
                print(ev.get('content', ''), end='', flush=True)
```

If this call errors, the cycle isn't shipped. Fix the wiring before
moving on.

### 1.6 Commit + README changelog

- One commit per cycle.
- Use structured commit messages (see [`commit-template.md`](references/commit-template.md)).
- For contract-level changes, run `readme-discipline` Job 3 — update
  the module README's prose + Last Updated date + append a dated
  changelog entry.

### 1.7 Restart uvicorn AGAIN before the next cycle

If the next cycle in the sequence touches the same paths, you need a
fresh uvicorn before that cycle's smoke test too.

---

## Job 2: Harness Measurement

**Choose your harness pool by scope:**

| When | Pool | Wall clock | Use for |
|---|---|---|---|
| Iterating on a single fix | 5-fixture micro pool | ~22min | Fast signal between cycles |
| Shipping a measurement to the cycle log | 20-fixture pool | ~80min | Comparable-to-prior-baseline |
| Validating end-to-end after a stack | Either, then escalate | varies | Ship micro first, full only if numbers warrant |

**Micro pool path:**
```bash
caffeinate -i venv/bin/python3 -m briarwood.eval.pressure_test.runner \
    --workers 4 \
    --fixtures-pool data/saved_properties/_pressure_test_micro_pool.json
```

**Full 20-fixture pool path:**
```bash
caffeinate -i venv/bin/python3 -m briarwood.eval.pressure_test.runner \
    --workers 4 \
    --fixtures-pool data/saved_properties/_pressure_test_pool_<latest_stamp>.json
```

(Don't pass `--out` — get a fresh timestamped dir. Resume support is
in if a run dies mid-flight.)

**Operational note:** keep laptop **open + plugged in**. `caffeinate -i`
blocks idle sleep but NOT lid close. Resume support survives a kill.

---

## Job 3: Diagnostic + Per-Run Delta + Cycle Log

After the harness completes:

```bash
venv/bin/python3 -m briarwood.eval.diagnostic \
    --results-dir data/eval/pressure_test/<new_results_dir> \
    --prior-dir data/eval/pressure_test/<prior_results_dir> \
    --cycle-label "<short label>"

venv/bin/python3 -m briarwood.eval.diagnostic.per_run_delta \
    --prior data/eval/pressure_test/<prior>/diagnostic.json \
    --current data/eval/pressure_test/<new>/diagnostic.json

venv/bin/python3 -m briarwood.eval.cycle \
    --results-dir data/eval/pressure_test/<new_results_dir> \
    --label "<short label>" \
    --signature '<JSON dict of mode → comparator>'
```

**Signature key shape (load-bearing):** the comparator dict uses
`F<n>_<mode>` keys (NOT `F<n>_<mode>_count`). The resolver appends
`_count`. Example:

```json
{
  "F3_substrate_gap": "<= -25",
  "F8_policy_aligned_loss": ">= 5",
  "honest_gap_rate": ">= 0.85"
}
```

**v2 Cycle 4 with_ci qualifier:** when you want statistical rigor,
append ` with_ci` to the comparator (`"<= -15 with_ci"`). The
verdict-classifier requires both 95% CI bounds to clear.

**Updating CURRENT_STATE.md:** prepend a new dated entry summarizing
the cycle outcome. Old entries roll into "Earlier:" suffix prose.
Don't grow CURRENT_STATE.md past 80% of the cap (~209KB) — split if
approaching.

---

## Job 4: Diagnose `partial_match` / `target_unchanged`

When the verdict isn't `target_hit`, the diagnostic question is:
**substrate, synthesizer, or detector?**

**Substrate** (data layer): the underlying records don't carry the
signal the cycle assumed. Examples from 2026-05-03:
- Bradley Beach narrative_notes lacked east-of-Main framing.
- 130 Beacon Hill Manasquan was geocoded 10mi north (geocoder error).
- Sea Girt $/sqft zone was empty because only 21 of 180 comps had
  lat/lon.

**Synthesizer** (prose layer): the prompt didn't direct the LLM to
the substrate, OR the LLM doesn't comply with the prompt directive.
Examples:
- Cycle K' tightened required-coverage but LLM still paraphrased
  "keep" instead of "hold" (needed structural section template, K'').
- General_user persona got investor-style verdict framing because the
  router didn't detect the persona (reframe alone wasn't enough).

**Detector** (eval layer): the F-mode classifier reads a different
signal than the production code reads. Example:
- `is_substrate_thin` checked `unified.comparable_sales.metrics.comp_count`
  but the F3 detector reads `events.rent_outlook.zillow_rental_comp_count`
  — two different fields. Cycle O followup A closed the gap.

**Always trace the failure to one of these three layers BEFORE
proposing a fix.** Mis-attribution wastes a cycle.

---

## Job 5: Human-Judge Calibration

**Trigger this when:**
- Two LLM judges disagree on >30% of cases.
- A cycle produced no `briarwood_wins` flips despite passing signature targets.
- You want to validate the moat before a launch decision.

**Pick 12 stratified cases** from the latest 20-fixture pool:

- 5-6 from the disagree-cluster (Claude=win, GPT-5=loss, OR vice versa)
- 3 anchor cases where both LLMs agree Briarwood wins
- 3 anchor cases where both LLMs agree baseline wins

Save picks to `data/eval/pressure_test/_human_judge_picks_<stamp>.json`
(this path is gitignored).

**Walk through cases one at a time conversationally:**

For each case, surface to the user:
- Question + persona + tier_observed
- Briarwood prose (full)
- Both baseline prose (Claude + GPT-5)
- Both judge rationales

User responds with `briarwood` / `baseline` / `tied` and (optional)
one-line reasoning. Record verdicts.

**Aggregate at the end:**
- Per-judge agreement rate vs human
- Note empty-baseline forfeits separately — they invalidate the
  judge's win count
- Surface the architectural-merit principle if it shows up: "structure
  right, data fixable = win" (locked 2026-05-03)

**Persist the calibration as a dated doc** at
`data/eval/learning/human_judge_calibration_<stamp>.md`. Include the
followup queue with priority + effort + status.

---

## Job 6: Followup Queue Discipline

**Every cycle that doesn't fully land surfaces followups.** File each
one immediately:

1. Add to the calibration doc's followup queue (or the §3.X strategic
   initiative file if no calibration is open).
2. Tag with `[size: S/M/L/XL/XS]`, `[priority: 1-N]`, `[impact: <area>]`.
3. Mark status: `OPEN` / `RESOLVED <date> — <commit/where>` / `PARKED — <why>`.
4. Cross-reference any prior cycle that surfaced or partially closed it.

**Don't promise a followup without filing it.** If you tell the user
"I'll address this next session" without a roadmap entry, it's
ephemeral.

---

## Job 7: Closeout (end of session)

**Before stopping:**

1. **Update `roadmap/resolved_index.md`** with rows for everything that
   closed this session.
2. **Update `CURRENT_STATE.md`** with the dated session-outcome entry.
3. **Update the strategic initiative file** (`roadmap/strategic_initiatives/<X.Y>_*.md`)
   with a "Post-<cycle> outcome" section if a measurement landed.
4. **Write a session doc** at `docs/sessions/<YYYY-MM-DD>_session_doc.md`
   capturing every commit (see [`session-doc-template.md`](references/session-doc-template.md)).
5. **Run the size check** (Job 0 step 8) and split any file ≥80%.
6. **Commit the docs**:
   ```bash
   git add roadmap/ CURRENT_STATE.md docs/sessions/
   git commit -m "docs(session): <stamp> session record (<N> commits)"
   ```

---

## Anti-patterns

The 2026-05-03 session encountered each of these. Avoid them.

- **Skipping the smoke test before the harness.** Cost: 30+ min lost
  to a schema bug that broke every chat query.
- **Trusting the headline metric without checking baseline non-empty
  counts.** "64% Claude wins" was 100% measurement artifact (7/12
  baselines were empty forfeits).
- **Adding fields to JSON rows without scanning consumer schemas.**
  The `geocode_source` audit field violated `ComparableSale.extra="forbid"`
  on every query.
- **Writing code from the handoff's predicted cluster instead of the
  current diagnostic's actual cluster.** Failure modes shift between
  cycles. Sample first, code second.
- **Running the full 20-fixture pool when the micro pool would give
  faster signal.** 80min vs 22min. Reserve the full pool for shipping
  measurements, not iteration.
- **Filing a followup without a priority + effort estimate.** Future
  sessions can't act on "investigate the X bug" without a size
  prediction.
- **Promising a followup verbally without writing it down.** Followups
  evaporate without paper.
- **Letting load-bearing docs grow past 80% of the cap.** DECISIONS.md
  hit 280KB before anyone noticed; the emergency split cost 30 min.

---

## Reference

- `references/commit-template.md` — commit message scaffolding
- `references/session-doc-template.md` — session-doc scaffolding
- `references/diagnostic-snippets.md` — Python one-liners for
  sampling F2/F3/F4 clusters from a diagnostic.json
- `data/eval/learning/human_judge_calibration_20260503.md` — canonical
  example of a calibration doc + followup queue

## Slash command

Invoke this skill via `/improve-loop` to walk the user through the
full session workflow from a handoff doc to commit + measurement.
