---
name: gloss-review
description: Stage 8 interlinear evaluator — audit completeness and correctness of candidate interlinear artifacts, benchmark them against ground truth, and decide pass/block/promote. Use for evaluation and promotion, not primary generation or import.
allowed-tools: Bash(python3:*), Bash(sqlite3:*), Bash(go:*), Bash(find:*), Bash(wc:*), Bash(ls:*), Bash(grep:*), Bash(jq:*), Bash(curl:*), Bash(git:*), Read, Write, Edit, Grep, Glob
---

# Gloss Review (Stage 8 Evaluator)

Audit, benchmark, and promote candidate Stage 8 interlinear artifacts.

This skill is the **evaluator/promotion gate** for Stage 8.
It should answer:
- is the candidate output complete enough?
- is it correct enough?
- what remains unresolved?
- should this text/range be blocked, kept as draft, or promoted?

Future preferred name: `interlinear-eval`.

## ⚠️ THIS IS A MANDATORY GATE

**This skill MUST run after every `/alignment build` and BEFORE Stage 9 or Stage 10.**

Stage 8 = build (alignment) + evaluate (gloss-review). Never proceed without evaluation.

```
/alignment build [work]
    ↓
/gloss-review evaluate [work]   ← YOU ARE HERE
    ↓
/reader-reliability [work]      (Stage 9)
```

## Quick Status

Total alignment files: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -path "*/interlinear" -name "*.json" | wc -l`
Files needing review: !`grep -rl '"needs_gloss_review": true' ${LYCEUM_TEXTS_DIR:-output/texts}/<slug>/interlinear/ 2>/dev/null | wc -l`
Static dictionary entries: !`# (lsj_lookup.go has been removed)
DB aligned words: !`nix-shell -p sqlite --run "sqlite3 data/editions.db 'SELECT COUNT(*) FROM aligned_words'" 2>/dev/null`

## Commands

- `/gloss-review audit [text]` — Audit completeness and visible quality problems in candidate Stage 8 output
- `/gloss-review benchmark [text]` — Benchmark candidate output against Hamilton/Clark ground truth where available
- `/gloss-review promote [text]` — Record pass/block/promote decision for Stage 8 output
- `/gloss-review fix [text|file-pattern]` — Run QA-driven remediation passes before another audit/benchmark cycle
- `/gloss-review status [text]` — Summarize Stage 8 evaluation status and promotion readiness
- `/gloss-review dictionary [word]` — Inspect or extend static dictionary entries when audit results justify it

## Command Compatibility

Legacy commands should map like this:
- old `/gloss-review audit` -> keep
- old `/gloss-review benchmark` -> keep
- old `/gloss-review fix` -> keep, but as QA-driven remediation only
- old `/gloss-review import` -> future `/new-text-ship promote`
- old `/alignment validate` -> `/gloss-review benchmark`
- old `/alignment audit` -> `/gloss-review audit`

Target: $ARGUMENTS

---

## Owned Responsibilities

### Owns
- Stage 8 completeness audit
- benchmark/evidence-based evaluation
- syntax/lemma agreement checks where applicable
- pass/block/promote decision for Stage 8
- QA metrics and unresolved-issue reporting

### Does not own
- primary Stage 8 generation
- treebank generation ownership
- source extraction or witness normalization
- DB import/build promotion
- reader/product QA

---

## Evaluation Model

### Evaluate for
1. completeness
2. contextual correctness
3. morphology/lemma agreement
4. benchmark performance where gold data exists
5. unresolved high-severity issue count

### Hard rule
A Stage 8 artifact is not done because a JSON file exists.

It is only ready for promotion when this skill records an explicit decision.

### Possible decisions
- `blocked`
- `draft`
- `promoted-for-ship`

---

## Workflows

## `/gloss-review audit`

Use to inspect a candidate Stage 8 artifact before or after repairs.

### Check for
- empty glosses / silent gaps
- morphology-number disagreement
- glosses too long for Hamilton conventions
- dictionary cruft
- Greek leaking into English glosses
- participle/non-participle mismatch
- known recurring failure types
- unresolved flags emitted by the builder

### Typical audit method
1. find files matching the target text/range/pattern
2. sample candidate words, especially non-static or changed glosses
3. compare against morphology and, where available, treebank constraints
4. count failure categories
5. report severity and likely remediation path

---

## `/gloss-review benchmark`

Use to compare candidate output against verified gloss resources.

### Primary benchmark sources
- `data/hamilton_ground_truth/anabasis_glosses.json`
- `data/hamilton_ground_truth/aesop_glosses.json`
- `data/hamilton_ground_truth/iliad_glosses.json`
- `data/hamilton_ground_truth/combined_glosses.json`

### Benchmark command
```bash
nix-shell -p python3 --run "python3 scripts/benchmark_glosses.py"
```

### Report
- coverage
- exact match
- semantic match
- mismatch / unresolved rates
- benchmark availability limits

If no ground truth exists for the target, record that explicitly and fall back to sampled semantic review plus treebank agreement where possible.

---

## `/gloss-review promote`

Use to decide whether candidate Stage 8 output is ready for downstream promotion.

### Promotion checklist
- completeness metrics computed
- benchmark attached or absence explained
- unresolved high-severity issues reviewed
- treebank/lemma agreement checked where applicable
- decision written explicitly

### Outcomes
- **blocked**: severe gaps or correctness issues remain
- **draft**: candidate is usable for more review but not ship-ready
- **promoted-for-ship**: Stage 8 passed and may proceed to `reader-reliability` and `new-text-ship`

### Hard rule
Do not import into `editions.db` or rebuild the server here.
Promotion in this skill means **evaluation approval**, not shipping.

---

## `/gloss-review fix`

Use for remediation in service of another audit/benchmark cycle.

### Appropriate uses
- dictionary propagation
- morphology-aware cleanup
- lemma correction from gold data
- text-specific recurring fixes
- formatting/style normalization

### Typical scripts
```bash
nix-shell -p python3 --run "python3 scripts/update_glosses_from_dict.py 'PATTERN'"
# (improve_glosses.py has been removed — gloss improvement is now inline in the pipeline)
# (improve_glosses.py has been removed — gloss improvement is now inline in the pipeline)
```

Homer lemma correction when applicable:
```bash
nix-shell -p python3 --run "python3 scripts/correct_lemmas_from_parrish.py"
```

### Constraint
Fixes here are subordinate to evaluation.
This skill should not become a second Stage 8 builder.

---

## `/gloss-review dictionary`

Use when benchmark/audit evidence shows a static dictionary addition or correction is warranted.

### Check entry
```bash
# (lsj_lookup.go has been removed)
```

### After editing dictionary entries
rerun:
```bash
nix-shell -p python3 --run "python3 scripts/update_glosses_from_dict.py"
```

Then re-audit and re-benchmark before any promotion decision.

---

## `/gloss-review status`

Summarize:
- files/text ranges in scope
- completeness findings
- benchmark availability and latest result
- unresolved severe issues
- current decision state: blocked/draft/promoted-for-ship
- whether the next step is more repair, `reader-reliability`, or `new-text-ship`

---

## Quality Metrics

Track these at minimum:

| Metric | Use |
|---|---|
| completeness / blank-rate | detect visible Stage 8 gaps |
| exact + semantic benchmark match | estimate contextual correctness |
| number agreement | catch morphology mismatches |
| participle form quality | catch grammar/style failures |
| gloss length | enforce Hamilton-style brevity |
| unresolved severe issue count | gate promotion |

---

## Outputs

### Current repo-facing outputs
- evaluation findings against `${LYCEUM_TEXTS_DIR:-output/texts}/<slug>/interlinear/` (pipeline workspace)
- benchmark results from `scripts/benchmark_glosses.py`
- explicit promotion/block notes

### Target pipeline outputs
Eventually this skill should write to the canonical workspace:

```text
$LYCEUM_TEXTS_DIR/<slug>/
├── state.json
├── interlinear/
├── qa/interlinear-report.md
└── replay/stage-history.json
```

At minimum, record:
- completeness metrics
- benchmark summary
- unresolved issue summary
- pass/block/promote decision

---

## Verification Contract

This skill follows the Stage 8 evaluator contract from `docs/text-pipeline-skill-verification-2026-03-13.md`.

### Verify
- completeness metrics were computed
- benchmark against ground truth ran where available
- syntax/lemma agreement checks ran where applicable
- Stage 8 pass/block/promote decision was written explicitly

### Minimum evidence
- interlinear evaluation report
- benchmark result or explicit no-ground-truth note
- decision state recorded in notes/state

### Pass criteria
- metrics are present and scoped to the target text/range
- benchmark output is attached or absence is explained
- unresolved high-severity issues block promotion
- promotion is explicit, not inferred from file existence

### Failure examples
- Stage 8 treated as done because artifacts exist
- benchmark/completeness skipped without explanation
- promotion occurs while severe unresolved issues remain

### Required next steps
After `promoted-for-ship`, run:
- `/reader-reliability audit <work>` / `/reader-reliability verify <work>` when available
- future `/new-text-ship promote <work>` for final import/build/review-pack

---

## Verification

After completing this stage, run the automated verification script:

```bash
bash scripts/verify_stage_8.sh "${SLUG}"
```

Exit codes: 0=PASS (advance), 1=FAIL (block), 2=WARN (advance with notes).
The orchestrator runs this automatically; when executing manually, check the output for [FAIL] or [WARN] lines.

---

## Key Files

| File | Purpose |
|---|---|
| `scripts/benchmark_glosses.py` | Ground-truth benchmark path |
| `scripts/update_glosses_from_dict.py` | QA-driven remediation |
| *(deleted — gloss improvement now handled by pipeline)* | Morphology/style remediation |
| *(deleted)* | Homer lemma repair |
| *(deleted — dictionary lookup now handled by pipeline)* | Static dictionary maintenance |
| `data/hamilton_ground_truth/` | Verified gloss benchmarks |
| *(deleted — gold standard data no longer vendored)* | Verified Homer lemma data |
| `${LYCEUM_TEXTS_DIR:-output/texts}/<slug>/interlinear/` (pipeline workspace) | Candidate/reviewed Stage 8 artifacts |
| `scripts/eval_token_glosses.py` | Binary eval runner for autoresearch (token-level quality checks) |
| `scripts/eval_segment_translations.py` | Binary eval runner for autoresearch (segment-level quality checks) |
| `docs/ground-truth/morphology-card-rubric.md` | Per-word morphology card quality rubric |
| `docs/hamilton-interlinear.md` | Style standard |
| `docs/text-pipeline-skill-architecture-2026-03-13.md` | Canonical ownership model |
| `docs/text-pipeline-skill-verification-2026-03-13.md` | Canonical verification contract |

---

## Error Type Reference

| Error Type | Example | Likely remediation |
|---|---|---|
| wrong LSJ sense | προΐαψεν → "project" | treebank evidence + repair + re-benchmark |
| missing rare word | ἑλώρια → empty | dictionary expansion or explicit unresolved flag |
| generic dictionary default | ἔθηκε → "put, place" | contextual evidence + re-evaluate |
| singular-for-plural | ψυχάς → "soul" | morphology-aware cleanup |
| non-ing participle | φέρων → "bring, carry" | morphology-aware cleanup |
| no hyphens | "of Achilles" | Hamilton-style cleanup |
| etymology over usage | δῖος → "of Zeus" | contextual evidence + benchmark review |
