---
name: alignment
description: Stage 8 interlinear builder — assemble candidate word-level alignment/interlinear artifacts from dictionary, treebank, witness, and ground-truth inputs. Use for build/repair/status of Stage 8 artifacts, not evaluation or import.
allowed-tools: Bash(python3:*), Bash(sqlite3:*), Bash(go:*), Bash(find:*), Bash(wc:*), Bash(ls:*), Bash(curl:*), Bash(git:*), Read, Write, Edit, Grep, Glob
---

# Alignment (Stage 8 Builder)

Build candidate Stage 8 interlinear artifacts for one text at a time.

This skill is the **builder/orchestrator** for Stage 8.

**Primary tool:** `scripts/llm_interlinear.py` — an LLM-powered
Creator→Skeptic→Referee adversarial pipeline that generates contextual
glosses. Supports two modes:
- `--mode agent` — spawns `pi -p` subprocesses (no API key needed)
- `--mode api` — calls Anthropic API directly (needs `ANTHROPIC_API_KEY`)

The pipeline combines evidence from:
- English translations (from `editions.db`)
- morphology (from `morph.db`)
- static dictionary (*(deleted — dictionary lookup now handled by pipeline)*)
- existing alignment glosses
- adversarial LLM review (Skeptic challenges, Referee adjudicates)

It does **not** own:
- final Stage 8 evaluation or promotion
- ground-truth benchmarking authority (though it runs Hamilton benchmarks inline)
- treebank generation/validation ownership
- DB import or server rebuild (though `--import` is available as a convenience)
- reader/product QA

Future preferred name: `interlinear-build`.

## Quick Status

Alignment files: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -path "*/interlinear" -name "*.json" | wc -l`
Files needing review: !`grep -rl '"needs_gloss_review": true' ${LYCEUM_TEXTS_DIR:-output/texts}/<slug>/interlinear/ 2>/dev/null | wc -l`
DB aligned segments: !`nix-shell -p sqlite --run "sqlite3 data/editions.db 'SELECT COUNT(*) FROM aligned_segments'" 2>/dev/null`
DB aligned words: !`nix-shell -p sqlite --run "sqlite3 data/editions.db 'SELECT COUNT(*) FROM aligned_words'" 2>/dev/null`

## Commands

- `/alignment plan [work]` — Decide Stage 8 build scope, evidence sources, and artifact targets
- `/alignment build [work]` — Build candidate Stage 8 interlinear/alignment artifacts
- `/alignment repair [work|file-pattern]` — Apply targeted repair passes to existing Stage 8 artifacts
- `/alignment status [work]` — Summarize Stage 8 build status and unresolved flags

## Command Compatibility

Legacy commands should map like this:
- old `/alignment generate` -> `/alignment build`
- old `/alignment review` -> `/alignment repair`
- old `/alignment audit` -> `/gloss-review audit`
- old `/alignment validate` -> `/gloss-review benchmark` or future `/interlinear-eval benchmark`
- old `/alignment import` -> future `/new-text-ship promote`
- old `/alignment treebank` -> `/treebank run --scope import` or `/treebank export`

Target: $ARGUMENTS

---

## Owned Responsibilities

### Owns
- Stage 8 artifact assembly
- combining dictionary, witness, treebank, and benchmark-derived evidence
- generating candidate alignment/interlinear JSON
- recording unresolved review flags and evidence provenance
- repair passes that improve candidate Stage 8 artifacts without promoting them

### Does not own
- final pass/block/promote decision
- benchmark authority
- treebank ownership
- final DB import/build steps
- reader reliability verification

---

## Performance

Stage 8 interlinear generation is the most time-intensive pipeline stage due to LLM processing overhead.

### Timing Benchmarks

**Agent mode** (`--mode agent`):
- ~5 minutes per 100 words
- Dominated by subprocess cold-start overhead (3 subprocesses per batch: Creator, Skeptic, Referee)
- Each `pi -p` subprocess incurs ~5-10s initialization cost
- Example: Meditations Book 1 (1600 words, 17 sections) = ~90 minutes wall time

**API mode** (`--mode api`):
- ~30 seconds per 100 words
- 5-10x faster than agent mode
- Requires `ANTHROPIC_API_KEY` environment variable
- Recommended for texts with >500 total words

### Automatic Section Splitting

The LLM pipeline (`llm_interlinear.py`) automatically splits sections >200 words into smaller batches
to maintain reasonable context size and avoid token limits. Very large sections (e.g., 400-500 words)
are handled gracefully without manual intervention.

### Performance Recommendations

1. **For small texts (<500 words)**: Agent mode is acceptable if no API key is available
2. **For medium/large texts (>500 words)**: Strongly recommended to use API mode
3. **Set `ANTHROPIC_API_KEY`** in your environment before running Stage 8 if available
4. **Timeout configuration**: Allow at least 1800s (30 min) for small texts, 3600s (1 hour) for texts >1000 words in agent mode

### Cost Estimation

Before running expensive builds, use the dry-run mode to estimate token usage and cost:

```bash
# Workspace pipeline
python3 scripts/generate_workspace_interlinear.py --workspace $LYCEUM_TEXTS_DIR/<slug> --dry-run

# Existing texts in DB
python3 scripts/llm_interlinear.py --text iliad --book 1 --dry-run
```

---

## ⚠️ MANDATORY GATE: gloss-review

**After completing any `/alignment build`, you MUST run `/skill:gloss-review evaluate` before proceeding to Stage 9 or Stage 10.**

Stage 8 has two phases:
| Phase | Skill | Purpose |
|-------|-------|---------|
| 8a | `alignment build` | Generate candidate glosses |
| 8b | `gloss-review evaluate` | Audit and approve/block |

**Never skip 8b.** The builder optimizes for completion; the evaluator optimizes for correctness. Self-approval bias is why these are separate skills.

Correct sequence:
```
/alignment build [work]
    ↓
/skill:gloss-review evaluate [work]   ← REQUIRED
    ↓
/skill:reader-reliability [work]      (Stage 9)
    ↓
/skill:new-text-ship [work]           (Stage 10)
```

---

## Primary Build Model

The primary builder is `scripts/llm_interlinear.py`, which runs a
Creator→Skeptic→Referee adversarial loop. For existing texts already in
`editions.db` (Iliad, Odyssey, John, Aesop, Meditations), this script reads
directly from the DB — stages 1-7 are already done for those texts.

### Quick start (existing texts)

```bash
# Dry-run to see cost estimate
python3 scripts/llm_interlinear.py --text iliad --book 1 --start 1 --end 50 --dry-run

# Generate glosses (agent mode, no API key needed)
python3 scripts/llm_interlinear.py --text iliad --book 1 --start 1 --end 50 --mode agent -v -y

# Generate + import into editions.db + rebuild server
python3 scripts/llm_interlinear.py --text iliad --book 1 --mode agent -v -y --import
```

### Evidence consumed (per batch)
1. English translation text (from `editions.db`)
2. Greek morphology — lemma, POS, morphological features (from `morph.db`)
3. Static dictionary glosses (*(deleted — dictionary lookup now handled by pipeline)*)
4. Existing alignment glosses (from `aligned_words` table)
5. Adversarial LLM review (Skeptic challenges → Referee adjudicates)

### Build outputs
- Candidate alignment JSON in `${LYCEUM_TEXTS_DIR:-output/texts}/<slug>/interlinear/` (pipeline workspace) (pipeline format)
- Per-word provenance: which agent, which model, challenged or not, reasoning
- Hamilton benchmark score (when ground truth exists)
- Run logs in `data/interlinear_runs/<run_id>/` (calls, batches, summary)

### Legacy builders (still available)
- *(deleted — replaced by text_pipeline_alignment.go)* — old heuristic-based pipeline (LSJ scoring)
- `scripts/update_glosses_from_dict.py` — dictionary propagation
- *(deleted — gloss improvement now handled by pipeline)* — morphology-aware cleanup

### Hard rule
Do not silently treat file creation as completion.

This skill produces **candidate** Stage 8 outputs.
Promotion belongs to `gloss-review` / future `interlinear-eval`.

---

## Workflows

## `/alignment plan`

Use before building or repairing a text.

### Decide
1. target text / range / file scope
2. current Stage 8 artifact state
3. which evidence sources exist and should be consumed
4. whether build mode should be:
   - `default`
   - `treebank-first`
   - `repair-only`
5. which downstream evaluator run is expected next

### Report
- input artifacts available
- missing evidence sources
- expected outputs
- expected unresolved flags

---

## `/alignment build`

Use to create or refresh candidate Stage 8 artifacts.

### CRITICAL: Use LLM Pipeline for All Texts

**The LLM adversarial pipeline is REQUIRED for quality glosses.** Do NOT use
the Go-based `text_pipeline_alignment.go` script — it produces dictionary-style
glosses ("to be", "he, she, it; self") instead of contextual glosses ("was", "his").

### Workspace texts (text pipeline)

For texts going through the `$LYCEUM_TEXTS_DIR/<slug>/` workspace pipeline, use:

```bash
# Generate all chapters (recommended)
python3 scripts/generate_workspace_interlinear.py --workspace $LYCEUM_TEXTS_DIR/<slug> --mode agent -v

# Generate a single chapter
python3 scripts/llm_interlinear.py --workspace $LYCEUM_TEXTS_DIR/<slug> --chapter 1 --mode agent -v -y

# Dry-run to estimate cost
python3 scripts/generate_workspace_interlinear.py --workspace $LYCEUM_TEXTS_DIR/<slug> --dry-run
```

This produces `interlinear/chapter_XX_llm.json` files with contextual glosses that
`scripts/import_workspace.go` will read during Stage 10 import.

### Existing texts in editions.db

For texts already in the database (Iliad, Odyssey, etc.):

```bash
# Generate glosses for a range (agent mode, no API key)
python3 scripts/llm_interlinear.py --text iliad --book 1 --start 1 --end 50 --mode agent -v -y

# Generate glosses for a full book
python3 scripts/llm_interlinear.py --text iliad --book 1 --mode agent -v -y

# Generate + import into DB + rebuild server (convenience)
python3 scripts/llm_interlinear.py --text iliad --book 1 --mode agent -v -y --import

# API mode (requires ANTHROPIC_API_KEY)
python3 scripts/llm_interlinear.py --text iliad --book 1 --mode api -v -y

# Check cost estimate first
python3 scripts/llm_interlinear.py --text iliad --book 1 --dry-run
```

Available texts: `iliad`, `odyssey`, `john`, `aesop`, `meditations`

### Pipeline flow
1. Extract batches from `editions.db` + `morph.db` (via `crane_extract.py`)
2. Creator agent generates contextual glosses per word
3. Skeptic agent challenges incorrect/weak glosses
4. Referee agent adjudicates disputes + spot-checks
5. Merge results with per-word provenance
6. Benchmark against Hamilton ground truth (where available)
7. Write candidate JSON + run logs + token-level GT

### Legacy path (still available)

```bash
# Old heuristic pipeline
nix-shell -p go --run "go run scripts/text_pipeline_alignment.go --text iliad"

# Dictionary propagation
nix-shell -p python3 --run "python3 scripts/update_glosses_from_dict.py 'PATTERN'"

# Morphology-aware cleanup
# (improve_glosses.py has been removed — gloss improvement is now inline in the pipeline)
```

### Build guidance
- prefer the LLM pipeline for new/replacement gloss generation
- use legacy repair scripts for targeted fixes to existing files
- do not silently overwrite reviewed outputs without recording what changed

---

## `/alignment repair`

Use for targeted remediation of candidate Stage 8 artifacts.

### Appropriate repairs
- dictionary propagation
- morphology-aware transformations
- lemma corrections from known gold data
- normalization or formatting cleanup
- targeted gap filling where evidence exists

### Not appropriate here
- declaring the artifact promoted
- treating local repairs as a substitute for benchmark/evaluation
- importing into DB or rebuilding the server

### Typical repair scripts
```bash
nix-shell -p python3 --run "python3 scripts/update_glosses_from_dict.py 'PATTERN'"
# (improve_glosses.py has been removed — gloss improvement is now inline in the pipeline)
# (improve_glosses.py has been removed — gloss improvement is now inline in the pipeline)
nix-shell -p python3 --run "python3 scripts/correct_lemmas_from_parrish.py"
```

For regenerating a range with the LLM pipeline instead of repairing:
```bash
python3 scripts/llm_interlinear.py --text iliad --book 1 --start 50 --end 100 --mode agent -v -y
```

---

## `/alignment status`

Summarize for one text or a file set:
- candidate artifact coverage
- unresolved flag counts
- evidence sources available
- whether the next required step is `gloss-review audit/benchmark/promote`

---

## Known Inputs

| Input | Source |
|---|---|
| Greek text + English translations | `editions.db` (via `crane_extract.py` batching) |
| Morphology (lemma, POS, features) | `morph.db` |
| Static dictionary / LSJ heuristics | *(deleted — dictionary lookup now handled by pipeline)* |
| Existing alignment glosses | `aligned_words` table in `editions.db` |
| Treebank constraints | *(acquired on-demand by treebank skill)*, `treebank` skill |
| Homer lemma corrections | *(deleted — gold standard data no longer vendored)*, *(deleted)* |

---

## Outputs

### Current repo-facing outputs
- candidate/updated alignment JSON in `${LYCEUM_TEXTS_DIR:-output/texts}/<slug>/interlinear/` (pipeline workspace)
- unresolved review flags in the same artifacts

### Target pipeline outputs
Eventually this skill should read/write the canonical workspace:

```text
$LYCEUM_TEXTS_DIR/<slug>/
├── manifest.json
├── state.json
├── interlinear/
├── qa/interlinear-report.md
└── replay/stage-history.json
```

At minimum, the skill should update:
- Stage 8 build status
- inputs consumed
- outputs written
- notes on unresolved flags

---

## Verification Contract

This skill follows the Stage 8 builder contract from `docs/text-pipeline-skill-verification-2026-03-13.md`.

### Verify
- candidate Stage 8 artifacts were produced
- provider inputs were incorporated as intended
- unresolved review flags are recorded
- missing values are visible and classified, not silent
- state/history updates are correct when using the canonical workspace

### Minimum evidence
- candidate alignment/interlinear JSON
- evidence notes on provider inputs used
- unresolved flag list
- updated Stage 8 build state

### Pass criteria
- output covers the requested text/range/file scope
- provider inputs used are recorded in notes or state
- words lacking confident glosses are flagged explicitly
- output is structurally consumable by `gloss-review` / future `interlinear-eval`
- no DB import/build step is treated as part of success

### Failure examples
- candidate output exists but does not record evidence sources
- empty gloss/morph fields appear without unresolved flags
- builder silently promotes/imports output

### Required next step
After a successful build or repair pass, run:
- `/gloss-review audit <work>`
- `/gloss-review benchmark <work>`
- `/gloss-review promote <work>` when ready

---

## Verification

After completing this stage, run the automated verification script:

```bash
bash scripts/verify_stage_8.sh "${SLUG}"
```

Exit codes: 0=PASS (advance), 1=FAIL (block), 2=WARN (advance with notes).
The orchestrator runs this automatically; when executing manually, check the output for [FAIL] or [WARN] lines.

---

## Key Files

| File | Purpose |
|---|---|
| `scripts/llm_interlinear.py` | **Primary builder** — LLM adversarial pipeline (Creator→Skeptic→Referee) |
| `scripts/crane_extract.py` | Batch extraction from `editions.db` + `morph.db` (used by `llm_interlinear.py`) |
| `scripts/crane_write.py` | Convert pipeline output to import-ready alignment JSON |
| `scripts/import_workspace.go` | Import alignment JSON into `editions.db` |
| *(deleted — dictionary lookup now handled by pipeline)* | Static dictionary + LSJ scoring |
| `scripts/interlinear_cost_report.py` | Cost dashboard for LLM pipeline runs |
| *(deleted — replaced by text_pipeline_alignment.go)* | Legacy heuristic-based pipeline (LSJ scoring) |
| `scripts/update_glosses_from_dict.py` | Legacy: propagate dictionary updates |
| *(deleted — gloss improvement now handled by pipeline)* | Legacy: morphology-aware cleanup |
| `${LYCEUM_TEXTS_DIR:-output/texts}/<slug>/interlinear/` (pipeline workspace) | Current candidate/reviewed alignment artifacts |
| `data/interlinear_runs/` | LLM pipeline run logs (calls, batches, summaries) |
| `scripts/eval_token_glosses.py` | Binary eval runner for autoresearch (token-level quality checks) |
| `scripts/eval_segment_translations.py` | Binary eval runner for autoresearch (segment-level quality checks) |
| `docs/interlinear-overhaul-plan-2026-03-14.md` | Full pipeline architecture and rollout plan |
| `docs/text-pipeline-skill-architecture-2026-03-13.md` | Canonical ownership model |

---

## Reference Notes

- The LLM pipeline (`llm_interlinear.py`) is the primary builder for existing
  texts. It reads from `editions.db`/`morph.db` directly — stages 1-7 are already
  complete for Iliad, Odyssey, John, Aesop, and Meditations.
- For new texts going through the full `$LYCEUM_TEXTS_DIR/<slug>/` workspace pipeline,
  Stage 8 should eventually call `llm_interlinear.py` on the workspace data.
- The old `crane-gloss` provider skill is deprecated. `llm_interlinear.py`
  replaces it and produces higher-quality output (90.2% Hamilton accuracy in
  manual pilot vs ~58% from old pipeline).
- Treebank evidence remains the strongest disambiguation layer where available,
  but the adversarial LLM loop achieves good results without it.
- Hamilton ground truth is used as an inline benchmark, not a generation source.
- Token-level ground truth accumulates with each successful pipeline run.
