---
name: segmentation
description: Stage 4 structural segmentation skill — choose the canonical reference system, segment Greek and witness texts into stable units, produce a reference inventory, and verify that the structure is honest and reproducible.
allowed-tools: Bash(find:*), Bash(ls:*), Bash(jq:*), Bash(go:*), Bash(git:*), Read, Write, Edit, Grep, Glob
---

# Segmentation (Stage 4 Structural Segmentation)

Assign stable structure before alignment.

This skill is the **Stage 4 owner** for structural segmentation.
It should answer:
- what is the canonical reference system for this work?
- what unit is appropriate: book/chapter/section/line/verse/sentence/etc.?
- are Greek and witness structures reproducible and auditable?
- is the segmentation honest for the source material and downstream reader needs?

It does **not** own:
- witness fitness classification
- cross-language alignment mode selection
- interlinear generation

## Quick Status

Structured artifact dirs: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -path '*/structured' -type d 2>/dev/null | wc -l`
Segmentation reports: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -path '*/qa/*segmentation*' 2>/dev/null | wc -l`

## Commands

- `/segmentation run [work]` — Produce or refresh structured Greek and witness artifacts
- `/segmentation validate [work]` — Check reference stability, counts, and structural honesty
- `/segmentation inventory [work]` — Emit the reference inventory for the current segmentation
- `/segmentation status [work]` — Summarize the canonical reference system, structure quality, and blockers

Target: $ARGUMENTS

---

## Owned Responsibilities

### Owns
- canonical reference system selection
- structural segmentation of Greek and witnesses
- reference inventory generation
- segmentation report and stability checks

### Does not own
- witness fitness classification
- direct/DP/range alignment decisions
- word-level interlinear work

---

## Segmentation Units

Possible segmentation units include:
- book
- chapter
- section
- line
- verse
- sentence
- stanza
- speech
- liturgical unit
- fable unit

### Important rule
Choose the coarsest structure that is still honest and useful.

Do not invent false precision the source material cannot support.

### Segment size policy
**Read `docs/segmentation-policy.md` before choosing a segmentation unit.** It defines the target word count per segment and when/how to sub-segment prose texts whose source reference system is too coarse for the reader.

## Reference System Selection Guide

Choose the reference system that matches the scholarly tradition for the text class:

| Text class | Reference system | Example refs | Notes |
|---|---|---|---|
| Biblical (NT) | book.chapter.verse | John 3.16, Rom 8.28 | Honor NA28/UBS5 verse numbering. Note 'ghost verses' (Matt 17:21, Acts 8:37 etc.) — follow the base edition's numbering. |
| Biblical (OT/LXX) | book.chapter.verse | Gen 1.1, Ps 23.1 | Note LXX vs MT psalm numbering differences (LXX Ps 9 = MT Ps 9-10). Always document which tradition you follow. |
| Epic poetry | book.line | Il. 1.1, Od. 1.1 | Line numbers from critical editions (OCT, Teubner). Never sub-segment lines. |
| Prose philosophy | book.chapter.section | Med. 1.1, Rep. 327a | Casaubon (Meditations), Stephanus (Plato), Bekker (Aristotle). |
| History/biography | book.chapter.section | Thu. 1.1.1, Plut. Alex. 1.1 | Follow Loeb/OCT section numbering. |
| Drama | play.line | Ag. 1, OT 1 | Line numbers from critical editions. |
| Fables | collection.number | Aes. 1, Perry 1 | Perry numbers for Aesop. |

### Biblical Versification Notes

- **Honor existing verse numbers** — even when verse divisions split mid-clause (this is common in Pauline epistles). The verse system is the universal reference standard.
- **Ghost verses**: Some editions include verses absent from critical text (Matt 17:21, Mark 7:16, John 5:4, Acts 8:37, etc.). Follow the base edition's numbering and document any gaps.
- **Psalm numbering**: LXX and MT differ by ±1 across Psalms 9-147. Always note which system you use.
- **Sub-verse notation**: Support `a`/`b` markers if the edition uses them (e.g., `3.16a`, `3.16b`).

---

## Workflows

## `/segmentation run`

Produce structured artifacts for the target text.

### Decide
1. canonical reference system
2. segmentation unit(s)
3. how Greek structure should be represented
4. what witness-side structural information should be preserved for later stages

### Output should make later stages easier
- Stage 5 witness comparison
- Stage 6 versification
- Stage 9 reader/reference behavior

---

## `/segmentation validate`

Check that the segmentation is stable and honest.

### Validate for
- stable references
- internal count consistency
- sample reference resolution
- segmentation granularity appropriate to the text class

### Important rule
A pretty reference grid is not success if it misrepresents the source structure.

---

## `/segmentation inventory`

Emit the reference inventory and counts for review.

### Useful contents
- reference list per book/chapter/unit
- unit counts
- notable merges/splits or ambiguities

---

## `/segmentation status`

Summarize:
- canonical reference system chosen
- segmentation unit
- inventory size/counts
- known structural ambiguities
- whether Stage 5 can begin or segmentation needs revision

---

## Outputs

### Canonical outputs
```text
$LYCEUM_TEXTS_DIR/<slug>/
├── structured/
├── qa/
├── state.json
└── replay/stage-history.json
```

### Segmentation artifacts should include
- structured Greek file(s)
- structured witness file(s) where applicable
- reference inventory
- segmentation report

---

## Database Import (Required)

After producing structured artifacts, Stage 4 **must** import Greek segments into `data/editions.db`. Downstream stages (Stage 6 translation generation, etc.) depend on segments being in the database.

### Import Command

```bash
nix-shell -p go --run "go run scripts/import_workspace.go --workspace $LYCEUM_TEXTS_DIR/<slug> --greek-only"
```

### What gets imported
- Author (created if missing, looked up by URN)
- Work (created if missing, looked up by URN)
- Edition (created with type="edition", language="grc")
- Segments (all Greek segments from structured/ files)

### Import verification
After import, verify with:
```bash
sqlite3 data/editions.db "SELECT COUNT(*) FROM segments WHERE edition_id = (SELECT id FROM editions WHERE urn LIKE '%<work-urn>%')"
```

### URN derivation
The import script derives URNs from the workspace manifest:
- Author URN: `manifest.author_urn`
- Work URN: `manifest.work_urn`
- Greek Edition URN: `manifest.work_urn + ".workspace-grc1"` (or `manifest.greek_edition_urn` if set)

### Idempotence
The import is idempotent — running it again clears and re-inserts segments for the edition. This allows safe re-runs after segmentation changes.

### Failure handling
If import fails:
1. Check that structured/ contains valid JSON files
2. Verify manifest.json has author_urn and work_urn set
3. Check editions.db exists and is writable

---

## Verification Contract

This skill follows the Stage 4 contract from `docs/text-pipeline-skill-verification-2026-03-13.md`.

### Verify
- a canonical reference system was selected
- Greek and witness structure is reproducible
- reference inventory is complete
- segmentation granularity matches the text class

### Minimum evidence
- structured artifacts under `structured/`
- reference inventory
- segmentation report

### Pass criteria
- references are stable and non-ambiguous
- counts per book/chapter/section are internally consistent
- sample references resolve correctly across structured files
- segmentation is honest for the text type and downstream reader needs

### Failure examples
- references drift unexpectedly across artifacts
- segmentation is too coarse for alignment or too precise for the source
- segmentation invents unsupported structure

### Required next steps
After successful segmentation:
1. **Import to DB**: Run `nix-shell -p go --run "go run scripts/import_workspace.go --workspace $LYCEUM_TEXTS_DIR/<slug> --greek-only"` to import Greek segments to editions.db
2. `/translation-witness ...` when witness comparison is needed
3. `/versify run <work>` when structural alignment is next
4. Invalidate downstream stages if the structure changes materially

---

## Verification

After completing this stage, run the automated verification script:

```bash
bash scripts/verify_stage_4.sh "${SLUG}"
```

Exit codes: 0=PASS (advance), 1=FAIL (block), 2=WARN (advance with notes).
The orchestrator runs this automatically; when executing manually, check the output for [FAIL] or [WARN] lines.

---

## Key Files

| File | Purpose |
|---|---|
| `docs/text-pipeline-master-plan-2026-03-13.md` | Canonical Stage 4 requirements |
| `docs/text-pipeline-skill-architecture-2026-03-13.md` | Ownership and command surface |
| `docs/text-pipeline-skill-verification-2026-03-13.md` | Verification contract |
| `$LYCEUM_TEXTS_DIR/<slug>/structured/` | Canonical structural artifacts |
