---
name: treebank
description: Stage 8 provider skill — generate or import CoNLL-U treebank data, validate it against Perseus/PROIEL where available, and export morphology/syntax constraints for interlinear-build.
allowed-tools: Bash(python3:*), Bash(sqlite3:*), Bash(go:*), Bash(curl:*), Bash(git:*), Bash(wc:*), Bash(ls:*), Bash(jq:*), Bash(nix-shell:*), Read, Write, Edit, Grep, Glob
---

# Treebank (Stage 8 Provider)

Provide lemma, morphology, and syntax constraints for Stage 8.

This skill is a **provider**, not a stage owner.
It supplies treebank-derived evidence to `alignment` / future `interlinear-build`.

It should answer:
- what treebank or CoNLL-U data exists for this text?
- how well does it match gold data where available?
- what morphology/syntax constraints can be exported downstream?
- where do current lemma or parse assumptions disagree with treebank evidence?

It does **not** own:
- final Stage 8 assembly
- final Stage 8 promotion
- DB import/build promotion
- reader/product QA

## Quick Status

Treebank data files: !`find data/treebank -type f 2>/dev/null | wc -l`
Lyceum CoNLL-U files: !`find *(on-demand)*lyceum -name '*.conllu' 2>/dev/null | wc -l`
Perseus corpora present: !`find data/treebank -maxdepth 1 -name 'grc_perseus*.conllu' 2>/dev/null | wc -l`

## Commands

- `/treebank run [work] [--scope generate|import]` — Generate new treebank output or import existing gold/provider data for the target scope
- `/treebank validate [work]` — Compare against Perseus/PROIEL gold data where available
- `/treebank export [work]` — Export morphology/syntax constraints consumable by Stage 8 builders
- `/treebank status [work]` — Summarize coverage, validation results, and downstream readiness

## Command Compatibility

Legacy commands should map like this:
- old `/treebank generate` -> `/treebank run --scope generate`
- old `/treebank import` -> `/treebank export` for Stage 8 consumption, not DB promotion
- old `/treebank validate` -> keep
- old behavior writing directly into alignment JSON -> treat as provider export, not as final success

Target: $ARGUMENTS

---

## Owned Responsibilities

### Owns
- importing or generating CoNLL-U/treebank artifacts
- validation against Perseus/PROIEL where available
- exporting morphology/syntax constraints for Stage 8
- mismatch reporting when lemma/parse evidence disagrees

### Does not own
- final Stage 8 assembly
- gloss promotion
- DB import/build promotion

---

## Provider Role

Use treebank data to:
- resolve lemma conflicts
- constrain morphology selection
- supply syntax clues for sense disambiguation
- improve popup-facing morphology strings
- identify disagreement between existing pipeline assumptions and verified treebank evidence

### Important rule
Treebank evidence is authoritative where gold data exists, but this skill still provides **constraints/evidence**, not Stage 8 promotion.

---

## Workflows

## `/treebank run`

Generate or import treebank artifacts.

### Current repo extraction path
```bash
nix-shell -p python3 --run "python3 scripts/treebank_extract.py --text TEXT --book BOOK [--start N --end M] --output /tmp/treebank_extract.json"
```

### Current repo write path
```bash
nix-shell -p python3 --run "python3 scripts/treebank_write.py --input /tmp/treebank_annotations.json"
```

### Output should preserve
- lemma
- UPOS
- FEATS
- HEAD
- DEPREL
- human-readable morphology strings for downstream consumption

---

## `/treebank validate`

Validate treebank output against gold data where available.

### Current gold sources
- `*(acquired on-demand from Universal Dependencies)*`

### Typical targets
- LEMMA match
- UPOS match
- FEATS accuracy
- DEPREL accuracy

### Important rule
If no gold data exists, record that explicitly and emit mismatch/uncertainty notes instead of claiming validation success.

---

## `/treebank export`

Export provider constraints for Stage 8.

### Downstream consumers need
- morphology strings for popup/interlinear use
- disambiguated lemma choices
- syntax clues usable by Stage 8 builders/evaluators

### Important rule
Do not treat export as DB promotion or ship completion.

---

## `/treebank status`

Summarize:
- corpus coverage
- generated/imported artifacts available
- validation state
- whether Stage 8 has usable morphology/syntax constraints

---

## Outputs

### Current repo-facing outputs
- CoNLL-U files in `$LYCEUM_TEXTS_DIR/<slug>/interlinear/`
- provider-side morphology/syntax data sourced from *(acquired on-demand by treebank skill)*

### Target pipeline outputs
Eventually this skill should write provider artifacts under:

```text
$LYCEUM_TEXTS_DIR/<slug>/
├── interlinear/
├── structured/
├── qa/
├── state.json
└── replay/stage-history.json
```

### Expected outputs
- CoNLL-U or equivalent treebank artifact
- validation report
- exported morphology/syntax constraints for Stage 8

---

## Verification Contract

This skill follows the provider contract from `docs/text-pipeline-skill-verification-2026-03-13.md`.

### Verify
- CoNLL-U or equivalent morphology constraints were generated/imported
- gold comparison passes where available
- mismatch report exists where gold is unavailable or disagreement remains
- output is consumable by Stage 8

### Minimum evidence
- treebank artifacts
- validation report against Perseus where applicable
- exported morphology/syntax constraints

### Pass criteria
- gold-covered texts produce explicit lemma/UPOS/feature/deprel comparison reports
- mismatches are reported, not hidden
- export format is usable by `alignment` / `interlinear-build`
- morphology strings are consistent with popup needs

### Failure examples
- treebank output exists but cannot be consumed by Stage 8
- validation claims success without comparison data
- treebank export is treated as promoted Stage 8 or ship output

### Required next steps
After successful provider output:
- `/alignment build <work>` to consume treebank constraints
- `/gloss-review benchmark <work>` after Stage 8 is rebuilt where relevant

---

## Key Files

| File | Purpose |
|---|---|
| `scripts/treebank_extract.py` | Extract sentence-level treebank input |
| `scripts/treebank_write.py` | Write CoNLL-U/provider artifacts |
| *(acquired on-demand by treebank skill)* | Gold and generated treebank data |
| `docs/text-pipeline-skill-architecture-2026-03-13.md` | Canonical provider ownership |
| `docs/text-pipeline-skill-verification-2026-03-13.md` | Verification contract |

---

## Reference Notes

Retained from the earlier skill:
- Perseus remains the primary gold reference
- UD-compatible CoNLL-U remains the preferred exchange format
- morphology strings still matter for popup/interlinear use
- treebank evidence is one of the strongest non-LLM disambiguation layers
