---
description: "Stage 8 ground-truth provider — acquire external benchmark corpora, build gloss benchmarks, and generate dictionary-gap reports as read-only evidence for gloss-review evaluation."
---

# Ground Truth — Stage 8 Provider/Evidence Skill

Extract and combine verified gloss corpora from external sources, maintain benchmark artifacts, and generate dictionary-gap reports for downstream evaluation.

## Role

The ground-truth skill is a **provider** within Stage 8 (interlinear/morphology). It does NOT generate glosses or decide pass/block — that's the gloss-review skill's job.

Ground truth provides independently curated benchmark data that the gloss-review evaluator uses to measure interlinear quality.

## Key Principle: On-Demand Acquisition

Ground truth corpora are **acquired on-the-fly** by the pipeline, not vendored in the repo. The skill should:

1. Identify the best available ground-truth corpus for the target text
2. Acquire it if not already present in the workspace (download from known sources, extract from APIs, etc.)
3. Build a benchmark document in the workspace's `interlinear/ground-truth-benchmark.json`
4. Report findings to `qa/ground-truth-report.md`

## Holdout vs Ground Truth

These are distinct concepts:

- **Holdout data** (`data/holdout/`) — curated evaluation benchmarks used to measure pipeline quality *after* generation. These live in the repo.
- **Ground truth corpora** — independently verified reference data used *during* Stage 8 to benchmark candidate glosses. These are acquired on-demand.

The pipeline degrades gracefully when ground truth is unavailable — it reports "unavailable" and the gloss-review stage relies on other signals.

## Corpus Selection Logic

The Go runtime (`internal/textpipeline/groundtruthruntime.go`, function `selectGroundTruthCorpus`) selects corpora in priority order:

1. **Exact-work corpus** — a corpus specifically curated for this text (e.g., OpenGNT Berean for John's Gospel)
2. **Combined fallback** — a merged cross-text corpus when no exact match exists
3. **Unavailable** — no corpus found; benchmark reports absence explicitly

## Known External Sources

| Source | Texts Covered | Format |
|--------|--------------|--------|
| OpenGNT Berean | NT (all books) | JSONL token-level |
| Perseus Treebank | Homer, Herodotus, Plato, Sophocles, etc. | CoNLL-U |

## Workspace Artifacts

| Artifact | Purpose |
|----------|---------|
| `interlinear/ground-truth-benchmark.json` | Benchmark document with gloss entries |
| `qa/ground-truth-report.md` | Provider summary and notes |
| `interlinear/ground-truth-gaps.json` | Dictionary gap report (if applicable) |
| `qa/ground-truth-gaps.md` | Human-readable gap report |

## Pipeline Scripts

```bash
# Run ground truth provider for a workspace
go run scripts/text_pipeline_ground_truth.go benchmark -root ${LYCEUM_TEXTS_DIR:-output/texts} -work "Gospel of John Book 1"

# Combine multiple ground truth sources
go run scripts/text_pipeline_ground_truth.go combine -root ${LYCEUM_TEXTS_DIR:-output/texts} -work "Gospel of John Book 1"

# Generate gap report
go run scripts/text_pipeline_ground_truth.go gaps -root ${LYCEUM_TEXTS_DIR:-output/texts} -work "Gospel of John Book 1"
```

## Evaluation Holdout Packs

The repo ships curated holdout data for post-pipeline quality measurement:

```
data/holdout/
├── token_level/          # word-level gloss benchmarks
│   ├── epic/             # Iliad opening
│   ├── fable/            # Aesop fables 1-8
│   ├── ionic-history/    # Herodotus opening
│   ├── nt/               # John chapter 1
│   ├── attic-prose/      # Plato Apology opening
│   └── tragedy/          # Sophocles OT opening
└── segment_level/        # verse/section translation benchmarks
    ├── epic/             # Iliad opening
    ├── fable/            # Aesop sample
    ├── nt/               # John chapter 1
    ├── philosophy/       # Meditations book 1
    └── prose-history/    # Anabasis 1.1
```

## Critical Rules

1. Ground truth is **read-only evidence** — it informs gloss-review decisions but does not self-promote Stage 8
2. The ground-truth provider writes benchmark evidence only; gloss-review remains the Stage 8 gate
3. When no corpus is available, report "unavailable" explicitly — do not fabricate ground truth
4. Do not use LLM output as ground truth — that creates circular validation
