---
name: source-hunt
description: Stage 1 source discovery skill — find, rank, and record Greek, English, and auxiliary sources with provenance, parseability, license, and alignment-fitness notes.
allowed-tools: Bash(find:*), Bash(ls:*), Bash(curl:*), Bash(git:*), Bash(jq:*), Read, Write, Edit, Grep, Glob
---

# Source Hunt (Stage 1 Source Discovery)

Discover and rank the best available sources for one target text.

This skill is the **Stage 1 owner** for source discovery.
It should answer:
- what is the best Greek base source?
- which English witnesses are worth carrying forward?
- which auxiliary resources exist?
- what provenance, licensing, parseability, and quality risks matter?

It does **not** own:
- full acquisition/extraction
- cleaning
- segmentation
- structural alignment

## Quick Status

Pipeline workspaces: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -mindepth 1 -maxdepth 1 -type d 2>/dev/null | wc -l`
Source catalog placeholders: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -path '*/sources/*.json' 2>/dev/null | wc -l`

## Commands

- `/source-hunt run [work]` — Gather Greek, English, and auxiliary candidates for a target work
- `/source-hunt rank [work]` — Rank discovered candidates by quality, provenance, parseability, and fit
- `/source-hunt accept [work] --greek <source-id> [--english <source-id> ...]` — Record the recommended/accepted sources for downstream extraction
- `/source-hunt status [work]` — Summarize current candidate coverage and recommendation state

Target: $ARGUMENTS

---

## Owned Responsibilities

### Owns
- source discovery
- source ranking
- provenance capture
- parseability/license/quality notes
- recommendation of Greek, English, and auxiliary resources

### Does not own
- downloading or extracting full source content beyond lightweight probes
- cleaning or normalization
- witness structural classification at Stage 5 depth

---

## Discovery Priorities

### Greek source priority
1. Perseus / OGL / CTS-compatible corpora
2. public scholarly plain text / TEI
3. Wikisource / Gutenberg / XML-like sources
4. Internet Archive text / DjVu / EPUB
5. scanned PDFs with OCR fallback
6. manual transcription only as a last resort

### English witness priority
1. public-domain scholarly translations with stable structure
2. translations known to align well by book/section/line
3. looser literary translations only as secondary witnesses

### Auxiliary resources to seek
- treebanks
- born-aligned translations
- Hamilton / Clark interlinears
- commentary/vocabulary aids
- canonical division metadata
- known-good sentence or verse references

---

## Workflows

## `/source-hunt run`

Collect candidates for the target text.

### For each candidate, try to record
- source ID
- title
- language
- URL
- editor / translator
- publication year
- license / public-domain status
- format
- likely extraction method
- quality notes
- confidence notes

### Outputs should separate
- Greek candidates
- English candidates
- auxiliary resources

---

## `/source-hunt rank`

Rank candidates using explicit criteria.

### Ranking criteria
- provenance completeness
- parseability
- structural usefulness for downstream alignment
- text quality / OCR risk
- license safety
- expected reproducibility

### Important rule
Do not treat “easy to fetch” as the same thing as “best source”.

Record why a candidate is preferred and why others were rejected or downgraded.

---

## `/source-hunt accept`

Record which candidates downstream stages should treat as selected.

### Typical output
- one recommended Greek base source
- one or more English witnesses
- relevant auxiliary resources flagged for later stages

### Important rule
Acceptance here is not extraction.
It is a planning/provenance decision for Stage 2.

---

## `/source-hunt status`

Summarize:
- candidate counts
- top-ranked Greek source
- top-ranked English witnesses
- auxiliary resources found
- provenance completeness gaps
- whether Stage 2 can start or is blocked

---

## Outputs

### Canonical outputs
```text
$LYCEUM_TEXTS_DIR/<slug>/
├── sources/
│   ├── greek_candidates.json
│   ├── english_candidates.json
│   └── auxiliary_resources.json
├── provenance.md
├── state.json
└── replay/stage-history.json
```

### Expected source catalog fields
- `id`
- `title`
- `language`
- `url`
- `editor`
- `translator`
- `publication_year`
- `license`
- `format`
- `extraction_method`
- `quality_notes`
- `confidence_notes`
- `recommended`

---

## Verification Contract

This skill follows the Stage 1 contract from `docs/text-pipeline-skill-verification-2026-03-13.md`.

### Verify
- Greek, English, and auxiliary candidate lists exist
- provenance fields are complete
- ranking rationale is present
- recommended sources are reproducible and appropriate

### Minimum evidence
- `sources/greek_candidates.json`
- `sources/english_candidates.json`
- `sources/auxiliary_resources.json`
- provenance notes

### Pass criteria
- every recommended source records URL/title/editor/year/license/format
- at least one recommended Greek source exists unless blocked
- English witness recommendations include fit notes for later alignment work
- source rankings are traceable to explicit criteria

### Failure examples
- provenance metadata missing for recommended sources
- best parseable source ignored without explanation
- witness recommendation too vague for Stage 2 or Stage 5 to use

### Required next steps
After accepted source discovery:
- `/source-extract run <work>`
- rerun source discovery when better witnesses or editions are later found

---

## Verification

After completing this stage, run the automated verification script:

```bash
bash scripts/verify_stage_1.sh "${SLUG}"
```

Exit codes: 0=PASS (advance), 1=FAIL (block), 2=WARN (advance with notes).
The orchestrator runs this automatically; when executing manually, check the output for [FAIL] or [WARN] lines.

---

## Key Files

| File | Purpose |
|---|---|
| `docs/text-pipeline-master-plan-2026-03-13.md` | Canonical Stage 1 requirements |
| `docs/text-pipeline-skill-architecture-2026-03-13.md` | Ownership and command surface |
| `docs/text-pipeline-skill-verification-2026-03-13.md` | Verification contract |
| `$LYCEUM_TEXTS_DIR/<slug>/sources/*.json` | Canonical candidate catalogs |