---
name: source-extract
description: Stage 2 acquisition and extraction skill — acquire approved sources, extract parseable text, record the extraction path, and preserve raw/extracted artifacts with confidence notes.
allowed-tools: Bash(find:*), Bash(ls:*), Bash(curl:*), Bash(go:*), Bash(python3:*), Bash(jq:*), Bash(git:*), Read, Write, Edit, Grep, Glob
---

# Source Extract (Stage 2 Acquisition and Extraction)

Acquire approved sources and extract parseable text for one target work.

This skill is the **Stage 2 owner** for acquisition and extraction.
It consolidates the planned format-specific paths into one stage skill.

It should answer:
- which approved sources were acquired?
- which extraction path was used?
- what raw and extracted artifacts were produced?
- what confidence or OCR warnings need to be carried forward?

It does **not** own:
- semantic cleaning
- structural segmentation
- source ranking decisions

## Quick Status

Raw artifact dirs: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -path '*/raw' -type d 2>/dev/null | wc -l`
Extracted artifact dirs: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -path '*/extracted' -type d 2>/dev/null | wc -l`
PDF tool support docs: !`ls scripts/fix_ocr_errors.py 2>/dev/null | wc -l`

## Commands

- `/source-extract run [work] [--source greek|english:<id>|all]` — Acquire and extract approved sources
- `/source-extract probe [work] [--source <id>]` — Determine likely extraction path before full extraction
- `/source-extract validate [work]` — Spot-check raw vs extracted output and extraction metadata
- `/source-extract status [work]` — Summarize acquired sources, extraction modes, and warnings

Target: $ARGUMENTS

---

## Owned Responsibilities

### Owns
- acquisition of approved source files
- extraction into parseable text
- extraction-path recording
- confidence/OCR warning recording
- preservation of both raw and extracted artifacts

### Does not own
- semantic cleanup
- source ranking
- segmentation

---

## Supported Extraction Modes

### HTML / TEI / plain text
- fetch source
- strip obvious navigation/boilerplate only where needed for extraction
- preserve headings/references where possible

### EPUB
- unpack EPUB
- extract XHTML in spine order
- preserve meaningful section structure
- note nav/TOC/notes stripping decisions

### PDF with text layer
- extract text directly
- preserve page boundaries as metadata if practical
- detect headers/footers/marginalia risk

### OCR / DjVu fallback
- extract OCR text
- measure OCR quality heuristics
- flag suspect spans/pages
- compare with other witnesses when possible

### Manual transcription fallback
- only when necessary
- must include provenance note and explicit warning

---

## Workflows

## `/source-extract probe`

Use before extraction when the format path is unclear.

### Determine
- source format
- likely extraction mode
- whether OCR is required
- whether extraction is low-risk or likely noisy

---

## `/source-extract run`

Acquire and extract the approved sources.

### For each source
1. preserve the raw artifact under `raw/`
2. record the extraction path used
3. write parseable output under `extracted/`
4. note confidence/warnings
5. preserve enough metadata to make later auditing possible

### Important rule
Do not discard the raw source after extraction.

---

## `/source-extract validate`

Spot-check extraction quality.

### Verify for samples
- extracted text matches the source
- headers/footers/marginalia are recognized as risk areas
- references/headings were not lost unnecessarily
- OCR suspect spans are explicitly noted

### Useful repo helpers
- `scripts/fix_ocr_errors.py` for OCR-heavy workflows
- `read_pdf` when source PDFs need targeted inspection

---

## `/source-extract status`

Summarize:
- acquired sources
- extraction mode used per source
- warnings/confidence notes
- whether Stage 3 can begin or extraction must be retried

---

## Outputs

### Canonical outputs
```text
$LYCEUM_TEXTS_DIR/<slug>/
├── raw/
├── extracted/
├── qa/
├── state.json
└── replay/stage-history.json
```

### Expected report content
- source ID
- extraction mode
- acquisition timestamp
- confidence notes
- suspect spans/pages
- structural preservation notes

---

## Verification Contract

This skill follows the Stage 2 contract from `docs/text-pipeline-skill-verification-2026-03-13.md`.

### Verify
- approved sources were acquired
- extraction path used was recorded
- raw and extracted artifacts exist
- extraction quality/confidence was measured
- sampled extracted text matches source content

### Minimum evidence
- files under `raw/`
- files under `extracted/`
- extraction report with format path used

### Pass criteria
- raw and extracted files exist for all approved sources unless blocked
- extraction mode is explicit
- sample spot checks confirm extracted text matches source pages/sections
- suspect OCR spans or headers/footers are flagged where relevant

### Failure examples
- extracted files exist without preserved raw sources
- OCR was used but confidence/suspect spans were not recorded
- extraction silently lost important structural markers

### Required next steps
After successful extraction:
- `/text-cleaning run <work>`
- rerun extraction if a better source or cleaner extraction path becomes available

---

## Verification

After completing this stage, run the automated verification script:

```bash
bash scripts/verify_stage_2.sh "${SLUG}"
```

Exit codes: 0=PASS (advance), 1=FAIL (block), 2=WARN (advance with notes).
The orchestrator runs this automatically; when executing manually, check the output for [FAIL] or [WARN] lines.

---

## Key Files

| File | Purpose |
|---|---|
| `scripts/fix_ocr_errors.py` | OCR cleanup helper for noisy sources |
| `scripts/parse_hamilton_ocr.py` | Example OCR-to-structured parsing path |
| `scripts/init_text_pipeline_workspace.go` | Bootstrap workspace layout |
| `docs/text-pipeline-master-plan-2026-03-13.md` | Canonical Stage 2 requirements |
| `docs/text-pipeline-skill-architecture-2026-03-13.md` | Ownership and command surface |
| `docs/text-pipeline-skill-verification-2026-03-13.md` | Verification contract |