---
name: text-cleaning
description: Stage 3 cleaning skill — turn extracted Greek and English sources into trustworthy clean text by removing contamination, normalizing Unicode, preserving real structure, and producing an auditable cleaning report.
allowed-tools: Bash(find:*), Bash(ls:*), Bash(python3:*), Bash(jq:*), Bash(git:*), Read, Write, Edit, Grep, Glob
---

# Text Cleaning (Stage 3 Cleaning and Normalization)

Turn extracted source text into trustworthy clean text.

This skill is the **Stage 3 owner** for cleaning and normalization.
It should answer:
- what was removed?
- what was preserved?
- did cleaning damage real text?
- is the source now safe for segmentation?

It does **not** own:
- source discovery
- segmentation
- witness ranking

## Quick Status

Clean artifact dirs: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -path '*/clean' -type d 2>/dev/null | wc -l`
Cleaning report placeholders: !`find ${LYCEUM_TEXTS_DIR:-output/texts} -path '*/qa/cleaning-report.md' 2>/dev/null | wc -l`

## Commands

- `/text-cleaning run [work] [--side greek|english|all]` — Produce or refresh cleaned text artifacts
- `/text-cleaning audit [work]` — Audit whether the cleaning pass under-cleaned, over-cleaned, or corrupted text
- `/text-cleaning diff [work]` — Compare extracted vs clean outputs for targeted review
- `/text-cleaning status [work]` — Summarize cleaned artifacts, contamination findings, and blockers

Target: $ARGUMENTS

---

## Owned Responsibilities

### Owns
- removal of notes/headers/footers/apparatus/bilingual contamination
- Unicode normalization
- OCR anomaly awareness during cleanup
- raw/extracted -> clean audit trail
- cleaning report production

### Does not own
- structural reference assignment
- witness ranking
- alignment decisions

---

## Cleaning Goals

### Greek
- remove English bleed-through and editorial notes
- preserve canonical references
- preserve meaningful punctuation and speaker labels
- normalize Unicode consistently

### English
- remove notes, page headers, summaries, footers
- preserve actual translation text and meaningful divisions
- normalize whitespace and punctuation consistently

---

## Workflows

## `/text-cleaning run`

Produce cleaned source files from extracted artifacts.

### Required loop
- cleaner removes obvious contamination
- auditor checks for over-cleaning or under-cleaning
- notes record unresolved uncertainties

### Important rule
Do not collapse extraction and cleaning into one opaque step.

Keep the audit trail visible from `extracted/` to `clean/`.

---

## `/text-cleaning audit`

Use to test whether the cleaned text is trustworthy.

### Check for
- remaining contamination
- accidental deletion of real text
- OCR anomalies carried through untouched
- lost references or structural markers
- Unicode normalization damage

### Typical evidence
- sampled raw/extracted -> clean comparison
- contamination scan notes
- section-count sanity checks

---

## `/text-cleaning diff`

Use when you need a focused review of what changed.

### Goal
Make over-cleaning and under-cleaning obvious enough to inspect.

---

## `/text-cleaning status`

Summarize:
- cleaned artifacts present
- known contamination risks
- audit status
- whether Stage 4 can begin or another cleaning pass is required

---

## Outputs

### Canonical outputs
```text
$LYCEUM_TEXTS_DIR/<slug>/
├── clean/
├── qa/cleaning-report.md
├── state.json
└── replay/stage-history.json
```

### Cleaning report should include
- sources cleaned
- categories of removed material
- remaining uncertainties
- sample audit notes
- recommendation: pass / another pass / blocked

---

## Verification Contract

This skill follows the Stage 3 contract from `docs/text-pipeline-skill-verification-2026-03-13.md`.

### Verify
- contamination/boilerplate was removed
- Unicode normalization is consistent
- no over-cleaning occurred on sampled passages
- raw->clean diff is auditable

### Minimum evidence
- cleaned source files under `clean/`
- `qa/cleaning-report.md`
- contamination scan results
- raw->clean diff summary

### Pass criteria
- sampled passages preserve real text while removing unwanted material
- no unresolved high-confidence contamination regions remain
- canonical references/speaker labels survive when intended
- editorial cruft is removed where applicable

### Failure examples
- clean text still includes notes/page headers/apparatus
- real text was deleted as noise
- normalization changed words incorrectly

### Required next steps
After successful cleaning:
- `/segmentation run <work>`
- rerun cleaning and invalidate downstream stages if the clean text changes materially

---

## Verification

After completing this stage, run the automated verification script:

```bash
bash scripts/verify_stage_3.sh "${SLUG}"
```

Exit codes: 0=PASS (advance), 1=FAIL (block), 2=WARN (advance with notes).
The orchestrator runs this automatically; when executing manually, check the output for [FAIL] or [WARN] lines.

---

## Key Files

| File | Purpose |
|---|---|
| `docs/text-pipeline-master-plan-2026-03-13.md` | Canonical Stage 3 requirements |
| `docs/text-pipeline-skill-architecture-2026-03-13.md` | Ownership and command surface |
| `docs/text-pipeline-skill-verification-2026-03-13.md` | Verification contract |
| `$LYCEUM_TEXTS_DIR/<slug>/qa/cleaning-report.md` | Canonical cleaning report |
