---
name: oma-hwp
description: >
  Convert HWP / HWPX / HWPML files to Markdown using kordoc. Extracts text, headings, tables,
  lists, images, footnotes, and hyperlinks. Use for Korean word processor files (Hangul),
  government documents, and AI-ready data preparation.
---

# HWP Skill - HWP / HWPX / HWPML to Markdown Conversion

## Scheduling

### Goal
Convert Korean HWP-family documents into readable Markdown or structured JSON while preserving document structure for LLM context, RAG, government-document review, or enterprise document processing.

### Intent signature
- User asks to convert, parse, read, extract, or transform `.hwp`, `.hwpx`, or `.hwpml`.
- User mentions Korean word processor files, Hangul documents, government forms, or "한글 파일".
- User needs headings, tables, nested tables, lists, images, footnotes, or hyperlinks extracted from HWP-family files.

### When to use
- Converting Korean HWP documents (`.hwp`, `.hwpx`, `.hwpml`) to Markdown
- Preparing Korean government/enterprise documents for LLM context or RAG
- Extracting structured content (tables, headings, lists, images) from HWP
- User says "convert this HWP", "parse hwpx", "HWP to markdown", "한글 파일"

### When NOT to use
- PDF files -> use `oma-pdf` (OCR + Tagged PDF specialization)
- XLSX / DOCX files -> currently out of scope (may be covered by a future `oma-docs`)
- Generating or editing HWP documents -> out of scope
- Already-text files -> use Read tool directly

### Expected inputs
- `input_path`: `.hwp`, `.hwpx`, or `.hwpml` file path
- `output_path` or `output_dir`: optional explicit output target
- `format`: optional output format, default `markdown`
- `page_range`: optional page or section range
- `kordoc_version`: optional pinned kordoc version

### Expected outputs
- Markdown output next to the input file or in the requested directory
- Optional JSON output when requested
- Post-processed Markdown with flattened GFM tables and stripped Private Use Area glyphs by default
- A short report with output path, detected source format, and conversion issues

### Dependencies
- `bun` and `bunx`
- `bunx kordoc@latest` or configured pinned kordoc version
- `resources/flatten-tables.ts` for Markdown cleanup
- Local filesystem access to input and output paths

### Control-flow features
- Branches by file extension, output target, format, page range, encryption/DRM state, and post-processing requirements
- Calls external CLI tools through `bunx` and `bun run`
- Reads local HWP-family files and writes local Markdown or JSON output
- Routes non-HWP inputs to other skills instead of stretching this skill's scope

## Structural Flow

### Entry
1. Confirm the input path exists.
2. Confirm the extension is `.hwp`, `.hwpx`, or `.hwpml`.
3. Resolve output path or directory and default filename.
4. Check that `bun` is available.

### Scenes
1. **PREPARE**: Validate path, extension, size, output target, and requested format.
2. **ACQUIRE**: Detect source format and runtime availability.
3. **ACT**: Run `kordoc` with explicit output target and requested options.
4. **VERIFY**: Post-process Markdown and inspect structure for headings, tables, lists, images, and footnotes.
5. **FINALIZE**: Report output path, source format, and any conversion limitations.

### Transitions
- If the input is `.pdf`, stop and route to `oma-pdf`.
- If the input is `.xlsx` or `.docx`, explain that this skill does not advertise those formats.
- If `bun` is unavailable, stop and ask the user to install Bun.
- If Markdown is produced, run `resources/flatten-tables.ts` unless the caller explicitly needs HTML tables or PUA glyphs preserved.
- If output is empty or garbled, consult `resources/troubleshooting.md`.

### Failure and recovery
| Failure | Recovery |
|---------|----------|
| `bun` or `bunx` unavailable | Ask user to install Bun |
| Unsupported or mismatched format | Check extension and magic bytes, then route or stop |
| Encrypted or DRM-locked document | Report limitation and request an accessible copy when needed |
| Empty Markdown output | Treat as possible scanned-image content and recommend OCR outside this skill |
| Complex merged tables | Accept flattened Markdown or HTML fallback as best effort |
| Stale kordoc cache | Use `bunx kordoc@latest` or configured pinned version |

### Exit
- Success: output file exists and structure is readable after post-processing.
- Partial success: output exists with explicitly reported table, glyph, encryption, or fidelity limitations.
- Failure: no reliable output is produced and the blocking cause is reported.

## Logical Operations

### Actions
| Action | SSL primitive | Evidence |
|--------|---------------|----------|
| Validate file path and extension | `VALIDATE` | Input preflight in execution protocol |
| Check runtime availability | `VALIDATE` | `bun --version` |
| Select output target and format | `SELECT` | Output behavior and config |
| Run converter | `CALL_TOOL` | `bunx kordoc@latest` |
| Write output artifact | `WRITE` | Markdown or JSON output |
| Flatten tables and strip PUA glyphs | `CALL_TOOL` | `resources/flatten-tables.ts` |
| Inspect extraction quality | `VALIDATE` | Verification step |
| Report result | `NOTIFY` | Final user-facing summary |

### Tools and instruments
- `kordoc`: primary HWP-family conversion CLI
- `flatten-tables.ts`: post-processing for GFM tables and Hancom PUA cleanup
- `bun` / `bunx`: runtime and CLI executor

### Canonical command path
```bash
bunx kordoc@latest "{input_path}" -o "{output_path}"
bun run ".agents/skills/oma-hwp/resources/flatten-tables.ts" "{output_path}"
```

For batch conversion, use an explicit output directory:
```bash
bunx kordoc@latest "{input_pattern}" -d "{output_dir}"
```

### Resource scope
| Scope | Resource target |
|-------|-----------------|
| `LOCAL_FS` | Input HWP-family files and generated outputs |
| `PROCESS` | `bunx kordoc` and `bun run` subprocesses |
| `MEMORY` | Format decisions, validation notes, and final report |

### Preconditions
- Input file exists and is readable.
- Output location is writable or can be created.
- `bun` is installed.
- `kordoc` can parse the document or fail with a reportable error.

### Effects and side effects
- Creates Markdown or JSON output files.
- May flatten merged-cell tables, trading cell fidelity for Markdown compatibility.
- Strips Private Use Area characters by default because they render as blanks without Hancom fonts.
- Does not intentionally modify the source HWP-family document.

### Guardrails
1. Always pass `@latest` or an explicit pinned version to avoid stale `bunx` cache.
2. Always pass an explicit output target when the user expects a file.
3. Do not add custom security defenses around kordoc's ZIP, XML, SSRF, or XSS defenses.
4. Report missing tables, garbled text, empty output, encrypted segments, and best-effort DRM extraction.
5. Keep full CLI details in `resources/execution-protocol.md` and troubleshooting branches in `resources/troubleshooting.md`.

### Supported Formats
| Format | Extension | Notes |
|--------|-----------|-------|
| HWP 5.x binary | `.hwp` | Full support (incl. DRM-locked via kordoc's rhwp-algorithm port) |
| HWPX | `.hwpx` | Full support incl. nested tables, merged cells |
| HWPML | `.hwp` (XML variant) | Auto-detected by signature |

> kordoc also parses PDF / XLSX / DOCX. Those are intentionally outside this skill's scope — see "When NOT to use".

## References
- Execution protocol: `resources/execution-protocol.md`
- Troubleshooting: `resources/troubleshooting.md`
- Configuration: `config/hwp-config.yaml`
- Upstream: https://github.com/chrisryugj/kordoc
- Related: `../oma-pdf/SKILL.md` (use for `.pdf` inputs)
