---
name: document-gen-resilient-multiformat
description: Resilient multi-format document generation with environment checks, auto-sanitization, and fallback engines
---

# Resilient Document Generation Workflow (Multi-Format + Unicode-Safe)

## When to Use

Use this skill when generating documents in multiple formats (`.docx`, `.pdf`, `.html`) and you need:
- **Reliable execution** with pre-flight environment verification
- **Automatic Unicode handling** based on target format
- **Fallback options** when primary tools (pandoc/LaTeX) are unavailable
- **Clear error isolation** through discrete, observable steps

**Trigger this workflow when:**
- `shell_agent` returns unknown/unclear errors on document generation
- You need multiple output formats from one source
- Documents contain special characters, symbols, or non-ASCII text
- Previous PDF generation attempts failed due to encoding or missing dependencies

## Pre-Flight Environment Check

**Before starting**, verify your environment has the required tools:

```
run_shell
command: which pandoc && pandoc --version | head -1
```

```
run_shell
command: which pdflatex || which xelatex || which wkhtmltopdf || echo "No PDF engine found"
```

```
run_shell
command: python3 -c "import fpdf; print('fpdf2 available')" 2>/dev/null || echo "fpdf2 not available"
```

```
run_shell
command: python3 -c "import reportlab; print('reportlab available')" 2>/dev/null || echo "reportlab not available"
```

**If pandoc is missing:** Install via `apt-get install pandoc` or `brew install pandoc`

**If no PDF engine found:** Choose fallback approach:
- Install LaTeX: `apt-get install texlive-latex-recommended texlive-fonts-recommended`
- Or use wkhtmltopdf: `apt-get install wkhtmltopdf`
- Or use Python fallback (fpdf2/reportlab) - see Alternative PDF Generation section

## Core Technique

Split the workflow into discrete, observable steps with automatic format detection:
1. **Environment check** → Verify tools are available
2. **Content creation** → Use `write_file` to create source Markdown
3. **Auto-sanitization** → Conditionally sanitize based on target format (PDF needs it, DOCX/HTML don't)
4. **Format conversion** → Use appropriate tool for each format with fallback options
5. **Verification** → Check output files exist and validate content

## ⚠️ Unicode & Format Compatibility Matrix

| Format | Unicode Support | Sanitization Needed | Recommended Engine |
|--------|-----------------|---------------------|-------------------|
| `.docx` | Excellent | No | pandoc (default) |
| `.html` | Excellent | No | pandoc (default) |
| `.pdf` | Limited (LaTeX) | **Yes** | xelatex > pdflatex > wkhtmltopdf > fpdf2 |
| `.pptx` | Good | No | pandoc (if available) or python-pptx |

### Critical Unicode Characters for PDF

| Character | Issue | Safe Replacement |
|-----------|-------|------------------|
| `—` (em dash) | May not render | `--` or `-` |
| `–` (en dash) | May not render | `-` |
| `" "` (curly quotes) | Encoding errors | `" "` (straight quotes) |
| `' '` (curly apostrophe) | Encoding errors | `'` (straight apostrophe) |
| `…` (ellipsis) | May not render | `...` |
| `→` `←` `↑` `↓` (arrows) | LaTeX incompatibility | `->` `<-` `^` `v` |
| `✓` `✗` (checkmarks) | May not render | `[x]` `[ ]` |
| `★` `●` (symbols) | May not render | `*` `-` |
| `©` `®` `™` | May require packages | `(c)` `(r)` `(tm)` |
| Non-ASCII letters (é, ñ, ü) | Font-dependent | Use xeLaTeX or replace |

## Step-by-Step Workflow

### Step 0: Pre-Flight Check

Verify environment before proceeding:

```
run_shell
command: pandoc --version >/dev/null 2>&1 && echo "PANDOC_OK" || echo "PANDOC_MISSING"
```

**If PANDOC_MISSING**: Either install pandoc or use Alternative PDF Generation (Python-based)

### Step 1: Create Source Content with write_file

Write your document content as Markdown to a source file:

```
write_file
path: /tmp/document_source.md
content: |
  # Document Title
  
  ## Section 1
  Content with original unicode characters...
  
  ## Section 2
  More content...
```

### Step 2: Conditional Unicode Sanitization

**Only sanitize if generating PDF**. Create sanitized version for PDF conversion:

```
write_file
path: /tmp/document_source_sanitized.md
content: |
  # Document Title
  
  ## Section 1
  Content with unicode replaced (em-dash -> --, curly quotes -> straight, etc.)
  
  ## Section 2
  More content...
```

**Optional: Use sanitization script** (see Unicode Sanitization Script section below):

```
run_shell
command: ./sanitize_for_pdf.sh /tmp/document_source.md /tmp/document_source_sanitized.md
```

**Note**: Keep the original unsanitized file for DOCX/HTML conversion (these formats handle Unicode better).

### Step 3: Convert to Target Formats with run_shell

Use appropriate commands for each format. **Use sanitized source for PDF, original for others**.

**DOCX (from original):**
```
run_shell
command: pandoc /tmp/document_source.md -o output.docx
```

**PDF (from sanitized, with engine priority):**
```
run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=xelatex
```

**If xelatex fails, try pdflatex:**
```
run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=pdflatex
```

**If LaTeX engines fail, try wkhtmltopdf:**
```
run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=wkhtmltopdf
```

**HTML (from original):**
```
run_shell
command: pandoc /tmp/document_source.md -o output.html
```

### Step 4: Verify Outputs

Check that files were created and have content:

```
run_shell
command: ls -lh output.* && file output.*
```

```
run_shell
command: test -s output.pdf && echo "PDF has content" || echo "PDF is empty"
```

## PDF Engine Decision Tree

```
Is pandoc available?
├─ NO → Use Python fallback (fpdf2 or reportlab)
└─ YES → Is LaTeX available?
    ├─ YES (xelatex) → Use: pandoc --pdf-engine=xelatex (best Unicode)
    ├─ YES (pdflatex) → Use: pandoc --pdf-engine=pdflatex + sanitization
    ├─ YES (wkhtmltopdf) → Use: pandoc --pdf-engine=wkhtmltopdf
    └─ NO → Install LaTeX or use Python fallback
```

## Alternative PDF Generation (Python Fallback)

When pandoc or LaTeX is unavailable, use Python libraries directly:

### Option A: fpdf2 (Simple, fast)

```
run_shell
command: python3 << 'EOF'
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=12)
# Note: fpdf2 has limited Unicode support - use Latin-1 or embed fonts
pdf.cell(200, 10, txt="Document Title", ln=True, align='C')
pdf.output("output.pdf")
EOF
```

### Option B: reportlab (More control, better Unicode)

```
run_shell
command: python3 << 'EOF'
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

# Register a Unicode font if needed
# pdfmetrics.registerFont(TTFont('UnicodeFont', 'path/to/font.ttf'))

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Document Title")
c.save()
EOF
```

### Option C: Markdown to PDF with Markdown2 + ReportLab

```
run_shell
command: python3 << 'EOF'
import markdown
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

# Read markdown
with open('/tmp/document_source.md', 'r', encoding='utf-8') as f:
    md_content = f.read()

# Convert to HTML
html_content = markdown.markdown(md_content)

# Create PDF
doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

# Parse HTML and add to story (simplified - use html2text for production)
story.append(Paragraph("Document Content", styles['Normal']))

doc.build(story)
print("PDF created successfully")
EOF
```

## Complete Example

```markdown
# Generate Multi-Format Report

## Step 0: Check environment
run_shell
command: pandoc --version >/dev/null 2>&1 && echo "PANDOC_OK" || echo "PANDOC_MISSING"

## Step 1: Write Markdown source
write_file
path: /tmp/report.md
content: |
  # Quarterly Report
  
  ## Executive Summary
  Performance metrics and analysis...
  
  ## Key Findings
  — Major finding with em-dash
  "Quote" with curly quotes
  ✓ Completed items

## Step 2: Create sanitized version (for PDF only)
write_file
path: /tmp/report_sanitized.md
content: |
  # Quarterly Report
  
  ## Executive Summary
  Performance metrics and analysis...
  
  ## Key Findings
  -- Major finding with em-dash
  "Quote" with straight quotes
  [x] Completed items

## Step 3: Convert to DOCX (from original)
run_shell
command: pandoc /tmp/report.md -o report.docx

## Step 4: Convert to PDF (from sanitized, with xelatex)
run_shell
command: pandoc /tmp/report_sanitized.md -o report.pdf --pdf-engine=xelatex

## Step 5: Convert to HTML (from original)
run_shell
command: pandoc /tmp/report.md -o report.html

## Step 6: Verify all outputs
run_shell
command: ls -lh report.* && echo "All files created"
```

## Error Handling & Recovery

| Error | Likely Cause | Recovery Action |
|-------|-------------|-----------------|
| `pandoc: command not found` | Pandoc not installed | Install pandoc or use Python fallback |
| `pdflatex not found` | LaTeX missing | Use `--pdf-engine=xelatex` or `wkhtmltopdf` |
| `! LaTeX Error: File X.sty not found` | Missing LaTeX package | Install package or use xelatex/wkhtmltopdf |
| `Encoding error` | Unicode in PDF | Use sanitized file + xelatex engine |
| `wkhtmltopdf: command not found` | wkhtmltopdf missing | Install or use xelatex/pdflatex |
| `PDF is empty (0 bytes)` | Conversion silently failed | Check pandoc stderr, try alternative engine |
| `fpdf2 Unicode error` | Non-Latin characters | Use reportlab with Unicode font or sanitize |

### Recovery Workflow

1. **Check error message** for specific tool/engine failure
2. **Try next engine in priority**: xelatex → pdflatex → wkhtmltopdf → fpdf2/reportlab
3. **If all pandoc attempts fail**: Switch to Python fallback (fpdf2/reportlab)
4. **If Unicode issues persist**: Apply stricter sanitization or use xelatex with Unicode font
5. **Document which approach succeeded** for future reference

## Unicode Sanitization Script (Reusable)

Save as `sanitize_for_pdf.sh`:

```bash
#!/bin/bash
# sanitize_for_pdf.sh - Replace problematic unicode chars for LaTeX/PDF
if [ -z "$1" ]; then
  echo "Usage: $0 <input.md> [output.md]"
  exit 1
fi
INPUT="$1"
OUTPUT="${2:-${1%.md}_sanitized.md}"

sed -e 's/—/--/g' \
    -e 's/–/-/g' \
    -e 's/"([^"]*)"/"\1"/g' \
    -e "s/'([^']*)/'\1'/g" \
    -e 's/…/.../g' \
    -e 's/→/->/g' \
    -e 's/←/<-/g' \
    -e 's/✓/[x]/g' \
    -e 's/✗/[ ]/g' \
    -e 's/©/(c)/g' \
    -e 's/®/(r)/g' \
    -e 's/™/(tm)/g' \
    "$INPUT" > "$OUTPUT"

echo "Sanitized: $INPUT -> $OUTPUT"
```

Make executable: `chmod +x sanitize_for_pdf.sh`

Usage:
```
run_shell
command: ./sanitize_for_pdf.sh /tmp/document_source.md /tmp/document_source_sanitized.md
```

## Common pandoc Commands Reference

```bash
# Markdown to Word
pandoc input.md -o output.docx

# Markdown to PDF (LaTeX - default pdflatex)
pandoc input.md -o output.pdf

# Markdown to PDF with xelatex (best Unicode support)
pandoc input.md -o output.pdf --pdf-engine=xelatex

# Markdown to PDF with wkhtmltopdf (HTML-based, no LaTeX)
pandoc input.md -o output.pdf --pdf-engine=wkhtmltopdf

# Markdown to HTML
pandoc input.md -o output.html

# With metadata
pandoc input.md -o output.pdf --metadata title="Document Title"

# With custom reference doc (DOCX)
pandoc input.md -o output.docx --reference-doc=template.docx

# With custom template (HTML/PDF)
pandoc input.md --template=template.html -o output.html

# Force UTF-8 input
pandoc -f markdown+utf8 input.md -o output.pdf
```

## Troubleshooting Quick Reference

**PDF generation fails with encoding error:**
- Use sanitized markdown file
- Try `--pdf-engine=xelatex` for better Unicode support
- Add `-f markdown+utf8` to pandoc command

**PDF generation fails: LaTeX not found:**
- Install: `apt-get install texlive-latex-recommended texlive-fonts-recommended`
- Or use: `--pdf-engine=wkhtmltopdf`
- Or use: Python fallback (fpdf2/reportlab)

**DOCX formatting issues:**
- Add `--reference-doc=template.docx` for custom styles
- Check markdown structure (headings, lists)

**Unicode/encoding errors in any format:**
- Ensure source file is UTF-8: `file -i source.md`
- Add `-f markdown+utf8` to pandoc command
- Use xelatex for PDF

**Special characters not rendering in PDF:**
- Use the character replacement table
- Create sanitized version before PDF conversion
- Try xelatex with Unicode font

**Missing pandoc:**
- Ubuntu/Debian: `apt-get install pandoc`
- macOS: `brew install pandoc`
- Fallback: Use Python libraries directly

## When to Use shell_agent vs Manual Workflow

| Scenario | Recommended Approach |
|----------|---------------------|
| Simple DOCX/HTML, no Unicode | `shell_agent` is fine |
| PDF generation required | Manual workflow (this skill) |
| Multiple formats from one source | Manual workflow (this skill) |
| Heavy Unicode/special characters | Manual workflow with sanitization |
| `shell_agent` failed with unknown error | Manual workflow (this skill) |
| Need explicit error visibility | Manual workflow (this skill) |
| Environment constraints (no pandoc) | Python fallback (this skill) |

**After successful manual workflow**: You can attempt `shell_agent` for similar future tasks, but keep this workflow as your known-working fallback.

## Related Skills

- `document-gen-fallback`: Original fallback workflow (less Unicode guidance)
- `write-file-fallback-report`: For when multiple tools fail simultaneously
- `spreadsheet-direct-python`: For Excel/CSV generation with Python

## Skill Health & Metrics

**Expected success rate**: 85%+ with proper environment setup
**Fallback activation**: Use Python fallback if pandoc/LaTeX unavailable
**Verification**: Always verify output files exist AND have content (>0 bytes)
*** End Files
*** Begin Files