---
name: academic-pdf-to-gfm
description: Convert academic PDF papers to GitHub-renderable GFM markdown with math equations. TRIGGERS - PDF, GitHub markdown, math
allowed-tools: Read, Grep, Glob, Bash, Write, Edit
---

# Academic PDF → GitHub GFM Conversion

A battle-tested workflow for converting academic/research PDF papers into GitHub-renderable GFM markdown with inline figures, mathematically correct LaTeX, and validated output.

**Battle-tested on**: López de Prado (2026) "How to Use the Sharpe Ratio" — 51 pages, 82 equations, 8 figures.

> **Self-Evolving Skill**: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

## Quick Start (3 Steps)

```bash
# Step 1: Extract prose (best structure preservation)
uv run --python 3.13 --with pymupdf4llm python3 -c "
import pymupdf4llm
md = pymupdf4llm.to_markdown('paper.pdf')
open('paper-raw.md', 'w').write(md)
"

# Step 2: Extract images
uv run --python 3.13 --with pymupdf python3 references/extract-images.py paper.pdf

# Step 3: Validate math before pushing
node references/validate-math.mjs paper.md
```

---

## CRITICAL: Detect PDF Type First

**This determines the entire workflow.** Getting it wrong wastes hours.

### Type A — Word-Generated PDF (Most Modern Academic Papers)

**Signs**: Embedded fonts, copyable text, Unicode math chars when you copy-paste (∑, π, α, β, γ, →)

**Math encoding**: Math is Unicode text in PDF stream — NOT images, NOT glyph maps

**Consequence**: OCR tools like `marker-pdf` **cannot extract LaTeX** — they see text like "γ₄" not `\gamma_4`. They may return empty output or crash silently.

**Required approach**:

1. Use `pymupdf4llm` for prose extraction
2. **Manually transcribe all equations** from PDF screenshots — there is no shortcut
3. Read each formula visually, write LaTeX by hand

**How to confirm**: Run `marker-pdf` — if output is empty or has zero math content, it's Type A.

### Type B — LaTeX-Generated PDF

**Signs**: Computer Modern fonts, precise mathematical spacing, arxiv.org source available

**Math encoding**: Glyph-mapped — structure is partially extractable

**Approach**: `pymupdf4llm` or `pdftotext` for text. If arxiv source exists, extract directly from `.tex` (vastly preferred over PDF conversion).

### Type C — Scanned/Image PDF

**Signs**: All pages are raster images, zero copyable text

**Approach**: OCR pipeline — `marker-pdf` is best option, or `tesseract`

---

## Tool Comparison

| Tool          | Best For                        | Install                           | Key Limitation                                  |
| ------------- | ------------------------------- | --------------------------------- | ----------------------------------------------- |
| `pymupdf4llm` | Type A/B prose (best structure) | `uv run --with pymupdf4llm`       | Math as Unicode, not LaTeX                      |
| `pdftotext`   | Quick plain text                | `brew install poppler`            | Loses table structure                           |
| `markitdown`  | Alternative prose               | `uv run --with 'markitdown[pdf]'` | Slight over-spacing; same math limit            |
| `marker-pdf`  | Type C scanned only             | `pip install marker-pdf`          | **Fails silently on Type A** (Unicode text bug) |

**Never trust `marker-pdf` output on Type A/B PDFs** — the apparent "success" with empty math sections is the failure mode.

---

## Image Extraction

Save `references/extract-images.py`:

```python
import fitz, os, sys

doc = fitz.open(sys.argv[1])
os.makedirs("references/media", exist_ok=True)
saved = []
for page_num in range(len(doc)):
    for img_idx, img in enumerate(doc[page_num].get_images(full=True)):
        xref = img[0]
        base_image = doc.extract_image(xref)
        img_bytes = base_image["image"]
        if len(img_bytes) < 2048:   # skip icons/logos/watermarks/rules
            continue
        ext = base_image["ext"]
        fname = f"fig-p{page_num+1:02d}-{img_idx+1:02d}.{ext}"
        with open(f"references/media/{fname}", "wb") as f:
            f.write(img_bytes)
        saved.append((page_num+1, fname, base_image.get("width"), base_image.get("height")))
        print(f"Saved: {fname} ({len(img_bytes)//1024}KB, {base_image.get('width')}×{base_image.get('height')})")
doc.close()
print(f"\n{len(saved)} images saved to references/media/")
```

**Naming**: `fig-p{page:02d}-{idx:02d}.{ext}` — page number in name for easy location matching.

**Size filter**: Skip `< 2 KB` (captures icons, watermarks, horizontal rules). Review everything ≥ 2 KB — some are decorative but most are figures.

**Insert in markdown**:

```markdown
![Figure 1: Variance of Sharpe ratio estimates](./media/fig-p12-01.png)
```

Place immediately after the nearest section heading or the paragraph that references the figure.

---

## GitHub GFM Math Rendering Rules

### The `$$` vs ` ```math ``` ` Decision — Root Cause

**GitHub's Markdown pre-processor runs BEFORE the math renderer.** It treats `\\` as an escaped backslash and collapses it to `\`. This breaks LaTeX line breaks in display math.

**The rule is simple**:

| Equation type                                           | Use             | Reason                                     |
| ------------------------------------------------------- | --------------- | ------------------------------------------ |
| Single-line display                                     | `$$...$$`       | No `\\` → pre-processor safe               |
| Multi-line (contains `\\`, `\begin{aligned}`, matrices) | ` ```math ``` ` | Pre-processor does NOT process code fences |
| Inline                                                  | `$...$`         | Standard                                   |

````markdown
# BROKEN on GitHub — \\ stripped by pre-processor:

$$
\begin{aligned}
a &= b + c \\
d &= e + f
\end{aligned}
$$

# CORRECT on GitHub:

```math
\begin{aligned}
a &= b + c \\
d &= e + f
\end{aligned}
```
````

````

### Display Block Formatting Rules

- `$$` must be on its **own line** — not `$$formula$$` on one line
- **Blank line required** before AND after every `$$` block
- **Blank line required between consecutive** `$$` blocks
- These rules do NOT apply to `` ```math ``` `` blocks

### Supported/Unsupported LaTeX

See [references/github-math-support-table.md](./references/github-math-support-table.md) for the full table.

**Key things to avoid**:

| Command | Problem | Fix |
|---------|---------|-----|
| `\begin{align}` | ❌ Not supported by GitHub | Use `\begin{aligned}` |
| `\boxed{}` | ⚠️ Can cause raw LaTeX passthrough | Remove or use bold text |
| `\operatorname{}` | ⚠️ Active GitHub bug, inconsistent | Use `\text{}` or `\mathrm{}` |
| `\newcommand` | ❌ Was briefly available, then pulled | Expand all macros inline |
| `x^_y` | Superscript immediately before subscript | Write `x^{*}_{i}` with braces |

### Common Gotchas

- `\\[8pt]` vertical spacing inside `$$` → eaten by pre-processor → move to `` ```math ``` ``
- `\frac{1}{T}:\left(` → spurious colon after fraction → remove colon
- Pearson vs excess kurtosis: most finance formulas need Pearson (γ₄ = 3 for Gaussian), not excess. **Always document the kurtosis convention in the formula comment.**
- `\begin{pmatrix}` with `\\` → must use `` ```math ``` ``
- `\begin{cases}` with multiple rows → must use `` ```math ``` ``

---

## GitLab: No Workarounds Needed

**Empirically verified 2026-03-15** on GitLab CE 18.9.2. Confirmed by Comrak source code analysis.

GitLab uses the **Comrak** Rust parser with `math_dollars: true`. When Comrak encounters `$$`, it calls `handle_dollars` which slices the raw input buffer directly and stores it as a `NodeMath` AST node — CommonMark's backslash handler is never invoked on math content. The raw LaTeX is passed to KaTeX via `<span data-math-style="display/inline">` unchanged.

**Every GitHub workaround is unnecessary on GitLab:**

| GitHub problem | GitHub fix required | GitLab |
|----------------|---------------------|--------|
| `\\` in `$$` stripped → broken multiline | Use ` ```math ``` ` | `$$` works with `\\` |
| `\left\{` → `\left{` (delimiter error) | Use `\left\lbrace` | `\left\{` works |
| `\{...\}` set notation → invisible braces | Use `\lbrace...\rbrace` | `\{...\}` works |
| `\,` in `$$` → literal comma | Remove `\,` | `\,` works |
| `\,` in inline `$` → literal comma | Remove `\,` | `\,` works |

On GitLab you can write standard LaTeX without any platform-specific workarounds. If you're targeting GitLab (or hosting your own GitLab CE), skip all the `\lbrace`/`\rbrace` substitutions and ` ```math ``` ` conversions — plain `$$` with standard LaTeX is correct.

### GitLab.com Has a Hard 50-Span Per-Page Limit

**GitLab.com (SaaS) enforces a limit of 50 total math spans per page** (display + inline combined). After the 50th span, all subsequent equations silently fall back to raw LaTeX text. This limit exists to prevent DoS attacks and cannot be overridden on GitLab.com.

| Document math density | gitlab.com | Self-hosted CE |
|---|---|---|
| ≤ 50 total spans | ✅ Renders fully | ✅ |
| 51–100 spans | ⚠️ Partial render | ✅ |
| 100+ spans (academic papers) | ❌ Most equations raw text | ✅ Disable with `math_rendering_limits_enabled: false` |

**Validated on**: Sharpe ratio paper (341 spans) — breaks at span 51 on gitlab.com, renders fully on local CE.

**The W6 check in `validate-math.mjs`** warns when a file exceeds the limit.

**Summary: which platform to use**:
- **GitHub.com**: No math span limit. Use `\lbrace`/`\rbrace` workarounds (handled by `--fix`).
- **Self-hosted GitLab CE**: No limit (disable math_rendering_limits_enabled). No workarounds needed.
- **GitLab.com**: Only suitable for documents with ≤ 50 math spans.

### Self-hosting GitLab CE for Math-Heavy Documents

GitLab CE is free and runs on a single machine. On a 61 GB workstation with slim config:
- Memory footprint: ~3 GB (`puma['worker_processes'] = 2`, `sidekiq['concurrency'] = 5`, monitoring disabled)
- Push mirroring to GitHub: free on CE (syncs within 5 min)
- `glab` CLI: first-party, comparable to `gh`

```yaml
# docker-compose.yml — slim GitLab CE
services:
  gitlab:
    image: gitlab/gitlab-ce:latest
    restart: unless-stopped
    environment:
      GITLAB_OMNIBUS_CONFIG: |
        external_url 'http://YOUR_IP:8929'
        puma['worker_processes'] = 2
        sidekiq['concurrency'] = 5
        prometheus_monitoring['enable'] = false
        alertmanager['enable'] = false
        node_exporter['enable'] = false
        redis_exporter['enable'] = false
        postgres_exporter['enable'] = false
        gitlab_exporter['enable'] = false
    ports: ["8929:8929", "8922:22"]
    volumes:
      - /srv/gitlab/config:/etc/gitlab
      - /srv/gitlab/logs:/var/log/gitlab
      - /srv/gitlab/data:/var/opt/gitlab
```

---

## Validation Pipeline

### Step 1: Install KaTeX Validator

```bash
bun add -g katex   # Bun-first per project policy
# or: npm install -g katex
````

### Step 2: Run Before Every Push

```bash
# Validate only (exit 1 on errors)
node references/validate-math.mjs your-file.md

# Validate + auto-fix correctable issues
node references/validate-math.mjs your-file.md --fix
```

The script is at [references/validate-math.mjs](./references/validate-math.mjs). It runs two layers:

**Layer 1 — KaTeX syntax**: parse errors in `$`, `$$`, ` ```math ``` ` blocks
**Layer 2 — GFM structural** (issues KaTeX passes but GitHub breaks):

| Code | Severity | Issue                                                                                         | Auto-fix                               |
| ---- | -------- | --------------------------------------------------------------------------------------------- | -------------------------------------- |
| E0   | Error    | `\!` `\,` `\;` `\{` `\}` in `$$` block — pre-processor strips backslash → parse error cascade | ✅ spacing removed; `\{`→`\lbrace`     |
| E0b  | Warning  | `\{` `\}` `\,` in inline `$...$` — invisible braces or literal commas in prose                | ✅ → `\lbrace`/`\rbrace`; `\,` removed |
| E1   | Error    | `$$` block with `\\` — GitHub pre-processor strips backslashes                                | ✅ → ` ```math ``` `                   |
| E2   | Error    | Consecutive `$$` blocks without blank line — orphaned delimiter cascade                       | ✅ add blank line                      |
| W1   | Warning  | Bare `^*` in `$$` or `$` block — markdown italic pairing eats the `*`                         | ✅ → `^{\ast}`                         |
| W2   | Warning  | `\begin{align}` — not supported on GitHub                                                     | ✗ manual                               |
| W3   | Warning  | `\boxed{}` — can cause raw LaTeX passthrough                                                  | ✗ manual                               |
| W4   | Warning  | `\operatorname{}` — inconsistent GitHub support                                               | ✗ manual                               |

**E0 is the most dangerous**: a single failing `$$` block exposes its `$$` delimiters as literal text, creating an orphaned `$` that shifts ALL subsequent inline `$...$` pairings. One broken equation takes down the entire document.

**`\{`/`\}` trap**: In `$$` blocks, `\left\{` becomes `\left{` (invalid KaTeX delimiter → "Missing or unrecognized delimiter") and `\{...\}` set notation becomes invisible grouping. Fix: use `\lbrace`/`\rbrace` (letter-based, CommonMark-immune). This affects every equation using set notation like `\{\hat{SR}_k\}` or `\min_T\left\{...\right\}`.

Exits code 1 on errors (CI-friendly). Warnings do not block CI but should be reviewed.

### Local Preview Tools

```bash
# GitHub-accurate hot-reload preview
bun add -g @hyrious/gfm
gfm your-file.md --serve

# Offline binary (gh extension)
gh extension install thiagokokada/gh-gfm-preview
gh gfm-preview your-file.md
```

VS Code extensions:

- `shd101wyy.markdown-preview-enhanced` — closest to GitHub rendering
- `bierner.markdown-preview-github-styles` — GitHub CSS styling

---

## Multi-Agent Adversarial Equation Validation

For papers with 10+ equations, use this multi-agent pattern:

### Phase 1 — Parallel Extraction

- **Agent A**: Extract prose with pymupdf4llm, transcribe math from PDF screenshots
- **Agent B**: Extract and categorize all images

### Phase 2 — Parallel Validation

- **Agent C**: Validate equations against reference implementation (if code/repo exists)
- **Agent D**: Numerical spot-checks — compute paper's exhibit values, compare

### Phase 3 — Discrepancy Handling

- For each discrepancy: write `/tmp/paper-discrepancy/eq-{N}.md`
- Spawn resolver agents to search online for authoritative third-party sources
- **Authority rule**: Paper is tentatively more authoritative than code implementation; a third independent source breaks ties

### Phase 4 — Guarded Application

- Apply **only HIGH-confidence fixes** to the markdown
- For MEDIUM-confidence: spawn an independent audit agent before touching the file
- Document all discrepancies even if not fixed — future readers need to know

---

## Anti-Patterns

| Anti-pattern                                         | Why it fails                                                                                                                | Fix                                                                  |
| ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| `\!\left(` or `\,` in `$$` blocks                    | GH pre-processor strips `\!`→`!` before KaTeX — `!\left(` crashes KaTeX, cascades all                                       | Remove `\!` `\,` `\;` (spacing only) — or use ` ```math ``` `        |
| `\left\{` or `\{...\}` in `$$`/`$` blocks            | `\{`→`{` (CommonMark escape), so `\left\{`→`\left{` = "Missing delimiter" error, and `\{x\}` renders without visible braces | Replace with `\left\lbrace`, `\right\rbrace`, `\lbrace`, `\rbrace`   |
| `$$\begin{aligned}...\\...\end{aligned}$$`           | `\\` stripped by GH pre-processor                                                                                           | Use ` ```math ``` `                                                  |
| Trusting `marker-pdf` on Word PDFs                   | Returns no output or zero math (Unicode bug)                                                                                | Read as screenshots, transcribe manually                             |
| `\begin{align}` in display math                      | Not supported by GitHub                                                                                                     | Replace with `\begin{aligned}`                                       |
| `\operatorname{Cov}`                                 | Active GH bug — sometimes renders raw                                                                                       | Use `\text{Cov}` or `\mathrm{Cov}`                                   |
| KaTeX validation only, no ` ```math ``` ` conversion | KaTeX passes but GH pre-processor still breaks `\\`                                                                         | Also convert ALL multi-line blocks                                   |
| `\boxed{}` for highlighting                          | Can cause raw LaTeX passthrough on GitHub                                                                                   | Use bold text or a blockquote callout                                |
| Excess kurtosis in formulas expecting Pearson        | Silent ~50% underestimate in variance formulas                                                                              | Always document convention; use `scipy.stats.kurtosis(fisher=False)` |
| Consecutive `$$` blocks without blank lines          | GitHub collapses them into one broken block                                                                                 | Add blank line between each block                                    |
| Running validation AFTER pushing                     | Bugs visible in public repo                                                                                                 | Validate locally before every push (`--fix` auto-corrects E0/E1/E2)  |

---

## References

| File                                                                      | Purpose                                |
| ------------------------------------------------------------------------- | -------------------------------------- |
| [validate-math.mjs](./references/validate-math.mjs)                       | KaTeX batch validator for GFM files    |
| [pdf-type-detection.md](./references/pdf-type-detection.md)               | Detailed guide to detecting PDF type   |
| [github-math-support-table.md](./references/github-math-support-table.md) | Full supported/unsupported LaTeX table |

---

## Related Skills

| Skill                                                                                                           | Relationship                                                     |
| --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| [pandoc-pdf-generation](../pandoc-pdf-generation/SKILL.md)                                                      | Opposite direction: markdown → PDF                               |
| [documentation-standards](../documentation-standards/SKILL.md)                                                  | GFM formatting standards                                         |
| [quant-research:opendeviation-eval-metrics](../../../quant-research/skills/opendeviation-eval-metrics/SKILL.md) | Worked example: `references/how-to-use-the-sharpe-ratio-2026.md` |

## Post-Execution Reflection

After this skill completes, reflect before closing the task:

0. **Locate yourself.** — Find this SKILL.md's canonical path before editing.
1. **What failed?** — Fix the instruction that caused it.
2. **What worked better than expected?** — Promote to recommended practice.
3. **What drifted?** — Fix any script, reference, or dependency that no longer matches reality.
4. **Log it.** — Evolution-log entry with trigger, fix, and evidence.

Do NOT defer. The next invocation inherits whatever you leave behind.
