---
name: youtube-to-markdown
description: Convert YouTube talks/seminars into a markdown transcript with the speaker's slides interleaved at the correct timestamps. Use when the user asks to transcribe a slide-driven talk, lecture, seminar, or conference video; wants slides extracted and inlined; or asks for a "video with slides" markdown writeup. Skip for interviews, panels, or videos without legible slides.
allowed-tools: Read, Write, Edit, Bash
---

# YouTube talk → markdown with slides

Produce a markdown document where the talk's slides (transcribed as markdown/LaTeX) are interleaved with the spoken transcript at the correct timestamps. Each transcript paragraph and each slide gets a clickable YouTube link.

## End-to-end workflow

1. **Download** video + auto-subtitles (`download.sh <video-url-or-id>`)
2. **Detect & extract slide frames** (`detect-slides.sh <video-id>`)
3. **View each frame and write markdown for it** — this is the part you (Claude) do interactively. Save the slide list as JSON.
4. **Build the merged markdown** (`build_merged.py <video-id> <slides.json>`)
5. **Upload to a gist and verify rendering** (`verify-render.sh <gist-id>`)

The user wants ONE markdown file out. They do not want the raw VTT or intermediate text files in the gist.

## Step-by-step details

### 1. Downloading

YouTube currently requires a JS runtime to solve the n-challenge and remote EJS components for the video URL. If `download.sh` errors, ensure `deno` is installed (`brew install deno`) and yt-dlp is recent (`uv pip install --upgrade yt-dlp`).

The script downloads:
- `<id>.mp4` — the video itself (format 18, 360p; sufficient for slide OCR)
- `<id>.en.vtt` — auto-captions
- `<id>.en-orig.vtt` — original subs if present

### 2. Slide detection

`detect-slides.sh` runs ffmpeg scene detection **on the cropped slide region** of the frame (typical Zoom layout: slides on the left ~75%, faces on the right). Cropping is essential — without it, every face change in the gallery view triggers a false positive.

It produces `/tmp/yt-slides-<id>/slide_NNNN_<seconds>.jpg` files. Expect 1.5–2× as many frames as actual distinct slides because builds, mouse movements and brief revisits register as transitions. That's fine; you'll dedupe by eye.

### 3. Transcribing slides (manual, interactive)

Use the Read tool on each JPEG frame (`/tmp/yt-slides-<id>/slide_NNNN_*.jpg`) — Read renders images so you can see and transcribe their content. For each distinct slide, write its content as markdown with LaTeX math. Skip frames that are duplicates of the previous slide (revisits, mouse-only changes, build animations that just reveal the next bullet).

Save your slide list as a JSON file with this shape:

```json
[
  {"t": 0.0, "md": "**Title**\n\n*Speaker — Affiliation*"},
  {"t": 43.9, "md": "### Section heading\n\nLet $\\mathbf{N} = \\{1,2,3,\\ldots\\}$..."},
  ...
]
```

`t` is the timestamp in seconds (the scene-detection time is fine; precision doesn't matter). `md` is markdown that will be rendered inside a blockquote.

**For long videos with many slides (>30):** transcribing is the bulk of the work. Do it diligently — don't skim. The user will check the output.

### 4. Building the merged markdown

`build_merged.py` does:
- Parses the VTT, keeping only the **new line per cue** for YouTube auto-captions (the line with inline `<HH:MM:SS>` markers); falls back to the last text line for plain VTTs
- Strips conservative discourse-marker fillers (`um`, `uh`, `er`, `ah`) and collapses stutter-repeats of common short words (`I I` → `I`, `the the` → `the`) — deliberately *not* a generic `(\w+)\s+\1` match because that destroys legitimate prose like *"that that book"*
- Groups cues into ~30–75-second paragraphs at sentence boundaries, **forcing a paragraph break at every slide timestamp** so slides land at their actual time rather than between arbitrary paragraph boundaries
- Interleaves slides at their timestamps
- **Double-escapes LaTeX in math blocks** (see "GFM math gotchas" below) — idempotent, safe to run twice
- Writes `<id>.with-slides.md`

### 5. Verifying the rendered output

Push to a gist (`gh gist create --public <file>`) and run `verify-render.sh <gist-id>`. The script fetches the rendered HTML and:
- Counts `js-inline-math` and `js-display-math` blocks
- Looks for `katex-error` markers
- Samples post-GFM math content so you can eyeball that backslash-escapes survived

GitHub renders math client-side via KaTeX. The HTML response embeds the math source in `<span class="js-inline-math">$...$</span>` placeholders — that's what KaTeX consumes in the browser. Anything missing or corrupted there will be missing or corrupted in the user's view.

## GFM math gotchas (the big one)

GitHub's GFM applies CommonMark backslash-escape parsing **before** the math is handed to KaTeX. CommonMark eats `\X` and gives back `X` for any ASCII punctuation X. So:

| Source in markdown | What CommonMark passes to KaTeX | KaTeX renders |
|---|---|---|
| `$\{1,2,3\}$` | `${1,2,3}$` | broken set braces |
| `$\\` (line break in cases) | `$\` | KaTeX error |
| `$\&`, `$\#`, `$\_` | `$&`, `$#`, `$_` | likely broken |

**Fix:** in math regions, double the backslash before any of `{ } \ & # $ _ , ; : ! |`. The build script does this automatically via `fix_math_escapes()` / `double_escapes()`. The implementation is character-level (not regex-based) so it's idempotent: re-running it on already-escaped content is a no-op.

`\setminus`, `\ldots`, `\mathbf`, `\bigcap`, etc. survive untouched because the character after the backslash is a letter, not punctuation, so CommonMark doesn't touch them. But thin-spaces (`\,`), spacers (`\;`, `\:`, `\!`), and `\|` are vulnerable — that's why the escapable set isn't just `{}\&#$_`.

## Other markdown gotchas

- **Don't wrap inline math in `*...*` italics**: `*$h$-fold product*` confuses GFM parsers. Either drop the italics or move the math out: `*h-fold product*` or `$h$-fold product`.
- **Speaker change markers**: VTT auto-captions render speaker changes as `&gt;&gt;`. After `html.unescape` it's `>>`, which collides with blockquote markers. Replace with `»`.

## YouTube auto-VTT structure (so you understand the dedup)

Each cue has the form:

```
00:00:09.679 --> 00:00:14.789
on archive in which a um
AI<00:00:15.280><c> system</c><00:00:15.679><c> called</c>...
```

The first line is the previous cue's text (kept on screen). The second line is the new content (with inline word timestamps). Naively concatenating gives ~2× duplication. The fix in `build_merged.py` is to keep only the line containing inline `<HH:MM:SS.fff>` markers.

## Output format conventions

The merged markdown uses:
- `# Title` at the top with talk metadata
- `> 📊 **Slide @ [MM:SS](youtu.be/...?t=N)**` blockquote for each slide
- `**[[MM:SS](youtu.be/...?t=N)]** transcript text...` for each paragraph
- `»` for speaker changes within a paragraph

Rationale: blockquotes for slides make them visually distinct. Linked timestamps so the reader can jump to any moment. No extra files (the user explicitly asked for the merged .md only).

## Helper scripts

- `~/.claude/skills/youtube-to-markdown/download.sh` — yt-dlp wrapper
- `~/.claude/skills/youtube-to-markdown/detect-slides.sh` — ffmpeg scene detection + frame extraction
- `~/.claude/skills/youtube-to-markdown/build_merged.py` — VTT cleanup + slide interleave + escape fixing
- `~/.claude/skills/youtube-to-markdown/verify-render.sh` — fetch gist HTML and sniff for KaTeX errors

## When NOT to use this skill

- The user just wants a transcript without slides → use yt-dlp directly with `--skip-download --write-auto-sub`, then dedupe rolling captions.
- The video isn't a slide-driven talk (e.g., interviews, panel discussions without visuals) → just produce the cleaned transcript.
- The user wants a polished prose summary, not a transcript → that's a different task; transcripts here are lightly cleaned but still verbatim.