---
name: wp-static-clone
description: >
  Clones a live WordPress (or other CMS-driven) site into a static HTML
  site deployable on any static host (Cloudflare Pages, Netlify, Vercel,
  S3+CloudFront, plain Apache/nginx). Use when the user wants to "scrape",
  "freeze", "archive", "static-ify", or "move to [host]" a WordPress site,
  or asks to turn a sitemap into deployable static HTML. Pulls every URL
  from sitemap_index.xml, fetches all assets, rewrites paths to be
  root-relative, strips WP runtime markup, and outputs a flat directory
  ready to deploy with no build command.
---

# wp-static-clone

Turn a live WordPress site into a static HTML clone deployable on any static host. Driven by the site's XML sitemap. Handles the WordPress-specific gotchas — Cloudflare bot protection, mid-scrape link rewriting, proxied analytics, R2-offloaded uploads, comment-form runtime, Yoast attribution, Gravatar privacy — that a naïve `wget` run misses.

**Recipes live in `AGENTS.md`; reusable scripts in `scripts/`.** This file is the workflow, gotchas, and output structure.

## When to use

Trigger on requests like:

- "Scrape this WordPress site for [host]"
- "Freeze [domain] as static HTML"
- "Pull all the pages from this sitemap and turn them into static files"
- "Move this WP site to [host] with no build step"

The broad shape (sitemap → wget → root-relative paths → static host) generalises to any CMS that emits a standard XML sitemap. The runtime cleanup (comment forms, Plausible proxy, Gravatar, Yoast) is WordPress-specific.

## Workflow

### Phase 0 — Confirm intent

Before scraping, confirm:

- **Source URL** (the live site).
- **Target host** — Cloudflare Pages, Netlify, Vercel, generic static. Drives Phase 9.
- **Same or different domain** at the destination. Drives whether `og:url`, `<link rel="canonical">`, and JSON-LD `@id` stay absolute (same domain — correct SEO behaviour) or get rewritten (different domain).
- **What to do with analytics and forms.** WP plugins for both can't run statically. Plausible gets replaced with the standard tracker (recipe in `AGENTS.md`); contact/search forms either get removed or wired through Pages Functions / Formspree / Netlify Forms — host-specific.

### Phase 1 — Discover URLs and pull XML sitemaps

Fetch the sitemap index. Try `<root>/sitemap_index.xml` (Yoast convention) first, then `<root>/sitemap.xml`. If the index references sub-sitemaps (`page-sitemap.xml`, `post-sitemap.xml`, …), fetch each and concatenate `<loc>` values into `urls.txt`. Skip image-sitemap entries.

Also fetch the XML sitemaps themselves and the Yoast XSL stylesheet now (recipe in `AGENTS.md`) — they aren't linked from HTML, so wget `-p` won't find them later.

### Phase 2 — Scrape in one shot

**Critical:** scrape every URL in a single `wget` invocation so its `--convert-links` pass sees all downloaded files and rewrites cross-page links correctly. Scraping URLs in separate runs leaves residual absolute links on whichever page was scraped first/last. Recipe in `AGENTS.md`.

### Phase 3 — Pull assets the page-requisites pass missed

Some assets aren't `-p`-followed because they appear only in `og:image`, `apple-touch-icon`, JSON-LD `image`/`logo`, `msapplication-TileImage`, or `<link rel="modulepreload">`. Audit and fetch the long tail. Recipe in `AGENTS.md` covers all three asset roots (`uploads`, `themes`, `plugins`).

### Phase 4 — Convert paths to root-relative

`wget -k` produces a mix of `../wp-content/...` (depth-relative) and bare `wp-content/...` (homepage). Both work locally but break the moment a page moves. Convert to root-relative `/wp-content/...` everywhere:

```sh
python3 scripts/rewrite-paths.py output/ urls.txt --source-domain example.com
```

The script derives the page-slug list from `urls.txt`, not from a directory walk — otherwise wget-grabbed archive directories like `category/`, `feed/`, `author/`, `wp-json/` get wrongly classified as pages and their inter-page links get mis-rewritten.

The script defaults to WordPress asset roots (`wp-content`, `wp-includes`). For non-WP sources, override with `--asset-roots`: e.g. `--asset-roots sites/default/files,sites/default/themes` for Drupal, `--asset-roots content/images` for Ghost. The rest of the rewriter is CMS-agnostic.

### Phase 5 — Brand the static output

So future-you (or anyone reading view-source) can tell at a glance that this is the static clone, not the live WP install:

```sh
python3 scripts/insert-banner.py output/
```

Inserts an HTML comment after `<!DOCTYPE html>` on every page. Idempotent. Then replace the "Generated by Yoast SEO" attribution in `wp-content/plugins/wordpress-seo/css/main-sitemap.xsl` — recipe in `AGENTS.md`.

### Phase 6 — Replace WP runtime hooks

Three categories of WP-only markup that breaks once the backend is gone:

**1. Comment forms, reply links, and dead head tags.** One pass:

```sh
python3 scripts/strip-wp-runtime.py output/
```

Removes `<div id="respond">` blocks (the comment form), `comment-reply-link` anchors in both block-theme and classic-theme variants, the `comment-reply-js` script tag and its underlying file, and dead `<head>` tags (REST API discovery, RSD, oEmbed alternates, RSS alternates, archive `next` links). Match-by-class throughout — no language assumptions about link text.

**2. Plausible analytics proxy.** The WP plugin proxies the script through `/wp-content/uploads/<hash>/pa-XXX.js` and posts events back to `/wp-json/...`. Both endpoints disappear. Replace the two-script block with the standard tracker — recipe in `AGENTS.md`.

**3. Gravatar avatars.** Self-host every distinct `(hash, size)` pair, drop the `?s=N&d=mm` requests to a third party:

```sh
python3 scripts/selfhost-gravatars.py output/
```

Saves under `avatars/` and rewrites every reference. Detects extension from response bytes (PNG fallback vs JPEG real avatar), keeps size variants separate (`?s=40` and `?s=80` are different files).

After the scripts, audit remaining absolute source-domain URLs (recipe in `AGENTS.md`) and triage by case: author archives → strip the `<a>` wrapper, server-rendered iframes → drop the wrapping `<p>`, Gravity Forms script blocks → strip on `gform`-mention, etc.

### Phase 7 — Copy `robots.txt`

Not linked from HTML; fetch it explicitly. Adjust the `Sitemap:` reference if the deployed sitemap path differs from the source.

### Phase 8 — Verify locally

Serve from `output/` with `python3 -m http.server`, then run the verify checklist in `AGENTS.md`:

1. Every URL in `urls.txt` resolves to a file (no missed pages).
2. No remaining `https://<source-domain>/` outside the canonical / `og:url` / JSON-LD allow-list.
3. No broken internal links from `wget --spider`.
4. Spot-check the homepage and a deep page in a browser. Watch srcset images, sidebar widgets, and the header banner — those break silently if missed.

### Phase 9 — Deploy

Host-specific recipes in `AGENTS.md`:

- **Cloudflare Pages** — `_redirects`, `_headers`, "no build command, no output directory" defaults.
- **Netlify** — same `_redirects` / `_headers` syntax, plus `netlify.toml`.
- **Vercel** — `vercel.json` with `redirects` / `headers`.
- **Generic** — nginx `try_files`, Apache `Options +MultiViews`.

## Gotchas

These are the things that bit us. Don't repeat them.

1. **Cloudflare bot protection 403s the default `Wget/1.x` UA.** Always set a real browser UA + `Accept` / `Accept-Language` headers (recipe). If you see `403 Forbidden` after a burst of requests, that's it — back off, switch UA, retry.

2. **Cross-page link rewriting only works in a single wget invocation.** wget's `-k` only rewrites to local paths it sees in the current run. If a page was downloaded in a separate invocation (e.g. to recover from a 403 on one URL), its links to the rest stay absolute. Solution: redo the full scrape once you have the right UA. Don't piecemeal it. If you're scraping at scale (10K+ URLs) and can't fit in one run, scrape in batches and re-run `scripts/rewrite-paths.py` afterwards as the canonical pass — `-k`'s output is then redundant.

3. **Default publish directory by host.** Cloudflare Pages serves the repo root when no build command is configured. Netlify and Vercel also default to root. If you scraped into `output/`, either move files to the repo root (`git mv output/* .`) or configure the host to publish from `output/`. Symptom of the wrong setup on Pages: every URL 404s with R2-style headers (`access-control-allow-origin: *`, `cache-control: no-store`) instead of a Pages-branded 404.

4. **WordPress Offload Media plugins** route `/wp-content/uploads/` to R2 / S3 buckets. wget may successfully fetch an image even when later direct access 404s (intermittent or partial bucket sync). Trust your local copy — that's why we scrape and self-host.

5. **Sitemaps and the Yoast XSL aren't linked from HTML.** wget `-p` won't find them. Fetch explicitly in Phase 1.

6. **Filenames with `?ver=...` query strings.** wget keeps these as literal filenames; HTML uses `%3F` encoding. Standard servers (Pages, Netlify, Vercel, `python -m http.server`) URL-decode and serve correctly. Don't try to "clean these up" unless something actually breaks.

7. **`og:url`, canonical, JSON-LD stay absolute.** They identify the canonical resource and are correct as-is when redeploying to the same domain. Only rewrite if changing domains.

8. **`sed -i ''` is macOS / BSD only.** GNU sed needs `sed -i` (no empty-string argument). Recipes in `AGENTS.md` flag the macOS-isms; default to the Python scripts where there's a choice — they're portable.

## Output structure

```text
<repo-root>/
  index.html                 ← homepage
  <slug>/index.html          ← one per URL from sitemap
  wp-content/                ← assets (themes, uploads, plugins)
  wp-includes/               ← block library CSS, et al.
  avatars/                   ← self-hosted Gravatars (Phase 6)
  sitemap_index.xml
  page-sitemap.xml           ← + any other child sitemaps
  wp-content/plugins/wordpress-seo/css/main-sitemap.xsl
  robots.txt
  _redirects                 ← optional, host-specific
  _headers                   ← optional, host-specific
```

Push to a git host and connect to the static host with **no build command** and **no build output directory** — defaults work.
