---
name: feishu-doc-scraper
description: Extract Feishu (Lark) Docs, Wiki pages, Wiki collections/hubs, spreadsheets, and Minutes (妙记) transcripts into clean high-fidelity local Markdown. The primary path is the lark-cli API — programmatic extraction with no LLM rewriting of the body — which recursively follows a collection's reference graph (mention-doc / sheet / cross-tenant links) and uses error codes to resolve permission boundaries precisely; a browser-DOM path is the fallback only when lark-cli cannot reach the content. Use this whenever the source is a Feishu/Lark URL and fidelity matters — including 导出飞书文档/合集/妙记转写, 把飞书 wiki/知识库转 markdown, scraping or archiving a Feishu collection, exporting a Feishu Minutes/妙记 transcript, or saving a Feishu page locally — even if the user only says clipping, archiving, converting, or "save this". Also covers the permission-denied path (owner-exported .docx → faithful Markdown with heading/highlight restoration).
compatibility: Primary path needs the `lark-cli` binary (npm `@larksuite/cli`, verified 1.0.32, 2026-05) authenticated to the target tenant. Fallback path needs a browser automation surface with an authenticated session (Chrome DevTools MCP / Browser Use / Computer Use). docx path needs `python-docx` and a docx→md converter (the bundled doc-to-markdown skill or pandoc).
argument-hint: [feishu-url-or-output-path]
---

# Feishu Doc Scraper

Extract a Feishu/Lark source into faithful local Markdown. **Prefer the lark-cli API** — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a *fallback*, not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed.

## Scope (read this first)

This skill's contract is **faithful per-source Markdown + a record of what was extracted**. It does *not* decide how the resulting files are named, indexed, deduplicated against existing notes, or organized into a knowledge base — that belongs to the host PKM / the user's own conventions. Stopping at faithful extraction keeps this skill orthogonal and reusable. When the user wants the output filed into a vault, extract first, then hand the clean Markdown to their organizing workflow.

## Choose the path

```
Is the source a Feishu/Lark URL (wiki / docx / sheets / minutes / base)?
├── YES → is lark-cli installed and authenticated to that tenant?
│        ├── YES → PATH A: lark-cli API extraction  (primary — start here)
│        │         └── hit code 131006 / 99991679 (permission denied)?
│        │              └── PATH B: owner-exported .docx → faithful Markdown
│        └── NO  → install/auth lark-cli first (it is worth it); only if
│                  truly impossible → PATH D: browser DOM fallback
├── the URL is a Minutes / 妙记 link, or a doc references one → PATH C: Minutes transcript
└── you were handed an exported .docx (not a URL) → PATH B
```

A collection/hub is just a docx whose body references other docs — **Path A handles it by recursively following the reference graph**, not by visiting pages in a browser.

## Path A — lark-cli API extraction (primary)

Full command catalog, recursion engine, cross-tenant and personal-space nuances: **[references/lark-cli-api-extraction.md](references/lark-cli-api-extraction.md)**. The essentials for the common case:

**1. Disable the proxy for Feishu domestic domains.** Feishu's `*.feishu.cn` endpoints are direct-connect in mainland China; routing them through a local proxy leaks credentials through the proxy and gets DNS-hijacked. lark-cli itself warns about this. Always:

```bash
export LARK_CLI_NO_PROXY=1
```

This does not conflict with any "Claude/Anthropic domains must use the proxy" rule — Feishu is a different host and is direct.

**2. Classify the URL, then resolve to a fetchable doc token.**

- `…/wiki/<node_token>` — a wiki node token is **not** a doc token. Resolve it first:
  ```bash
  lark-cli wiki spaces get_node --params '{"token":"<node_token>"}'
  # → .data.node.obj_token  and  .data.node.obj_type  (e.g. "docx")
  ```
- `…/docx/<doc_token>` — already a doc token, fetch directly.
- `…/sheets/<token>` — spreadsheet, use the sheets commands (see reference).
- `…/minutes/<token>` — Minutes, go to **Path C**.

**3. Fetch the body as Markdown — programmatically, never via the model.**

```bash
lark-cli docs +fetch --doc <obj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err
# body is .data.markdown — extract with jq, do NOT retype or summarize it
jq -r '.data.markdown' /tmp/fetch.json > source.md
```

Keep stdout and stderr separate. A harmless `[deprecated] docs +fetch with v1 API is deprecated` goes to stderr; piping `2>/dev/null` *and* `jq` together produced a false `Exit code 5` in practice — redirect to files and inspect, don't blind-pipe. The body must reach disk without passing through the model (paraphrasing silently corrupts source text — this is the single most important fidelity rule).

**4. If it's a collection/hub, follow the reference graph (BFS).** The hub body contains `<mention-doc>`, `<sheet>`, `<image>` tags and cross-tenant / Minutes / Tencent-Meeting URLs. Extract every reference, dispatch by type, fetch, and **repeat on each newly fetched doc until no new references remain** (leaf nodes). Use the bundled extractor so nothing is silently missed (a missed reference = a missing document, the #1 hub-scraping failure):

```bash
python3 scripts/feishu_extract_refs.py source.md   # → JSON list of {type, token, title}
```

Recursion loop, dispatch table, and the cross-tenant/`my.feishu.cn` personal-space rules are in the reference.

**5. Final residual-tag check (acceptance gate for collections).** Every rich-media reference must have been resolved and rendered:

```bash
grep -rlE '<(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo "UNRESOLVED — keep recursing" || echo "clean"
```

Must be empty before you stop.

## Path B — permission denied → owner-exported .docx

`lark-cli wiki spaces get_node` returning `code 131006 … node permission denied, user needs read permission` (or fetch returning it) is a **hard Feishu-side boundary**. lark-cli, anonymous curl, and the browser all fail it — this has been verified exhaustively; do not spend cycles trying to bypass it. The only correct move: ask the permission holder to export the doc as `.docx` and send it back out-of-band, then convert with fidelity (font-size→heading and `w:shd`→highlight restoration, then visual verification). Full procedure: **[references/docx-export-to-markdown.md](references/docx-export-to-markdown.md)**.

## Path C — Feishu Minutes (妙记) transcript

`lark-cli minutes` only returns metadata and can download audio/video — it **cannot** export the text transcript. The transcript comes from a native endpoint called through `lark-cli api`, and needs an extra scope granted via a device-flow login. Native AI transcription is far better than downloading the media and re-running ASR — never do the latter. Endpoint, scope name, the device-flow timeout trap, and per-minute (not per-tenant) permission behavior: **[references/feishu-minutes-transcript.md](references/feishu-minutes-transcript.md)**.

## Path D — browser DOM fallback (last resort)

Only when lark-cli genuinely cannot reach the content (no install possible, and the doc is not permission-walled). This is the old virtual-scroll / TOC-driven DOM capture workflow. It is slower, depends on a connected browser surface (the in-browser extension frequently fails to connect), and an anonymous debugging Chrome can only tell you whether a page is *publicly* reachable — it cannot read login-walled content. Workflow: **[references/browser-dom-fallback.md](references/browser-dom-fallback.md)**. Battle-tested DOM rules (virtual scroll, `data-block-id` ordering, table/bullet extraction, image streams): **[references/browser-failure-rules.md](references/browser-failure-rules.md)**.

## Hard rules

These are the rules whose violation silently ruins the output. Each has a reason — follow the reason, not just the letter.

- **Never let the document body pass through the model.** Extract with `jq`/`cat`/scripts straight to disk. The model paraphrasing source text is undetectable later and destroys fidelity. This is why Path A beats the browser path structurally.
- **`export LARK_CLI_NO_PROXY=1` for `*.feishu.cn`.** Otherwise credentials transit a local proxy and DNS is hijacked.
- **Transcripts come from the platform's native transcription, never re-ASR.** Downloading media and transcribing again loses speaker labels, timestamps, and accuracy.
- **A generated docx Markdown is not done until it has been *visually* verified** against the source (render to image, read it). Feishu-exported docx uses font-size+bold for headings rather than Word heading styles, so a "no errors, word count matches" check passes while the entire heading hierarchy is silently flat. Text-level checks cannot catch this.
- **Do not 死磕 (grind) on docx embedded-image download.** lark-cli (through 1.0.32) cannot download `<image>` tokens from a docx — exhaustively verified. Register the image tokens and note "needs document owner to right-click → save"; the text is the value, images are a tracked gap.
- **HTTP 200 from anonymous curl ≠ accessible.** A Feishu login wall returns 200 with a body containing `accounts.feishu.cn` / `login` / `passport` / an empty `<title>`. Check the body, never infer "public" from the status code.
- **A file "not found" by a search agent is not authoritative.** Verify against authoritative sources before concluding (this is general Inference Discipline; relevant when locating where ingested content already lives).
- **U+FFFD final check on every produced file:** `LC_ALL=C grep -rl $'\xef\xbf\xbd' .` must be empty. A replacement character means an encoding step corrupted the text.

## Acceptance contract

Stop only when all that apply are true:

- Every fetched body reached disk via `jq`/script, not retyped by the model.
- Collections: the residual rich-media-tag grep (Path A step 5) is empty — every `mention-doc`/`sheet`/cross-tenant reference was followed to a leaf.
- `LC_ALL=C grep -rl $'\xef\xbf\xbd' .` is empty.
- docx path: rendered to an image and visually compared to the source; heading hierarchy and highlights match (see docx reference's checklist).
- Browser fallback only: TOC coverage + scale check (see browser-failure-rules.md).
- Each output file's frontmatter records `source` (the original URL/token) and, if any post-processing was applied, a `post_process` provenance line.
- Permission gaps (131006 docs not exported yet, undownloadable images) are explicitly listed for the user — a transparent gap beats a silent omission.

## Do NOT attempt

Verified dead-ends — retrying them only wastes the session. Full table with failure modes and root causes: **[references/permission-and-failure-boundaries.md](references/permission-and-failure-boundaries.md)**. The top ones:

- Bypassing `131006` permission-denied by any means (lark-cli / curl / anonymous browser) — it is a server-side boundary.
- Downloading docx embedded images via `docs +media-download`, `api …/drive/v1/medias/<t>/download` (with or without `extra`), or `schema drive.medias.download` — none work; lark-cli even mis-reports the real HTTP 400 as "empty JSON".
- `WebFetch` against `open.feishu.cn/document/server-docs/...` for API specs — backend is flaky; use `open.feishu.cn/llms-docs/zh-CN/llms-<module>.txt` instead (LLM-friendly, stable).
- AppleScript/JXA `executeJavaScript`, Chrome CDP on port 9222 — disabled/empty in this environment (browser path only).
- Using `minimax-docx` to convert docx→md — it is a docx *authoring* tool; use the doc-to-markdown skill instead.

## Bundled resources

- `scripts/feishu_extract_refs.py` — deterministic reference-token extractor; the recursion engine's core. Run it on every fetched body to enumerate `<mention-doc>`/`<sheet>`/`<image>`/cross-tenant/Minutes/Tencent-Meeting references as JSON.
- `scripts/restore_docx_headings.py` — for Path B: reads true font sizes via python-docx, maps them to heading levels, restores `w:shd` highlights to Obsidian `==…==`, without retyping body text.
- `scripts/feishu_dom_capture.js` — Path D: injectable end-to-end browser DOM capture.
- `scripts/download_feishu_images.py` — Path D: SSR image extraction when browser automation is unavailable.
- `scripts/build_feishu_markdown.py` — Path D: render a capture manifest into Markdown.
- `scripts/check_heading_coverage.py` — coverage verification (both paths).
- `references/lark-cli-api-extraction.md` — Path A full reference (commands, recursion, sheets, cross-tenant).
- `references/feishu-minutes-transcript.md` — Path C native transcript API + scope auth.
- `references/permission-and-failure-boundaries.md` — error codes + the full Do-NOT-attempt table.
- `references/docx-export-to-markdown.md` — Path B faithful conversion procedure.
- `references/browser-dom-fallback.md` + `references/browser-failure-rules.md` — Path D.
- `references/capture-manifest.md` — manifest shape for `build_feishu_markdown.py`.

## Next step

After extraction completes, the clean Markdown typically feeds the user's own knowledge-base ingestion (filing, indexing, dedup) — which is deliberately out of this skill's scope. If the source went through Path B (a docx), the doc-to-markdown skill is already part of that flow. Offer the handoff; do not auto-organize:

```
Extraction complete: [N] sources → faithful Markdown ([M] permission/image gaps listed).

Options:
A) Hand off to your PKM/organizing workflow — file & index these (Recommended if part of a vault)
B) Run /daymade-docs:docs-cleaner — consolidate redundant content across the extracted files
C) Stop here — the faithful Markdown is the deliverable
```
