--- name: scrape-web-page description: General-purpose web page scraper via the user's localhost (preserves Israeli IP for geo-restricted sites). Handles any page that is NOT a clean article — SPAs, branch locators, product catalogs, search result pages, government portals, dashboards, API endpoints discovered via devtools, or anything requiring raw HTML, rendered DOM, network/XHR responses, or structured extraction. Supports scripted user interaction (click "load more", scroll, type into search, iterate a city dropdown) for sites that hide data behind UI actions — a naive "scrape this URL" will return nothing on those. Always tries to identify and call an unauthenticated backend JSON API first — UI automation is the fallback, not the default. Canonical user invocation provides a triple: (URL, interactive target to click/type/scroll, data to extract) — the skill loads the URL, activates the target in a loop until it stops producing new content, then extracts the requested data. Trigger phrases: "scrape this page", "dump the HTML", "find the API behind this page", "capture this SPA", "get all the X from this page", "extract the JSON", "scrape branch list / store list / catalog", "load URL, click this, then extract that", "I had to click to load everything", any non-article URL. For clean article extraction use `scrape-article` instead. --- # scrape-web-page Flexible scraper for arbitrary web pages. Unlike `scrape-article`, does not assume a readability-friendly article body exists. Supports four output modes: raw HTML, rendered DOM HTML, extracted structured data (JSON/CSV), and network-response capture (finding the backing API of an SPA). ## Hard constraint Requests must originate from this machine. Do **not** route through a hosted reader (Jina, Firecrawl SaaS, ScrapingBee, etc.). The whole reason this plugin exists is to use the user's Israeli IP. ## API-first, UI as fallback Before reaching for any browser automation, try to identify and call the site's **backend API directly**. Most modern SPAs are a thin shell around a JSON API, and that API is usually reachable unauthenticated from the same origin — once you know the endpoint and any query params, a plain `Fetcher.get` (rung 1) replaces everything else. Workflow: 1. Load the page once in `network` mode (Playwright with a response listener) and list every JSON endpoint it hits. Also watch for interactions (search, filter, paginate) that trigger additional endpoints — run them once, capture the URL templates. 2. For each candidate endpoint, try calling it directly with `curl` or `Fetcher.get`. Confirm: - It returns the real data unauthenticated (no session cookie / CSRF token required, or the required headers are static and copyable). - It supports pagination / filtering via query params you can enumerate. - It isn't rate-limited into uselessness. 3. If yes → **record the endpoint in `sites.yaml` under `api_endpoints:` and use it going forward.** Skip browser automation entirely on future runs. This is dramatically faster, more reliable, and handles pagination without clicking. 4. Only if the API is authenticated, signed, rate-limited, or non-existent → fall back to UI automation (`interactive` mode). Report both attempts in the output: "Tried API at `` → . Fell back to UI interaction because ." This way the user can see whether the API route is worth pushing on (e.g. via a captured auth token) before accepting the slower UI path. Do not skip the API check because "this SPA probably needs Playwright." Check first — the check costs one page load. ## Pick a mode before you start Ask (or infer from the goal): - **`raw`** — plain HTML as served. Use for server-rendered sites, or to inspect the initial payload of an SPA before JS runs. - **`rendered`** — HTML after JS execution. Use for SPAs where content is injected client-side on load with no user interaction required. - **`interactive`** — rendered + scripted user interactions (click "load more" buttons, scroll, type into a search box, select a dropdown, paginate). **Many SPAs hide data behind a "load all" or per-city filter** — e.g. the Israel Post branch locator requires clicking through cities or a "show all branches" button before the full list appears. A naive one-shot scrape of the landing URL will return an empty or partial dataset. If the user says anything like "I had to click to load all the branches", the mode is `interactive`. - **`extract`** — structured data pulled from the DOM or embedded JSON blobs. User specifies what to extract (e.g. "all branch ids and names", "product name + price", "table rows"). Often pairs with `interactive` or `network`. - **`network`** — capture XHR / fetch responses while loading (and optionally interacting with) the page. Use when the goal is to discover the backing API of an SPA — this is what you want for branch locators, store finders, autocomplete endpoints. Often the right first move for SPAs: catch the API once, then skip the browser on future runs. Decision order for an unknown SPA: 1. Try `network` first — if the page fires a clean JSON endpoint, use that directly from here on. 2. If the endpoint requires interaction to fire (e.g. only triggers on city search), fall through to `interactive` + `network` combined: script the interactions while capturing responses. 3. Only resort to DOM scraping (`extract` off a `rendered` / `interactive` capture) if no JSON endpoint is exposed. Do not default to `rendered` — it is the slowest. Start at `raw` unless the user's goal obviously needs JS. ## `interactive` mode — scripted user actions Use when the data only appears after the user does something. ### Preferred invocation: user supplies the triple The cleanest way to drive this mode is for the user to provide three things: 1. **URL** — the page to load. 2. **Interactive target** — the thing to click / type / scroll to reveal the data. A CSS selector is ideal; plain text ("the button labelled 'טען עוד'", "the city dropdown", "scroll to bottom") is fine — resolve it to a selector on the loaded page. 3. **Data to extract** — what the user actually wants out ("all branch ids and their Hebrew names", "every row in the results table", "the price and product name of each card"). Given that triple, the recipe is fixed: ``` load URL → repeat: activate the interactive target, wait for new content (stop when the target disappears, is disabled, or row count stops growing) → once settled, extract the requested data from the final DOM → save as JSON (with a sidecar raw HTML dump in case extraction needs tweaking) ``` Resolve the target by kind: - **Button / link** (pagination, "load more", "show all"): click in a loop until it's gone / disabled / row count plateaus. - **Dropdown / select**: enumerate `