--- name: find-articles description: Discover new EU referendum articles via Fréttasafn, classify HIGH/MEDIUM/LOW, save to inbox, auto-analyse HIGH priority. Use for manual ad-hoc discovery, weekend catch-up, or working a specific date range or backlog. The routine daily flow is handled by the overnight-batch-weekdays scheduled task — this skill is for human-driven exploration. --- # Find Articles Discover new EU referendum articles, filter out already-processed and irrelevant ones, and **automatically analyse HIGH priority articles**. Also surfaces backlog from previous sessions. > **Note:** The `overnight-batch-weekdays` Cowork routine runs discovery + classification + auto-queue + processing autonomously every weekday night, with results landing in `~/esbvaktin-site/` ready for morning git-diff review and push. This skill is for **manual ad-hoc use** — when something specific comes up during the day, when scanning a particular topic, or for weekend catch-up. The routine handles the routine daily flow; this skill is for human-driven exploration. ## Usage ``` /find-articles # Scan last 7 days + check backlog /find-articles 3 # Scan last 3 days + check backlog /find-articles 2026-03-01 # Scan from specific date /find-articles backlog # Skip scanning, just work the backlog ``` ## Autonomy Model - **HIGH priority** articles (verified by full-text read) → auto-queue and auto-analyse. No user confirmation needed. - **MEDIUM priority** articles → present to user, wait for selection. - **LOW priority** → note in summary, leave as pending. - **Backlog** → always check for pending HIGH priority articles from previous sessions before scanning. The user can override by saying "stop after scanning" or "don't analyse yet". ## Steps ### Step 0: Check Backlog Before scanning for new articles, check the existing inbox for unprocessed HIGH priority articles: ```bash uv run python scripts/manage_inbox.py next --high-only --limit 10 ``` Note any backlog articles. These will be included in the analysis batch alongside newly discovered HIGH articles. If the argument is `backlog`, **skip Steps 1–7** and go directly to Step 8 with the backlog articles. ### Step 1: Refresh Article Registry (transitional safety net) ```bash uv run python scripts/build_article_registry.py --status ``` This merges `data/analyses/`, site reports, and DB sightings into `data/article_registry.json`. The registry is **no longer the primary dedup source** — frettasafn's `consumer_state` table is, and `scan_eu` anti-joins against it at the SQL level (Step 3). The registry is now a transitional safety net: `check_duplicate.py` checks both registry and consumer_state; `register_article_sightings.py` writes through to consumer_state. The registry rebuild here keeps both views aligned during the verification period before Phase 4 retires it. Skip this step if you're scanning a small focused window — consumer_state alone is enough for the SQL anti-join. ### Step 2: Show Inbox Status ```bash uv run python scripts/manage_inbox.py status ``` Also load rejected URLs for URL-based filtering. **Use the Read tool** to read `data/rejected_urls.txt` (do NOT use shell commands like `wc`, `grep`, or input redirection — these trigger permission prompts). Parse the lines in your response: non-empty lines that don't start with `#` are rejected URLs. ### Step 3: Scan for EU Articles Use the Fréttasafn MCP `scan_eu` tool with `consumer_id="esbvaktin"` so frettasafn anti-joins against its `consumer_state` table at the SQL level — already-processed and rejected articles are filtered out before they reach Python: ``` scan_eu(date_from=, date_to=, consumer_id="esbvaktin", limit=50) ``` Where `start_date` is determined by the argument (default: 7 days ago). The `exclude_states` parameter defaults to `["processed", "rejected"]` when `consumer_id` is set; pass it explicitly only to override (e.g. `["processed", "rejected", "skipped"]` to also skip deferred articles). If the scan returns many results, run a second pass with a narrower date range or higher limit if needed to ensure coverage. ### Step 4: Filter and Classify For each article returned by `scan_eu` (already filtered server-side via consumer_state SQL anti-join — `state ∈ {processed, rejected}` are excluded before the rows reach you): 1. **Backstop URL check** — the SQL anti-join already excludes URLs whose article_id is marked `rejected`, but check `rejected` (loaded in Step 2) anyway as a safety net for any rejection that hasn't propagated to consumer_state yet. In normal operation this should be a no-op. 2. **Title-based false positive filter** — skip articles whose titles clearly have no EU/referendum content. Common false positive patterns from `scan_eu`: - Crime/accident reports (kynferðisbrot, slys, lögregla, eld) - Sports (ÓL, keppni, leikur) - Celebrity/entertainment news - Weather - Real estate listings The scan_eu tool over-matches because "ESB" and "Evrópusambandið" appear in sidebars and tag clouds. **Do not auto-reject based on title alone** — only filter obvious non-EU content. 4. **Classify remaining** into: - **HIGH**: Opinion pieces, interviews, analyses with factual claims about EU/referendum topics (fisheries, sovereignty, agriculture, trade, EEA, polling) - **MEDIUM**: News reports about the referendum process, parliamentary proceedings, party positions - **LOW**: Short items (<200 words), personnel news, tangential mentions, meta-commentary without verifiable claims ### Step 5: Fetch and Verify Candidates For articles classified HIGH or MEDIUM, fetch the full text using `get_article(article_id)` from Fréttasafn MCP. After reading, re-evaluate: - Is this genuinely about the EU referendum? (Some articles mention ESB once in passing) - Does it contain verifiable factual claims? (Not just process/meta commentary) - Upgrade or downgrade classification based on actual content ### Step 6: Save to Inbox Before presenting results, persist all discovered articles to the inbox. 1. **Use the Write tool** (not Bash heredoc/cat) to write the classified articles as a JSON array to `data/inbox/_scan_YYYYMMDD.json`. Fields: `url`, `title`, `source`, `date`, `word_count`, `article_type`, `topics` (array), `priority`, `frettasafn_id`, `notes`. 2. Import the batch: ```bash uv run python scripts/manage_inbox.py add-batch data/inbox/_scan_YYYYMMDD.json ``` 3. For HIGH and MEDIUM articles where full text was fetched, **use the Write tool** to save the text to `data/inbox/texts/.md`, then mark it: ```bash uv run python scripts/manage_inbox.py set-status pending ``` (The `has_text` flag is set by `save-text`, but since we wrote the file directly, update the inbox entry manually if needed.) **Never use shell heredocs, `cat >`, or `echo >` to write data files** — these trigger permission prompts. ### Step 7: Present Summary Display a brief summary (not a full table for HIGH — those are about to be analysed): ```markdown ## Scan Results — [date range] Registry: X processed | Inbox: Y pending (Z high) | W rejected **Proceeding to analyse N HIGH priority articles** (M new + K backlog): 1. [source] Title (date) — key topics 2. ... ### MEDIUM (awaiting your selection) | # | ID | Title | Source | Date | Topics | |---|-----|-------|--------|------|--------| ### LOW (skipped) - N articles noted, left as pending ``` If there are MEDIUM articles, note them but do not wait — proceed with HIGH articles first. The user can queue MEDIUM articles while HIGH analyses run. ### Step 8: Auto-Analyse HIGH Priority Articles **Do not stop and wait.** For each HIGH priority article (both newly discovered and backlog), automatically: 1. Queue it: `uv run python scripts/manage_inbox.py queue ` 2. Launch `/analyse-article ` with the article URL **Ordering:** Analyse articles from the most recent date first — current news has higher time-sensitivity than backlog. Within the same date, prefer articles with cached text (faster start). **Parallelisation:** Run analyses **sequentially** (each analysis spawns parallel subagents internally). This avoids context window pressure and makes error recovery simpler. After each analysis completes: - Note the verdict summary - Continue to the next article After all HIGH articles are analysed, present a batch summary: ```markdown ## Analysis Batch Complete | # | Title | Verdicts | Completeness | |---|-------|----------|-------------| | 1 | ... | 3 supported, 1 partial | 78% | | 2 | ... | 2 supported, 2 misleading | 65% | Remaining: N MEDIUM articles pending your selection. ``` ### Step 9: Handle MEDIUM Articles (if user responds) When the user picks MEDIUM articles to analyse: 1. Queue them: `uv run python scripts/manage_inbox.py queue [ ...]` 2. For each selected article, launch `/analyse-article ` When the user marks articles as irrelevant/skip: 1. Reject them: `uv run python scripts/manage_inbox.py reject [ ...]` (This sets inbox status to "rejected", appends URLs to `rejected_urls.txt`, AND writes `state="rejected"` to frettasafn's consumer_state — so future `scan_eu` calls anti-join them out at the SQL level. No further action needed.) When articles are neither picked nor rejected, they remain `pending` in the inbox for next session. ## Notes - **Fréttasafn false positives**: The `scan_eu` tool searches keyword groups (EU general, referendum, accession, EEA/Schengen, fisheries, agriculture). It returns many articles where these keywords only appear in sidebars, tag clouds, or "related articles" sections. Always fetch full text before final classification. - **Parallel fetching**: Fetch multiple articles in parallel using concurrent MCP calls to speed up Step 5. - **Date range**: Default 7 days balances thoroughness with speed. For catch-up after a break, use a longer range. - **Dedup authority**: `consumer_state` is the increasingly-authoritative source after Phase 3 — `scan_eu` anti-joins against it server-side; `register_article_sightings.py` writes through to it; `check_duplicate.py` checks it. The registry is a transitional safety net during the verification period. Don't grep `data/analyses/` directly — call `consumer_state_summary` or use `check_duplicate.py` which checks both paths. - **Backlog priority**: Articles older than 7 days are flagged as backlog in `manage_inbox.py next` output. They are still analysed — age does not reduce importance, only time-sensitivity. - **Session efficiency**: A typical session should be: run `/find-articles` → Claude scans, finds 2–4 HIGH articles, analyses them all, presents results + MEDIUM list. User reviews MEDIUM, optionally queues some. Minimal keyboard time. - **Avoid permission prompts**: Use dedicated tools (Read, Write, Grep, Glob) instead of shell commands for all file I/O. Never use input redirection (`<`), heredocs (`<< EOF`), `cat >`, `echo >`, `wc`, or `grep` on data files — these trigger security prompts that require user interaction. Only use Bash for `uv run python` commands.