---
name: openplanter
description: "Investigate datasets and resolve entities using OpenPlanter methodology — cross-reference heterogeneous sources, build Admiralty/ACH confidence-tiered evidence chains, and apply OSINT tradecraft. Triggers on entity resolution, cross-reference datasets, evidence chain, OSINT investigation, structured investigation."
argument-hint: "[workspace-path or investigation-query]"
---

# OpenPlanter — Investigation Methodology Skill

Epistemic framework for cross-dataset investigation, entity resolution, and evidence-backed analysis. Extracted from the [OpenPlanter](https://github.com/ShinMegamiBoson/OpenPlanter) recursive investigation agent and enriched with professional OSINT tradecraft (Admiralty System, ACH, FollowTheMoney schema, intelligence cycle methodology).

Claude Code already has the tools. This skill provides the **methodology**.

## When to Use

- Cross-referencing heterogeneous datasets (corporate registries, campaign finance, lobbying, property records, contracts)
- Entity resolution across datasets with inconsistent naming
- Building evidence chains with provenance and confidence tiers
- Structured OSINT investigations requiring epistemic discipline
- Any analysis where claims need to trace to cited source records

## Quick Start

```bash
# 1. Initialize workspace
python3 ~/.claude/skills/openplanter/scripts/init_workspace.py /path/to/investigation

# 2. Drop datasets into datasets/
cp campaign_finance.csv lobbying.json corporate_registry.csv /path/to/investigation/datasets/

# 3. Write an investigation plan
# → plans/plan.md (see references/output-templates.md for plan template)

# 4. Resolve entities across datasets
python3 ~/.claude/skills/openplanter/scripts/entity_resolver.py /path/to/investigation

# 5. Cross-reference linked entities
python3 ~/.claude/skills/openplanter/scripts/cross_reference.py /path/to/investigation

# 6. Validate evidence chains
python3 ~/.claude/skills/openplanter/scripts/evidence_chain.py /path/to/investigation

# 7. Score confidence
python3 ~/.claude/skills/openplanter/scripts/confidence_scorer.py /path/to/investigation
```

## Investigation Methodology

### Epistemic Discipline

Assume nothing about the environment until confirmed firsthand. These principles prevent the most common investigation failures:

1. **Ground truth comes from files, not memory.** Read actual data before modifying it, and read actual error messages before diagnosing. Model memory of data structure is unreliable—reading the file takes seconds, recovering from a wrong assumption takes minutes.
2. **Empty output is ambiguous.** If a command returns empty, cross-check with `ls -la` and `wc -c` before concluding a file is actually empty, because output capture mechanisms can silently lose data.
3. **Success does not mean correctness.** A command that "succeeds" may have done nothing. Check actual outcomes, not just exit codes. After downloading, verify with `ls` and `wc -c`. After extraction, verify expected files exist.
4. **Verify round-trip correctness.** After any data transformation (parsing, linking, aggregation), check the result from the consumer's perspective—load the output, spot-check records, verify row counts. Transformations that silently drop records are the most common source of wrong conclusions.
5. **Three failures = wrong approach.** If a command fails 3 times, change strategy entirely. Repeating an identical command expecting different results wastes context window.
6. **Produce artifacts early.** Write a working first draft of findings as soon as the requirements are clear, then iterate. An imperfect deliverable beats a perfect analysis with no output. If 3+ steps have passed without writing any output, stop and write—even if incomplete.

For the full epistemic framework (including data ingestion rules and hard rules from OpenPlanter's prompts.py), see `references/investigation-methodology.md`.

### Entity Resolution Protocol

**Pipeline: Normalize → Block → Compare → Score → Cluster → Review**

Adapted from OpenPlanter's prompts.py, enriched with Middesk, OpenSanctions, and ICIJ patterns.

1. **Normalize** — Apply canonical key transformation: Unicode NFKD + diacritic stripping, case folding, legal suffix canonicalization (LLC/Inc/Corp/Ltd → canonical forms), punctuation removal, ampersand normalization (`&` → `and`). See `references/entity-resolution-patterns.md` for complete normalization tables and suffix maps.

2. **Block** — Reduce O(N^2) comparisons using blocking keys: first 3 characters of normalized name + state/jurisdiction, phonetic key (Soundex/Double Metaphone), token overlap via inverted index.

3. **Compare** — Pairwise similarity with cascading checks: `real_quick_ratio()` → `quick_ratio()` → `ratio()`. Use `autojunk=False` for entity names because strings under 200 characters produce false negatives with junk heuristics. Include token set comparison for word-order variants ("Apple Inc" vs "Inc Apple").

4. **Score** — Multi-signal weighted model:
   - Hard signals: TIN/EIN exact match (1.0), identical phone E.164 (0.8), identical email (0.8)
   - Soft signals: name similarity (0.5), address fuzzy (0.2), state match (0.1)
   - Hard disqualifiers: TIN mismatch when both present (-0.5), country mismatch (-0.5)

5. **Cluster** — Group via transitive closure using Union-Find. Gate closure so all pairwise scores exceed threshold—this prevents chain errors where A≈B and B≈C but A≉C. Exclude registered agent addresses from triggering transitive closure alone, because thousands of entities share the same registered agent.

6. **Review** — Flag by confidence band:
   - Score >= 0.85: auto-match (confirmed)
   - Score 0.70-0.84: queue for review (probable)
   - Score 0.55-0.69: include in wide net (possible)
   - Score < 0.55: discard

### Evidence Chain Construction

Every claim traces to a specific record in a specific dataset. This is what separates investigation from speculation.

```
Claim
  └── Evidence Item
        ├── type: document | record | image | testimony
        ├── source_ref → Source
        │     ├── url / file path
        │     ├── collection_timestamp
        │     ├── source_type: primary | secondary | tertiary | official | unofficial
        │     └── reliability_grade: A–F (Admiralty)
        ├── credibility_grade: 1–6 (Admiralty)
        ├── corroboration_status: single | corroborated | contradicted | unresolvable
        └── match_details (if cross-reference)
              ├── fields_matched: [name, address, ein]
              ├── match_type: exact | fuzzy | address-based
              └── link_strength: weakest criterion in the chain
```

**Key principles:**
- Distinguish direct evidence (A appears in record X), circumstantial evidence (A's address matches B's address), and absence of evidence (no disclosure found)
- Document every hop in a multi-step chain with source record, linking field, and match quality
- Link strength = weakest criterion in the chain (a chain is only as strong as its weakest link)
- Track source lineage to detect circular reporting—Source B citing Source A is not independent corroboration

### Confidence Tiers

Based on the Admiralty System (NATO AJP-2.1), adapted for dataset investigation:

| Tier | Criteria | Required Evidence | Admiralty Equivalent |
|------|----------|-------------------|---------------------|
| **Confirmed** | 2+ independent sources with different collection paths; hard signal match (EIN, phone); or official record with verifiable provenance | Independent corroboration required | A1–B1 |
| **Probable** | Strong single source (official record); high fuzzy match (>0.85) on name + address + state; consistent with known patterns | Single strong source acceptable | B2–C2 |
| **Possible** | Circumstantial evidence only; moderate fuzzy match (0.55-0.84); consistent but not yet corroborated; requires additional investigation | Hypothesis supported but not confirmed | C3–D3 |
| **Unresolved** | Contradictory evidence; insufficient data; single weak source; or unable to verify | Cannot determine with available evidence | D4–F6 |

**Sherman Kent probability mapping:**
- Confirmed ≈ "almost certain" (93-99%)
- Probable ≈ "likely" (75-85%)
- Possible ≈ "chances about even" (45-55%)
- Unresolved ≈ insufficient basis for estimate

### Verification Principle

**Implementation and verification must be uncorrelated.** An agent that performs an analysis introduces systematic bias when self-verifying—it "knows" what the answer should be and unconsciously confirms it. Use the implement-then-verify pattern:

```
Step 1: Perform entity resolution and cross-referencing (analysis agent)
Step 2: Read the result files
Step 3: Independent verification (separate agent or separate pass with no shared context from step 1):
        - Load output files fresh
        - Spot-check N random records against raw source data
        - Verify row counts match expectations
        - Run validation script
        - Report raw output only
```

The verification executor has no context from the analysis executor. It runs commands and reports output, making its evidence independent.

**Anti-bias checks:**
- **Confirmation bias**: Score hypotheses by inconsistency count (ACH), not confirmation count, because disconfirming evidence is more diagnostic
- **Anchoring**: Do not rank hypotheses until evidence collection is complete
- **Circular reporting**: Track source lineage; verify independence of collection paths before counting corroborations
- **Satisficing**: Require minimum 3 competing hypotheses before scoring—this prevents premature commitment to the first plausible explanation

### Analysis Output Standards

Include in all investigation deliverables:

1. **Methodology section**: Sources used, entity resolution approach, linking logic, known limitations
2. **Confidence breakdown**: Count of findings per tier (confirmed/probable/possible/unresolved)
3. **Evidence appendix**: Every hop, every source record cited, every match score
4. **Structured output**: JSON for machine-readable (`findings/`), Markdown for human-readable (`findings/`)
5. **Provenance**: For each dataset—source URL/path, access timestamp, transformations applied

See `references/output-templates.md` for ready-to-use templates (investigation plans, summaries, evidence chains).

## Integration Modes

The skill operates in three modes based on investigation complexity:

| Mode | When | Scripts |
|------|------|---------|
| **Methodology Only** | Simple tasks, 1-2 datasets, local analysis | `entity_resolver.py`, `cross_reference.py`, `evidence_chain.py`, `confidence_scorer.py` |
| **Web-Enriched** | Need external data, public records, entity enrichment | Above + `dataset_fetcher.py`, `web_enrich.py`, `scrape_records.py`, + 6 specialized fetchers |
| **Full RLM Delegation** | Complex multi-step investigations, 3+ datasets, 20+ reasoning steps | `delegate_to_rlm.py` → full OpenPlanter agent (provider-agnostic, session-resumable) |

**One-command pipeline** for any mode: `investigate.py /path/to/workspace --phases all`

### RLM Delegation — Provider-Agnostic

The RLM agent auto-detects the LLM provider from the model name. Works with any provider the agent supports:

```bash
# Anthropic (default)
python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model claude-sonnet-4-5-20250929

# OpenAI
python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model gpt-4o

# OpenRouter (any model via slash routing)
python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model anthropic/claude-sonnet-4-5

# Ollama (local inference, air-gapped)
python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --provider ollama --model llama3

# Cerebras (model name doesn't contain "cerebras", so specify --provider)
python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model qwen-3-235b-a22b-instruct-2507 --provider cerebras

# Resume a previous investigation session
python3 scripts/delegate_to_rlm.py --resume abc123 --workspace DIR

# List saved sessions
python3 scripts/delegate_to_rlm.py --list-sessions --workspace DIR

# Control reasoning depth
python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --reasoning-effort high

# List available models for a provider
python3 scripts/delegate_to_rlm.py --list-models --provider ollama
```

Provider auto-detection: `claude-*` → anthropic, `gpt-*/o1-*/o3-*` → openai, `org/model` → openrouter, `*cerebras*` → cerebras, `llama*/qwen*/mistral*/gemma*` → ollama. For models without a recognizable prefix, pass `--provider` explicitly.

API keys pass through environment variables: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `OPENROUTER_API_KEY`, `CEREBRAS_API_KEY` (or `OPENPLANTER_`-prefixed variants). Ollama requires no API key. Set `OPENPLANTER_REPO` to override local clone discovery.

## Scripts Reference

All scripts use Python stdlib only. Zero external dependencies. External tools (Exa, Firecrawl, OpenPlanter agent) are invoked as subprocesses.

### Core Analysis

| Script | Purpose | Example |
|--------|---------|---------|
| `init_workspace.py` | Create investigation workspace structure | `python3 scripts/init_workspace.py /tmp/investigation` |
| `entity_resolver.py` | Fuzzy entity matching + canonical map | `python3 scripts/entity_resolver.py /tmp/investigation --threshold 0.85` |
| `cross_reference.py` | Cross-dataset record linking | `python3 scripts/cross_reference.py /tmp/investigation` |
| `evidence_chain.py` | Validate evidence chain structure | `python3 scripts/evidence_chain.py /tmp/investigation` |
| `confidence_scorer.py` | Score findings by confidence tier | `python3 scripts/confidence_scorer.py /tmp/investigation` |

### Data Collection & Enrichment

| Script | Purpose | Example |
|--------|---------|---------|
| `dataset_fetcher.py` | Download bulk public datasets (SEC, FEC, OFAC, LDA, OpenSanctions) | `python3 scripts/dataset_fetcher.py /tmp/investigation --sources sec,fec` |
| `web_enrich.py` | Enrich entities via Exa neural search | `python3 scripts/web_enrich.py /tmp/investigation --categories company,news` |
| `scrape_records.py` | Fetch entity records from government APIs | `python3 scripts/scrape_records.py /tmp/investigation --entities "Acme Corp" --sources sec,fec` |

### Specialized Data Fetchers

Individual scripts for targeted government and public data sources. All use Python stdlib only, produce JSON + provenance sidecar, and support `--dry-run` and `--list`.

| Script | Data Source | Auth | Key Linking Fields |
|--------|-------------|------|--------------------|
| `fetch_census.py` | US Census Bureau ACS 5-Year | Optional `CENSUS_API_KEY` | Geography (state, county, ZIP) |
| `fetch_epa.py` | EPA ECHO Facility Search | None | `registry_id`, lat/lon, SIC/NAICS |
| `fetch_icij.py` | ICIJ Offshore Leaks Database | None | `icij_id`, entity name, jurisdiction |
| `fetch_osha.py` | OSHA DOL Enforcement | None | `activity_nr`, `estab_name`, SIC |
| `fetch_propublica990.py` | ProPublica Nonprofit Explorer (IRS 990) | None | `ein`, org name, NTEE code |
| `fetch_sam.py` | SAM.gov Entity Registration | `SAM_GOV_API_KEY` | UEI, CAGE code, NAICS |

**Usage pattern** (all fetchers follow the same interface):
```bash
python3 scripts/fetch_sam.py /tmp/investigation --query "Raytheon" --state CT
python3 scripts/fetch_epa.py /tmp/investigation --state TX --query "Refinery"
python3 scripts/fetch_icij.py /tmp/investigation --entity "Mossack" --type intermediary
python3 scripts/fetch_propublica990.py /tmp/investigation --ein 237327340
python3 scripts/fetch_census.py /tmp/investigation --state 36 --county "*"
python3 scripts/fetch_osha.py /tmp/investigation --sic 2911 --state TX
```

### Orchestration & Delegation

| Script | Purpose | Example |
|--------|---------|---------|
| `investigate.py` | Run full pipeline end-to-end | `python3 scripts/investigate.py /tmp/investigation --phases all` |
| `delegate_to_rlm.py` | Spawn full OpenPlanter agent (session-resumable, provider-agnostic) | `python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR` |

### Knowledge Graph

| Script | Purpose | Example |
|--------|---------|---------|
| `wiki_graph_query.py` | Query OpenPlanter wiki knowledge graph (read-only) | `python3 scripts/wiki_graph_query.py /tmp/investigation --entity "Raytheon" --neighbors` |

Supports entity lookup, neighbor traversal, BFS path finding, full-text search, and graph statistics. Reads NetworkX node-link JSON graphs produced by OpenPlanter's wiki_graph.py during delegated investigations.

## Skill Integration

OpenPlanter methodology composes with existing Claude Code skills:

| Investigation Task | Skill | Integration Pattern |
|---|---|---|
| Web research for entity enrichment | `exa-search` | `web_enrich.py` calls `exa_search.py` as subprocess |
| Scrape JS-heavy public records portals | `Firecrawl` | `firecrawl scrape URL --only-main-content` |
| Structured government APIs | Built-in | `scrape_records.py` queries SEC, FEC, LDA, USAspending via `urllib` |
| Bulk dataset downloads | Built-in | `dataset_fetcher.py` fetches SEC, FEC, OFAC, OpenSanctions, LDA |
| Defense contractor lookup | Built-in | `fetch_sam.py` queries SAM.gov by name/UEI/CAGE/NAICS |
| Environmental compliance | Built-in | `fetch_epa.py` queries EPA ECHO for facilities + violations |
| Nonprofit/dark money flows | Built-in | `fetch_propublica990.py` queries IRS 990 data via ProPublica |
| Offshore entity chains | Built-in | `fetch_icij.py` queries Panama/Paradise/Pandora Papers |
| Workplace safety records | Built-in | `fetch_osha.py` queries DOL enforcement data |
| Demographics/economic context | Built-in | `fetch_census.py` queries Census ACS 5-Year estimates |
| Knowledge graph query | Built-in | `wiki_graph_query.py` reads OpenPlanter wiki graphs |
| Local RAG over large document corpora | `rlama` | Create collection from `datasets/`, query semantically |
| Parallel investigation threads | `minoan-swarm` | Elat Research Swarm with domain-split investigators |
| Academic/legal research | `academic-research` | Case law, regulatory filings, citations |
| Twitter/social media OSINT | `twitter` | `x-search` for entity mentions, `bird` for profile data |
| Daimonic timeline curation | `worldwarwatcher-update` | Mazkir ha-Milḥamat entity resolution for non-military domains |

### US Public Records Datasets

Key datasets and their linking keys for cross-reference investigations:

| Dataset | Access | Linking Key | Script | Format |
|---|---|---|---|---|
| FEC Campaign Finance | `api.open.fec.gov` + bulk CSV | `committee_id`, contributor name | `dataset_fetcher.py` | CSV, JSON API |
| Senate LDA Lobbying | `lda.senate.gov/api` | Registrant name, Client name | `dataset_fetcher.py` | JSON API, XML |
| SEC EDGAR | `data.sec.gov` + EFTS search | `CIK` (Central Index Key) | `dataset_fetcher.py` | JSON, XBRL |
| SAM.gov Entity Registration | `api.sam.gov` | UEI, CAGE code, NAICS | `fetch_sam.py` | JSON API |
| EPA ECHO Facilities | `echodata.epa.gov` | FRS Registry ID, lat/lon | `fetch_epa.py` | JSON API |
| ProPublica 990 (IRS) | `projects.propublica.org` | EIN, org name | `fetch_propublica990.py` | JSON API |
| ICIJ Offshore Leaks | `offshoreleaks.icij.org` | Node ID, entity name, jurisdiction | `fetch_icij.py` | JSON API |
| OSHA Inspections | `enforcedata.dol.gov` | Activity number, SIC code | `fetch_osha.py` | JSON API |
| US Census ACS | `api.census.gov` | State/county/ZIP FIPS | `fetch_census.py` | JSON API |
| OFAC Sanctions | `treasury.gov/ofac` (or OpenSanctions) | Name + aliases, identifiers | `dataset_fetcher.py` | CSV, XML |
| State Corporate Registries | Per-state (or OpenCorporates API) | State registration number | — | Varies |
| Property Records | County-level (or ATTOM/CoreLogic) | Parcel ID (APN), owner name | — | CSV, shapefile |

**Cross-dataset linking challenge**: No universal corporate ID exists in US public records. The standard approach: normalize names, fuzzy match, filter by jurisdiction/address, then anchor on known IDs (CIK, committee_id, EIN, UEI, CAGE, FRS ID) when available.

## Multi-Agent Investigation

For complex investigations requiring parallel workstreams, use `minoan-swarm`. Separate the verifier agent from analysis agents to maintain uncorrelated verification—the verifier receives only output files and verification criteria, with no shared context from the analysis phase.

See `references/investigation-methodology.md` for the full swarm role template (keret, kothar, resheph, anat, shapash).

## Structured Analytical Techniques

For complex scenarios with multiple possible explanations, apply Analysis of Competing Hypotheses (ACH) and Key Assumptions Check. These techniques are detailed in `references/investigation-methodology.md`.

**ACH summary**: Build a matrix of hypotheses vs. evidence. Score by **inconsistency count** (fewest I markers wins), not confirmation count. This counteracts confirmation bias because disconfirming evidence is more diagnostic than supporting evidence. Identify linchpin evidence whose reclassification would change the conclusion.

## Deep References

- `references/investigation-methodology.md` — Full epistemic framework, ACH procedure, Key Assumptions Check, multi-agent swarm template
- `references/entity-resolution-patterns.md` — Complete normalization tables, suffix maps, address canonicalization
- `references/output-templates.md` — JSON/Markdown templates for investigation plans, summaries, and evidence chains
- `references/public-records-apis.md` — API endpoints, auth, rate limits, linking keys for SEC, FEC, LDA, OFAC, USAspending, Census, EPA, ICIJ, OSHA, ProPublica 990, SAM.gov

## OpenPlanter Tool to Claude Code Mapping

| OpenPlanter Tool | Claude Code Equivalent |
|---|---|
| `list_files` | `Glob`, `Bash(ls)` |
| `read_file` | `Read` |
| `write_file` | `Write` |
| `search_files` | `Grep` |
| `edit_file` | `Edit` |
| `apply_patch` | `Edit` |
| `run_shell` | `Bash` |
| `web_search` | `exa-search` skill |
| `fetch_url` | `Firecrawl` skill / `WebFetch` |
| `subtask` | `Task` tool (minoan-swarm) |
| `execute` | `Task` tool (haiku model) |
| `think` | Native reasoning |
| `wiki_graph` | `wiki_graph_query.py` (read-only) |
| `fetch_sam` | `fetch_sam.py` |
| `fetch_epa` | `fetch_epa.py` |
| `fetch_icij` | `fetch_icij.py` |
| `fetch_osha` | `fetch_osha.py` |
| `fetch_990` | `fetch_propublica990.py` |
| `fetch_census` | `fetch_census.py` |
| `resume_session` | `delegate_to_rlm.py --resume` |

No capability gap. The methodology is what matters, not the tooling.