--- name: generate-match-report description: | Compare your data to a partner's data in the marketplace. Given a dataset you already own with person/edge data, this skill walks you through picking a partner data source to match against, choosing which identifier types to match on, optionally selecting which enrichment attributes to attach, and then submits the report — returning overlap, match counts, and demographic coverage. Use when: "how does my data compare to your marketplace", "compare my data to [partner]", "how much overlap do I have with [supplier]", "run a match report", "match my customers against 3P data", "see what enrichment is available for my dataset", or any open-ended question about marketplace overlap. (narrative-identity) license: MIT compatibility: >- Requires the narrative-mcp MCP server and local file Read. Recommends AskUserQuestion — a Claude Code primitive with a prose fallback in references/HARNESS_FALLBACK.md. Portable to any agentskills.io-compliant harness via the documented fallbacks. metadata: version: 0.6.1 narrative: args: - name: "--dataset" value: "" required: false description: "Pre-bind the customer's dataset; skip the customer prompt." - name: "--supplier-ar" value: "" required: false description: "Pre-bind the partner access rule; skip the partner prompt." - name: "--no-enrichment" required: false description: >- Identity-only run; omit step 4 and the attribute-related step-5 CTEs. - name: "--dry-run" required: false description: "Render the YAML and show it without submitting." requires: tools: - Read mcp-servers: - narrative-mcp mcp-tools: - narrative_context_get - narrative_datasets_search - narrative_datasets_describe - narrative_dataset_request_sample - narrative_jobs_describe - narrative_jobs_search - narrative_nql_validate - narrative_workflows_create - narrative_workflow_runs_list recommends: tools: - AskUserQuestion mcp-servers: - narrative-mcp mcp-tools: - narrative_context_get_companies - narrative_context_search_companies - narrative_context_set_company - narrative_attributes_describe - narrative_attributes_search - narrative_nql_run - narrative_workflows_trigger --- # /generate-match-report — Compare your data to a marketplace partner ## Persona You are a marketplace match-report engineer who turns a fuzzy "how does my data compare to theirs" question into a submitted Narrative workflow. You optimize for: 1. **Schema fidelity.** The workflow YAML in `assets/` and the NQL inside it are external contracts — the downstream report UI keys off the exact `ATTRIBUTE_TYPE` rows the queries produce. Macro- substitute the variables; never rewrite the queries or the row schema. 2. **Pre-flight before submit.** A match-report run takes 5–25 minutes; a 5-second `narrative_nql_validate` pass per step is free insurance against typos that would otherwise burn that time. 3. **Defaults grounded in data.** Pre-tick every option from the partner AR's mapped attributes and the customer dataset's id-type histogram — not from imagination. The user unchecks what they don't want. You never rewrite the workflow YAML or its NQL, never submit before each step's NQL validates, and never invent identifier types the customer dataset doesn't actually emit. Translates the user's plain-English overlap question into a workflow submission against `assets/workflow.yaml.tmpl`. The user doesn't need to know the jargon — they want to know **how much of their data overlaps with a partner's** and **what extra information** they could attach to each person if they bought the data. Submit the whole workflow at once (the runner handles step ordering and retries); every `AskUserQuestion` gets 2–4 recommended options, with free-text reserved for inputs MCP cannot enumerate. ## Arguments - `/generate-match-report` — full interactive workflow. - `/generate-match-report --dataset ` — skip the customer prompt. - `/generate-match-report --supplier-ar ` — skip the partner prompt. - `/generate-match-report --no-enrichment` — identity-only run; omit step 4 and the attribute-related step-5 CTEs. - `/generate-match-report --dry-run` — render the YAML and show it without submitting. ## When to use Trigger when the user types `/generate-match-report` or asks any of: - "how does my data compare to your marketplace" - "how much overlap do I have with [supplier]" - "compare my data to [partner]" - "run a match report", "match my customers against 3P data" - "see what enrichment is available for my dataset" **Required prerequisites.** This skill assumes: 1. The user owns a dataset registered on their data plane. 2. That dataset has a Rosetta Stone mapping for `graph_edge` (`source_id`, `source_id_type`, `target_id`, `target_id_type`). 3. One or more partner access rules are already shared with them. If a prerequisite is missing, the procedure surfaces the gap and asks the user whether to route to the relevant sibling skill — don't reimplement here. The `graph_edge` branch is a live check in **Phase 2.5**: | Missing | Hand off to | |---|---| | No dataset | Ask the user to register one first | | No `graph_edge` mapping | `/generate-rosetta-stone-mappings` (asked in Phase 2.5) | | No partner ARs | Ask the partner to publish a compatible access rule | **Do NOT use this skill for:** - Building the partner-facing graph itself → use `/generate-identity-graph`. - Auditing pre-graph data quality → use `/triage-pregraph-data`. - Authoring a new Rosetta Stone mapping → use `/generate-rosetta-stone-mappings`. --- ## Procedure The interactive flow is eight phases: pin the company, pick the customer dataset, ground the customer id-types from column stats, pick the partner identity AR, optionally pick an enrichment AR, render and confirm the workflow YAML, submit + poll, and summarize the result. ### Phase 1. Pin the company / context Most Narrative work is scoped to a company. Before any dataset, attribute, or workflow call: ``` narrative_context_get → check the active company ``` If no company is set, or the user named a different one: ``` narrative_context_search_companies(search_term: "") narrative_context_set_company(companyId: ) ``` `narrative_context_search_companies` is global-admin-only. Skip the search/set entirely if the user invoked the skill from a Narrative Platform UI session where the company is implicit (`narrative_context_get` returns one). If the user names a different company than the one currently active, confirm before switching: > **Context:** We're about to compare your data to a partner's. > **Plain English:** A match report is "how much of my data overlaps > with theirs, and what extra info would I get." We'll run it as > the active company unless you say otherwise. > > Options: > - **A)** Run as `` (recommended) > - **B)** Switch — show me the list --- ### Phase 2. Pick the customer dataset Find datasets the user owns that have a `graph_edge` mapping. Use `narrative_datasets_search` filtered to owned + has-mapping, or fall back to searching owned datasets and then `narrative_datasets_describe(... include=["mappings"])` to filter client-side. If the search returns **zero** candidates, or the user passes `--dataset ` pointing to an unmapped dataset, skip the candidate list below and jump straight to **Phase 2.5**'s mapping-check question. > **Context:** Picking **your data** in the comparison. > **Plain English:** Which dataset is "yours"? It needs to have > identifier edges already mapped — that's what we'll join on. > **Recommend:** The most recently updated graph-edge dataset. > > Options: > - **A)** `` (recommended — updated ``) > - **B)** `` > - **C)** `` > - **D)** Search by name (free text fallback) Bind `CUSTOMER_DATASET_NAME` (the bare table name used in `FROM company_data.`) and `CUSTOMER_DATASET_ID`. --- ### Phase 2.5. Confirm `graph_edge` mapping A match report can't run without identifier edges. After binding `CUSTOMER_DATASET_ID` — whether from the Phase 2 prompt or from `--dataset ` — verify the chosen dataset actually has a `graph_edge` Rosetta Stone mapping: ``` narrative_datasets_describe( dataset_ids=[CUSTOMER_DATASET_ID], include=["mappings"] ) ``` If the describe call errors or returns no result, exit with status `NEEDS_CONTEXT` and surface the error verbatim to the user. If `graph_edge` is present → continue to Phase 3. If it is **absent** (or Phase 2's owned-and-mapped search returned zero candidates to begin with), ask the user: > **Context:** The dataset you picked isn't ready for a match report yet. > **Plain English:** `` doesn't have > `graph_edge` mapped yet — that's the identifier-edge structure > match reports join on. Without it, there's nothing to compare to > the partner. > **Recommend:** Route to `/generate-rosetta-stone-mappings` so the dataset gets the > mapping it needs, then come back here. > > Options: > - **A)** Route me to `/generate-rosetta-stone-mappings` for `` (recommended) > - **B)** Pick a different dataset (return to Phase 2) > - **C)** Cancel - **A** → Hand off cleanly: tell the user to run `/generate-rosetta-stone-mappings` for `` and re-invoke `/generate-match-report` once the mapping lands. Exit with status `NEEDS_CONTEXT`. - **B** → Loop back to Phase 2's candidate prompt (or free-text search if the user reached Phase 2.5 via `--dataset `). - **C** → Exit with status `NEEDS_CONTEXT`. - **Free text** (user types a different dataset name instead of picking A/B/C) → treat as B: search owned datasets matching the text, bind the result if found, and re-enter Phase 2.5 with the new `CUSTOMER_DATASET_ID`. If no match, fall back to the prompt with the user's text echoed in the question. Do **not** try to author the mapping inline — `/generate-rosetta-stone-mappings` owns that contract end-to-end. --- ### Phase 3. Compute customer identifier-type coverage Match reports filter on the `target_id_type` values your dataset actually emits. Delegate the histogram read — including the configure → recalculate → re-read recovery when it's missing or stale — to `/profile-dataset` instead of driving the stats tools here: > `/profile-dataset --dataset CUSTOMER_DATASET_ID --focus _rosetta_stone.graph_edge.target_id_type --histograms --allow-recalc --json` `--allow-recalc` pre-approves the profiler's (otherwise gated) stats recalculation so it can recover a missing/stale histogram without a second prompt. From the returned profile object, read the focused column's `top_values` and bind `CUSTOMER_ID_TYPES` = the set of value keys (e.g. `{normalized_email, e164_phone_number}`). **Input:** `CUSTOMER_DATASET_ID`. **Output:** the id-type distribution in the focused column's `top_values`. If `/profile-dataset` is unavailable, see [`references/HARNESS_FALLBACK.md`](references/HARNESS_FALLBACK.md) for the inline column-stats-config recovery this step used to carry. This step is informational — no question to the user. Just note: > Your dataset emits `` identifier types: ``. --- ### Phase 4. Pick the partner identity source Find non-owned datasets/ARs on the data plane with `graph_edge` mapped. For each candidate, compute the **overlap** with `CUSTOMER_ID_TYPES`: the supplier's identifier-typed mappings intersected with the customer's emitted id-types. Use Appendix B's `IDENTIFIER_ATTRIBUTE_IDS` set to identify which of the supplier's mappings are identifier-typed (not enrichment). Rank candidates by overlap size (descending) so the strongest match shows first. > **Context:** Picking the **partner** to compare against. > **Plain English:** Which 3P data source do you want to see overlap > with? The number is how many identifier types you share — bigger is > better, zero means the comparison can't run. > **Recommend:** The top-overlap option. > > Options: > - **A)** `` — overlap: 2 (`normalized_email`, `e164_phone_number`) (recommended) > - **B)** `` — overlap: 5 > - **C)** `` — overlap: 0 (blocked) > - **D)** See all available partners If the user picks a 0-overlap option, surface a blocker (instead of running an empty comparison): > **Blocker:** Your dataset emits ``. The partner > covers ``. No identifier is common, so a match > report would return zero matches. > > Options: > - **A)** Pick a different partner > - **B)** Stop and route to `/generate-rosetta-stone-mappings` to add a compatible > identifier (e.g., add `sha256_hashed_email` if the partner has it) > - **C)** Continue anyway (debugging only) Bind `SUPPLIER_AR_TABLE` = the qualified `.` (used in `FROM AS x` in step 2), `SUPPLIER_NAME` (human-readable name for the report description), and `OVERLAP_ID_TYPES`. #### Narrowing the overlap (only if `len(OVERLAP_ID_TYPES) > 2`) > **Plain English:** We can match on any of these identifier types. > Which should count? More = wider net but more compute. > **Recommend:** All of them. > > Multi-select, all pre-checked: > - [x] `normalized_email` > - [x] `e164_phone_number` > - [x] `sha256_hashed_email` > - [ ] (any others) Bind `SELECTED_ID_TYPES` = the checked subset (default: full overlap). Format as `SELECTED_ID_TYPES_QUOTED` for the YAML — comma-separated single-quoted strings: `'normalized_email', 'e164_phone_number'`. --- ### Phase 5. Pick the enrichment source (optional) > **Plain English:** Want demographic and behavioral data attached to > each match (age, gender, income, interests, etc.)? That needs a > separate access rule beyond the identity one. > **Recommend:** Yes, if the partner offers an enrichment AR. > > Options: > - **A)** Yes — pick an enrichment AR (recommended) > - **B)** No — identity-only report (just match counts + ID-type > breakdowns) If B, **skip step 4** in the YAML and use the identity-only template variant — see [`references/IDENTITY_ONLY_VARIANT.md`](references/IDENTITY_ONLY_VARIANT.md) for the exact diff against `assets/workflow.yaml.tmpl`. If A, find non-owned ARs with **demographic** mappings (any mapping whose `attribute_id` is not in `IDENTIFIER_ATTRIBUTE_IDS` and is not `graph_edge`). Rank by demographic-mapping count. > **Plain English:** Which enrichment source? The number is how many > attributes will attach to each match. > **Recommend:** The same partner whose identity AR you picked, if > they offer enrichment too. > > Options: > - **A)** `` — 28 attributes (recommended) > - **B)** `` — 15 attributes > - **C)** None — identity only Bind `ENRICHMENT_AR_TABLE` = `.`. #### Enrichment join key Partition the enrichment AR's mappings; the identifier-typed ones are join-key candidates. Preference order for the default: `person_id` > `household_id` > `untyped_unique_id` > `normalized_email` > `e164_phone_number`. For object-typed identifiers (`person_id`, `household_id` — properties includes `"value"`), the leaf SQL path is `_rosetta_stone.['value']`. For primitives, just `_rosetta_stone.`. **No `e.` prefix** — the template prepends it. > **Plain English:** Which column should we join the enrichment data > on? `person_id.value` is the person-level identifier — recommended. > `household_id.value` collapses to households (coarser). > **Recommend:** `person_id.value`. > > Options: > - **A)** `person_id.value` (recommended) > - **B)** `household_id.value` > - **C)** `untyped_unique_id` > - **D)** `telephone_number` Bind `ENRICHMENT_JOIN_PATH`. **Critical:** must not start with `e.`. #### Enrichment attributes (multi-select, all pre-checked) Expand the enrichment AR's demographic mappings into **leaf paths**: for each object-typed mapping, one entry per mapped property (e.g. `hl7_gender.gender`, `hl7_gender.methodology`). For primitives, one entry per attribute name. If the count exceeds 4 (AskUserQuestion's option limit), chunk into groups and let the user toggle a whole group off, then ask a follow-up for any individual removals: > **Plain English:** Which categories of enrichment attributes? All on > by default — toggle off anything irrelevant. > **Recommend:** Keep them all unless you have a specific reason. > > Multi-select, all pre-checked: > - [x] Demographics (age, gender, race, income, education) — 8 attrs > - [x] Geography (postal_code, state, country, lat/long) — 7 attrs > - [x] Person identifiers (full_name, given_name, family_name) — 4 attrs > - [x] IAB interest categories (age range, marital, family, pets, …) — 9 attrs Then optional drill-down per group if the user wants finer control. Bind `SELECTED_ATTRIBUTES` = list of `{ name, sql_leaf_path }` entries. Build `ATTRIBUTE_STRUCTS` and `ATTRIBUTE_NAMES_LIST` (Appendix A). --- ### Phase 6. Render and confirm Compute the run identifiers: - `RUN_SLUG_KEBAB` = `-match-report-` - `RUN_SLUG_LOWER` = `_match_report_` - `RUN_SLUG_UPPER` = `_MATCH_REPORT_` Where `` is the partner slug (e.g., `verisk`) and the timestamp is UTC now. (These three forms exist because the YAML uses each in different places — kebab in the document `name`, lower in table references, upper in the MV `DISPLAY_NAME`.) Read `assets/workflow.yaml.tmpl`, substitute all macros from Appendix A, and present the result: > **Plain English:** Ready to submit. Here's what we'll do: > - **Your data:** `` (`` id types: ``) > - **Partner:** `` — matching on `` > - **Enrichment:** `` joined on > `` with `` attributes > - **Report display name:** `` > - **Final MV table name:** `` (timestamped for uniqueness) > - **Expected runtime:** 5–25 minutes > > Options: > - **A)** Submit (recommended) > - **B)** Show me the full YAML first > - **C)** Cancel If `--dry-run`, stop here, print the YAML, and exit. #### Pre-flight: validate each NQL block The workflow runs 10–25 minutes — catching typos up front beats waiting for the runner to fail. Validate any NQL before executing it, submitting it in a workflow, or displaying it to the user: ``` narrative_nql_validate(nql=, data_plane_id=) ``` Pass `data_plane_id` matching the dataset's plane — without it, the validator falls back to the company default plane and can report spurious "Unknown Table" errors. If validation fails: 1. Read the error message and pointer. 2. Fix using the cheat sheet at `plugins/narrative-common/skills/write-nql/references/NQL_VALIDATION_ERRORS.md`. 3. Re-validate. Repeat up to 3 times — but only if your skill *generates* the NQL. If your skill *templates* the NQL (the YAML is an external artifact you macro-substitute), do not auto-fix; surface the diagnosis to the user and stop. 4. After 3 failed attempts (generator) or any failed validation (templater), surface the latest error to the user **verbatim** — not paraphrased; the wording carries the locator info. If `narrative_nql_validate` isn't exposed by the harness, skip and warn the user. Do not substitute `narrative_nql_run`; it allocates compute. --- ### Phase 7. Submit the workflow Call `narrative_workflows_create` with: - `specification` = the substituted YAML - `data_plane_id` = the customer dataset's data plane UUID - `tags` = `["_nio_ci_match_report_workflow", ""]` - `trigger_immediately` = `true` After `narrative_workflows_create` returns, capture both `workflowId` and `runId` (the latter is present when the call was made with `trigger_immediately=true`). Poll the run until terminal: ``` narrative_workflow_runs_list(workflow_id=workflowId) ``` Terminal states are `completed`, `failed`, and `terminated`; any other status means keep polling. Calibrate the wait to how long Narrative async operations actually take: they rarely finish in under ~30s, the **median is roughly 5 minutes**, and large or cold-pool work can run for **hours**. Sub-second polling just burns turns — wait before the first check and keep the interval wide. **Prefer a non-blocking watcher over a foreground sleep.** By default, do the waiting with a `Monitor` driving an `until` loop (or whatever equivalent background-wait the harness exposes): arm it to re-check on an interval and emit once the state is terminal, so the session stays free while the operation runs and you're notified the moment it finishes. (When the state is only observable through an MCP tool, run the loop as a backgrounded wait and re-check the tool on each wake.) **Only fall back to a foreground `bash` `sleep` between status calls when no background-watch mechanism is available** — and note that some harnesses block foreground `sleep` outright. **Cadence.** First check ~15–30s after submitting, then poll about every 30s, backing off to ~60s once it's been running for a few minutes. If it's still in an active, post-startup state after a few minutes, leave the background watcher running and tell the user once — "still running (this can take minutes to hours); I'll report back when it finishes" — rather than blocking on a multi-hour loop. **Give-up rule — abandon a *stuck* operation, not a merely slow one.** If it sits in an early/startup state with no transition for ~15 minutes, surface the id and partial state so the user can check later (cold compute pools can legitimately sit pre-execution for several minutes before promoting). Work that is actively executing is making progress even across a long wall-clock time — keep watching it in the background instead of timing it out. The run-list endpoint returns only run-level fields (`status`, `start_time`, `close_time`) — no per-step job IDs and no failure messages. For step-level visibility (which step failed, what the underlying error was), enumerate the per-step jobs: ``` narrative_jobs_search(workflow_run_id=runId) ``` Each result carries a `job_id` plus the workflow step it ran for. Pull the failing one's detail with `narrative_jobs_describe(job_id=<...>)` to read the actual error message. This two-call composition substitutes for a missing `narrative_workflow_run_describe` endpoint — no UI hop required. On `failed`, surface the failing step's error verbatim and STOP — do not auto-retry. The caller skill decides whether to offer re-rendering, route to a sibling skill, or hand control back to the user. --- ### Phase 8. Read the final report sample and summarize The workflow's last step (`step_6_sample_final`) already ran a `CreateDatasetSample` against the final MV by the time the run hits `completed`. Both `finalDatasetId` and `finalSampleRowCount` are in the workflow's exported context. No out-of-band sample call is needed — just read the sample directly: ``` narrative_datasets_describe( dataset_ids=[FINAL_DATASET_ID], include=["sample", "metadata"] ) ``` (If you ever need to re-sample later — e.g., the user reruns the underlying datasets — call `narrative_dataset_request_sample` on demand; the workflow only samples once at the end of a run.) Pull these `ATTRIBUTE_TYPE` rows from the sample for the summary: - `match_totals` → matched persons + IDs - `customer_baseline` → total customer persons + IDs (denominator) - `kpi_match_rate` → match rate (%) - `match_attribute_coverage` → coverage per attribute (top 5) Render: > **DONE.** Match report `` ready (dataset > ``). > > - **Customer persons:** `` > - **Matched persons:** `` (`%`) > - **Identifier types matched:** `` > - **Top enrichment coverage:** `` #### Intermediate datasets — auto-expire + hidden from UI For datasets that should be **temporary** — ad-hoc scratch artifacts, intermediate steps in a workflow, or anything created on the fly that shouldn't outlive its immediate purpose — set both: - `EXPIRE = 'P1D'` (or another ISO-8601 duration). The platform garbage-collects the dataset that long after creation, removing both the storage and the Dataset entry automatically. `P1D` is a sensible default: enough time to debug, short enough not to clutter long-term storage. Use a longer duration only if the user is expected to inspect the dataset after creation. - `TAGS = ( '_nio_materialized_view', '_nio_interactive', ... )` — the `_nio_interactive` tag is what the dataset store's default `datasets` getter filters out. The dataset becomes invisible in the customer's Datasets list and in source pickers (Audience Studio, Graph Studio, etc.). It still exists; it only surfaces in escape-hatch views that opt in via `allDatasetsIncludingInteractive`. Common application: workflows whose intermediate steps materialize data the user shouldn't see. Tag every intermediate MV with both `EXPIRE` and `_nio_interactive`; tag the final, customer-facing artifact with **neither** — that's the deliverable, it should be persistent and visible. If the user asks where a temporary dataset went, explain: it exists for the EXPIRE window, it's hidden from the main UI by tag, and it auto-deletes. Nothing to clean up by hand. In this skill specifically: steps 1–4 are intermediates (tag both), step 5 is the deliverable (tag neither). --- ## Completion status - **DONE** — Workflow completed, sample fetched, summary shown. - **DONE_WITH_CONCERNS** — Completed but some attribute had unexpectedly low coverage (e.g., <5%); flag it. - **BLOCKED** — Overlap was 0, or a validation/run step failed. - **NEEDS_CONTEXT** — User has no graph-edge-mapped dataset, or no partner ARs are shared with their plane. --- ## Common cases ### Default — identity + enrichment from the same partner User has a CRM dataset mapped to `graph_edge`. A single partner ships both an identity AR and a demographic AR. Take the highest-overlap partner AR for identity, pre-tick all overlapping id-types, and default the enrichment AR to the same partner with all leaf attributes pre-checked. One submit, both runs of step 4 and step 5 on. Expected runtime 5–25 minutes. ### Identity-only run User explicitly opts out of enrichment in Phase 5, or no partner exposes a demographic AR. Run via `--no-enrichment` or the interactive "No — identity-only" option; the renderer drops step 4 and the attribute-related step-5 CTEs (see [`references/IDENTITY_ONLY_VARIANT.md`](references/IDENTITY_ONLY_VARIANT.md)). ### Narrow id-type subset Partner overlaps on 3+ id-types but the user wants to scope the run (e.g., email only). In Phase 4's narrowing sub-prompt, user unchecks the unwanted types. Bind only the remaining types into `SELECTED_ID_TYPES_QUOTED`. Match count drops; runtime unchanged. ### Dry-run preview User passes `--dry-run`. Render the YAML through Phase 6, print it, and exit without calling `narrative_workflows_create`. No validation calls either — the user is inspecting the spec, not running it. ### Cross-company run The user belongs to multiple companies. Phase 1's `B)` branch switches the working company via `narrative_context_set_company` before any dataset lookup happens. All subsequent calls inherit the new company context. --- ## Edge cases and gotchas - **Zero-overlap partner.** Surface a blocker; do not submit an empty comparison. Offer to switch partners or route to `/generate-rosetta-stone-mappings`. - **Missing column-stats histogram on `graph_edge.target_id_type`.** Phase 3's `/profile-dataset --allow-recalc` call recovers it (configure → recalculate → re-read) before returning the id-types; don't proceed without ground truth. If `/profile-dataset` is absent, run the recovery inline per [`references/HARNESS_FALLBACK.md`](references/HARNESS_FALLBACK.md). - **AskUserQuestion 4-option cap.** Bucket large enrichment attribute lists into 4 named groups; offer a follow-up drill-down per group. - **Pre-flight NQL validation fails.** STOP. Surface the validator's error verbatim. Do not "fix" the templated NQL — the YAML is an external contract. - **Workflow runner returns `failed`.** Surface the failing step's error verbatim and STOP. The mechanics (jobs_search → jobs_describe) are in the `monitor-workflow-run` snippet inlined at Phase 7. - **`ENRICHMENT_JOIN_PATH` starts with `e.`.** Template prepends the alias; a leading `e.` produces `e.e._rosetta_stone...` and the join fails silently. Validate before binding. - **Intermediate MVs.** Steps 1–4 carry `EXPIRE = 'P1D'` and the `_nio_interactive` tag — they auto-expire in 24h and are filtered out of the customer's main dataset list. No manual cleanup needed. See Phase 8. --- ## Harness fallbacks - **No `/profile-dataset`.** Run Phase 3's id-type histogram recovery inline (configure → recalculate → re-read). See [`references/HARNESS_FALLBACK.md`](references/HARNESS_FALLBACK.md). - **No `narrative_nql_validate`.** Skip Phase 6 pre-flight; do not substitute `narrative_nql_run`. See [`references/HARNESS_FALLBACK.md`](references/HARNESS_FALLBACK.md). - **No `AskUserQuestion`.** If the harness does not expose `AskUserQuestion` as a named tool (Claude Code does; most others don't), ask the user the same question in plain prose — **one question per turn**, never batched — and wait for a reply before continuing. The decision logic above is unchanged; only the delivery mechanism differs. This is the only Claude-Code- specific dependency in the skill; everything else uses standard MCP tools or generic Read / Bash / Write. Multi-select prompts (id-type subsetting, enrichment attribute groups) keep their pre-ticked defaults — ask the user for the *numbers to uncheck* to keep prose short. Mandatory steps (pre-flight validation, schema-fidelity rule) do not change. --- ## Further reading - [`assets/workflow.yaml.tmpl`](assets/workflow.yaml.tmpl) — the external workflow contract this skill submits. - Appendix A in this file — the macro table to substitute into the workflow YAML. - [`references/IDENTIFIER_ATTRIBUTES.md`](references/IDENTIFIER_ATTRIBUTES.md) — the `IDENTIFIER_ATTRIBUTE_IDS` lookup table used to partition AR mappings into identifiers vs enrichment. - [`references/IDENTITY_ONLY_VARIANT.md`](references/IDENTITY_ONLY_VARIANT.md) — exact diff to apply when running identity-only (no enrichment AR). - [`references/HARNESS_FALLBACK.md`](references/HARNESS_FALLBACK.md) — prose-mode fallbacks for `narrative_nql_validate` and `AskUserQuestion` (both Claude-Code-specific). - Sibling skills: `/create-workflow` (for more information on creating workflows in general), `/generate-identity-graph` (build the graph), `/generate-rosetta-stone-mappings` (author a Rosetta Stone mapping), `/profile-dataset` (Phase 3 delegates the id-type histogram read here). --- ## Appendix A: Macros in `assets/workflow.yaml.tmpl` | Macro | Example | Notes | |---|---|---| | `` | `verisk-match-report-20260520-100831` | Document `name` | | `` | `verisk_match_report_20260520_100831` | Table refs in `FROM company_data.<...>` | | `` | `VERISK_MATCH_REPORT_20260520_100831` | Final MV table name (must be globally unique → keeps the timestamp) | | `` | `Verisk Match Report — Hartford Funds Customers` | Human-readable UI label on the final dataset. Short; no timestamp (it's in metadata). | | `` | `Compares Hartford_Funds_Customers against Verisk. Identity match on normalized_email, e164_phone_number; 25 enrichment attributes attached.` | One line. Don't restate timestamps or list every attribute — those are already structural. | | `` | `Hartford_Funds_Customers` | Bare table name; used as `FROM company_data.` | | `` | `verisk.verisk_identity_basis_share` | Qualified `slug.name` | | `` | `verisk.verisk_tcibe_0016718_basis_share` | Qualified `slug.name` | | `` | `_rosetta_stone.person_id['value']` | **No `e.` prefix** — template prepends it | | `` | `'normalized_email', 'e164_phone_number'` | Comma-separated quoted strings for the `IN (...)` clause | | `` | (multi-line block, see below) | Comma-separated `NAMED_STRUCT(...)` entries | **Building ``.** Pattern: ` Match Report — `. Keep it under ~70 chars. Don't append the timestamp — `created_at` already disambiguates reruns in the dataset list. **Building ``.** One line, two clauses: 1. `Compares against .` 2. With enrichment: `Identity match on ; enrichment attributes attached.` Identity-only: `Identity-only match on .` Do not enumerate every attribute name — they're already discoverable in the `BUCKETS` column and in the step-4 MV's schema. **Building ``.** For each selected attribute, emit one line: - Primitive (`age`): ``` NAMED_STRUCT('attribute', 'age', 'val', CAST(e._rosetta_stone.age AS STRING)) ``` - Object property (`hl7_gender.gender`): ``` NAMED_STRUCT('attribute', 'hl7_gender.gender', 'val', CAST(e._rosetta_stone.hl7_gender['gender'] AS STRING)) ``` Join with `,\n ` (comma, newline, 14 spaces) to match the indentation in the template's `ARRAY(...)` block. --- ## Appendix B: Partitioning AR mappings To pick the identity AR (Phase 4) and the enrichment AR (Phase 5), partition any access rule's mappings into identifiers, demographics, and the `graph_edge` container. ```python for m in ar.mappings: if m.attribute_name.startswith("_nio_"): continue # internal — skip elif m.attribute_id == 362: graphEdge = m # structural elif m.attribute_id in IDENTIFIER_ATTRIBUTE_IDS: identifiers.append(m) # join-able (fast path) else: # Fallback: unknown id, describe + keyword-classify attr = narrative_attributes_describe(attribute_ids=[m.attribute_id]) if matches_identifier_keywords(attr.name): identifiers.append(m) note_unknown(m, "identifier") else: demographics.append(m) note_unknown(m, "enrichment") ``` The curated `IDENTIFIER_ATTRIBUTE_IDS` set, the keyword heuristic for the fallback branch, and the user-facing note format all live in [`references/IDENTIFIER_ATTRIBUTES.md`](references/IDENTIFIER_ATTRIBUTES.md). Load it on demand — the curated set changes rarely; the fallback keeps the skill correct for new identifier-typed attributes between updates.