---
name: document-dbt
description: 'Enriches dbt source and model YAML descriptions for AI-driven query generation. Profiles BigQuery tables for schema metadata, sparse_ratio, and enum values so documentation uses ground-truth data, never invented values. Use when documenting new or existing dbt sources/models. Also supports dataset-level mode: pass a BigQuery dataset name to auto-discover all tables and dispatch a parallel swarm of agents to document every table at once.'
user-invocable: true
argument-hint: <source_or_model_yaml_path_or_bigquery_dataset_name>
---

# Document DBT

Enriches dbt YAML documentation so a downstream query-building agent — with no other context — can write correct, efficient, compliant BigQuery jobs. Documentation quality is measured by that standard.

Every column description must answer three questions:

1. **What is this?** — semantic meaning, not just data type
1. **Should I use it?** — sparse_ratio, PII, deprecated, freshness reliability
1. **How do I use it?** — join targets, expected values, null semantics, caveats

## Usage

`/document-dbt <path>` where `<path>` is either:

- A source or model YAML file under `schemas/` → **Table/Model Mode** (documents specific tables in that file)
- A BigQuery dataset name (e.g., `hubspot`, `salesforce`, `postgres_public`) → **Dataset Mode** (auto-discovers all tables and dispatches a parallel swarm)

## Dataset Lifecycle Policy

- **Removed datasets (never document or recreate):** `audits`, `marketing`, `pathways`.
  - If YAML exists, remove it.
  - If the dataset exists in BigQuery, delete it (with contents).
- **Legacy/deprecated datasets (document as legacy):** `ablespace`, `agora`, `bamboo`, `clockify`, `healthie`, `medallion`.
  - Ensure `config.meta.deprecated: true` at the source level.
  - Source description must explicitly say the dataset is legacy/deprecated and retained for historical use after migration.
- **`postgres_public` requirement:** source description must explicitly state it is the Google Cloud Datastream CDC replica of Pathway Postgres.

## Dataset Mode

When the argument does not end in `.yml` and does not contain a `/`, treat it as a BigQuery dataset name and follow this workflow instead of the single-table workflow below.

### Step D1: Discover all tables in the dataset

Run this Python snippet to retrieve the ground-truth table list directly from BigQuery — never guess or infer from the YAML:

```python
from google.cloud import bigquery

PROJECT = "parallel-data-prod"
DATASET = "<dataset_name>"  # substitute the argument

client = bigquery.Client(project=PROJECT)
tables = list(client.list_tables(f"{PROJECT}.{DATASET}"))
table_names = sorted(t.table_id for t in tables)
print(f"Found {len(table_names)} tables in {DATASET}:")
for name in table_names:
    print(f"  - {name}")
```

Or equivalently via CLI:

```bash
bq ls --max_results=500 --format=prettyjson parallel-data-prod:<dataset_name> | python3 -c "
import json, sys
tables = json.load(sys.stdin)
for t in sorted(tables, key=lambda x: x['tableReference']['tableId']):
    print(t['tableReference']['tableId'])
"
```

**CRITICAL — pagination:** `bq ls` without `--max_results` defaults to 50 results. Large datasets will be silently truncated. Always pass `--max_results=500` (or a value larger than the expected table count). Failure to do so will cause tables to appear missing from BigQuery and get wrongly deleted from the YAML.

### Step D2: Locate the corresponding YAML source file

Find the source YAML for the dataset:

```bash
grep -rl "name: <dataset_name>" schemas/sources/
```

If no file exists yet, it must be created following dbt source conventions before agents can write to it.

If the dataset is in the **removed dataset list** (`audits`, `marketing`, `pathways`), do not create a YAML file. Ensure any existing YAML is removed and ensure the BigQuery dataset is deleted.

### Step D3: Cross-reference tables against the YAML

Compare the BigQuery table list against the `- name:` entries under the dataset's `tables:` block in the YAML:

| Discrepancy                                 | Action                                                                     |
| ------------------------------------------- | -------------------------------------------------------------------------- |
| Table in BigQuery but **missing from YAML** | Add a skeleton `- name:` entry so the agent for that table can document it |
| Table in YAML but **not in BigQuery**       | Remove the entire table block — it is hallucinated or stale                |
| Table in both                               | No structural change needed — agent will enrich descriptions               |

**This cross-reference is the core anti-hallucination gate.** The BigQuery list is ground truth. The YAML must reflect it exactly.

### Step D4: Dispatch agents in batches of 5 (divide and conquer)

**Do NOT launch all agents at once.** Launch at most 5 agents per batch. Wait for each batch to complete before starting the next. This prevents concurrent YAML write conflicts and makes failures easy to isolate.

Use `subagent_type: dbt-documentation-agent`, `mode: bypassPermissions`, and `run_in_background: true` for all agents. The specialized agent type avoids loading the full skill on every invocation, reducing token usage significantly.

**Batching strategy:**

1. Divide the full table list into groups of 5.
1. Launch one batch (up to 5 agents) in a single message using parallel Agent tool calls.
1. Wait for all agents in the batch to complete (you will be notified automatically).
1. Verify no YAML corruption before proceeding: `cd schemas && uv run dbt parse --no-partial-parse`.
1. Run dedicated tests for tables touched in that batch: `cd schemas && uv run dbt test -s source:<dataset_name>.<table_1> source:<dataset_name>.<table_2> ...`.
1. Repeat for the next batch.

Agent prompt template (repeat for each table in a batch):

```
Document the `<table_name>` table in `<yaml_path>` for the `<dataset_name>` BigQuery dataset.

Steps:
1. Profile the table: `uv run python .claude/skills/document-dbt/scripts/profile_table.py <dataset_name> <table_name>`
2. Read the current YAML block for this table in `<yaml_path>`.
3. Reconcile columns: add missing ones, remove hallucinated ones, fix stale sparse_ratio values.
4. Read `.claude/agents/dbt-documentation-agent/reference.md` for PII taxonomy and Prisma enums.
5. Enrich every column description (semantic meaning, join targets, enum values, sparse_ratio, PII tags).
6. Write changes back to `<yaml_path>`.
7. Verify parse: `cd schemas && uv run dbt parse --no-partial-parse`
8. Run dedicated tests: `cd schemas && uv run dbt test -s source:<dataset_name>.<table_name>`

Working directory: /Users/merylldindin/Projects/parallel/bluebasin
```

### Step D5: Monitor and clean up after each batch

After each batch of agents completes:

- Run `cd schemas && uv run dbt parse --no-partial-parse`. Fix any YAML corruption before starting the next batch.
- Run dedicated tests for the batch tables: `cd schemas && uv run dbt test -s source:<dataset_name>.<table_1> source:<dataset_name>.<table_2> ...`. Do not proceed until this passes.
- Check for any agents that re-inserted tables that should not exist. Strip those blocks immediately using a targeted regex removal.
- After all batches finish and the final parse is clean, the task is complete.

## Table/Model Mode Workflow

### Step 1: Read the target YAML

Read the file to identify all tables and their columns.

### Step 2: Profile each table in BigQuery

**IMPORTANT:** Never estimate sparse_ratio, never guess enum values, never carry over values from removed columns. Always profile from BigQuery.

For **source** YAMLs, the table reference is `parallel-data-prod.<source_name>.<table_name>`.
For **model** YAMLs, find the schema from `schemas/dbt_project.yml` dispatch rules, then the reference is `parallel-data-prod.<schema>.<model_name>`.

Run the profiling script:

```bash
uv run python .claude/skills/document-dbt/scripts/profile_table.py <dataset> <table>
```

**Parallelization-safe:** No temp tables or shared state. Each invocation runs in its own BigQuery client session, so multiple agents can profile different tables concurrently.

### Step 2b: Reconcile YAML columns against BigQuery schema

After profiling, compare the `=== SCHEMA ===` column list against the YAML `- name:` entries for each table. Fix any drift before writing descriptions:

| Drift type               | Detection                                                                                | Action                                                                                                          |
| ------------------------ | ---------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| **Missing from YAML**    | Column in BigQuery schema but no `- name:` entry in YAML                                 | Add the column entry with a proper description based on profiling data                                          |
| **Hallucinated in YAML** | `- name:` entry in YAML but no matching column in BigQuery schema                        | Remove the entire column block (name + description + config) — it was either renamed, dropped, or never existed |
| **Stale sparse_ratio**   | YAML sparse_ratio differs from profiled value                                            | Update to the profiled value                                                                                    |
| **Stale enum values**    | YAML lists enum values that no longer appear in BigQuery, or omits values that now exist | Update description to match profiled enum values verbatim                                                       |

**This step is non-negotiable.** The YAML must be a 1:1 reflection of what BigQuery actually contains. A hallucinated column in the YAML is worse than a missing one — it will cause a downstream query builder to reference a column that does not exist, producing a query error.

The script outputs four CSV sections separated by `===` headers:

| Section                 | Header                                                     | Contents                                 |
| ----------------------- | ---------------------------------------------------------- | ---------------------------------------- |
| `=== SCHEMA ===`        | `column_name,data_type,is_nullable,is_partitioning_column` | Column types and constraints             |
| `=== ROW COUNT: N ===`  | (inline)                                                   | Total rows in the table                  |
| `=== SPARSE RATIOS ===` | `column_name,sparse_ratio`                                 | NULL fraction per column (0.00–1.00)     |
| `=== ENUMS ===`         | `column_name,distinct_count,value,occurrences`             | Enum values and high-cardinality markers |

**Interpreting the enum detection result set:**

The script applies a cardinality limit of **30 distinct values**. Columns matching any of these patterns are excluded entirely (never enums):

- **Suffix patterns** (case-insensitive, both snake_case and camelCase): `Id`, `At`, `Date`, `Url`, `Uri`, `Path`, `Key`, `Hash`, `Token`, `Email`, `Phone`, `Name`, `Number`, `By` — matches `user_id`, `userId`, `created_at`, `createdAt`, `billing_email`, `firstName`, etc.
- **Exact names** (case-insensitive): `id`, `name`, `title`, `description`, `content`, `context`, `notes`, `chat`
- **Fivetran columns**: any column starting with `_fivetran_`

| Result pattern                 | Meaning                                                                   | Action in description                                                                                                                                         |
| ------------------------------ | ------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Rows with `value` populated    | Confirmed enum (≤ 30 distinct)                                            | List values verbatim: "Possible values: X, Y, Z."                                                                                                             |
| Single row with `value = NULL` | High cardinality (> 30 distinct)                                          | Do not list values. Describe the pattern instead. Check ingest `models.py` or Prisma enums for authoritative values if the column name suggests a known type. |
| Column not in result set       | Excluded by pattern filter (identifiers, timestamps, emails, names, etc.) | Not an enum. Describe as identifier, timestamp, etc.                                                                                                          |

**IMPORTANT:** When enum values are returned, use them verbatim. Do not add values not present in BigQuery. Do not omit values that appear. If the profiled values conflict with Prisma or ingest enums, note the discrepancy — the BigQuery values are ground truth of what is actually stored.

### Step 3: Gather context from four layers

Read the agent reference for join keys, PII taxonomy, freshness tiers, and field transformation details:

```
.claude/agents/dbt-documentation-agent/reference.md
```

Then pull context from these layers in priority order:

**Layer 1 — Ingest function codebase** (`functions/src/ingest-<source>/`):

- `models.py`: Field aliases (`Field(alias=...)`) reveal original API names vs BigQuery names
- `models.py`: Validators (`@field_validator`) reveal allowed values and transformation logic
- `models.py`: Derived/computed fields that don't map to a single API field
- `functions/src/common/enums/`: Authoritative status/type values used in ingestion

**Layer 2 — Prisma enums** (from `reference.md → Prisma Enum Registry`):

For `postgres_public` sources, the Prisma enums are the canonical value sets. Cross-check profiled BigQuery values against the registry — if BigQuery contains values not in Prisma (e.g., legacy data, migration artifacts), note this in the description.

**Layer 3 — Model SQL** (for model YAMLs only):

Read the `.sql` file to understand how each column is derived. Note CASE expressions, joins, type casts, and aggregation logic. The description should explain what the model computes, not just repeat the column name.

**Layer 4 — Third-party API docs** (for external sources):

When a column's semantics aren't clear from the ingest code, consult the API reference. See `reference.md → Context Sources → Third-Party API Documentation` for URLs.

### Step 4: Interpret schema metadata

Use the schema metadata result set to inform descriptions:

| Metadata                        | Documentation implication                                                                          |
| ------------------------------- | -------------------------------------------------------------------------------------------------- |
| `data_type = TIMESTAMP`         | Note format: `YYYY-MM-DD HH:MM:SS UTC`. Mention timezone handling if relevant.                     |
| `data_type = DATE`              | Note format: `YYYY-MM-DD`.                                                                         |
| `data_type = FLOAT64`           | Note unit (dollars, minutes, percentage, etc.) and precision expectations.                         |
| `data_type = INT64`             | Note whether it's a count, a foreign key, or an encoded value.                                     |
| `data_type = BOOL`              | Document true/false semantics — what does true mean in this context?                               |
| `data_type = STRING`            | Describe expected format: UUID, email, free text, enum, JSON array, etc.                           |
| `is_nullable = NO`              | Column is guaranteed non-null. No need for null semantics in description. Do not set sparse_ratio. |
| `is_nullable = YES`             | Explain when/why null occurs. Set sparse_ratio.                                                    |
| `is_partitioning_column = true` | Note in description: "Partition key — filter on this column for query performance."                |

### Step 5: Apply sparse_ratio rules

Set `config.meta.sparse_ratio` from the profiled value (rounded to 2 decimals). Use the measured value exactly — never adjust based on intuition.

| sparse_ratio | Description requirement                                                                  |
| ------------ | ---------------------------------------------------------------------------------------- |
| 0.00–0.10    | Reliably populated. No special note needed.                                              |
| 0.10–0.50    | Note when/why nulls occur.                                                               |
| 0.50–0.80    | Explicitly warn: "Partially populated — [context]."                                      |
| 0.80–0.95    | Flag as sparse. Suggest alternative columns if they exist. Add IS NOT NULL guard advice. |
| 0.95–1.00    | Flag as very sparse or effectively empty. Advise against use in filters or joins.        |

**Special cases:**

- If `is_nullable = NO` AND `sparse_ratio = 0.0`, skip sparse_ratio meta entirely — the NOT NULL constraint makes it redundant.
- If `sparse_ratio = 1.00`, investigate: is the column deprecated, is it populated only for specific record types, or is it a new column pending first ingest? Document the reason.
- If the table does not yet exist in BigQuery (new model), set `sparse_ratio: 0.0` and add: "Provisional — pending first materialization."

### Step 6: Set meta annotations

**sparse_ratio** — from profiled values (see Step 5).

**pii** — use the taxonomy from `reference.md → PII Handling Rules`:

| Tag               | When to apply                                  |
| ----------------- | ---------------------------------------------- |
| `employee_name`   | Provider or staff full/first/last name         |
| `employee_email`  | Provider or staff email                        |
| `student_name`    | Student full/first/last name (FERPA-protected) |
| `student_email`   | Student or parent email (FERPA-protected)      |
| `phone`           | Any phone number                               |
| `address`         | Physical/mailing address                       |
| `ssn`             | Social Security Number                         |
| `birth_date`      | Date of birth                                  |
| `professional_id` | License or NPI number                          |
| `ip_address`      | IP address                                     |

When a column has a PII tag, the description must state: the PII category, that it should be excluded from aggregate queries, and which ID column to prefer for joins instead.

**deprecated** — set `true` when sparse_ratio = 1.00 with no business use, or when a source is superseded by newer systems.

### Step 7: Write descriptions that pass the quality checklist

For every column, verify against this checklist (from `reference.md`):

- [ ] Semantic meaning is clear without domain knowledge
- [ ] Join target named if this is an ID or foreign key column
- [ ] Expected value set documented for enum columns (from profile, verbatim)
- [ ] Date/timestamp format noted
- [ ] Numeric unit noted (dollars, minutes, percentage, count)
- [ ] Null semantics explained (when null, what null means) — skip if NOT NULL
- [ ] `sparse_ratio` present and reflected in description if > 0.10
- [ ] `pii` tag present if column contains personal information
- [ ] Caveats noted for known data quality issues
- [ ] Alternatives suggested for deprecated or sparse columns
- [ ] Dedicated dbt tests for the updated tables/models pass

**What makes a description inadequate** (never write these):

- "Text/string value. Nullable." — no semantic content
- "Decimal number." — no meaning or unit
- Missing join hint on an ID column
- No sparse_ratio context when the column is > 10% null
- Enum values listed that do not appear in the BigQuery profile
- A column entry that does not exist in BigQuery (hallucinated column)

### Step 8: Source-level documentation

For source YAMLs, the source-level `description` must include:

- What the external system is and what domain it covers
- Ingestion method (Fivetran, Cloud Function, etc.)
- Freshness tier from `reference.md → Source Catalog` so a query builder knows data currency
- Key join relationships to other sources

Dataset-specific requirements:

- For `postgres_public`, explicitly state it is replicated via Google Cloud Datastream CDC.
- For legacy datasets (`ablespace`, `agora`, `bamboo`, `clockify`, `healthie`, `medallion`), explicitly label as deprecated/legacy and historical-only.

### Step 9: Verify

**IMPORTANT:** A YAML documentation update is incomplete unless both parse and dedicated tests pass.

```bash
cd schemas && uv run dbt parse --no-partial-parse
```

Then run dedicated tests for only the updated resources:

- Source YAML updates: `cd schemas && uv run dbt test -s source:<source_name>.<table_1> source:<source_name>.<table_2> ...`
- Model YAML updates: `cd schemas && uv run dbt test -s <model_name_1> <model_name_2> ...`

Confirm parse is clean and all dedicated tests pass before considering the task complete.

## Examples

### Enum column (profile returned values)

```yaml
- name: appointment_status
  description: >
    Status of the therapy appointment.
    Possible values: CANCELLED, LATE_CANCELLED, NO_SHOW, OCCURRED, EXCUSED.
    Used to filter billable sessions (OCCURRED) from non-billable ones.
  config:
    meta:
      sparse_ratio: 0.0
```

### Sparse ID column (profile showed 97% null)

```yaml
- name: account_id
  description: >
    Salesforce Account ID linking this record to a school district.
    Sparse (97% null) — only populated for records explicitly tied to a
    Salesforce account. Do not use as a join key; prefer opportunity_id
    or account_name for district-level queries.
  config:
    meta:
      sparse_ratio: 0.97
```

### PII column

```yaml
- name: provider_email
  description: >
    Primary email address for the provider. PII — employee email.
    Exclude from aggregate exports and public-facing queries.
    Use provider_id for joins; only surface in provider-scoped internal reports.
  config:
    meta:
      pii: employee_email
      sparse_ratio: 0.04
```

### High-cardinality string (profile returned value = NULL)

```yaml
- name: notes
  description: >
    Free-text session notes written by the provider. High cardinality — do not
    use in GROUP BY or WHERE equality filters. Useful for full-text search
    or qualitative analysis only.
  config:
    meta:
      sparse_ratio: 0.35
```

### Partition column

```yaml
- name: service_date
  description: >
    Date of the therapy session. Format: YYYY-MM-DD.
    Partition key — filter on this column for query performance.
    Joins to billing period calculations via pay_period_start/pay_period_end ranges.
  config:
    meta:
      sparse_ratio: 0.0
```

### NOT NULL column (no sparse_ratio needed)

```yaml
- name: user_id
  description: >
    Zoom-assigned UUID for the employee account. Maps from API field "id".
    Format: alphanumeric string (e.g., "kFFVDbTVBSqUAAAAAAAAAA").
    NOT NULL. Joins to zoom.meetings on host_id and zoom.licenses on user_id.
```

## Related Skills

- `/review-branch` — Review documentation changes before PR
- `/create-pr` — Open PR after documentation pass

## Related Agents

- `dbt-documentation-agent` — Full agent with deep context for complex, multi-file documentation tasks
- `dbt-models-specialist` — DBT model creation and optimization
