---
name: databricks-table-creator
description: Create and populate Databricks Delta tables via interactive REPL or notebook submission. Uses dbx.py for operations and PySpark for data transforms.
---

# Databricks Table Creator

Create Delta tables in Unity Catalog. Choose between interactive REPL for simple transforms or notebook submission for complex ETL.

## When to Use

- Creating new tables from queries or transforms
- Materializing derived datasets
- Building pipeline output tables

## Prerequisites

Requires `databricks-sdk` and `tenacity`. Run `python3 -c "import databricks.sdk; import tenacity"` to verify. If it fails, stop and ask the user to set up a virtual environment (see `databricks-sdk-foundation` skill). Do NOT install packages on behalf of the user.

For auth and full CLI reference, see `databricks-sdk-foundation`.

## Decision: REPL vs Notebook

| Criteria | Interactive REPL | Notebook |
|----------|-----------------|----------|
| Lines of logic | < 50 | > 50 |
| Complexity | Single-step transform | Multi-step ETL |
| Documentation | Comment block sufficient | Needs markdown cells |
| Reusability | One-off | Will be re-run |

## Interactive REPL Path

Create a context and execute PySpark:

```bash
python3 dbx.py clusters ensure <CLUSTER_ID>
python3 dbx.py repl create <CLUSTER_ID>
```

### Create Table from Query

```bash
python3 dbx.py repl exec <CID> <CTX> '
# Table: <CATALOG>.<SCHEMA>.<TABLE_NAME>
# Purpose: <what this table contains>
# Source: <source table(s)>
spark.sql("""
  CREATE OR REPLACE TABLE <CATALOG>.<SCHEMA>.<TABLE_NAME>
  COMMENT "Purpose: <what this table contains>. Source: <source table(s)>. Owner: <team>."
  AS
  SELECT col1, col2, transform(col3) AS col3
  FROM <CATALOG>.<SCHEMA>.<SOURCE_TABLE>
  WHERE condition
""")'
```

### Add Column Descriptions After Creation

Always document columns immediately after creating a table. See `databricks-table-documenter` for bulk patterns.

```bash
python3 dbx.py repl exec <CID> <CTX> '
table = "<CATALOG>.<SCHEMA>.<TABLE_NAME>"
columns = {
    "col1": "<description>",
    "col2": "<description>",
    "col3": "<description>",
}
for col, desc in columns.items():
    safe_desc = desc.replace("\\", "\\\\").replace("\"", "\\\"")
    spark.sql(f"ALTER TABLE {table} ALTER COLUMN {col} COMMENT \"{safe_desc}\"")
print("Columns documented")
'
```

## Notebook Path

For complex ETL (> 50 lines, multi-step, needs documentation), use `databricks-notebook-worker` to create a notebook, push it, and run it.

## Verify After Creation

### On-Cluster Checks (via REPL)

```bash
# Row count
python3 dbx.py repl exec <CID> <CTX> 'print(spark.sql("SELECT COUNT(*) FROM <CATALOG>.<SCHEMA>.<TABLE_NAME>").collect()[0][0])'

# Schema
python3 dbx.py repl exec <CID> <CTX> 'spark.sql("DESCRIBE TABLE <CATALOG>.<SCHEMA>.<TABLE_NAME>").show(truncate=False)'

# Sample rows
python3 dbx.py repl exec <CID> <CTX> 'spark.sql("SELECT * FROM <CATALOG>.<SCHEMA>.<TABLE_NAME> LIMIT 5").show(truncate=False)'
```

### Metadata Check (No Cluster Required)

```bash
python3 dbx.py catalog describe <CATALOG>.<SCHEMA>.<TABLE_NAME>
```

## Conventions

1. **Always use `CREATE OR REPLACE TABLE`** -- makes operations idempotent
2. **Always include `COMMENT` on the table** -- purpose, source, and owner
3. **Always add column descriptions** after creation -- see `databricks-table-documenter`
4. **Always verify after creation** -- row count, schema, sample rows
5. **Table naming:** `<CATALOG>.<SCHEMA>.<DESCRIPTIVE_NAME>`
6. **Comment block:** Every REPL table creation must start with a code comment explaining purpose and source
7. **Schema selection:** Organize schemas by domain
