--- name: databricks-table-creator description: Create and populate Databricks Delta tables via interactive REPL or notebook submission. Uses dbx.py for operations and PySpark for data transforms. --- # Databricks Table Creator Create Delta tables in Unity Catalog. Choose between interactive REPL for simple transforms or notebook submission for complex ETL. ## When to Use - Creating new tables from queries or transforms - Materializing derived datasets - Building pipeline output tables ## Prerequisites Requires `databricks-sdk` and `tenacity`. Run `python3 -c "import databricks.sdk; import tenacity"` to verify. If it fails, stop and ask the user to set up a virtual environment (see `databricks-sdk-foundation` skill). Do NOT install packages on behalf of the user. For auth and full CLI reference, see `databricks-sdk-foundation`. ## Decision: REPL vs Notebook | Criteria | Interactive REPL | Notebook | |----------|-----------------|----------| | Lines of logic | < 50 | > 50 | | Complexity | Single-step transform | Multi-step ETL | | Documentation | Comment block sufficient | Needs markdown cells | | Reusability | One-off | Will be re-run | ## Interactive REPL Path Create a context and execute PySpark: ```bash python3 dbx.py clusters ensure python3 dbx.py repl create ``` ### Create Table from Query ```bash python3 dbx.py repl exec ' # Table: .. # Purpose: # Source: spark.sql(""" CREATE OR REPLACE TABLE .. COMMENT "Purpose: . Source: . Owner: ." AS SELECT col1, col2, transform(col3) AS col3 FROM .. WHERE condition """)' ``` ### Add Column Descriptions After Creation Always document columns immediately after creating a table. See `databricks-table-documenter` for bulk patterns. ```bash python3 dbx.py repl exec ' table = ".." columns = { "col1": "", "col2": "", "col3": "", } for col, desc in columns.items(): safe_desc = desc.replace("\\", "\\\\").replace("\"", "\\\"") spark.sql(f"ALTER TABLE {table} ALTER COLUMN {col} COMMENT \"{safe_desc}\"") print("Columns documented") ' ``` ## Notebook Path For complex ETL (> 50 lines, multi-step, needs documentation), use `databricks-notebook-worker` to create a notebook, push it, and run it. ## Verify After Creation ### On-Cluster Checks (via REPL) ```bash # Row count python3 dbx.py repl exec 'print(spark.sql("SELECT COUNT(*) FROM ..").collect()[0][0])' # Schema python3 dbx.py repl exec 'spark.sql("DESCRIBE TABLE ..").show(truncate=False)' # Sample rows python3 dbx.py repl exec 'spark.sql("SELECT * FROM .. LIMIT 5").show(truncate=False)' ``` ### Metadata Check (No Cluster Required) ```bash python3 dbx.py catalog describe .. ``` ## Conventions 1. **Always use `CREATE OR REPLACE TABLE`** -- makes operations idempotent 2. **Always include `COMMENT` on the table** -- purpose, source, and owner 3. **Always add column descriptions** after creation -- see `databricks-table-documenter` 4. **Always verify after creation** -- row count, schema, sample rows 5. **Table naming:** `..` 6. **Comment block:** Every REPL table creation must start with a code comment explaining purpose and source 7. **Schema selection:** Organize schemas by domain