---
name: databricks-notebook-worker
description: Create Databricks Python notebooks, push to workspace, run on cluster, and verify outputs using dbx.py.
---

# Databricks Notebook Worker

End-to-end workflow for creating Databricks Python notebooks, pushing them to the workspace, running them on a cluster, and verifying outputs.

## When to Use

Use for any task that requires creating a Databricks Python notebook, pushing it to the workspace, optionally running it on a cluster, and verifying outputs.

## Prerequisites

Requires `databricks-sdk` and `tenacity`. Run `python3 -c "import databricks.sdk; import tenacity"` to verify. If it fails, stop and ask the user to set up a virtual environment (see `databricks-sdk-foundation` skill). Do NOT install packages on behalf of the user.

For auth and full CLI reference, see `databricks-sdk-foundation`.

## Work Procedure

### 1. Understand the Requirements

Read the task description, preconditions, expected behavior, and verification steps. Check project conventions for table naming, secret scopes, and cluster assignments.

### 2. Create the Notebook Locally

Create a Python file at `notebooks/<notebook_name>.py` following Databricks notebook format:

```python
# Databricks notebook source
# MAGIC %md
# MAGIC # Notebook Title
# MAGIC Description of what this notebook does.

# COMMAND ----------

# imports and setup code

# COMMAND ----------

# main logic cells, each separated by # COMMAND ----------

# COMMAND ----------

# MAGIC %md
# MAGIC ## Verification

# COMMAND ----------

# assertion cells that verify outputs
assert row_count == expected, f"Expected {expected} rows, got {row_count}"
```

**Critical conventions:**
- First line must be `# Databricks notebook source`
- Cell separators: `# COMMAND ----------`
- Markdown cells: `# MAGIC %md` prefix on each line
- SQL cells: `# MAGIC %sql` prefix
- Use `dbutils.secrets.get(scope="<SCOPE>", key="<KEY>")` for secrets
- Use `spark` (SparkSession) for Delta table I/O
- Tables must be idempotent: use `CREATE OR REPLACE TABLE` or `MERGE`, not `INSERT INTO`
- Include verification assertions at the end

### 3. Push to Databricks Workspace

```bash
python3 dbx.py workspace import ./notebooks/<notebook_name>.py "/Users/<YOUR_EMAIL>/<project>/<notebook_name>" --overwrite
```

### 4. Verify Import

```bash
python3 dbx.py workspace list "/Users/<YOUR_EMAIL>/<project>/"
```

### 5. Run the Notebook

Ensure cluster is running, then submit:

```bash
python3 dbx.py clusters ensure <CLUSTER_ID>
python3 dbx.py jobs submit --cluster <CLUSTER_ID> --notebook "/Users/<YOUR_EMAIL>/<project>/<notebook_name>" --wait
```

Check run output for errors:

```bash
python3 dbx.py jobs output <RUN_ID>
```

### 6. Verify Outputs

Create a REPL context and run verification queries:

```bash
python3 dbx.py repl create <CLUSTER_ID>
# Returns context_id

# Row count
python3 dbx.py repl exec <CID> <CTX> 'print(spark.sql("SELECT COUNT(*) FROM <OUTPUT_TABLE>").collect()[0][0])'

# Schema
python3 dbx.py repl exec <CID> <CTX> 'spark.sql("DESCRIBE TABLE <OUTPUT_TABLE>").show(truncate=False)'

# Sample rows
python3 dbx.py repl exec <CID> <CTX> 'spark.sql("SELECT * FROM <OUTPUT_TABLE> LIMIT 5").show(truncate=False)'

# Clean up
python3 dbx.py repl destroy <CID> <CTX>
```

### 7. Export Output-Populated Notebook (Optional)

```bash
python3 dbx.py workspace export "/Users/<YOUR_EMAIL>/<project>/<notebook_name>" ./notebooks/<notebook_name>.ipynb --format jupyter
```

### 8. Commit

Commit the notebook file(s) to the local git repo.

## Error Escalation

Stop and escalate when:

| Situation | Detection |
|-----------|-----------|
| Cluster cannot be started | `dbx.py clusters ensure` returns permission error |
| Precondition table missing | NOT_FOUND / AnalysisException in run output |
| Notebook execution fails | `result_state: FAILED` in `dbx.py jobs status` |
| GPU cluster OOM | OOM errors in `dbx.py jobs output` |
| Secret scope missing | dbutils.secrets.get() error in run output |
