---
name: databricks-interactive-repl
description: Interactive code execution on Databricks clusters via dbx.py. Provides a stateful Python REPL where variables persist across commands.
---

# Databricks Interactive REPL

Execute code interactively on a running Databricks cluster. Variables persist between commands in the same context.

## When to Use

- Quick queries, transforms, or exploratory work (< 5 min runtime)
- Any task that needs interactive feedback (inspect results, iterate)

For long-running jobs (> 5 min) or work needing an audit trail, use `databricks-job-runner` instead.

## Prerequisites

Requires `databricks-sdk` and `tenacity`. Run `python3 -c "import databricks.sdk; import tenacity"` to verify. If it fails, stop and ask the user to set up a virtual environment (see `databricks-sdk-foundation` skill). Do NOT install packages on behalf of the user.

For auth and full CLI reference, see `databricks-sdk-foundation`.

## Cluster Setup

Ensure the cluster is running before creating a context:

```bash
python3 dbx.py clusters ensure <CLUSTER_ID>
```

## Context Lifecycle

### 1. Create a Context

```bash
python3 dbx.py repl create <CLUSTER_ID>
```

Returns: `{"context_id": "<CTX_ID>"}`

### 2. Execute a Command

```bash
python3 dbx.py repl exec <CLUSTER_ID> <CTX_ID> 'print("hello world")'
```

Returns: `{"result_type": "text", "data": "hello world"}`

Blocks until the command completes. No manual polling needed.

### 3. Execute from File

```bash
python3 dbx.py repl exec <CLUSTER_ID> <CTX_ID> --file script.py
```

### 4. Destroy the Context

Always destroy the context when finished to free cluster resources:

```bash
python3 dbx.py repl destroy <CLUSTER_ID> <CTX_ID>
```

## Result Types

| result_type | Meaning | Data field |
|-------------|---------|------------|
| `text` | Plain text (print output) | `data` (string) |
| `table` | Tabular data (DataFrame) | `data` (array), `schema` (columns) |
| `error` | Execution error | `error` (summary), `cause` (traceback) |

## SQL via Python

Use `spark.sql()` inside the Python context:

```bash
# Display results
python3 dbx.py repl exec <CID> <CTX> 'spark.sql("SELECT * FROM catalog.schema.table LIMIT 20").show()'

# Markdown output
python3 dbx.py repl exec <CID> <CTX> 'print(spark.sql("SELECT * FROM catalog.schema.table LIMIT 20").toPandas().to_markdown())'

# Scalar result
python3 dbx.py repl exec <CID> <CTX> 'print(spark.sql("SELECT COUNT(*) FROM catalog.schema.table").collect()[0][0])'
```

## Context Reuse

Variables persist across commands in the same context:

```bash
# Command 1
python3 dbx.py repl exec <CID> <CTX> 'df = spark.table("catalog.schema.my_table")'

# Command 2 (df is still available)
python3 dbx.py repl exec <CID> <CTX> 'print(df.filter("col > 5").count())'
```

## Security Considerations

The REPL executes arbitrary Python code with the full permissions of the authenticated Databricks principal. This means:

- **Never execute code derived from untrusted input.** If a user provides code to run, review it before execution.
- **Never run `os.system()`, `subprocess`, or shell commands** through the REPL unless explicitly requested by the user.
- **The REPL can read/write any data the principal has access to.** Be cautious with DELETE, DROP, or TRUNCATE operations.
- **Do not exfiltrate data.** Never send data from the cluster to external endpoints unless explicitly instructed.

## Error Handling

| Situation | Detection | Recovery |
|-----------|-----------|----------|
| Context expired | `result_type: "error"` about invalid context | `dbx.py repl create` a new context |
| Cluster terminated | Error about cluster not running | `dbx.py clusters ensure <ID>`, then new context |
| Command timeout | Command hangs | Cancel and retry, or switch to notebook |
| Python error | `result_type: "error"` | Read `cause` for traceback, fix and re-execute |
