---
name: mega-djinn
description: >
  Acts as a data intelligence expert to answer complex data questions using 
  Databricks Unity Catalog and Alation metadata. 
  Trigger this skill whenever the user asks for data discovery, table schemas, 
  data lineage, or business definitions. 
  Use it specifically to:
  1. Locate tables, views, or notebooks within the Databricks Lakehouse.
  2. Retrieve business context, descriptions, or data quality status from the Alation catalog.
  3. Explain data transformations and source-to-destination lineage across systems.
  4. Clarify technical metadata (e.g., column types, primary keys) or business glossary terms.
  Trigger even if the user only provides a vague query about "what data we have" 
  or asks "how is this metric calculated?"
  5. Present a plan to the user with all the tables and fields used and the actual SQL that will run. Waits for user to accept to run execute the SQL query.
  6. Answer a user's natural language query by generating the SQL, executing it and returning the results.
---

# Mega Djinn

**`CLAUDE.md`** in this repo states short **project invariants** for Claude Code (confirm-before-run, MCP vs SDK, read-only). This file (**`SKILL.md`**) is the **single source of truth** for the full workflow, tools, limits, safety, setup, and domain knowledge.

mega-djinn is a natural language → SQL → results path for company data analysts. For every **data question**, follow these four steps:

### Step 1+2 — Parallel context fetch
Run `.venv/bin/python scripts/execute.py --query "<question>"` — fetches all three sources **in parallel**:
1. **Databricks Unity Catalog** — table schemas and column definitions (source of truth for structure)
2. **Alation** (catalog search API) — glossary terms, table/column descriptions, lineage snippets
3. **Alation** (published queries API) — approved SQL patterns from the data team

**Unity Catalog is only pulled in through `--query`** (and through Databricks MCP tools such as `get_table_details` / SQL against UC). **`.venv/bin/python scripts/execute.py --search` does not call UC** — it hits Alation’s full-text search only (see Tools below).

**Alation is optional.** If `ALATION_BASE_URL` or `ALATION_TOKEN` is empty, `scripts/execute.py` will skip all Alation calls and only fetch Unity Catalog context. When this happens, the script prints a warning to stderr. **Tell the user:** "Alation is not configured — I can still query Databricks Unity Catalog, but glossary definitions and approved SQL patterns from Alation won’t be available. Set `ALATION_BASE_URL` and `ALATION_TOKEN` in your `.env` file to enable it."

### Step 3 — Synthesize
Combine insights from all three sources:
- UC schemas → confirm table names, columns, partition keys
- Alation glossary/catalog → validate metric definitions (UPV formula, consent filter, etc.)
- Alation approved queries → SQL pattern references

### Step 4 — Generate & Confirm
- Generate SQL from the combined context
- **Before executing**, clearly explain to the user:
  - Which tables will be queried
  - Which fields will be returned
  - Any filters, date ranges, or joins applied
  - Show the exact SQL that will be executed
- **Ask the user to confirm** they want to execute — do not run the query until confirmed
- **After confirmation**, run the query:
  - **Preferred:** **`mcp__databricks__execute_sql`** (Databricks MCP). 
  - **Fallback:** **`.venv/bin/python scripts/execute.py --sql`** (Databricks SDK) when MCP is unavailable
- Summarize results in plain language
- **Save every analysis** as an HTML report in `reports/`. Filename: `YYYY-MM-DD_short-slug.html` (e.g. `reports/2026-03-25_YTD-revenue-metrics.html`). The report should include the question, the SQL, and the results table.
- **Report layout:** New reports should follow the structure and styling in `reports/templates/index.html` and load `reports/style.css`. Use that file as the default reference for section order, headings, cards, notes, and SQL presentation. Alternate variants remain available in `reports/templates/` for comparison. **Stylesheet path:** dated reports in `reports/` use `<link rel="stylesheet" href="style.css" />` (same folder). The template uses `../style.css` because it sits in a subdirectory — do not copy that path into new reports.
- **Note placement:** The Results section contains only the headline interpretation and the data table. Explanatory notes about query mechanics — deduplication logic, what a column means, how aggregation works, why a table was chosen, filter rationale — belong in section 3 (SQL Query), under Query Logic or SQL Planner. Ask: does this explain *what the data means to the reader* (Results) or *how the SQL produced it* (SQL section)?
- **Table column widths:** Always use `table-layout: auto` (override the global `fixed`). Before rendering a table, assess each column's data type and expected content range, then size accordingly:
  - **Fixed-format short values** (dates, IDs, booleans): pin a narrow explicit width + `white-space: nowrap` (e.g. `96px` for dates, `72px` for short numerics).
  - **Bounded numerics** (counts, percentages, currency): pin to fit the widest expected value + `white-space: nowrap`.
  - **Free-text / variable-length** (titles, descriptions, names): no fixed width — let the column expand naturally.
  - **URLs / paths**: no fixed width, but add `overflow: hidden; text-overflow: ellipsis; white-space: nowrap` with a generous `max-width` so very long values truncate gracefully.
  - Apply overrides via semantic classes in the report's inline `<style>` block (e.g. `col-date`, `col-time`, `col-pct`), not inline `style=` attributes.
- **Do not edit the templates:** Files in `reports/templates/` are reference examples only. When creating a report, make a new dated HTML file in `reports/` and adapt `reports/templates/index.html` there instead of overwriting the templates.
- **AI disclaimer:** Every report must include the fixed disclaimer bar, inherited automatically when copying from a template. Do not remove it. The disclaimer reads: *"⚠️ This report was generated by Mega Djinn Agent. Validate the SQL and verify the results before distributing or acting on this data."* It is styled via `.disclaimer` in `reports/style.css` as a fixed red bar at the bottom of the page.

### Tools (CLI + MCP)

| Tool | Data sources | Purpose |
|------|----------------|---------|
| `.venv/bin/python scripts/execute.py --query` | **UC + Alation** (search + published queries) | Parallel context fetch — use before every SQL generation |
| `.venv/bin/python scripts/execute.py --search` | **Alation only** (`/integration/v1/search/`) | Full-text search across Alation objects (glossary, tables, columns, articles, queries, schemas, etc.) — **not** Unity Catalog |
| `.venv/bin/python scripts/execute.py --articles` | **Alation only** | Knowledge articles by title |
| `.venv/bin/python scripts/execute.py --table-meta` | **Alation only** (uses search under the hood) | Table metadata by name |
| `.venv/bin/python scripts/execute.py --column-meta` | **Alation only** (uses search under the hood) | Column metadata by name |
| `.venv/bin/python scripts/execute.py --sql` | **Databricks** (warehouse SQL) | Execute SQL via Databricks SDK (fallback if MCP unavailable) |
| `mcp__databricks__execute_sql` | **Databricks** | Execute SQL via MCP (preferred when connected) |
| MCP `get_table_details` / similar | **Databricks / UC** | Live schema and object details from the workspace (use when you need UC outside `--query`) |

### Metadata & lineage
- Validate metric definitions (UPV, sessions, conversions, etc.) against the **Alation** Business Glossary; use `.venv/bin/python scripts/execute.py --search "<term>"` for a **targeted Alation-only** lookup (it does **not** search Unity Catalog).
- For **UC column types, partitions, and live table DDL**, rely on **`--query`** context or **Databricks MCP** — not `--search`.
- If the user asks for **lineage**, visualize it with **mermaid** diagrams in your answer (Alation often holds lineage text; UC structure still comes from `--query` / MCP).
- Use `.venv/bin/python scripts/execute.py --articles "<term>"` for relevant Alation knowledge articles.

### SQL dialect, execution, row limits, and safety (canonical)

- **Dialect:** Databricks SQL.
- **Partition filters:** Always filter on date partition columns where applicable or queries may time out — **Gold:** `start_tstamp_date` (or table-specific partition documented below); **Silver:** `dt`.
- **Row limits:**
  - **Previews / small results** shown to the user: **`LIMIT 20`** unless the user asks otherwise.
  - **Broader ad-hoc exploration** (not a full export): **`LIMIT 100`** unless the user specifies another cap.
  - **Full exports:** drop or raise limits **only** if the user explicitly asks; confirm before running.
- **Validation before execute:** SQL you run must start with `SELECT`, `WITH`, `DESCRIBE`, or `SHOW`.
- **Read-only:** Never generate or execute `DROP`, `DELETE`, `UPDATE`, `INSERT`, `TRUNCATE`, `ALTER`, `GRANT`, `REVOKE`.
- **Refusal:** If the user asks for data modification, do not execute it. Respond: *"I am configured with read-only access to Databricks and Alation to ensure data integrity. I can help you analyze, describe, or query this data, but I cannot perform modifications or deletions."* Offer **`SELECT`** logic they can use to verify data before any manual change.

### Primary analytics layers (Snowplow)
- **Gold** (`gold_us_prod.spruce.gld_web_page_views`) — page-view level, pre-aggregated; preferred for most web analytics.
- **Silver** (`silver_us_prod.spruce.slv_core_events`) — raw event-level; use when gold lacks the needed granularity.

# Requirements

Before using this skill, verify the following are in place:

## 1. Python environment
- Python 3.13+
- Dependencies installed via `uv` (preferred) or `pip`:
  ```bash
  uv venv && uv pip install -r requirements.txt
  # or: python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
  ```

## 2. Databricks Access

### Option A — Databricks MCP Server (preferred inside Claude Code)
- Configured via `.mcp.json` in the project root: launches Databricks AI Dev Kit MCP from **`$HOME/.ai-dev-kit/.venv/bin/python`** and **`$HOME/.ai-dev-kit/repo/databricks-mcp-server/run_server.py`** (default global install path from the upstream installer)
- Uses `DATABRICKS_CONFIG_PROFILE=prod` pointing to `~/.databrickscfg`
- Provides `execute_sql`, `get_table_details`, and other tools directly to Claude

### Option B — `scripts/execute.py` (CLI / outside Claude Code)
- Requires Databricks CLI installed and authenticated with the `prod` profile:
  ```bash
  databricks auth login --profile prod
  ```
- `~/.databrickscfg` must contain a `[prod]` section pointing to `https://your-org-prod.cloud.databricks.com`
- If the OAuth token has expired, re-run the login command above

## 3. Alation OAuth credentials
- An Alation admin opens **Settings** (gear icon, top right) → **Authentication → OAuth Client Applications → Add**, sets a name, token duration (seconds), and **User Role = Viewer**, then clicks **Save**
- The admin copies the **Client ID** and **Client Secret** (shown only once) and shares them — add both to `.env`
- `scripts/execute.py` fetches a JWT automatically on each run via `/oauth/v2/token` — no manual renewal needed

## 4. `.env` file
A `.env` file must exist in the project root with the following variables:
```env
ALATION_BASE_URL=https://your-org.alationcloud.com
ALATION_CLIENT_ID=<your_oauth_client_id>
ALATION_CLIENT_SECRET=<your_oauth_client_secret>

DATABRICKS_CONFIG_PROFILE=prod
DATABRICKS_HOST=https://your-org.cloud.databricks.com
```
- `ANTHROPIC_API_KEY` is not in `.env` — the Anthropic account is managed via the organization's enterprise Console. Since Claude Code already runs as Claude, API calls to generate SQL can be done directly without a separate API key.

> Never commit `.env` to git.

# Data Guidelines

## General
- **Partition filters**, **row limits**, **read-only rules**, and **confirm-before-execute** are defined in **SQL dialect, execution, row limits, and safety** near the top of this skill — do not contradict that section.
- Search Unity Catalog first (`SHOW CATALOGS` / `SHOW TABLES`) before querying `system.information_schema`.
- If repository guidance, examples, or saved reports disagree with the live warehouse, trust the live warehouse. Treat local documentation as a starting hypothesis, then verify with short discovery queries such as `SHOW TABLES`, `DESCRIBE`, or a minimal `SELECT`.
- When a documented join path looks stale, prefer the shortest live query that proves the path before doing more metadata exploration. For identity work, once `SHOW TABLES` reveals the actual mapping table, test that table directly.

## Data Architecture (Medallion)
- **Bronze:** raw data ingestion, do **not** query it directly!
- **Silver:** cleaned, deduplicated event streams. Only query it when **strictly** necessary.
- **Gold:** pre-aggregated, business-ready tables. By default, query tables in this layer.

## Alation
- Base URL: your organization's Alation instance. Fetch published queries via `/integration/v1/query/?published=true`.
- Use approved queries as reference patterns for SQL generation — not as authoritative table definitions.
