---
name: data-wrangler
description: >
  Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases,
  apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally
  write results to files. Use when the user wants to: (1) Clean, filter, or transform data,
  (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.),
  (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results.
  Prefer this over in-context reasoning for datasets with thousands of rows or complex
  transformations.
---

# Data Wrangler

Transform and export data using DuckDB SQL.

## Contents

- [Usage](#usage) - Command syntax and Windows escaping
- [Explore Mode](#explore-mode) - Quick data profiling
- [Query Mode](#query-mode) - Return results to Claude
- [Write Mode](#write-mode) - Export to files
- [Request/Response Format](#requestresponse-format) - JSON structure
- [Source Types](#source-types) - File, database, and cloud sources
- [Transformations](#transformations) - SQL patterns reference
- [Secrets](#secrets) - Secure credential handling

## Usage

**IMPORTANT - Windows Shell Escaping:**

1. Always `cd` to the skill directory first
2. Use **double quotes** for echo with escaped inner quotes (`\"`)
3. Use **forward slashes** in file paths

```bash
cd "<skill_directory>" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py
```

## Explore Mode

**Get schema, statistics, and sample in one call.** Use before writing queries to understand data structure.

```json
{"mode": "explore", "path": "D:/data/sales.csv"}
```

**Response:**
```json
{
  "file": "D:/data/sales.csv",
  "format": "csv",
  "row_count": 15234,
  "columns": [
    {"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0},
    {"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3}
  ],
  "sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001     | Alice    | ... |"
}
```

**Options:**
- `sample_rows`: Number of sample rows (default: 10, max: 100)
- `sources`: For database tables (same as query mode)

## Query Mode

Return results directly to Claude for analysis.

### Direct File Queries

```json
{"query": "SELECT * FROM 'data.csv' LIMIT 10"}
```

### Multi-Source Joins

```json
{
  "query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id",
  "sources": [
    {"type": "file", "alias": "sales", "path": "/data/sales.parquet"},
    {"type": "file", "alias": "products", "path": "/data/products.csv"}
  ]
}
```

## Write Mode

Export query results to files. Add an `output` object to write instead of returning data.

### Basic Write

```json
{
  "query": "SELECT * FROM 'raw.csv' WHERE status = 'active'",
  "output": {
    "path": "D:/output/filtered.parquet",
    "format": "parquet"
  }
}
```

### Write with Options

```json
{
  "query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'",
  "output": {
    "path": "D:/output/events/",
    "format": "parquet",
    "options": {
      "compression": "zstd",
      "partition_by": ["year", "month"],
      "overwrite": true
    }
  }
}
```

### Output Formats

| Format | Options |
|--------|---------|
| `parquet` | `compression` (zstd/snappy/gzip/lz4), `partition_by`, `row_group_size` |
| `csv` | `header` (default: true), `delimiter`, `compression`, `partition_by` |
| `json` | `array` (true=JSON array, false=newline-delimited) |

### Write Response

Response includes verification info - no need for follow-up queries:

```json
{
  "success": true,
  "output_path": "D:/output/events/",
  "format": "parquet",
  "rows_written": 15234,
  "files_created": ["D:/output/events/year=2023/data_0.parquet", "..."],
  "total_size_bytes": 5678901,
  "duration_ms": 1234
}
```

### Overwrite Protection

By default, existing files are **not** overwritten. Set `options.overwrite: true` to allow.

## Request/Response Format

### Request

```json
{
  "query": "SQL statement",
  "sources": [...],
  "output": {"path": "...", "format": "..."},
  "options": {"max_rows": 200, "format": "markdown"},
  "secrets_file": "path/to/secrets.yaml"
}
```

### Query Mode Options

- `max_rows`: Maximum rows to return (default: 200)
- `max_bytes`: Maximum response size (default: 200000)
- `format`: `markdown` (default), `json`, `records`, or `csv`

### Query Mode Response (markdown)

```
| column1 | column2 |
|---|---|
| value1 | value2 |
```

### Query Mode Response (json)

```json
{
  "schema": [{"name": "col1", "type": "INTEGER"}],
  "rows": [[1, "value"]],
  "truncated": false,
  "warnings": [],
  "error": null
}
```

## Source Types

### File (auto-detects CSV, Parquet, JSON, Excel)

```json
{"type": "file", "alias": "data", "path": "/path/to/file.csv"}
```

Glob patterns: `{"path": "/logs/**/*.parquet"}`

Custom delimiter: `{"path": "/data/file.csv", "delimiter": "|"}`

### PostgreSQL

```json
{
  "type": "postgres", "alias": "users",
  "host": "host", "port": 5432, "database": "db",
  "user": "user", "password": "pass",
  "schema": "public", "table": "users"
}
```

### MySQL

```json
{
  "type": "mysql", "alias": "orders",
  "host": "host", "port": 3306, "database": "db",
  "user": "user", "password": "pass", "table": "orders"
}
```

### SQLite

```json
{"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"}
```

### S3

```json
{
  "type": "s3", "alias": "logs",
  "url": "s3://bucket/path/*.parquet",
  "aws_region": "us-east-1",
  "aws_access_key_id": "...", "aws_secret_access_key": "..."
}
```

## Transformations

See [TRANSFORMS.md](TRANSFORMS.md) for advanced patterns including:

- **PIVOT/UNPIVOT** - Reshape data between wide and long formats
- **Sampling** - Random subsets with `USING SAMPLE n ROWS` or `SAMPLE 10%`
- **Dynamic columns** - `EXCLUDE`, `REPLACE`, `COLUMNS('pattern')`
- **Window functions** - Running totals, rankings, moving averages
- **Date/time operations** - Extraction, arithmetic, formatting

### Quick Examples

```sql
-- PIVOT: Convert rows to columns
PIVOT sales ON quarter USING SUM(revenue) GROUP BY region

-- UNPIVOT: Convert columns to rows
UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount

-- Sampling: Random 10% with reproducible seed
SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42)

-- Dynamic columns: Exclude sensitive, transform email
SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users
```

## Workflow

1. **Inspect schema**: `DESCRIBE SELECT * FROM 'file.csv'`
2. **Preview data**: `SELECT * FROM 'file.csv' LIMIT 5`
3. **Transform**: Apply filters, joins, aggregations
4. **Export** (optional): Add `output` to write results

## Error Handling

- If `error` is non-null: Check column names, verify paths
- If `truncated` is true: Use more aggregation or filters
- If write fails with "exists": Set `options.overwrite: true`

## Secrets

Store credentials securely in YAML. See [SECRETS.md](SECRETS.md) for complete documentation.

```json
{
  "query": "SELECT * FROM customers LIMIT 10",
  "secrets_file": "D:/path/to/secrets.yaml",
  "sources": [{
    "type": "postgres", "alias": "customers",
    "secret": "my_postgres", "table": "customers"
  }]
}
```

Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.
