---
name: synthdata-generate
description: >
  Generate synthetic tabular datasets from YAML schemas. Use this skill when the user wants to create
  sample data, mock data, test data, synthetic datasets, or demo data for any domain — HR directories,
  e-commerce orders, SaaS metrics, healthcare records, financial transactions, security events,
  application logs, IoT sensor readings, CRM pipelines, survey responses, or custom schemas. Ships with
  10+ domain templates and supports custom YAML schemas with Faker-backed fields, statistical
  distributions (normal/lognormal/zipf/poisson), foreign-key integrity, behavioral profiles, and
  temporal event generation. Also trigger when user says "generate synthetic data", "create fake data",
  "mock dataset", "test data", or names a specific domain like "e-commerce data" or "HR data".
version: 0.1.0
allowed-tools: Read Bash Glob Write
---

# Synthdata Generate

Generate synthetic tabular datasets using **bundled Python scripts** — no code generation required. A schema-driven engine reads YAML and produces xlsx, csv, json, sql, or parquet output.

## Prerequisites

```bash
pip install openpyxl faker numpy pandas pyyaml --break-system-packages
```

## Workflow

### Step 1: Interview the user

Ask three questions:

1. **What domain?** — Offer the template list (see below). If none fit, offer `blank-slate` + custom columns.
2. **What scale?** — `quick` (demo-size), `medium` (default, hundreds–thousands), `thorough` (full-fidelity, may be 10K+ rows).
3. **What format?** — xlsx (default), csv, json, sql, or parquet.

List available templates:

```bash
python scripts/generate.py --list-templates
```

Present defaults as: *"I'll use `<template>` at `medium` effort → xlsx by default. Which of these would you like to change?"*

**Wait for the user's response before proceeding.**

#### Effort Levels

| Level | Rows | Profiles | When to use |
|-------|------|----------|-------------|
| **quick** | Smallest row counts (template-defined) | Flat baseline (no behavioral variation) | Smoke tests, schema checks, fast iteration |
| **medium** | Default row counts | Behavioral profiles with jitter | Day-to-day use |
| **thorough** | Largest row counts | Full profile variation | Stakeholder-facing deliverables |

Effort controls **row counts**, **profile richness**, and (for some templates) **time window length**. The schema itself (columns, FK structure, distributions) is identical across all levels.

### Step 2: Run the generator

```bash
# Use a built-in template
python scripts/generate.py --template hr-directory --effort medium --output ./hr.xlsx

# Use a custom schema file
python scripts/generate.py --schema ./my-schema.yaml --effort medium --output ./out/

# Override output format
python scripts/generate.py --template saas-metrics --format json --output ./saas.json
```

CLI flags:

| Flag | Default | Description |
|------|---------|-------------|
| `--template <name>` | — | Use a bundled template (see `--list-templates`) |
| `--schema <path>` | — | Use a custom YAML schema file |
| `--effort` | `medium` | `quick` / `medium` / `thorough` |
| `--output` | `./synthdata_output` | File path (xlsx/json/sql) or directory (csv/multi-table json) |
| `--format` | schema default | Override output format: xlsx, csv, json, sql, parquet |
| `--seed` | `42` | Random seed for reproducibility |
| `--locale` | `en_US` | Faker locale (e.g. `en_GB`, `de_DE`, `ja_JP`) |

### Step 3: Custom schema authoring (if no template fits)

Copy `templates/blank-slate.yaml` as a starting point, edit, then run with `--schema`. See `references/schema-spec.md` for the full spec.

Key concepts:

- **Column types**: `id`, `faker`, `choice`, `int`, `float`, `bool`, `date`, `timestamp`, `constant`, `formula`, `ref`
- **Distributions**: `uniform`, `normal`, `lognormal`, `exponential`, `poisson`, `gamma`, `pareto`
- **Foreign keys**: child tables declare `foreign_key: {column, references: "table.col", distribution: uniform|zipfian}`
- **Profiles**: behavioral personas (e.g., 5% whales, 20% dormant) that can drive `rows_per_parent` via `lam_expr`
- **Temporal**: `timestamp` column with `pattern: uniform | business-hours | diurnal`, `weekday_only`, `start`/`end`

### Step 4: Deliver

Report row counts per table, file path, and seed. If the user wants to iterate, re-run with a different `--seed` or `--effort`.

## Available Templates

| Template | Tables | Use case |
|---|---|---|
| `hr-directory` | departments, employees | Employee directories, HRIS test data |
| `ecommerce-orders` | customers, products, orders | Retail/marketplace analytics, RFM analysis |
| `saas-metrics` | accounts, users, events, subscriptions | Product analytics, MRR dashboards |
| `healthcare-patients` | patients, providers, encounters, claims | EHR sandboxes, payer analytics |
| `financial-transactions` | customers, accounts, transactions | Banking, fraud-detection training data |
| `security-events` | users, devices, alerts, incidents | SIEM demos, SOC training |
| `log-events` | services, requests, errors | Log-analytics dashboards, observability |
| `iot-sensors` | devices, readings, events | IoT platform testing, anomaly detection |
| `crm-pipeline` | contacts, companies, deals, activities | Sales enablement, pipeline dashboards |
| `survey-responses` | respondents, questions, responses | Research, market surveys |
| `healthcare-hrm-security` | users, threats, sims, training, DLP, abuse, monthly_risk | human risk intelligence |
| `blank-slate` | users | Minimal starter for custom schemas |

## Additional Resources

- `references/schema-spec.md` — Complete YAML schema reference
- `references/distributions.md` — Statistical distribution guide
- `references/faker-fields.md` — Faker method cheat sheet
- `examples/` — Worked examples of custom schemas
