---
name: synthdata-anonymize
description: >
  Replace real PII in a dataset with realistic synthetic equivalents while preserving row counts, column
  types, and statistical distributions. Detects names, emails, phones, SSNs, addresses, credit cards, and
  user-identifying columns via name heuristics + value patterns. Use this skill when the user wants to
  "anonymize this dataset", "scrub PII", "make this data safe to share", "de-identify real data",
  "create a synthetic copy", or needs a sharable version of production data without exposing individuals.
version: 0.1.0
allowed-tools: Read Bash Glob Write Edit
---

# Synthdata Anonymize

Turn a real dataset into a safe synthetic equivalent: original PII is replaced with Faker-generated
values, while row counts, column types, numeric distributions, and categorical frequencies are preserved.

## Prerequisites

```bash
pip install openpyxl faker numpy pandas --break-system-packages
```

## Workflow

### Step 1: Load & scan

Ask the user for the input file (xlsx/csv/json) and output path. Then run the detector in scan-only
mode to list candidate PII columns with confidence scores:

```bash
python scripts/anonymize.py --input data.xlsx --scan
```

The detector flags columns via two signals:
- **Column name heuristics** — `name`, `email`, `phone`, `ssn`, `address`, `dob`, `ip`, `credit_card`, etc.
- **Value pattern matching** — regex for email / phone / SSN / credit-card / IPv4 in sample rows

Each flagged column receives a suggested Faker field (e.g., `email_address → faker.email`,
`home_phone → faker.phone_number`, `full_name → faker.name`).

### Step 2: Confirm the plan

Show the user the detection report. Ask them to:
1. **Confirm** each auto-detected mapping, or
2. **Override** any column (skip it, or change the faker method), or
3. **Add** columns the detector missed.

Columns the user marks as "keep" stay unchanged. This is the user's chance to protect join keys,
timestamps, geographies, or categorical fields that must retain real values.

### Step 3: Anonymize

```bash
python scripts/anonymize.py --input data.xlsx --output data_anon.xlsx \
    --map "email=email,phone=phone_number,full_name=name,ssn=ssn" \
    --keep "department,hire_date,salary" \
    --seed 42
```

CLI flags:

| Flag | Description |
|------|-------------|
| `--input` | Source dataset (xlsx/csv/json) |
| `--output` | Output path (default: `<input>_anon.<ext>`) |
| `--scan` | Detect-only; print report and exit |
| `--map` | Comma list of `col=faker_method` overrides |
| `--keep` | Comma list of columns to pass through unchanged |
| `--drop` | Comma list of columns to remove entirely |
| `--preserve-joins` | Columns that must map consistently (same real value → same fake value) |
| `--locale` | Faker locale (default `en_US`) |
| `--seed` | Seed for reproducible anonymization |

### Step 4: Preserve joins (optional)

If the dataset has foreign keys (e.g., `customer_id` in `orders` referencing `customers.customer_id`),
pass `--preserve-joins customer_id` so the same real `customer_id` always maps to the same fake one
across all tables. This keeps relational integrity intact.

### Step 5: Report

Print:
- Number of columns anonymized, kept, dropped
- Before/after row counts (should match)
- Per-column sample of before→after values (with original truncated for safety)
- A distribution check for numeric columns (mean/std preserved)

## How it works

- **String PII columns:** each unique real value is mapped to a unique fake value via a deterministic
  hash-keyed lookup. This guarantees `preserve-joins` consistency within a run.
- **Numeric columns:** passed through unchanged by default (numbers rarely identify individuals
  directly). If the user marks a numeric column as PII (e.g., salary), the script can jitter it with
  ±5% noise while preserving the column mean.
- **Date columns:** passed through unchanged by default. If flagged, offsets every value by a random
  constant (preserving intervals within a row but breaking calendar alignment).
- **Categorical columns:** passed through unchanged by default. Only flagged if values look like
  identifiers (emails, phones, etc.).

## Safety

- **Never overwrites input** — always writes to `--output` or a suffixed path
- **Deterministic with seed** — same input + same seed → same output (auditable)
- **Fails closed** — columns the detector isn't sure about are reported, not silently passed through