---
name: cleaning-data
description: Systematic data quality remediation - detect duplicates/outliers/inconsistencies, design cleaning strategy, execute transformations, verify results (component skill for DataPeeker analysis sessions)
---

# Cleaning Data - Component Skill

## Purpose

Use this skill when:
- Have completed the `importing-data` skill with quality report generated
- Need to address data quality issues before analysis (duplicates, outliers, NULL handling, free text categorization)
- Want systematic approach to cleaning decisions with documented rationale
- Need to create clean tables ready for process skills (exploratory-analysis, guided-investigation, etc.)
- Following DataPeeker principle: **cleaning is ALWAYS mandatory** even if minimal issues found

This skill is a **prerequisite** for all DataPeeker analysis workflows and consumes the quality report from importing-data.

## Prerequisites

Before using this skill, you MUST:
1. Have completed the `importing-data` skill successfully
2. Have access to `05-quality-report.md` generated by importing-data
3. Have `raw_*` table(s) in `data/analytics.db`
4. Be familiar with basic SQLite for validation queries
5. Understand data quality concepts: duplicates, outliers, NULL handling, categorical standardization

## Data Cleaning Process

Create a TodoWrite checklist for the 5-phase data cleaning process:

```
Phase 1: Quality Report Review - pending
Phase 2: Issue Detection (Agent-Delegated) - pending
Phase 3: Cleaning Strategy Design - pending
Phase 4: Cleaning Execution - pending
Phase 5: Verification & Documentation - pending
```

Mark each phase as you complete it. Document all findings in numbered markdown files (`01-cleaning-scope.md` through `05-verification-report.md`) within your analysis workspace directory.

---

## Phase 1: Quality Report Review

**Goal:** Read quality report from importing-data, understand detected issues, prioritize for cleaning based on impact and severity.

### Read Quality Report from importing-data

**Locate the quality report:**
```
analysis/[session-name]/05-quality-report.md
```

This report (generated by importing-data Phase 5) contains:
- Table schema and row counts
- NULL percentages per column
- Duplicate counts and examples
- Outlier flags (3 MAD threshold) per numeric column
- Free text candidates (columns with >50% uniqueness)
- Summary of quality concerns

**Extract key information:**
- Which columns have >10% NULLs?
- How many duplicate rows exist (exact duplicates)?
- Which numeric columns have outliers?
- Which text columns need categorization?

**Document:** Summarize findings from quality report.

### Prioritize Issues Using Framework

**Issue Prioritization Matrix:**

Evaluate each issue on three dimensions:

**1. Impact (% of rows/columns affected)**
- **High:** >10% of rows affected
- **Medium:** 1-10% of rows affected
- **Low:** <1% of rows affected

**2. Severity (effect on analysis validity)**
- **Critical:** Makes analysis invalid or misleading (e.g., key column >50% NULL)
- **Significant:** Reduces data quality for important columns (e.g., duplicates, inconsistent categories)
- **Minor:** Affects edge cases only (e.g., outliers that are legitimate)

**3. Effort (complexity to resolve)**
- **Low:** Simple removal, exclusion, or standardization (1-2 SQL queries)
- **Medium:** Requires sub-agent for categorization or pattern analysis (3-5 queries)
- **High:** Complex deduplication, manual review, or domain expertise needed (>5 queries)

**Combine dimensions to assign priority:**

| Impact | Severity  | Effort | Priority     | Action Timing |
|--------|-----------|--------|--------------|---------------|
| High   | Critical  | Any    | **CRITICAL** | Must address  |
| High   | Significant| Low/Med| **HIGH**     | Must address  |
| Medium | Critical  | Low/Med| **HIGH**     | Must address  |
| Any    | Any       | High   | **MEDIUM**   | Address if time permits |
| Low    | Minor     | Any    | **LOW**      | Document, may skip |

**Document:** Create prioritized issue table in `01-cleaning-scope.md`.

### Define Cleaning Scope and Objectives

Create `analysis/[session-name]/01-cleaning-scope.md` with: ./templates/phase-1.md

**CHECKPOINT:** Before proceeding to Phase 2, you MUST have:
- [ ] Read and understood `05-quality-report.md` from importing-data
- [ ] Extracted all detected issues (NULLs, duplicates, outliers, free text, FK orphans)
- [ ] Applied prioritization matrix (impact × severity × effort)
- [ ] Reviewed FK relationships and orphaned records (if multiple tables)
- [ ] Defined cleaning objectives with success criteria
- [ ] `01-cleaning-scope.md` created with all sections filled

---

## Phase 2: Issue Detection (Agent-Delegated)

**Goal:** Deep-dive investigation of prioritized data quality issues using sub-agents to prevent context pollution.

**CRITICAL:** This phase MUST use sub-agent delegation. DO NOT analyze data in main agent context.

### Detection 1: Duplicate Records

**Use dedicated duplicate detection agents**

**For exact duplicates:**

Invoke the `detect-exact-duplicates` agent:

```
Task tool with agent: detect-exact-duplicates
Parameters:
- table_name: raw_[actual_table_name]
- key_columns: [columns that define uniqueness, from Phase 1 scope]
```

**For near-duplicates (fuzzy matching):**

Invoke the `detect-near-duplicates` agent:

```
Task tool with agent: detect-near-duplicates
Parameters:
- table_name: raw_[actual_table_name]
- text_column: [specific text column flagged in Phase 1]
```

Repeat for each text column requiring fuzzy matching.

**Document findings in `02-detected-issues.md` using template below.**

---

### Detection 2: Outliers (MAD-Based)

**Use dedicated outlier detection agent**

For each numeric column flagged in Phase 1:

Invoke the `detect-outliers` agent:

```
Task tool with agent: detect-outliers
Parameters:
- table_name: raw_[actual_table_name]
- numeric_col: [specific numeric column from Phase 1 scope]
```

Repeat for each numeric column requiring outlier analysis.

**Document findings in `02-detected-issues.md` using template below.**

---

### Detection 4: Referential Integrity Validation

**If multiple tables exist with FK relationships identified in Phase 1:**

**Use dedicated FK validation agent**

For each FK relationship flagged in Phase 1:

Invoke the `detect-foreign-keys` agent (focused validation mode):

```
Task tool with agent: detect-foreign-keys
Parameters:
- database_path: data/analytics.db
- child_table: [specific child table]
- child_column: [FK column]
- parent_table: [specific parent table]
- parent_column: [PK column]
```

The agent will:
- Confirm value overlap percentage (validate Phase 1 findings)
- Identify specific orphaned record IDs
- Assess orphan patterns (recent vs old, specific categories, etc.)
- Quantify impact on analysis (% of records affected in joins)

**If single table:** Skip this detection, document "N/A - Single table" in detected issues report.

**Document findings in `02-detected-issues.md` using template below.**

---

### Review Sub-Agent Findings

After all sub-agents return findings:

**For duplicates:**
- Are exact duplicates truly identical (all columns match)?
- Are near-duplicates legitimate variations or data entry errors?
- Which duplicate groups should be merged vs kept separate?

**For outliers:**
- Are outliers data errors or legitimate extreme values?
- Do outliers follow any pattern (seasonal, geographic, product-specific)?
- Which outliers should be excluded vs capped vs flagged?

**For FK orphans (if multiple tables):**
- Are orphaned records recent (may resolve soon) or old (permanent issue)?
- Do orphans follow a pattern (specific categories, time periods)?
- Can orphans be matched to parent records through fuzzy matching?
- Should orphans be excluded, flagged, or have placeholder parents created?

**Document:** Observations and preliminary decisions for Strategy Phase.

### Create Detected Issues Report

Create `analysis/[session-name]/02-detected-issues.md` with: ./templates/phase-2.md

**CHECKPOINT:** Before proceeding to Phase 3, you MUST have:
- [ ] Delegated duplicate detection to sub-agent (exact and near-duplicates)
- [ ] Delegated outlier detection to sub-agent (MAD-based, all flagged columns)
- [ ] Delegated FK validation to sub-agent (if multiple tables with relationships)
- [ ] Reviewed all sub-agent findings and documented observations
- [ ] Created `02-detected-issues.md` with all sections filled (including FK orphans if applicable)
- [ ] Identified specific records/issues for Phase 3 strategy decisions
- [ ] Listed implications for Phase 3 (what user needs to decide)

---

## Phase 3: Cleaning Strategy Design

**Goal:** Design cleaning approach for each detected issue type, present options with trade-offs, get user confirmation before execution.

### Review Detected Issues from Phase 2

From `02-detected-issues.md`, identify all issue types requiring decisions:
- Exact duplicates: [count]
- Near-duplicates: [count]
- Outliers per column: [counts]
- Free text categorization: [columns]

### Decision Framework: Duplicates

**For exact duplicates, choose ONE approach:**

**Option A: Keep First Occurrence**
- **Method:** ORDER BY rowid, keep lowest rowid per duplicate group
- **Pros:** Simple, deterministic, fast
- **Cons:** First may not be most complete/accurate
- **Use when:** No quality difference between duplicates

**Option B: Keep Most Complete**
- **Method:** Rank by completeness (fewest NULLs), keep best per group
- **Pros:** Preserves maximum information
- **Cons:** More complex, requires completeness scoring
- **Use when:** Duplicates have varying data quality

**Option C: Merge Records**
- **Method:** Combine non-NULL values from all duplicates
- **Pros:** No data loss
- **Cons:** Complex, may create inconsistencies
- **Use when:** Duplicates have complementary information

**For near-duplicates, choose ONE approach:**

**Option A: Auto-Merge High Confidence (>95%)**
- **Method:** Apply fuzzy matching agent's high confidence mappings automatically
- **Pros:** Efficient, addresses most obvious issues
- **Cons:** Small risk of incorrect merges
- **Use when:** Trust fuzzy matching agent's assessment

**Option B: Manual Review All**
- **Method:** Review every near-duplicate group before merging
- **Pros:** Zero incorrect merges
- **Cons:** Time-consuming
- **Use when:** Data quality is critical

**Option C: Skip Near-Duplicates**
- **Method:** Only handle exact duplicates, leave fuzzy matches as-is
- **Pros:** Conservative, no risk
- **Cons:** Misses data quality improvements
- **Use when:** Near-duplicates are legitimate variations

**Document chosen approach in `03-cleaning-strategy.md`**

---

### Decision Framework: Outliers

**For each numeric column with outliers, choose ONE approach:**

**Option A: Exclude Outliers**
- **Method:** Filter out rows where value > 3 MAD from median
- **Pros:** Clean dataset, no extreme values skewing analysis
- **Cons:** Data loss, may exclude legitimate extremes
- **Use when:** Outliers are clearly data errors

**Option B: Cap at Threshold**
- **Method:** Set outliers to 3 MAD threshold (winsorization)
- **Pros:** Preserves row count, reduces extreme influence
- **Cons:** Distorts actual values
- **Use when:** Want to preserve rows but limit extreme influence

**Option C: Flag and Keep**
- **Method:** Add outlier_flag column, keep all data
- **Pros:** No data loss, analysts can filter if needed
- **Cons:** Outliers may still skew analysis if not filtered
- **Use when:** Outliers might be legitimate, need analyst judgment

**Option D: Keep As-Is**
- **Method:** No transformation
- **Pros:** Preserves true data
- **Cons:** Extremes may dominate analysis
- **Use when:** Outliers are legitimate (VIPs, seasonal spikes)

**Document chosen approach per column in `03-cleaning-strategy.md`**

---

### Decision Framework: Free Text Categorization

**If free text columns flagged in Phase 2:**

**Step 1: Invoke categorization agent**

```
Task tool with agent: categorize-free-text
Parameters:
- column_name: [specific text column]
- unique_values: [list from Phase 2 detection]
- business_context: [optional context about what values represent]
```

**Step 2: Review agent's proposed categories**

Agent will return:
- Proposed category schema (3-10 categories)
- Value mappings with confidence levels
- Ambiguous/uncategorizable values flagged

**Step 3: Choose categorization approach:**

**Option A: Accept Agent Proposal**
- **Method:** Use agent's categories and mappings as-is
- **Pros:** Fast, leverages semantic analysis
- **Cons:** May miss business context
- **Use when:** Agent's categories make sense for analysis

**Option B: Modify Categories**
- **Method:** Adjust agent's proposal (rename, merge, split categories)
- **Pros:** Incorporates business knowledge
- **Cons:** Requires manual refinement
- **Use when:** Agent's categories are close but need tweaking

**Option C: Manual Categorization**
- **Method:** Define categories and mappings from scratch
- **Pros:** Full control, perfect fit for business needs
- **Cons:** Time-consuming
- **Use when:** Agent's proposal doesn't fit business model

**Option D: Keep As-Is**
- **Method:** No categorization
- **Pros:** Preserves original data
- **Cons:** High-cardinality text column harder to analyze
- **Use when:** Free text values are inherently unique (IDs, descriptions)

**Document chosen approach in `03-cleaning-strategy.md`**

---

### Decision Framework: Referential Integrity (If Multiple Tables)

**For FK orphaned records identified in Phase 2:**

For each FK relationship with orphaned records, choose ONE approach:

**Option A: Exclude Orphaned Records**
- **Method:** Filter out child records where FK doesn't match any parent PK
- **Pros:** Clean referential integrity, INNER JOINs work correctly
- **Cons:** Data loss
- **Use when:** Orphaned records are data errors with no business value

**Option B: Preserve with NULL**
- **Method:** Set orphaned FK values to NULL (retain child records)
- **Pros:** Preserves child row count, makes orphans explicit
- **Cons:** Loses relationship information, NULL handling required in queries
- **Use when:** Child records have value even without parent context

**Option C: Flag and Keep**
- **Method:** Add `[fk_column]_orphan_flag` column, keep original FK value
- **Pros:** No data loss, analysts can filter as needed
- **Cons:** Referential integrity violated until analyst filters
- **Use when:** Need investigation before deciding, orphans may resolve

**Option D: Create Placeholder Parent**
- **Method:** Insert synthetic parent record (e.g., id=-1, name="Unknown"), map orphans to it
- **Pros:** Preserves referential integrity AND child rows, INNER JOINs work
- **Cons:** Creates synthetic data, may skew parent-level aggregations
- **Use when:** JOINs required but can't lose child records (e.g., orders with unknown customer)

**Document chosen approach per FK relationship in `03-cleaning-strategy.md`**

---

### Decision Framework: Business Rules (Optional)

**If business rules were defined in Phase 1 scope:**

For each rule, choose approach for violations:

**Option A: Exclude Violating Records**
- **Method:** Filter out rows that fail validation
- **Pros:** Clean dataset, only valid data
- **Cons:** Data loss
- **Use when:** Invalid data cannot be corrected

**Option B: Cap/Coerce to Valid Range**
- **Method:** Adjust values to meet constraints
- **Pros:** Preserves rows
- **Cons:** Changes actual data
- **Use when:** Violations are minor (e.g., age 150 → cap at 120)

**Option C: Flag and Keep**
- **Method:** Add validation_flag column
- **Pros:** No data loss, transparent
- **Cons:** Invalid data present
- **Use when:** Need to investigate violations before deciding

**Document chosen approach per rule in `03-cleaning-strategy.md`**

---

### Create Cleaning Strategy Document

Create `analysis/[session-name]/03-cleaning-strategy.md` with: ./templates/phase-3.md

**CHECKPOINT:** Before proceeding to Phase 4, you MUST have:
- [ ] Reviewed all detected issues from Phase 2
- [ ] Chosen approach for duplicates (exact and near)
- [ ] Chosen approach for outliers (per numeric column)
- [ ] Reviewed free text categorization agent proposal (if applicable)
- [ ] Chosen approach for free text categorization
- [ ] Chosen approach for FK orphans (if multiple tables with relationships)
- [ ] Defined business rule handling (if applicable)
- [ ] User confirmed all strategies via checkpoint review
- [ ] `03-cleaning-strategy.md` created with all decisions documented

---

## Phase 4: Cleaning Execution

**Goal:** Execute approved cleaning strategies, create clean_* tables, track all exclusions and transformations.

**CRITICAL:** All transformations use CREATE TABLE AS SELECT pattern. Keep raw_* tables intact.

### Transformation 1: Remove Duplicates

**Based on Strategy from Phase 3:**

**For Exact Duplicates (Keep First approach):**

```sql
-- Create clean table without exact duplicates
CREATE TABLE clean_[table_name] AS
WITH ranked_records AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY [key_columns]
      ORDER BY rowid
    ) as rn
  FROM raw_[table_name]
)
SELECT [all_columns]  -- Exclude rn column
FROM ranked_records
WHERE rn = 1;
```

**For Exact Duplicates (Keep Most Complete approach):**

```sql
-- Create clean table keeping most complete record per duplicate group
CREATE TABLE clean_[table_name] AS
WITH completeness_scored AS (
  SELECT *,
    ([count non-NULL columns formula]) as completeness_score,
    ROW_NUMBER() OVER (
      PARTITION BY [key_columns]
      ORDER BY completeness_score DESC, rowid
    ) as rn
  FROM raw_[table_name]
)
SELECT [all_columns]  -- Exclude rn and completeness_score
FROM completeness_scored
WHERE rn = 1;
```

**For Near-Duplicates (Auto-Merge High Confidence approach):**

```sql
-- Create mapping table from categorization agent
CREATE TABLE [column]_near_dup_mapping AS
SELECT
  original_value,
  canonical_value
FROM (VALUES
  ('[value1]', '[canonical]'),
  ('[value2]', '[canonical]'),
  ...
) AS mapping(original_value, canonical_value);

-- Apply mapping
UPDATE clean_[table_name]
SET [text_column] = (
  SELECT canonical_value
  FROM [column]_near_dup_mapping
  WHERE original_value = [text_column]
)
WHERE [text_column] IN (SELECT original_value FROM [column]_near_dup_mapping);
```

**Verification:**

```sql
-- Verify duplicate removal
SELECT COUNT(*) as clean_count FROM clean_[table_name];
SELECT COUNT(*) - COUNT(DISTINCT [key_columns]) as remaining_dups FROM clean_[table_name];
-- Expected: 0 remaining duplicates

-- Exclusion count
SELECT [raw_count] - [clean_count] as excluded_duplicates;
```

**Document in `04-cleaning-execution.md`:**
- SQL executed
- Before count: [N] rows
- After count: [N] rows
- Duplicates removed: [N] rows ([X]%)

---

### Transformation 2: Handle Outliers

**Based on Strategy from Phase 3:**

**For Outliers (Exclude approach):**

```sql
-- Calculate thresholds
WITH stats AS (
  SELECT
    [median_calculation] as median,
    [mad_calculation] * 1.4826 as mad
  FROM raw_[table_name]
  WHERE [numeric_col] IS NOT NULL
)
-- Create clean table excluding outliers
CREATE TABLE clean_[table_name] AS
SELECT r.*
FROM raw_[table_name] r
CROSS JOIN stats s
WHERE ABS(r.[numeric_col] - s.median) <= 3 * s.mad
   OR r.[numeric_col] IS NULL;  -- Keep NULLs
```

**For Outliers (Cap at Threshold approach - Winsorization):**

```sql
WITH stats AS (
  SELECT
    [median_calculation] as median,
    [mad_calculation] * 1.4826 as mad
  FROM raw_[table_name]
  WHERE [numeric_col] IS NOT NULL
)
CREATE TABLE clean_[table_name] AS
SELECT
  [other_columns],
  CASE
    WHEN [numeric_col] > median + 3 * mad THEN median + 3 * mad
    WHEN [numeric_col] < median - 3 * mad THEN median - 3 * mad
    ELSE [numeric_col]
  END as [numeric_col]
FROM raw_[table_name]
CROSS JOIN stats;
```

**For Outliers (Flag and Keep approach):**

```sql
WITH stats AS (
  SELECT [median_calculation] as median, [mad_calculation] * 1.4826 as mad
  FROM raw_[table_name]
  WHERE [numeric_col] IS NOT NULL
)
CREATE TABLE clean_[table_name] AS
SELECT
  r.*,
  CASE
    WHEN ABS(r.[numeric_col] - s.median) > 3 * s.mad THEN 1
    ELSE 0
  END as [numeric_col]_outlier_flag
FROM raw_[table_name] r
CROSS JOIN stats s;
```

**Verification:**

```sql
-- Verify outlier handling
SELECT COUNT(*) as clean_count FROM clean_[table_name];

-- For Exclude approach: check no outliers remain
WITH stats AS (...)
SELECT COUNT(*) as remaining_outliers
FROM clean_[table_name], stats
WHERE ABS([numeric_col] - median) > 3 * mad;
-- Expected: 0

-- For Cap approach: check values at thresholds
SELECT MIN([numeric_col]), MAX([numeric_col]) FROM clean_[table_name];

-- For Flag approach: check flag distribution
SELECT [numeric_col]_outlier_flag, COUNT(*)
FROM clean_[table_name]
GROUP BY [numeric_col]_outlier_flag;
```

**Document in `04-cleaning-execution.md`:**
- Approach used per column
- SQL executed
- Before/after row counts (if excluding)
- Outliers affected: [N] rows ([X]%)

---

### Transformation 3: Categorize Free Text

**Based on Strategy from Phase 3:**

**If using agent's proposed categories:**

```sql
-- Create category mapping from agent output
CREATE TABLE [column]_category_mapping AS
VALUES
  ('[value]', '[Category 1]'),
  ('[value]', '[Category 1]'),
  ('[value]', '[Category 2]'),
  ...
) AS mapping(original_value, category);

-- Apply categorization
CREATE TABLE clean_[table_name] AS
SELECT
  r.[other_columns],
  COALESCE(m.category, 'Other') as [column]_category
FROM raw_[table_name] r
LEFT JOIN [column]_category_mapping m
  ON r.[text_column] = m.original_value;
```

**If handling uncategorizable values:**

```sql
-- Option A: Exclude uncategorizable
CREATE TABLE clean_[table_name] AS
SELECT
  r.*,
  m.category as [column]_category
FROM raw_[table_name] r
INNER JOIN [column]_category_mapping m
  ON r.[text_column] = m.original_value;
-- INNER JOIN excludes unmapped values

-- Option B: Map to "Other" category (shown in previous query with COALESCE)
```

**Verification:**

```sql
-- Verify categorization coverage
SELECT
  [column]_category,
  COUNT(*) as count,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct
FROM clean_[table_name]
GROUP BY [column]_category
ORDER BY count DESC;

-- Check for unmapped values (if kept)
SELECT COUNT(*) as unmapped
FROM clean_[table_name]
WHERE [column]_category IS NULL OR [column]_category = 'Other';
```

**Document in `04-cleaning-execution.md`:**
- Category schema used
- SQL executed
- Distribution by category
- Unmapped/excluded values: [N] rows ([X]%)

---

### Transformation 4: Business Rule Validation (if applicable)

**Based on Strategy from Phase 3:**

```sql
-- Exclude rows violating business rules
CREATE TABLE clean_[table_name] AS
SELECT *
FROM raw_[table_name]
WHERE [rule_1_validation]
  AND [rule_2_validation]
  ...;

-- Example rules:
-- WHERE age BETWEEN 0 AND 120
-- AND amount > 0
-- AND date BETWEEN '2020-01-01' AND '2025-12-31'
```

**Verification:**

```sql
-- Verify no violations remain
SELECT COUNT(*) as violations
FROM clean_[table_name]
WHERE NOT ([rule_1_validation] AND [rule_2_validation] ...);
-- Expected: 0
```

**Document in `04-cleaning-execution.md`:**
- Rules enforced
- SQL executed
- Violations excluded: [N] rows ([X]%)

---

### Transformation 5: Referential Integrity Enforcement (if multiple tables)

**Based on Strategy from Phase 3:**

**For Orphaned Records (Exclude approach):**

```sql
-- Remove orphaned child records (Option A from Phase 3)
CREATE TABLE clean_child_table AS
SELECT c.*
FROM raw_child_table c
INNER JOIN raw_parent_table p ON c.fk_column = p.pk_column;
-- INNER JOIN automatically excludes orphans
```

**For Orphaned Records (Preserve with NULL approach):**

```sql
-- Set orphaned FK values to NULL (Option B from Phase 3)
CREATE TABLE clean_child_table AS
SELECT
  c.*,
  CASE
    WHEN p.pk_column IS NULL THEN NULL
    ELSE c.fk_column
  END as fk_column
FROM raw_child_table c
LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column;
```

**For Orphaned Records (Flag and Keep approach):**

```sql
-- Add orphan flag column (Option C from Phase 3)
CREATE TABLE clean_child_table AS
SELECT
  c.*,
  CASE
    WHEN p.pk_column IS NULL AND c.fk_column IS NOT NULL THEN 1
    ELSE 0
  END as fk_column_orphan_flag
FROM raw_child_table c
LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column;
```

**For Orphaned Records (Create Placeholder Parent approach):**

```sql
-- Step 1: Create placeholder parent record (Option D from Phase 3)
INSERT INTO raw_parent_table (pk_column, name, other_fields)
VALUES (-1, 'Unknown', NULL, ...);

-- Step 2: Remap orphans to placeholder
CREATE TABLE clean_child_table AS
SELECT
  c.*,
  CASE
    WHEN p.pk_column IS NULL THEN -1
    ELSE c.fk_column
  END as fk_column
FROM raw_child_table c
LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column;
```

**Verification:**

```sql
-- Verify no orphans remain (for Exclude approach)
SELECT COUNT(*) as orphans
FROM clean_child_table c
LEFT JOIN clean_parent_table p ON c.fk_column = p.pk_column
WHERE p.pk_column IS NULL AND c.fk_column IS NOT NULL;
-- Expected: 0

-- Verify NULL remapping (for Preserve with NULL approach)
SELECT COUNT(*) as nulled_fks
FROM clean_child_table
WHERE fk_column IS NULL;
-- Expected: [count of orphans from Phase 2]

-- Verify flag accuracy (for Flag and Keep approach)
SELECT fk_column_orphan_flag, COUNT(*)
FROM clean_child_table
GROUP BY fk_column_orphan_flag;
-- Expected: flag=1 count matches orphan count from Phase 2
```

**Document in `04-cleaning-execution.md`:**
- FK relationship handled
- Approach used (Exclude/Preserve/Flag/Placeholder)
- SQL executed
- Orphans affected: [N] rows ([X]%)
- JOIN behavior after transformation

[If single table: "N/A - Single table analysis"]

---

### Combined Transformation Approach

**If multiple transformations needed, use CTE chain:**

```sql
CREATE TABLE clean_[table_name] AS
WITH
-- Step 1: Remove duplicates
deduped AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY [key_cols] ORDER BY rowid) as rn
  FROM raw_[table_name]
),
no_dups AS (
  SELECT [all_columns] FROM deduped WHERE rn = 1
),
-- Step 2: Handle outliers
outliers_removed AS (
  SELECT d.*
  FROM no_dups d
  CROSS JOIN (SELECT [median], [mad] FROM ...) stats
  WHERE ABS(d.[numeric_col] - stats.median) <= 3 * stats.mad
),
-- Step 3: Apply categorization
categorized AS (
  SELECT
    o.*,
    COALESCE(m.category, 'Other') as [column]_category
  FROM outliers_removed o
  LEFT JOIN [column]_category_mapping m ON o.[text_col] = m.original_value
),
-- Step 4: Enforce business rules
final AS (
  SELECT *
  FROM categorized
  WHERE [rule_validations]
)
SELECT * FROM final;
```

**Verification of combined transformations:**

```sql
-- Row count reconciliation
SELECT
  (SELECT COUNT(*) FROM raw_[table_name]) as raw_count,
  (SELECT COUNT(*) FROM clean_[table_name]) as clean_count,
  (SELECT COUNT(*) FROM raw_[table_name]) - (SELECT COUNT(*) FROM clean_[table_name]) as total_excluded;
```

---

### Create Cleaning Execution Log

Create `analysis/[session-name]/04-cleaning-execution.md` with: ./templates/phase-4.md

**CHECKPOINT:** Before proceeding to Phase 5, you MUST have:
- [ ] Executed all transformations from Phase 3 strategy
- [ ] Created clean_[table_name] table in data/analytics.db
- [ ] Verified each transformation with specific checks
- [ ] Reconciled row counts (raw = clean + exclusions)
- [ ] Documented all exclusions with reasons and counts
- [ ] Spot-checked sample records before/after
- [ ] `04-cleaning-execution.md` created with all results documented

---

## Phase 5: Verification & Documentation

**Goal:** Validate cleaning results, quantify quality improvements, document complete audit trail from raw to clean.

### Verify Row Count Reconciliation

**Critical validation - MUST match exactly:**

```sql
-- Count raw table
SELECT COUNT(*) as raw_count FROM raw_[table_name];

-- Count clean table
SELECT COUNT(*) as clean_count FROM clean_[table_name];

-- Calculate exclusions from Phase 4 log
-- Expected: raw_count = clean_count + total_exclusions
```

**Document:** Confirm reconciliation passes. If mismatch, investigate before proceeding.

---

### Verify Transformation Results

**For each transformation applied in Phase 4:**

**Duplicate Removal Verification:**

```sql
-- Confirm no duplicates remain
SELECT [key_columns], COUNT(*) as occurrences
FROM clean_[table_name]
GROUP BY [key_columns]
HAVING COUNT(*) > 1;
-- Expected: 0 rows returned
```

**Outlier Handling Verification:**

```sql
-- For Exclude approach: confirm no outliers remain
WITH stats AS (
  SELECT [median], [mad] FROM ...
)
SELECT COUNT(*) as remaining_outliers
FROM clean_[table_name], stats
WHERE ABS([numeric_col] - median) > 3 * mad;
-- Expected: 0 rows

-- For Cap approach: confirm values at thresholds
SELECT MIN([numeric_col]) as min_val, MAX([numeric_col]) as max_val
FROM clean_[table_name];
-- Expected: min >= (median - 3*MAD), max <= (median + 3*MAD)

-- For Flag approach: check flag accuracy
SELECT [numeric_col]_outlier_flag, COUNT(*)
FROM clean_[table_name]
GROUP BY [numeric_col]_outlier_flag;
-- Expected: distribution matches Phase 4 execution log
```

**Free Text Categorization Verification:**

```sql
-- Confirm all values categorized
SELECT COUNT(*) as uncategorized
FROM clean_[table_name]
WHERE [column]_category IS NULL;
-- Expected: 0 (unless "keep uncategorized" was strategy)

-- Verify category distribution
SELECT [column]_category, COUNT(*) as count,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct
FROM clean_[table_name]
GROUP BY [column]_category
ORDER BY count DESC;
-- Expected: matches Phase 4 execution results
```

**Business Rule Verification (if applicable):**

```sql
-- Confirm no violations remain
SELECT COUNT(*) as violations
FROM clean_[table_name]
WHERE NOT ([rule_1_condition] AND [rule_2_condition] ...);
-- Expected: 0 rows
```

**Document all verification results:**
- ✓ Pass: Expected result confirmed
- ✗ Fail: Unexpected result, requires investigation

---

### Compare Data Quality Metrics

**Before vs After comparison:**

```sql
-- Completeness comparison
SELECT
  'raw' as table_name,
  COUNT(*) as total_rows,
  COUNT([col1]) as [col1]_non_null,
  ROUND(100.0 * COUNT([col1]) / COUNT(*), 2) as [col1]_completeness_pct
FROM raw_[table_name]
UNION ALL
SELECT
  'clean' as table_name,
  COUNT(*),
  COUNT([col1]),
  ROUND(100.0 * COUNT([col1]) / COUNT(*), 2)
FROM clean_[table_name];
```

**Create quality improvement table:**

| Metric | Raw Table | Clean Table | Improvement |
|--------|-----------|-------------|-------------|
| Total rows | [N] | [N] | -[X]% (exclusions) |
| Completeness ([col1]) | [X]% | [X]% | +[X] pct points |
| Duplicate groups | [N] | 0 | -[N] (100%) |
| Outliers ([col2]) | [N] | 0 | -[N] (100%) |
| Free text unique values | [N] | [N categories] | -[X]% (categorization) |

**Document:** Quality improvements quantified with specific deltas.

---

### Spot Check Sample Records

**Select representative samples to manually verify:**

```sql
-- Records that were in duplicate groups
SELECT * FROM clean_[table_name] WHERE rowid IN ([IDs from Phase 2]);
-- Verify: correct record kept per strategy

-- Records with outliers (if flagged/capped, not excluded)
SELECT * FROM clean_[table_name] WHERE [numeric_col]_outlier_flag = 1;
-- Verify: flag accurate, values capped if applicable

-- Records with categorized free text
SELECT [original_col], [col]_category FROM clean_[table_name] LIMIT 20;
-- Verify: categories make sense, mapping correct
```

**Document:** Manual verification confirms automated transformations worked correctly.

---

### Assess Limitations and Confidence

**Document what this cleaning did NOT address:**

- **Scope limitations:** [Issues identified but not addressed - e.g., "Date range not validated"]
- **Data coverage:** [Time periods, geographies, categories not covered]
- **Assumptions made:** [Business rules assumed without domain validation]
- **Edge cases:** [Unusual values handled a specific way]

**Confidence assessment:**

- **High confidence:** [Transformations with clear validation - e.g., "Exact duplicate removal"]
- **Medium confidence:** [Transformations with some subjectivity - e.g., "Free text categorization"]
- **Low confidence / Needs review:** [Transformations requiring domain expertise - e.g., "Outliers might be legitimate"]

**Document:** Honest assessment of what was cleaned and what wasn't, with confidence levels.

---

### Create Verification Report

Create `analysis/[session-name]/05-verification-report.md` with: ./templates/phase-5.md

**CHECKPOINT:** Before concluding cleaning-data skill, you MUST have:
- [ ] Verified row count reconciliation (raw = clean + exclusions)
- [ ] Validated all transformations with specific queries
- [ ] Quantified quality improvements with before/after metrics
- [ ] Spot-checked sample records manually
- [ ] Documented limitations and assumptions honestly
- [ ] Assessed confidence level for each transformation
- [ ] Created complete exclusion accounting table
- [ ] `05-verification-report.md` created with all sections filled
- [ ] clean_* table ready for analysis process skills

---

## Common Rationalizations

### "The data looks clean after Phase 2, I can skip Phase 3 strategy design"
**Why this is wrong:** Detecting issues isn't the same as deciding how to fix them. Different approaches (exclude vs cap vs flag) have different analytical implications.

**Do instead:** Always complete Phase 3 with explicit decision frameworks. Document why you chose each approach with user confirmation.

### "I'll just exclude all outliers automatically, no need to review them"
**Why this is wrong:** Some outliers are legitimate (VIP customers, seasonal spikes, rare events). Automatic exclusion loses valuable data.

**Do instead:** Complete Phase 2 detection with agent analysis. Review patterns in Phase 3. Choose approach based on business context, not just statistical threshold.

### "The fuzzy matching agent found near-duplicates, I'll merge them all"
**Why this is wrong:** 90-95% similarity doesn't mean identical. "John Smith" vs "John Smyth" might be the same person OR two different people.

**Do instead:** Review confidence levels in Phase 3. Auto-merge only high confidence (>95%). Manual review medium confidence. Document decisions.

### "I don't need to document exclusions, I can remember what I removed"
**Why this is wrong:** Undocumented exclusions break audit trail. When results are questioned, you can't explain what data was excluded or why.

**Do instead:** Complete Phase 4 execution log with exclusion summary table. Document every excluded record with reason and count. Reconcile in Phase 5.

### "Verification is just running the same queries again, waste of time"
**Why this is wrong:** Phase 5 verification checks RESULTS, not execution. Queries can run without errors but produce wrong results (logic bugs, wrong thresholds, incorrect mappings).

**Do instead:** Always complete Phase 5 with before/after comparisons, spot checks, and manual inspection. Verification catches transformation bugs.

### "I found one issue, fixed it, done with cleaning"
**Why this is wrong:** Data quality issues cluster. If you found duplicates, likely also have outliers, NULLs, and inconsistencies. One fix doesn't make data "clean".

**Do instead:** Complete all 5 phases systematically. Phase 2 detects ALL issue types. Address all prioritized issues in Phases 3-4.

### "The clean table has fewer rows, that's proof it's better"
**Why this is wrong:** Smaller isn't always better. Excluding 50% of data might remove all the interesting variation. Quality ≠ quantity reduction.

**Do instead:** Complete Phase 5 with quality improvement quantification. Measure completeness, consistency, validity improvements - not just row count reduction.

### "I'll categorize free text myself, faster than using the agent"
**Why this is wrong:** Manual categorization is inconsistent, misses patterns, and pollutes main agent context with hundreds of unique values.

**Do instead:** Always delegate free text analysis to categorize-free-text agent in Phase 3. Agent provides structured mapping with confidence levels. Review and adjust if needed.

### "Business rules failed for 2%, I'll just delete those rows and move on"
**Why this is wrong:** 2% violations might indicate systematic data quality issue (bad data entry, import error, logic flaw). Deleting hides the problem.

**Do instead:** Investigate violations in Phase 2. Document why they violate rules in Phase 3. Consider whether to exclude, correct, or flag in strategy. Document pattern in Phase 5.

### "Phase 5 validation passed, I'm done - no need to document limitations"
**Why this is wrong:** All cleaning has limitations and assumptions. Pretending otherwise misleads analysts who use the clean data.

**Do instead:** Complete Phase 5 limitations section honestly. Document what was NOT cleaned, assumptions made, edge cases, confidence levels. Transparency builds trust.

---

## Summary

This skill ensures systematic, documented data cleaning with quality validation by:

1. **Prioritized scope definition:** Read quality report from importing-data, apply impact × severity × effort framework - ensures high-value issues addressed first, not random fixes.

2. **Structured decision-making:** Present options with trade-offs for duplicates, outliers, free text, business rules - gets user confirmation before execution, prevents undocumented assumptions.

3. **Agent-delegated detection:** Use dedicated sub-agents (detect-exact-duplicates, detect-near-duplicates, detect-outliers, categorize-free-text) - prevents context pollution while ensuring thorough analysis.

4. **Explicit strategy approval:** Document chosen approach per issue type in Phase 3 with rationale - creates decision audit trail, enables strategy review if results questioned.

5. **Transformation transparency:** Execute cleaning with CREATE TABLE AS SELECT, preserve raw_* tables, track all exclusions - maintains complete audit trail from raw to clean.

6. **Rigorous verification:** Validate transformations, quantify quality improvements, spot-check samples, document limitations - ensures clean data is actually clean and limitations are known.

Follow this process and you'll create well-documented clean tables with validated quality improvements, complete audit trail from raw data to analysis-ready data, and honest assessment of what was cleaned and what limitations remain.

---
