---
name: data-explorer
description: Use when the task is to inspect a dataset, profile data, understand schema and grain, assess data quality, find missingness or duplicates, identify row grain, check time coverage, or determine whether data is analysis-ready. Triggers include "inspect this dataset," "profile the data," "understand the schema," "check for missingness," "find duplicates," "what is the row grain," "assess data quality," "is this data ready for analysis," or first-pass EDA. Do NOT use for final stakeholder reporting, dashboard layout, model validation, experiment readout, or production monitoring.
---

> Part of the [data-scientist](https://github.com/DAlanMtz/data-scientist) skill suite. Install `data-scientist` for full lifecycle methodology, routing, and review orchestration.

# Data Explorer

## Purpose

Intake a new dataset, understand its structure and quality, and determine whether it is ready for analysis. The posture is methodical and skeptical: do not assume the data is clean, the grain is obvious, or the coverage is complete. Surface issues before they corrupt downstream analysis.

This skill produces a structured data profile and analysis-readiness assessment, not a final analysis. It is the first step before `metric-analyst`, `experiment-analyst`, or modeling work begins.

## When To Use This Skill

Use `data-explorer` when:

- The user asks to inspect, profile, or explore a new dataset.
- The user wants to understand the schema, field types, or key columns.
- The user asks about missingness, duplicates, outliers, or invalid values.
- The user wants to identify the row grain of a table.
- The user asks whether data is "ready" for analysis, modeling, or joining.
- The task is first-pass EDA with no specific analysis question yet defined.
- The user asks to assess data quality before beginning any downstream work.

## When Not To Use This Skill

| Situation | Use instead |
|---|---|
| Final stakeholder report or communication | `insight-reporter` |
| Metric definitions and KPI calculations | `metric-analyst` |
| Experiment analysis or A/B testing | `experiment-analyst` |
| Model validation or production readiness | `model-auditor` |
| Dashboard or visual layout | `dashboard-designer` |
| Production pipeline monitoring | `production-analytics` |

## Relationship to Parent Skill

| Responsibility | Owner |
|---|---|
| Routing to this skill | Parent `data-scientist` (`workflow/specialist-routing.md`) |
| Dataset profiling and schema inspection | **This skill** |
| Row grain identification | **This skill** |
| Missingness, duplicates, outlier assessment | **This skill** |
| Time coverage and freshness check | **This skill** |
| Analysis-readiness determination | **This skill** |
| Downstream analysis using the profiled data | `metric-analyst`, `experiment-analyst`, parent `data-scientist` |

## Entry Gates

Before beginning, confirm or state as assumptions:

1. **Dataset or schema information** — What data is available? Table names, field names, a sample, or a schema description.
2. **Intended analysis question** — What will this data be used to answer? (Use "general profiling" if unknown — do not block on this.)
3. **Key fields to focus on** — Are there specific ID fields, date fields, or outcome fields of interest?

Missing items are not blockers. State them as assumptions and profile broadly. The analysis-readiness summary will flag what is unknown.

## Required Workflow

1. **Identify the data source and intended question.** Name the table, file, or extract. State the intended downstream analysis if known.
2. **Inspect schema and key fields.** Column names, inferred or stated types, expected vs. actual field roles (IDs, dates, outcome fields, join keys, measures).
3. **Identify the row grain.** One row = one what? Confirm by checking whether the primary key is unique at the expected grain level.
4. **Check missingness, duplicates, outliers, and invalid values.** For each key field: null rate, duplicate rate (at grain level), outlier presence, and invalid values (negative counts, future dates, impossible values).
5. **Check time coverage and freshness.** What is the earliest and latest date? Are there gaps? Does the coverage match the intended analysis window?
6. **Check segment coverage and join keys.** Are expected categories present? Are join keys consistent across tables? Would a join create fan-out or drop rows unexpectedly?
7. **Produce the analysis-readiness summary.** State whether the data is ready, conditionally ready, or blocked — and what must be resolved before analysis proceeds.
8. **Handoff.** Route to `metric-analyst`, `experiment-analyst`, parent `data-scientist`, or the user with specific recommendations.

## Output Formats

| Format | Use when |
|---|---|
| **Data profile summary** | Quick structured overview of schema, grain, and key quality stats |
| **EDA findings** | Narrative findings from exploratory profiling with interpretation |
| **Data quality report** | Structured findings table with severity, affected field, issue, and fix |
| **Analysis-readiness checklist** | Pass/conditional/blocked determination with specific conditions |

## Standard Data Profile Format

```
**Data Profile: [Table / Dataset Name]**
Source: [table name, file path, or description]
Profiled: [date or "current session"]
Intended use: [analysis question or "general profiling"]

**Schema summary:**
| Field | Type | Role | Notes |
|---|---|---|---|
| [field] | [type] | [ID / Date / Measure / Category / Join key] | [observations] |

**Row grain:** [One row = one what?] | Unique key: [field(s)] | Confirmed: [Yes / No / Uncertain]
**Row count:** [N] | Duplicate rows at grain: [count or %]

**Missingness:**
| Field | Null rate | Pattern |
|---|---|---|
| [field] | [X%] | [MCAR / MAR / MNAR / Unknown] |

**Key quality issues:**
| Severity | Field | Issue | Recommended fix |
|---|---|---|---|
| [Critical/High/Medium/Low] | [field] | [description] | [action] |

**Time coverage:** [Start] → [End] | Gaps: [Yes/No — describe] | Freshness: [lag note]

**Segment / join checks:** [Key findings on categories or join keys]

**Analysis-readiness verdict:**
- [ ] Ready — proceed to analysis
- [ ] Conditionally ready — resolve [specific issue(s)] first
- [ ] Blocked — [reason] must be resolved before analysis

**Recommended next step:** [route to metric-analyst / experiment-analyst / user action]
```

## Review Checklist

Run before delivering any data profile or readiness assessment:

| # | Check | Pass condition |
|---|---|---|
| DE1 | Grain is understood | One row = one [unit] is confirmed or explicitly uncertain |
| DE2 | Key fields are identified | ID, date, target/outcome, join key, and measure fields are labeled |
| DE3 | Missingness is checked | Null rates are reported for all key fields |
| DE4 | Duplicates are checked | Duplicate row count at the expected grain level is reported |
| DE5 | Time/freshness is checked | Date range and any gaps are noted |
| DE6 | Source limitations are documented | Known quality issues, lag, or collection artifacts are stated |
| DE7 | Join risk is assessed | If a join is planned, cardinality, key consistency, and fan-out risk are checked |
| DE8 | Recommendations are tied to quality findings | Suggestions address specific issues, not generic advice |
| DE9 | Analysis-readiness verdict is explicit | Ready / Conditionally ready / Blocked is stated, not implied |

**Common failure modes:**
- Treating the wrong field as the grain (e.g., using order ID when customer ID is the intended grain)
- Declaring data "clean" based on row count alone without checking nulls or duplicates
- Missing the fan-out risk from a one-to-many join
- Not checking time coverage when the analysis involves a specific period
- Profiling fields irrelevant to the intended analysis while ignoring key fields

## Handoff Back to `data-scientist`

After profiling:

- If the data is ready or conditionally ready, route to `metric-analyst` for KPI/SQL work, `experiment-analyst` for experiment analysis, or the parent `data-scientist` for modeling or other paths.
- If the data is blocked, return findings to the user or parent `data-scientist` with a specific list of required fixes before analysis can proceed.
- Include the data profile summary as context in the handoff so downstream skills do not re-profile the same data.
