---
name: experiment-analyst
description: Use when the task involves A/B testing, experiment design, statistical comparison, or causal claims from controlled or quasi-controlled comparisons. Triggers include A/B test, experiment, treatment/control, statistical significance, p-value, confidence interval, lift, power, sample size, hypothesis test, causal impact, or "was this experiment successful." Also use when the user asks to design an experiment before running it, or to interpret results after running one. Do NOT use for ordinary KPI reporting, broad EDA, dashboard layout, or model validation unrelated to an experiment.
---

> Part of the [data-scientist](https://github.com/DAlanMtz/data-scientist) skill suite. Install `data-scientist` for full lifecycle methodology, routing, and review orchestration.

# Experiment Analyst

## Purpose

Design experiments correctly, evaluate their results rigorously, and communicate conclusions with appropriate causal discipline. The posture is skeptical: treatment effects are not accepted without checking assignment validity, sample balance, appropriate test selection, and practical significance alongside statistical significance.

This skill covers both pre-experiment design (hypothesis, power, randomization) and post-experiment analysis (effect estimation, significance, caveats, decision).

## When To Use This Skill

Use `experiment-analyst` when:

- The user asks to analyze an A/B test or multivariate experiment result.
- The user asks to design an experiment (define hypothesis, randomization, sample size, power).
- The user asks about statistical significance, p-values, confidence intervals, lift, or power.
- The user compares two groups (treatment/control, before/after, variant A/B) and wants to draw a conclusion.
- The user asks whether an experiment "worked" or whether a result is "real."
- The user wants to make a causal claim from a comparison and needs the evidence assessed.

## When Not To Use This Skill

| Situation | Use instead |
|---|---|
| Ordinary KPI reporting without a treatment/control comparison | `metric-analyst` or parent `data-scientist` |
| Broad EDA, data profiling, schema inspection | `data-explorer` |
| Dashboard layout for experiment results | `dashboard-designer` (after this skill produces findings) |
| Model validation, leakage, production readiness | `model-auditor` |
| Stakeholder communication after results are finalized | `insight-reporter` |

## Relationship to Parent Skill

| Responsibility | Owner |
|---|---|
| Routing to this skill | Parent `data-scientist` (`workflow/specialist-routing.md`) |
| Experiment framing and scoping | **This skill** |
| Statistical testing and effect estimation | **This skill** |
| Causal language control | **This skill** |
| Communicating results to stakeholders | `insight-reporter` (after this skill) |
| Visualizing experiment results as a dashboard | `dashboard-designer` (after this skill) |

## Entry Gates

Before beginning analysis or design, confirm or state as assumptions:

1. **Outcome metric** — What is the primary metric being tested? (Conversion rate, revenue, retention, etc.)
2. **Comparison groups** — What is the treatment? What is the control? How were they defined?
3. **Unit of analysis** — One row = one what? (User, session, order, device?)
4. **Time window** — When did the experiment run? Are pre- and post-exposure periods clearly bounded?
5. **Assignment logic** — How were units assigned to treatment/control? (Random, stratified, self-selected?)
6. **Sample sizes** — How many units are in each group?

If two or more items are missing and would materially change the analysis, apply Level 2 (Clarify Then Proceed). State any missing items as open questions or assumptions and proceed where possible.

**For design-only tasks (no data yet):** Entry gates reduce to outcome metric, unit of analysis, expected effect size, and acceptable error rates (α and power).

## Required Workflow

### Post-Experiment Analysis

1. **Clarify the hypothesis.** What was the pre-specified expected direction of effect? (If not pre-specified, note this as a limitation.)
2. **Define treatment/control groups.** Confirm group definitions, assignment logic, and whether any units were excluded.
3. **Check assignment and balance.** Are treatment and control groups comparable on pre-experiment covariates? Note any imbalance.
4. **Check sample size and power.** Is the experiment adequately powered for the observed effect size? Under-powered experiments produce unreliable significance estimates.
5. **Choose and apply the appropriate test or estimator.** Match the test to the outcome type (proportion, continuous, count, time-to-event) and data structure. State which test was used and why.
6. **Estimate effect size and uncertainty.** Report the point estimate, confidence interval, and absolute lift alongside the relative lift. Do not report relative lift alone.
7. **Interpret practical significance.** Is the effect size large enough to matter for the business? A statistically significant result may be practically negligible.
8. **State causal caveats.** Was assignment truly random? Were there interference effects (SUTVA violations)? Were guardrail metrics checked? What alternative explanations remain?
9. **Recommend a decision or next step.** Ship, extend, iterate, or stop — based on the combined evidence.

### Experiment Design

1. Define the primary hypothesis and outcome metric.
2. Define the unit of randomization and assignment mechanism.
3. Determine the minimum detectable effect (MDE) based on business context.
4. Calculate the required sample size for the target power (typically 80%) and significance level (typically α = 0.05).
5. Specify guardrail metrics — metrics that must not degrade.
6. Specify the planned analysis method and any pre-registered adjustments.
7. Document the design as an experiment brief.

## Output Formats

| Format | Use when |
|---|---|
| **Experiment readout** | Post-experiment analysis with effect estimate, significance, and decision recommendation |
| **Test design review** | Pre-experiment design document with hypothesis, randomization, power, and guardrails |
| **Statistical comparison summary** | Side-by-side comparison of groups with test result and interpretation |
| **Causal caveat memo** | Focused note on what can and cannot be claimed causally from the comparison |

## Standard Experiment Readout Format

```
**Experiment Readout: [Experiment Name]**
Date: [analysis date]
Analyst: [if known]

**Hypothesis:** [Directional hypothesis — e.g., "Treatment increases conversion rate vs. control"]
**Outcome metric:** [Primary metric]
**Unit of analysis:** [One row = one what?]
**Time window:** [Start → End]

**Groups:**
| Group | N | Metric value |
|---|---|---|
| Treatment | [n] | [value] |
| Control | [n] | [value] |

**Effect:**
- Absolute lift: [X pp / X units]
- Relative lift: [X%]
- 95% confidence interval: [lower, upper]
- Test: [test name], p = [value]
- Statistically significant: [Yes / No / Borderline]
- Practically significant: [Yes / No — explain threshold]

**Sample balance:** [Balanced / Imbalanced — note if imbalanced and impact]
**Power:** [Was the experiment adequately powered at the observed effect size?]

**Causal caveats:**
- [Caveat 1]
- [Caveat 2]

**Guardrail metrics:** [Any guardrail metric changes]

**Decision recommendation:** [Ship / Extend / Iterate / Stop] — [1–2 sentence rationale]
```

## Review Checklist

Run before finalizing any experiment analysis:

| # | Check | Pass condition |
|---|---|---|
| EA1 | Hypothesis is clear and directional | A pre-specified direction of effect exists, or absence is noted as a limitation |
| EA2 | Groups are valid | Treatment and control definitions are correct and assignment logic is confirmed |
| EA3 | Sample balance is considered | Pre-experiment covariate balance is checked; imbalances are flagged and addressed |
| EA4 | Test is appropriate for the outcome type | Test matches the data structure — not a default t-test applied universally |
| EA5 | Effect size is included | Absolute lift and confidence interval are reported, not p-value alone |
| EA6 | P-value is not overinterpreted | p < 0.05 is not treated as proof of effect; borderline results are noted |
| EA7 | Practical significance is addressed | The effect size is evaluated against a business-meaningful threshold |
| EA8 | Causal language is controlled | "caused," "proved," and "because" are avoided unless assignment was truly random and guardrails are met |
| EA9 | Power is considered | Under-powered experiments are noted; non-significant results are not declared null effects |
| EA10 | Guardrail metrics are checked | Metric degradation on guardrails is reported, even when the primary metric improves |

**Common failure modes:**
- Reporting only relative lift without absolute lift or confidence intervals
- Declaring significance at p = 0.049 without noting borderline uncertainty
- Ignoring sample imbalance between treatment and control
- Treating "no significant difference" as "no effect" without checking power
- Running multiple tests without correction and reporting the best result

## Handoff Back to `data-scientist`

After experiment analysis:

- If the result requires stakeholder communication, route to `insight-reporter` with the experiment readout as input.
- If the result requires a visual report or dashboard, route to `dashboard-designer` with the experiment findings.
- If the experiment surfaces model quality concerns, route to `model-auditor`.
- Return to parent `data-scientist` for project closeout or routing to a follow-up analysis path.