---
name: "statsmodels-statistical-modeling"
description: "Python statistical modeling: regression (OLS, WLS, GLM), discrete (Logit, Poisson, NegBin), time series (ARIMA, SARIMAX, VAR), with rigorous inference, diagnostics, and hypothesis tests. Use scikit-learn for ML; statistical-analysis for test choice."
license: "BSD-3-Clause"
---

# statsmodels

## Overview

Statsmodels provides classical statistical modeling with rigorous inference for Python. It covers linear models, generalized linear models, discrete choice, time series, and comprehensive diagnostics. Unlike scikit-learn (prediction-focused), statsmodels emphasizes coefficient interpretation, p-values, confidence intervals, and model diagnostics.

## When to Use

- Fitting linear regression (OLS, WLS, GLS) with detailed coefficient tables and diagnostics
- Running logistic regression with odds ratios and marginal effects for clinical/epidemiological studies
- Analyzing count data with Poisson or negative binomial regression
- Time series forecasting with ARIMA, SARIMAX, or exponential smoothing
- Performing ANOVA, t-tests, or non-parametric tests with proper corrections
- Testing model assumptions (heteroskedasticity, autocorrelation, normality of residuals)
- Model comparison using AIC/BIC or likelihood ratio tests
- Using R-style formula interface (`y ~ x1 + x2 + C(group)`) for intuitive model specification
- For prediction-focused ML with cross-validation and hyperparameter tuning, use `scikit-learn` instead
- For Bayesian modeling with posterior inference, use `pymc` instead

## Prerequisites

- **Python packages**: `statsmodels`, `numpy`, `pandas`, `scipy`
- **Optional**: `matplotlib` (for diagnostic plots), `patsy` (for formula API, included with statsmodels)
- **Data**: Tabular data as pandas DataFrames or NumPy arrays

```bash
pip install statsmodels numpy pandas matplotlib
```

## Quick Start

```python
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

# Generate sample data
np.random.seed(42)
n = 100
df = pd.DataFrame({
    "x1": np.random.randn(n),
    "x2": np.random.randn(n),
    "group": np.random.choice(["A", "B"], n)
})
df["y"] = 2 + 3 * df["x1"] - 1.5 * df["x2"] + np.random.randn(n)

# OLS with formula API (R-style)
results = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit()
print(results.summary())
print(f"R²: {results.rsquared:.3f}, AIC: {results.aic:.1f}")
```

## Core API

### Module 1: Linear Regression (OLS, WLS, GLS)

Standard linear models with comprehensive diagnostics.

```python
import statsmodels.api as sm
import numpy as np

# Generate data
np.random.seed(42)
X = np.random.randn(200, 3)
y = 1 + 2*X[:, 0] - 0.5*X[:, 1] + np.random.randn(200)

# ALWAYS add constant for intercept
X_const = sm.add_constant(X)
results = sm.OLS(y, X_const).fit()

print(results.summary())
print(f"\nCoefficients: {results.params}")
print(f"P-values: {results.pvalues}")
print(f"R²: {results.rsquared:.4f}")

# Predictions with confidence intervals
pred = results.get_prediction(X_const[:5])
print(pred.summary_frame())
```

```python
# Robust standard errors (heteroskedasticity-consistent)
results_robust = sm.OLS(y, X_const).fit(cov_type="HC3")
print("Robust SEs:", results_robust.bse)

# Weighted Least Squares
weights = 1 / np.abs(results.resid + 0.1)  # Example weights
results_wls = sm.WLS(y, X_const, weights=weights).fit()
print(f"WLS R²: {results_wls.rsquared:.4f}")
```

### Module 2: Generalized Linear Models (GLM)

Extend regression to non-normal outcomes (binary, count, continuous-positive).

```python
import statsmodels.api as sm
import numpy as np

# Poisson regression for count data
np.random.seed(42)
X = np.random.randn(200, 2)
X_const = sm.add_constant(X)
y_counts = np.random.poisson(np.exp(0.5 + 0.3*X[:, 0]))

model = sm.GLM(y_counts, X_const, family=sm.families.Poisson())
results = model.fit()
print(results.summary())

# Rate ratios
rate_ratios = np.exp(results.params)
print(f"Rate ratios: {rate_ratios}")

# Check overdispersion
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion ratio: {overdispersion:.2f}")
if overdispersion > 1.5:
    print("→ Consider Negative Binomial model")
```

### Module 3: Discrete Choice Models (Logit, Probit, Count)

Binary, multinomial, and count outcome models.

```python
import statsmodels.api as sm
import numpy as np

# Logistic regression
np.random.seed(42)
X = np.random.randn(300, 2)
X_const = sm.add_constant(X)
prob = 1 / (1 + np.exp(-(0.5 + X[:, 0] - 0.5*X[:, 1])))
y_binary = np.random.binomial(1, prob)

logit_results = sm.Logit(y_binary, X_const).fit()
print(logit_results.summary())

# Odds ratios
odds_ratios = np.exp(logit_results.params)
print(f"Odds ratios: {odds_ratios}")

# Marginal effects (at means)
margeff = logit_results.get_margeff()
print(margeff.summary())

# Predicted probabilities
probs = logit_results.predict(X_const[:5])
print(f"Predicted P(Y=1): {probs}")
```

### Module 4: Time Series (ARIMA, SARIMAX)

Univariate and multivariate time series modeling and forecasting.

```python
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
import numpy as np
import pandas as pd

# Generate time series
np.random.seed(42)
dates = pd.date_range("2020-01-01", periods=200, freq="D")
y = np.cumsum(np.random.randn(200)) + 50
ts = pd.Series(y, index=dates)

# Stationarity test
adf_result = adfuller(ts)
print(f"ADF statistic: {adf_result[0]:.4f}, p-value: {adf_result[1]:.4f}")
print("Stationary" if adf_result[1] < 0.05 else "Non-stationary → difference")

# Fit ARIMA
model = ARIMA(ts, order=(1, 1, 1))
results = model.fit()
print(results.summary())

# Forecast with confidence intervals
forecast = results.get_forecast(steps=30)
forecast_df = forecast.summary_frame()
print(f"30-day forecast:\n{forecast_df.head()}")
```

```python
# Seasonal ARIMA (SARIMAX)
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Monthly data with yearly seasonality
model_sarima = SARIMAX(ts, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results_sarima = model_sarima.fit(disp=False)
print(f"AIC: {results_sarima.aic:.1f}")

# Diagnostic plots
results_sarima.plot_diagnostics(figsize=(12, 8))
```

### Module 5: Statistical Tests and Diagnostics

Assumption tests, hypothesis tests, and model validation.

```python
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_ljungbox
from statsmodels.stats.stattools import jarque_bera
import numpy as np

# Fit a model first
np.random.seed(42)
X = sm.add_constant(np.random.randn(200, 2))
y = 1 + 2*X[:, 1] + np.random.randn(200) * X[:, 1]  # Heteroskedastic
results = sm.OLS(y, X).fit()

# Heteroskedasticity test (Breusch-Pagan)
bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_p:.4f} {'→ heteroskedastic' if bp_p < 0.05 else '→ OK'}")

# Normality test (Jarque-Bera)
jb_stat, jb_p, _, _ = jarque_bera(results.resid)
print(f"Jarque-Bera p-value: {jb_p:.4f} {'→ non-normal' if jb_p < 0.05 else '→ OK'}")

# Autocorrelation test (Ljung-Box)
lb_result = acorr_ljungbox(results.resid, lags=[10], return_df=True)
print(f"Ljung-Box p-value (lag 10): {lb_result['lb_pvalue'].values[0]:.4f}")
```

```python
# Variance Inflation Factor (multicollinearity)
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame({
    "Variable": [f"x{i}" for i in range(X.shape[1])],
    "VIF": [variance_inflation_factor(X, i) for i in range(X.shape[1])]
})
print(vif_data)  # VIF > 10 suggests multicollinearity
```

### Module 6: Formula API (R-style)

Intuitive model specification using formulas with automatic dummy coding.

```python
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    "y": np.random.randn(100),
    "x1": np.random.randn(100),
    "x2": np.random.randn(100),
    "group": np.random.choice(["A", "B", "C"], 100),
})

# Formula with categoricals (auto dummy-coded)
res = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit()
print(res.summary())

# Interactions
res2 = smf.ols("y ~ x1 * x2", data=df).fit()  # x1 + x2 + x1:x2
print(f"Interaction term p-value: {res2.pvalues['x1:x2']:.4f}")

# Logit via formula
df["binary"] = (df["y"] > 0).astype(int)
logit_res = smf.logit("binary ~ x1 + x2 + C(group)", data=df).fit()
print(f"Logit AIC: {logit_res.aic:.1f}")
```

## Common Workflows

### Workflow 1: Complete Regression Analysis

**Goal**: Fit OLS, validate assumptions, use robust SEs if needed.

```python
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np
import pandas as pd

# 1. Fit initial model
np.random.seed(42)
df = pd.DataFrame({"y": np.random.randn(200), "x1": np.random.randn(200), "x2": np.random.randn(200)})
df["y"] = 2 + 3*df["x1"] - df["x2"] + np.random.randn(200)

results = smf.ols("y ~ x1 + x2", data=df).fit()

# 2. Check heteroskedasticity
bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, results.model.exog)
print(f"Breusch-Pagan p: {bp_p:.4f}")

# 3. If heteroskedastic, use robust SEs
if bp_p < 0.05:
    results = smf.ols("y ~ x1 + x2", data=df).fit(cov_type="HC3")
    print("Using HC3 robust standard errors")

# 4. Check multicollinearity
X = results.model.exog
for i in range(1, X.shape[1]):  # skip constant
    print(f"VIF x{i}: {variance_inflation_factor(X, i):.2f}")

# 5. Final results
print(results.summary())
print(f"\nAIC: {results.aic:.1f}, BIC: {results.bic:.1f}")
```

### Workflow 2: Model Comparison

**Goal**: Compare nested and non-nested models using appropriate criteria.

```python
import statsmodels.formula.api as smf
from scipy import stats
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({"y": np.random.randn(200), "x1": np.random.randn(200),
                    "x2": np.random.randn(200), "x3": np.random.randn(200)})
df["y"] = 1 + 2*df["x1"] - df["x2"] + 0.1*df["x3"] + np.random.randn(200)

# Fit nested models
m1 = smf.ols("y ~ x1", data=df).fit()
m2 = smf.ols("y ~ x1 + x2", data=df).fit()
m3 = smf.ols("y ~ x1 + x2 + x3", data=df).fit()

# Compare via AIC/BIC (lower = better)
comparison = pd.DataFrame({
    "R²": [m.rsquared for m in [m1, m2, m3]],
    "AIC": [m.aic for m in [m1, m2, m3]],
    "BIC": [m.bic for m in [m1, m2, m3]],
}, index=["y~x1", "y~x1+x2", "y~x1+x2+x3"])
print(comparison)

# Likelihood ratio test (nested: m2 vs m3)
lr_stat = 2 * (m3.llf - m2.llf)
p_val = 1 - stats.chi2.cdf(lr_stat, df=m3.df_model - m2.df_model)
print(f"\nLR test (m3 vs m2): stat={lr_stat:.2f}, p={p_val:.4f}")
```

### Workflow 3: Time Series Forecasting Pipeline

**Goal**: Test stationarity, identify model order, forecast.

```python
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate data
np.random.seed(42)
ts = pd.Series(np.cumsum(np.random.randn(200)) + 100,
               index=pd.date_range("2020-01-01", periods=200, freq="D"))

# 1. Test stationarity
adf_p = adfuller(ts)[1]
print(f"ADF p-value: {adf_p:.4f} → {'stationary' if adf_p < 0.05 else 'non-stationary'}")

# 2. Identify order from ACF/PACF (on differenced series)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 6))
plot_acf(ts.diff().dropna(), lags=20, ax=ax1)
plot_pacf(ts.diff().dropna(), lags=20, ax=ax2)
plt.savefig("acf_pacf.png", dpi=150, bbox_inches="tight")

# 3. Fit and forecast
model = ARIMA(ts[:180], order=(1, 1, 1))
results = model.fit()
forecast = results.get_forecast(steps=20)
fc_df = forecast.summary_frame()
print(f"ARIMA AIC: {results.aic:.1f}")
print(f"Forecast (first 5 days):\n{fc_df.head()}")
```

## Key Parameters

| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| `cov_type` | OLS/WLS/GLM | `"nonrobust"` | `"HC0"`-`"HC3"`, `"HAC"`, `"cluster"` | Robust covariance estimator |
| `family` | GLM | required | `Poisson()`, `Binomial()`, `Gamma()`, etc. | Distribution family |
| `order` | ARIMA | required | `(p, d, q)` tuple | AR order, differencing, MA order |
| `seasonal_order` | SARIMAX | `(0,0,0,0)` | `(P, D, Q, s)` tuple | Seasonal ARIMA parameters |
| `alpha` | `summary()`, `conf_int()` | `0.05` | `0.01`-`0.10` | Significance level for CIs |
| `maxiter` | All `.fit()` | `35`-`100` | `50`-`1000` | Max optimization iterations |
| `method` | `.fit()` | model-dependent | `"newton"`, `"bfgs"`, `"lbfgs"`, `"powell"` | Optimization algorithm |
| `lags` | ACF/PACF | `None` | `10`-`50` | Number of lags to display |

## Best Practices

1. **Always add a constant** for OLS/GLM: `sm.add_constant(X)` or use formula API (adds intercept automatically)

2. **Match model to outcome type**: Binary → Logit/Probit, Counts → Poisson/NegBin, Continuous → OLS/WLS, Time series → ARIMA

3. **Check diagnostics before interpreting**: Run Breusch-Pagan (heteroskedasticity), Jarque-Bera (normality), Ljung-Box (autocorrelation) on residuals

4. **Use robust SEs when assumptions fail**: `results = model.fit(cov_type="HC3")` for heteroskedasticity-robust inference

5. **Report effect sizes, not just p-values**: Include coefficients, confidence intervals, and R² alongside significance tests

6. **Prefer formula API for exploratory work**: `smf.ols("y ~ x1 * x2 + C(group)", data=df)` is more readable and handles categoricals automatically

7. **Test stationarity before time series modeling**: Use ADF test; difference if non-stationary

## Common Recipes

### Recipe: ANOVA with Post-hoc Tests

When to use: Comparing means across 3+ groups.

```python
import statsmodels.formula.api as smf
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({"value": np.concatenate([np.random.normal(m, 1, 30) for m in [5, 6, 7]]),
                    "group": np.repeat(["A", "B", "C"], 30)})

# One-way ANOVA
anova = smf.ols("value ~ C(group)", data=df).fit()
print(sm.stats.anova_lm(anova))

# Post-hoc Tukey HSD
tukey = pairwise_tukeyhsd(df["value"], df["group"], alpha=0.05)
print(tukey)
```

### Recipe: Power Analysis for Sample Size

When to use: Determining required sample size before a study.

```python
from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
# What sample size for medium effect (d=0.5), 80% power, alpha=0.05?
n = analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.8)
print(f"Required n per group: {n:.0f}")

# Power for given sample size
power = analysis.solve_power(effect_size=0.5, alpha=0.05, nobs1=50)
print(f"Power with n=50: {power:.3f}")
```

### Recipe: Mixed Effects Model

When to use: Hierarchical/clustered data (patients within hospitals, students within schools).

```python
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    "y": np.random.randn(100), "x": np.random.randn(100),
    "group": np.repeat(range(10), 10)
})

# Random intercept model
model = smf.mixedlm("y ~ x", data=df, groups=df["group"])
results = model.fit()
print(results.summary())
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `MissingDataError` | NaN values in data | Drop NAs: `df.dropna()` or impute before fitting |
| No intercept in results | Forgot `sm.add_constant()` | Always add constant, or use `smf.ols()` formula API |
| `ConvergenceWarning` | Optimization failed | Increase `maxiter`, try different `method`, or scale variables |
| Overdispersion in Poisson | Variance > mean | Switch to `NegativeBinomial` or use `GLM(family=NegativeBinomial())` |
| Non-stationary time series | Trend or unit root | Difference the series (`ts.diff()`) or increase `d` in ARIMA |
| Singular matrix error | Perfect multicollinearity | Remove redundant variables; check VIF > 10 |
| Different results from R | Default settings differ | Check: constant term, link function, optimizer, SE type |
| `PerfectSeparationError` in Logit | Predictor perfectly separates classes | Use regularized logistic (penalized MLE) or Firth's method |

## References

- [statsmodels documentation](https://www.statsmodels.org/stable/) — official docs
- [statsmodels User Guide](https://www.statsmodels.org/stable/user-guide.html) — tutorials
- [statsmodels API Reference](https://www.statsmodels.org/stable/api.html) — complete API
- Seabold, S. & Perktold, J. (2010). Statsmodels: Econometric and Statistical Modeling with Python. *Proceedings of the 9th Python in Science Conference*.
