---
name: "scvi-tools-single-cell"
description: "Deep generative models for single-cell omics: probabilistic batch correction (scVI), semi-supervised annotation (scANVI), CITE-seq RNA+protein (totalVI), transfer learning (scARCHES), and DE with uncertainty. Unified setup→train→extract API on AnnData. Use harmony-batch-correction for fast linear correction without deep learning; muon for multi-modal MuData workflows."
license: "BSD-3-Clause"
---

# scvi-tools — Single-Cell Deep Generative Models

## Overview

scvi-tools is a probabilistic modeling framework for single-cell genomics built on PyTorch. It implements variational autoencoders (VAEs) that learn low-dimensional latent representations of cells while explicitly modeling batch effects, count noise distributions, and multi-modal data. All models share a unified API: `setup_anndata()` to register data, instantiate the model, `train()`, then extract latent representations, normalized expression, or differential expression results. Models operate on raw count data in AnnData format and return statistically grounded outputs with uncertainty estimates.

## When to Use

- Integrating multiple scRNA-seq batches or studies with probabilistic batch correction that preserves biological variation
- Performing differential expression with uncertainty quantification and composite hypotheses (not just fold-change thresholding)
- Annotating cell types via semi-supervised transfer learning from a partially-labeled reference (scANVI)
- Jointly modeling CITE-seq protein and RNA data to obtain denoised protein estimates and joint embeddings (totalVI)
- Adapting a pretrained model to a new query dataset without full retraining (scARCHES transfer learning)
- Deconvolving spatial transcriptomics spots into cell type proportions using a matched scRNA-seq reference (DestVI)
- Detecting doublets in scRNA-seq data as a QC preprocessing step (Solo)
- Use **harmony-batch-correction** instead when you need fast linear batch correction (seconds vs minutes) without deep learning overhead
- For **multi-modal MuData workflows** (joint RNA+ATAC Multiome, combined modality objects), use **muon** instead
- For **standard clustering and visualization** without batch effects or probabilistic DE, use **scanpy-scrna-seq**

## Prerequisites

- **Python packages**: `scvi-tools>=1.1`, `scanpy`, `anndata`
- **Data requirements**: AnnData (`.h5ad`) with **raw counts** — not log-normalized. Store counts in `adata.layers["counts"]` if `adata.X` has been normalized.
- **Hardware**: CPU works for <50k cells; GPU (8GB+ VRAM) recommended for larger datasets. Training time: 5–30 minutes on GPU for 100k cells.

```bash
pip install scvi-tools scanpy
# GPU acceleration (recommended for >50k cells)
pip install "scvi-tools[cuda12]"   # or scvi-tools[cuda11]
```

## Quick Start

Minimal scVI batch integration on a built-in example dataset:

```python
import scvi
import scanpy as sc

# Load example data with batch labels
adata = scvi.data.heart_cell_atlas_subsampled()

# Preprocessing: filter genes, select HVGs (subset to save training time)
sc.pp.filter_genes(adata, min_counts=3)
adata.layers["counts"] = adata.X.copy()  # preserve raw counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key="cell_source", subset=True)

# Register data, train, extract
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="cell_source")
model = scvi.model.SCVI(adata, n_latent=30)
model.train(max_epochs=200, early_stopping=True)

adata.obsm["X_scVI"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color=["cell_source", "leiden"])
print(f"Latent shape: {adata.obsm['X_scVI'].shape}")
# Latent shape: (14000, 30)
```

## Core API

### 1. Data Registration (setup_anndata)

All models share the same registration pattern. Call `setup_anndata()` on the model class before instantiation to tell scvi-tools where to find counts, batch labels, and covariates.

```python
import scvi

# Minimal: counts in a layer, batch column in obs
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",        # Key in adata.layers with raw counts; None = adata.X
    batch_key="batch",     # Column in adata.obs for technical batch
)

# Extended: additional categorical and continuous covariates
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    categorical_covariate_keys=["donor", "protocol"],   # Discrete biological/technical vars
    continuous_covariate_keys=["percent_mito", "log_n_counts"],  # Continuous covariates
)

# Inspect registered summary
print(adata.uns["_scvi"]["summary_stats"])
# {'n_vars': 2000, 'n_cells': 45000, 'n_batch': 6, 'n_extra_categorical_covs': 2, ...}
```

### 2. scVI — Batch Correction and Integration

The core unsupervised model. Learns a batch-corrected latent space and a denoised expression layer. Starting point for any multi-batch scRNA-seq analysis.

```python
import scvi

model = scvi.model.SCVI(
    adata,
    n_latent=30,              # Latent space dimensions; 10–50 typical
    n_layers=2,               # Hidden layers in encoder/decoder; 1–3
    n_hidden=128,             # Neurons per hidden layer; 64–256
    gene_likelihood="zinb",   # "zinb" (zero-inflated NB), "nb", or "poisson"
    dispersion="gene",        # "gene" or "gene-batch" for batch-specific dispersion
)
model.train(max_epochs=200, early_stopping=True)

# Extract results
latent = model.get_latent_representation()        # ndarray (n_cells, n_latent)
normalized = model.get_normalized_expression(
    library_size=1e4, n_samples=25, return_mean=True
)
print(f"Latent: {latent.shape}")
print(f"Denoised expression: {normalized.shape}")
# Latent: (45000, 30)
# Denoised expression: (45000, 2000)

adata.obsm["X_scVI"] = latent
adata.layers["scvi_normalized"] = normalized
```

### 3. scANVI — Semi-Supervised Cell Annotation

Extends scVI with label supervision for cell type transfer learning. Accepts partially labeled data (unannotated cells labeled `"Unknown"`) and predicts cell types for unlabeled cells.

```python
import scvi

# Register with cell type labels; mark unannotated cells as "Unknown"
scvi.model.SCANVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    labels_key="cell_type",           # Column in adata.obs with known labels
    unlabeled_category="Unknown",     # Sentinel value for unannotated cells
)

# Recommended: initialize from a pretrained scVI model
scvi_model = scvi.model.SCVI(adata, n_latent=30)
scvi_model.train(max_epochs=200, early_stopping=True)

model = scvi.model.SCANVI.from_scvi_model(scvi_model, unlabeled_category="Unknown")
model.train(max_epochs=20)  # Short fine-tuning on top of scVI

# Predict cell types
labels_pred = model.predict()              # Hard labels (str per cell)
probs = model.predict(soft=True)           # DataFrame: probability per cell type
confidence = probs.max(axis=1)

adata.obs["scANVI_pred"] = labels_pred
adata.obs["scANVI_confidence"] = confidence
print(f"Median confidence: {confidence.median():.3f}")
print(f"High-confidence cells (>0.7): {(confidence > 0.7).sum()}")
```

### 4. totalVI — CITE-seq RNA + Protein

Joint probabilistic model for CITE-seq data. Denoises both RNA and protein counts and estimates protein foreground probability (signal vs background).

```python
import scvi

# Protein counts must be in adata.obsm (not adata.X or layers)
# adata.obsm["protein_expression"] shape: (n_cells, n_proteins)
scvi.model.TOTALVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    protein_expression_obsm_key="protein_expression",
)

model = scvi.model.TOTALVI(adata, latent_distribution="normal")
model.train(max_epochs=200, early_stopping=True)

# Extract joint latent space (RNA + protein)
latent = model.get_latent_representation()

# Denoised RNA and protein separately
rna_norm, protein_norm = model.get_normalized_expression(
    n_samples=25, return_mean=True
)

# Protein foreground probability (probability that signal > background)
foreground_prob = model.get_protein_foreground_probability(n_samples=25, return_mean=True)
print(f"RNA normalized: {rna_norm.shape}")
print(f"Protein normalized: {protein_norm.shape}")
print(f"Foreground prob: {foreground_prob.shape}")  # (n_cells, n_proteins)

adata.obsm["X_totalVI"] = latent
```

### 5. Differential Expression

Probabilistic DE using the generative model rather than raw counts. Supports composite hypotheses testing whether effect size exceeds a biologically meaningful threshold (`delta`).

```python
# DE between two cell type groups
de_df = model.differential_expression(
    groupby="cell_type",
    group1="CD4 T",         # Numerator group
    group2="CD8 T",         # Denominator group; None = rest of cells
    mode="change",          # "change": composite |LFC| > delta; "vanilla": any LFC
    delta=0.25,             # Minimum |LFC| to be considered DE
    fdr_target=0.05,        # FDR level for calling significance
    all_stats=True,         # Include mean expression per group
)

# Filter significant DE genes
sig = de_df[de_df["is_de_fdr_0.05"] & (de_df["lfc_mean"].abs() > 0.5)]
sig_sorted = sig.sort_values("lfc_mean", ascending=False)
print(f"Total DE genes (FDR<5%, |LFC|>0.5): {len(sig)}")
print(sig_sorted[["lfc_mean", "bayes_factor", "proba_de"]].head(10))
```

```python
# One-vs-rest DE across all clusters (generates a dict of DataFrames)
de_all = {}
for ct in adata.obs["cell_type"].unique():
    de_all[ct] = model.differential_expression(
        idx1=[adata.obs["cell_type"] == ct],   # Boolean index
        mode="change",
        delta=0.25,
    )
    sig_n = de_all[ct]["is_de_fdr_0.05"].sum()
    print(f"  {ct}: {sig_n} DE genes vs rest")
```

### 6. scARCHES — Query-to-Reference Transfer Learning

Adapts a pretrained reference model to a new query dataset without re-training from scratch. Preserves the reference embedding structure and maps query cells into the same latent space.

```python
import scvi

# --- Reference training (done once, save the model) ---
scvi.model.SCVI.setup_anndata(ref_adata, layer="counts", batch_key="batch")
ref_model = scvi.model.SCVI(ref_adata, n_latent=30)
ref_model.train(max_epochs=200, early_stopping=True)
ref_model.save("./reference_model/", overwrite=True)

# Reference annotation (use scANVI for label transfer)
scvi.model.SCANVI.setup_anndata(
    ref_adata, layer="counts", batch_key="batch",
    labels_key="cell_type", unlabeled_category="Unknown",
)
ref_scanvi = scvi.model.SCANVI.from_scvi_model(ref_model, unlabeled_category="Unknown")
ref_scanvi.train(max_epochs=20)
ref_scanvi.save("./reference_scanvi/", overwrite=True)

# --- Query mapping (new dataset, minimal training) ---
ref_scanvi = scvi.model.SCANVI.load("./reference_scanvi/", adata=ref_adata)

# Prepare query: must share gene set with reference
query_adata.obs["cell_type"] = "Unknown"   # All query cells are unlabeled
query_model = scvi.model.SCANVI.load_query_data(query_adata, ref_scanvi)
query_model.train(
    max_epochs=100,
    plan_kwargs={"weight_decay": 0.0},  # Critical: prevents catastrophic forgetting
)

# Map query into reference latent space
query_latent = query_model.get_latent_representation(query_adata)
query_labels = query_model.predict(query_adata)
print(f"Query latent: {query_latent.shape}")
print(f"Query label predictions: {query_labels[:5]}")
```

### 7. Downstream Analysis on Latent Space

After extracting the latent representation, use scanpy for UMAP, clustering, and marker genes — with the batch-corrected embedding as input.

```python
import scanpy as sc

# UMAP and clustering on scVI latent representation
adata.obsm["X_scVI"] = model.get_latent_representation()

sc.pp.neighbors(
    adata,
    use_rep="X_scVI",   # Use scVI latent (not PCA)
    n_neighbors=30,
    n_pcs=None,         # n_pcs ignored when use_rep is set
)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

# Marker genes using raw counts or normalized expression
# Option A: scanpy Wilcoxon on log-normalized expression
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon", n_genes=25)

# Option B: scVI probabilistic DE per cluster (more accurate)
markers = {}
for cluster in adata.obs["leiden"].unique():
    de = model.differential_expression(
        idx1=[adata.obs["leiden"] == cluster],
        mode="change", delta=0.25,
    )
    markers[cluster] = de[de["is_de_fdr_0.05"]].sort_values("lfc_mean", ascending=False)

# Visualization
sc.pl.umap(adata, color=["leiden", "batch", "cell_type"], ncols=3)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5, groupby="leiden")
```

## Key Concepts

### Unified API Pattern

All scvi-tools models follow the same 4-step workflow:

```
1. ModelClass.setup_anndata(adata, ...)   →  Register layers, batch keys, covariates
2. model = ModelClass(adata, ...)         →  Set architecture hyperparameters
3. model.train(max_epochs=..., ...)       →  Fit model (GPU auto-detected)
4. model.get_*()                          →  Extract results: latent, normalized, DE
```

### Model Selection Guide

| Data | Model | Core Feature | When to Use |
|------|-------|--------------|-------------|
| scRNA-seq | **scVI** | Batch correction, denoising | Default for any multi-batch scRNA-seq |
| scRNA-seq (partial labels) | **scANVI** | Cell type transfer | Have reference labels; want to annotate query |
| CITE-seq (RNA+protein) | **totalVI** | Joint RNA+protein | 10x CITE-seq, REAP-seq |
| Reference → query | **scARCHES** | Transfer learning | Map new data to existing atlas |
| Spatial | **DestVI** | Spot deconvolution | 10x Visium, Slide-seq |
| scRNA-seq QC | **Solo** | Doublet detection | Pre-analysis QC step |

### Raw Counts Requirement

scvi-tools models learn count distributions directly from raw data. Log-normalized input produces incorrect results. Always preserve raw counts before normalization:

```python
# Correct workflow: save counts, then normalize for scanpy steps
adata.layers["counts"] = adata.X.copy()   # Save raw before any transformation
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Then: scvi.model.SCVI.setup_anndata(adata, layer="counts", ...)
```

## Common Workflows

### Workflow 1: Multi-Batch scRNA-seq Integration

**Goal**: Integrate multiple scRNA-seq datasets, remove batch effects, identify cell clusters, and find marker genes.

```python
import scvi
import scanpy as sc

# Load and concatenate datasets
adata1 = sc.read_h5ad("dataset1.h5ad")
adata2 = sc.read_h5ad("dataset2.h5ad")
adata3 = sc.read_h5ad("dataset3.h5ad")
adata = sc.concat(
    [adata1, adata2, adata3],
    label="batch",
    keys=["dataset1", "dataset2", "dataset3"],
)

# Preprocessing: preserve raw counts, select HVGs per batch
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=4000,
    batch_key="batch",       # Select HVGs consistently across batches
    subset=True,
)
print(f"HVGs selected: {adata.n_vars}")

# scVI integration
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
model = scvi.model.SCVI(adata, n_latent=30, n_layers=2)
model.train(max_epochs=300, early_stopping=True, early_stopping_patience=15)
print(f"Training history: {model.history['elbo_train'].tail()}")

# Batch-corrected analysis
adata.obsm["X_scVI"] = model.get_latent_representation()
adata.layers["scvi_norm"] = model.get_normalized_expression(n_samples=25, return_mean=True)

sc.pp.neighbors(adata, use_rep="X_scVI", n_neighbors=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

# Verify batch mixing (batches should overlap on UMAP)
sc.pl.umap(adata, color=["batch", "leiden"], save="_batch_integration.png")
model.save("./scvi_model/", overwrite=True)
print(f"Clusters: {adata.obs['leiden'].nunique()}")
```

### Workflow 2: CITE-seq totalVI Pipeline

**Goal**: Model CITE-seq RNA + protein data jointly, obtain denoised protein estimates, and generate a joint embedding for annotation.

```python
import scvi
import scanpy as sc
import pandas as pd

# Load CITE-seq AnnData (RNA in X, protein counts in obsm)
adata = sc.read_h5ad("citeseq_data.h5ad")
# Expected structure:
#   adata.X or adata.layers["counts"] : RNA raw counts (n_cells, n_genes)
#   adata.obsm["protein_expression"]   : Protein raw counts (n_cells, n_proteins)

# Preprocessing
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=4000, batch_key="batch", subset=True)

# totalVI setup and training
scvi.model.TOTALVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    protein_expression_obsm_key="protein_expression",
)

model = scvi.model.TOTALVI(adata, latent_distribution="normal")
model.train(max_epochs=200, early_stopping=True)

# Extract joint embedding and denoised values
adata.obsm["X_totalVI"] = model.get_latent_representation()
rna_norm, protein_norm = model.get_normalized_expression(n_samples=25, return_mean=True)
foreground = model.get_protein_foreground_probability(n_samples=25, return_mean=True)

adata.layers["rna_denoised"] = rna_norm
protein_df = pd.DataFrame(
    protein_norm,
    index=adata.obs_names,
    columns=adata.uns["protein_names"],
)
protein_df.to_csv("denoised_protein_expression.csv")
print(f"Protein foreground: min={foreground.min():.3f}, max={foreground.max():.3f}")

# Joint clustering on RNA + protein latent
sc.pp.neighbors(adata, use_rep="X_totalVI", n_neighbors=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.8)

# Protein-guided annotation (use denoised protein for clean signal)
protein_markers = {"B cell": "CD19", "T cell": "CD3E", "Monocyte": "CD14"}
for cell_type, marker in protein_markers.items():
    if marker in protein_df.columns:
        adata.obs[f"denoised_{marker}"] = protein_df[marker].values
sc.pl.umap(adata, color=["leiden"] + [f"denoised_{m}" for m in protein_markers.values()])
```

## Key Parameters

| Parameter | Model / Function | Default | Range / Options | Effect |
|-----------|-----------------|---------|-----------------|--------|
| `n_latent` | All models | `10` | `10`–`50` | Latent space dimensionality; 20–30 typical for most datasets |
| `n_layers` | All models | `1` | `1`–`3` | Depth of encoder/decoder networks |
| `n_hidden` | All models | `128` | `64`–`256` | Width of each hidden layer |
| `gene_likelihood` | scVI, scANVI | `"zinb"` | `"zinb"`, `"nb"`, `"poisson"` | Count noise distribution; "nb" faster if sparsity is low |
| `max_epochs` | `train()` | `400` | `50`–`1000` | Training iterations (ELBO steps) |
| `early_stopping` | `train()` | `False` | `True`/`False` | Halt when validation ELBO plateaus |
| `early_stopping_patience` | `train()` | `45` | `5`–`100` | Epochs to wait before stopping |
| `batch_size` | `train()` | `128` | `64`–`512` | Mini-batch size; reduce if OOM |
| `lr` | `train()` | `1e-3` | `1e-4`–`1e-2` | Initial learning rate |
| `delta` | `differential_expression()` | `0.25` | `0.1`–`1.0` | Min |LFC| threshold for composite DE |
| `n_samples` | `get_normalized_expression()` | `1` | `1`–`100` | Posterior samples; ≥25 for stable estimates |

## Best Practices

1. **Always store raw counts before normalization**: Run `adata.layers["counts"] = adata.X.copy()` immediately after loading data, before any `normalize_total` or `log1p` call. Passing log-normalized data silently produces incorrect latent spaces.

2. **Select highly variable genes before training**: Use `sc.pp.highly_variable_genes(n_top_genes=2000–4000, batch_key="batch")` to reduce training time and model noise. Including all genes rarely improves results and significantly slows training.

3. **Register all known batch variables**: If data has multiple technical sources (sequencing plate, donor, protocol), include them as `batch_key` or `categorical_covariate_keys`. Unregistered batch effects contaminate the latent space and downstream DE.

4. **Use the scVI → scANVI two-step pipeline**: Initializing scANVI with `from_scvi_model()` after training scVI is faster, more stable, and produces better embeddings than training scANVI from scratch. Always train scVI first when using scANVI.

5. **Save trained models immediately**: `model.save("./model_dir/")` persists the full model. Reloading with `SCVI.load()` is seconds vs. minutes of retraining. Models are typically 10–50 MB.

6. **Use `n_latent=20–30` as starting point**: Values below 10 lose resolution; above 50 tend to overfit without added biological insight. Increase to 50 only for very heterogeneous atlases (>500k cells, many cell types).

7. **Filter low-confidence scANVI predictions**: Cells where `predict(soft=True).max(axis=1) < 0.7` are ambiguous; treat them as `"Unknown"` rather than accepting the argmax label.

## Common Recipes

### Recipe: Doublet Detection with Solo

When to use: Remove doublets before integration or DE to avoid spurious clusters.

```python
import scvi

scvi.model.SCVI.setup_anndata(adata, layer="counts")
vae = scvi.model.SCVI(adata, n_latent=20)
vae.train(max_epochs=100)

solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train(max_epochs=200)
doublet_preds = solo.predict()   # DataFrame: {"singlet": prob, "doublet": prob}
adata.obs["doublet_score"] = doublet_preds["doublet"].values
adata.obs["is_doublet"] = doublet_preds["prediction"].values

n_doublets = (adata.obs["is_doublet"] == "doublet").sum()
print(f"Detected {n_doublets} doublets ({n_doublets / adata.n_obs:.1%})")
adata = adata[adata.obs["is_doublet"] == "singlet"].copy()
print(f"Cells after doublet removal: {adata.n_obs}")
```

### Recipe: Spatial Deconvolution with DestVI

When to use: Estimate cell type proportions in spatial transcriptomics spots using a matched scRNA-seq reference.

```python
import scvi

# Step 1: Train CondSCVI on reference single-cell data
scvi.model.CondSCVI.setup_anndata(sc_adata, layer="counts", labels_key="cell_type")
sc_model = scvi.model.CondSCVI(sc_adata, weight_obs=False)
sc_model.train(max_epochs=200)

# Step 2: Deconvolve spatial spots
scvi.model.DestVI.setup_anndata(st_adata, layer="counts")
st_model = scvi.model.DestVI.from_rna_model(st_adata, sc_model)
st_model.train(max_epochs=2500)

proportions = st_model.get_proportions()  # DataFrame (n_spots, n_cell_types)
st_adata.obsm["cell_type_proportions"] = proportions.values
print(f"Proportions per spot: {proportions.shape}")
print(proportions.head())
```

### Recipe: Extract Denoised Expression for Imputation

When to use: Obtain smooth, denoised expression values for visualization or downstream machine learning when raw sparse counts are too noisy.

```python
import scvi

scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
model = scvi.model.SCVI(adata, n_latent=30)
model.train(max_epochs=200, early_stopping=True)

# n_samples >= 25 gives stable posterior mean; increase to 100 for publication
denoised = model.get_normalized_expression(
    n_samples=50,
    library_size="latent",  # "latent" uses learned library size; int for fixed normalization
    return_mean=True,
)
adata.layers["denoised"] = denoised
print(f"Denoised layer added: {denoised.shape}")
# Use adata.layers["denoised"] for heatmaps, pseudotime, or ML features
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `ValueError: adata must contain raw counts` | Log-normalized data passed instead of raw counts | Save raw before normalizing: `adata.layers["counts"] = adata.X.copy()` then `setup_anndata(layer="counts")` |
| Training loss oscillates without decreasing | Learning rate too high or very heterogeneous data | Try `lr=1e-4`; check that `adata.X` and the registered layer are not sparse with negatives |
| CUDA out of memory | Dataset too large for GPU VRAM | Reduce `batch_size=64`, `n_hidden=64`; subset to fewer HVGs; use CPU for small datasets |
| Poor batch integration (batches separate on UMAP) | Batch key not registered, or too few epochs | Verify `batch_key` column exists in `adata.obs`; increase `max_epochs`; add `early_stopping=True` |
| scANVI predicts same label for all cells | Too few labeled cells per type or learning rate issue | Need ≥50 labeled cells per type; use `from_scvi_model()` workflow; check `unlabeled_category` matches exactly |
| `load()` fails with `KeyError` or shape mismatch | AnnData `var_names` differ from training data | Ensure query `adata.var_names` exactly matches training data; do not subset genes after training |
| Slow training on CPU for large datasets | Large dataset without GPU acceleration | Install `scvi-tools[cuda12]`; pass `accelerator="gpu"` to `train()`; or subsample to 50k cells for prototyping |
| `scvi.model.SCANVI.load_query_data` shape error | Query and reference have different gene sets | Align genes: `query_adata = query_adata[:, ref_adata.var_names]` before calling `load_query_data` |

## Related Skills

- **scanpy-scrna-seq** — standard scRNA-seq QC, clustering, marker genes; use scVI latent as input to `sc.pp.neighbors(use_rep="X_scVI")`
- **anndata-data-structure** — create, subset, concatenate, and persist AnnData objects required by scvi-tools
- **harmony-batch-correction** — fast linear batch correction alternative; use when deep learning overhead is not justified
- **cellxgene-census** — download reference atlases for scANVI transfer learning and scARCHES query-to-reference mapping
- **muon-multiomics-singlecell** — multi-modal single-cell analysis with MuData; complement to totalVI for Multiome workflows

## References

- [scvi-tools documentation](https://docs.scvi-tools.org/) — official API reference, tutorials, and model guides
- [scvi-tools GitHub](https://github.com/scverse/scvi-tools) — source code, issue tracker, and changelog
- Lopez et al. (2018) "Deep generative modeling for single-cell transcriptomics" — [Nature Methods 15:1053–1058](https://doi.org/10.1038/s41592-018-0229-2) — original scVI paper
- Xu et al. (2021) "Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models" — [Molecular Systems Biology 17:e9620](https://doi.org/10.15252/msb.20209620) — scANVI paper
- Gayoso et al. (2022) "A Python library for probabilistic analysis of single-cell omics data" — [Nature Biotechnology 40:163–166](https://doi.org/10.1038/s41587-021-01206-w) — scvi-tools library paper
