--- name: bio-workflows-scrnaseq-pipeline description: End-to-end single-cell RNA-seq workflow from 10X Genomics data to annotated cell types. Covers QC, normalization, clustering, marker detection, and cell type annotation. Use when analyzing single-cell RNA-seq data. tool_type: mixed primary_tool: Seurat workflow: true depends_on: - single-cell/data-io - single-cell/preprocessing - single-cell/doublet-detection - single-cell/clustering - single-cell/markers-annotation qc_checkpoints: - after_loading: "Expected cell count, reasonable UMI distribution" - after_qc: "Remove low-quality cells and doublets" - after_normalization: "No batch effects, HVGs look sensible" - after_clustering: "Clusters are biologically meaningful" --- ## Version Compatibility Reference examples tested with: Cell Ranger 8.0+, ggplot2 3.5+, numpy 1.26+, scanpy 1.10+ Before using code patterns, verify installed versions match. If versions differ: - Python: `pip show ` then `help(module.function)` to check signatures - R: `packageVersion('')` then `?function_name` to verify parameters If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying. # Single-Cell RNA-seq Pipeline **"Analyze my single-cell RNA-seq data from counts to cell types"** -> Orchestrate QC filtering, normalization (scanpy/Seurat), batch integration (scVI/Harmony), clustering, marker detection, cell type annotation, and trajectory inference. Complete workflow from 10X Genomics Cell Ranger output to annotated cell types. ## Workflow Overview ``` 10X data (filtered_feature_bc_matrix) | v [1. Load Data] ---------> Read10X / read_10x_h5 | v [2. QC Filtering] ------> nFeature, percent.mt, doublets | v [3. Normalization] -----> SCTransform or LogNormalize | v [4. HVG Selection] -----> FindVariableFeatures | v [5. Dim Reduction] -----> PCA -> UMAP | v [6. Clustering] --------> FindNeighbors -> FindClusters | v [7. Markers] -----------> FindAllMarkers | v [8. Annotation] --------> Manual or automated | v Annotated Seurat/AnnData object ``` ## Primary Path: Seurat (R) ### Step 1: Load 10X Data ```r library(Seurat) library(ggplot2) library(dplyr) # Load from Cell Ranger output data_dir <- 'cellranger_output/filtered_feature_bc_matrix' counts <- Read10X(data.dir = data_dir) # Create Seurat object seurat_obj <- CreateSeuratObject(counts = counts, project = 'my_project', min.cells = 3, min.features = 200) ``` ### Step 2: Quality Control ```r # Calculate QC metrics seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-') seurat_obj[['percent.ribo']] <- PercentageFeatureSet(seurat_obj, pattern = '^RP[SL]') # Visualize QC metrics VlnPlot(seurat_obj, features = c('nFeature_RNA', 'nCount_RNA', 'percent.mt'), ncol = 3) # Filter cells seurat_obj <- subset(seurat_obj, nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20 & nCount_RNA > 500) cat('Cells after QC:', ncol(seurat_obj), '\n') ``` **QC Checkpoint 1:** Review QC plots - Remove cells with very low/high gene counts - Remove cells with high mitochondrial content (dying cells) ### Step 3: Doublet Detection ```r library(scDblFinder) # Convert to SCE for scDblFinder sce <- as.SingleCellExperiment(seurat_obj) sce <- scDblFinder(sce) # Add back to Seurat seurat_obj$doublet_class <- sce$scDblFinder.class seurat_obj$doublet_score <- sce$scDblFinder.score # Remove doublets seurat_obj <- subset(seurat_obj, doublet_class == 'singlet') cat('Cells after doublet removal:', ncol(seurat_obj), '\n') ``` ### Step 4: Normalization with SCTransform ```r # SCTransform (recommended for most analyses) seurat_obj <- SCTransform(seurat_obj, vars.to.regress = 'percent.mt', verbose = FALSE) ``` Alternative: Standard normalization ```r seurat_obj <- NormalizeData(seurat_obj) seurat_obj <- FindVariableFeatures(seurat_obj, selection.method = 'vst', nfeatures = 2000) seurat_obj <- ScaleData(seurat_obj, vars.to.regress = 'percent.mt') ``` ### Step 5: Dimensionality Reduction ```r # PCA seurat_obj <- RunPCA(seurat_obj, npcs = 50, verbose = FALSE) # Determine optimal PCs ElbowPlot(seurat_obj, ndims = 50) # UMAP n_pcs <- 30 # Choose based on elbow plot seurat_obj <- RunUMAP(seurat_obj, dims = 1:n_pcs, verbose = FALSE) ``` ### Step 6: Clustering ```r # Find neighbors seurat_obj <- FindNeighbors(seurat_obj, dims = 1:n_pcs, verbose = FALSE) # Find clusters (try multiple resolutions) seurat_obj <- FindClusters(seurat_obj, resolution = c(0.2, 0.4, 0.6, 0.8, 1.0), verbose = FALSE) # Visualize DimPlot(seurat_obj, reduction = 'umap', group.by = 'SCT_snn_res.0.4', label = TRUE) ``` **QC Checkpoint 2:** Assess clustering - Clusters should be visually separable on UMAP - Resolution 0.4-0.8 is often appropriate ### Step 7: Find Marker Genes ```r # Set identity to chosen resolution Idents(seurat_obj) <- 'SCT_snn_res.0.4' # Find markers for all clusters markers <- FindAllMarkers(seurat_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) # Top markers per cluster top_markers <- markers %>% group_by(cluster) %>% slice_max(n = 10, order_by = avg_log2FC) # Visualize top markers DoHeatmap(seurat_obj, features = top_markers$gene) + NoLegend() ``` ### Step 8: Cell Type Annotation ```r # Manual annotation based on known markers # Example for PBMC data: cluster_annotations <- c( '0' = 'CD4 T cells', '1' = 'CD14 Monocytes', '2' = 'B cells', '3' = 'CD8 T cells', '4' = 'NK cells', '5' = 'CD16 Monocytes', '6' = 'Dendritic cells' ) seurat_obj$cell_type <- cluster_annotations[as.character(Idents(seurat_obj))] # Final UMAP DimPlot(seurat_obj, reduction = 'umap', group.by = 'cell_type', label = TRUE) # Save object saveRDS(seurat_obj, 'seurat_annotated.rds') ``` ## Alternative Path: Scanpy (Python) ```python import scanpy as sc import numpy as np # Load 10X data adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5') adata.var_names_make_unique() # QC metrics adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) # Filter sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata = adata[adata.obs.n_genes_by_counts < 5000, :] adata = adata[adata.obs.pct_counts_mt < 20, :] # Doublet detection sc.pp.scrublet(adata) adata = adata[~adata.obs['predicted_doublet'], :] # Normalize and HVGs sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) # PCA, neighbors, UMAP sc.pp.scale(adata, max_value=10) sc.tl.pca(adata, n_comps=50) sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30) sc.tl.umap(adata) # Clustering sc.tl.leiden(adata, resolution=0.5) # Markers sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon') sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False) # Save adata.write('scanpy_annotated.h5ad') ``` ## Parameter Recommendations | Step | Parameter | Recommendation | |------|-----------|----------------| | QC | min.features | 200-500 | | QC | max.features | 2500-5000 (depends on data) | | QC | percent.mt | <10-20% | | SCTransform | vars.to.regress | percent.mt | | PCA | npcs | 30-50 | | UMAP | dims | 15-30 (check elbow plot) | | Clustering | resolution | 0.4-0.8 (start with 0.5) | ## Troubleshooting | Issue | Likely Cause | Solution | |-------|--------------|----------| | All cells filtered | QC too strict | Relax thresholds | | Poor UMAP separation | Too few HVGs or PCs | Increase nfeatures, check n_pcs | | Too many/few clusters | Wrong resolution | Adjust resolution parameter | | Unknown cell types | Missing markers | Check known marker genes manually | ## Complete R Workflow ```r library(Seurat) library(scDblFinder) library(ggplot2) library(dplyr) # Configuration data_dir <- 'filtered_feature_bc_matrix' output_dir <- 'results' dir.create(output_dir, showWarnings = FALSE) # Load counts <- Read10X(data.dir = data_dir) seurat_obj <- CreateSeuratObject(counts = counts, min.cells = 3, min.features = 200) cat('Initial cells:', ncol(seurat_obj), '\n') # QC seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-') seurat_obj <- subset(seurat_obj, nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20) cat('After QC:', ncol(seurat_obj), '\n') # Doublets sce <- as.SingleCellExperiment(seurat_obj) sce <- scDblFinder(sce) seurat_obj$doublet <- sce$scDblFinder.class seurat_obj <- subset(seurat_obj, doublet == 'singlet') cat('After doublet removal:', ncol(seurat_obj), '\n') # Normalize seurat_obj <- SCTransform(seurat_obj, vars.to.regress = 'percent.mt', verbose = FALSE) # Dimension reduction seurat_obj <- RunPCA(seurat_obj, npcs = 50, verbose = FALSE) seurat_obj <- RunUMAP(seurat_obj, dims = 1:30, verbose = FALSE) # Cluster seurat_obj <- FindNeighbors(seurat_obj, dims = 1:30, verbose = FALSE) seurat_obj <- FindClusters(seurat_obj, resolution = 0.5, verbose = FALSE) # Markers markers <- FindAllMarkers(seurat_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) write.csv(markers, file.path(output_dir, 'markers.csv')) # Save saveRDS(seurat_obj, file.path(output_dir, 'seurat_object.rds')) # Plots pdf(file.path(output_dir, 'umap.pdf'), width = 10, height = 8) DimPlot(seurat_obj, reduction = 'umap', label = TRUE) dev.off() cat('Pipeline complete. Object saved to:', output_dir, '\n') ``` ## Related Skills - database-access/geo-data - Resolve GSE to SRA; detect SuperSeries before processing - database-access/sra-data - Download 10x records with --include-technical for barcodes/UMIs - single-cell/data-io - Loading different formats - single-cell/preprocessing - QC details - single-cell/doublet-detection - Doublet methods comparison - single-cell/clustering - Clustering parameters - single-cell/markers-annotation - Annotation strategies - single-cell/multimodal-integration - CITE-seq, multiome