--- name: bio-workflows-cytometry-pipeline description: End-to-end flow, spectral, and mass cytometry (CyTOF) pipeline from raw FCS files to differentially abundant/expressed cell populations. Orchestrates the read -> compensate/unmix -> transform -> QC -> doublet-removal -> cluster-or-gate -> annotate -> diffcyt DA/DS chain with flowCore/CATALYST/diffcyt, branching on instrument type and on clustering-vs-gating. Use when processing a cytometry experiment end-to-end, deciding the pipeline path for an instrument, or wiring the flow-cytometry component skills into one analysis with valid sample-level statistics. tool_type: r primary_tool: CATALYST workflow: true depends_on: - flow-cytometry/fcs-handling - flow-cytometry/compensation-transformation - flow-cytometry/cytometry-qc - flow-cytometry/doublet-detection - flow-cytometry/bead-normalization - flow-cytometry/gating-analysis - flow-cytometry/clustering-phenotyping - flow-cytometry/differential-analysis --- ## Version Compatibility Reference examples tested with: CATALYST 1.26+, diffcyt 1.22+, FlowSOM 2.10+, flowCore 2.14+, flowWorkspace 4.14+, flowStats 4.14+, edgeR 4.0+, limma 3.58+, ggplot2 3.5+; Python (partial alt) flowkit 1.1+. Before using code patterns, verify installed versions match. If versions differ: - R: `packageVersion('')` then `?function_name` to verify parameters - Python: `pip show ` then `help(module.function)` to check signatures If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt rather than retrying. Each stage defers depth to its component skill. # Flow Cytometry Pipeline **"Process my cytometry data from FCS to differential populations"** -> read raw -> compensate/unmix -> transform -> QC -> remove doublets -> cluster (or gate) -> annotate -> test DA/DS, with the sample as the unit of inference. - R: `flowCore` + `CATALYST::prepData/cluster/runDR` + `diffcyt::diffcyt()` ## The Single Most Important Modern Insight -- A Pipeline Is a Chain of Irreversible Decisions, and the Unit of Inference Is the Sample Each early choice silently gates the validity of the final test: reading raw (not log-linearized), compensating BEFORE transforming, removing margin events before density QC, assigning type-vs-state markers correctly, and removing doublets before clustering. None of these is recoverable downstream - a doublet clustered as a "double-positive," a state marker used for clustering, or an uncompensated channel becomes a false population that the differential test then "confirms." The second load-bearing thread is that the SAMPLE/subject, not the cell, is the experimental unit: diffcyt aggregates cells to per-sample-per-cluster counts (DA) and medians (DS) before testing, so biological replication (>= 2-3 per group) is mandatory and a per-cell test is invalid. Two normalization layers sit at OPPOSITE ends of the pipeline - EQ-bead drift correction on raw counts at the very front (CyTOF), and CytoNorm cross-batch harmonization after clustering - and conflating them is a classic error. ## Decision Tree: Which Path | Situation | Path | Why | |-----------|------|-----| | Conventional fluorescence flow | compensate (`$SPILLOVER`/flowStats) -> logicle -> ... | optical spillover; logicle handles negatives | | Spectral cytometer (Aurora/ID7000) | UNMIX (not compensate) -> arcsinh ~150 | overdetermined system; fluorescence-scale | | Mass cytometry (CyTOF) | EQ-bead normalize (raw) -> arcsinh cofactor 5 -> `compCytof` if needed | metals barely spill (~1-4%); drift correction first | | High-dim discovery, no prior gates | cluster (FlowSOM via CATALYST) | scales; finds unexpected populations | | Well-defined populations / rare events (MRD) | hierarchical gating (openCyto) | interpretable; clustering fails for ultra-rare | | Multi-batch / multi-day | anchor sample per batch -> CytoNorm (after clustering) | model batch in the design for inference | ## Pipeline Overview ``` FCS -> compensate/unmix -> transform -> QC (margins, time, dead) -> doublets -> [ cluster (FlowSOM) | gate (openCyto) ] -> annotate -> diffcyt DA/DS -> report EQ-bead drift normalization (CyTOF) runs on raw counts BEFORE everything; CytoNorm runs LAST. ``` ## 1. Panel, Metadata, and Load **Goal:** Define the type/state panel and sample metadata, then load FCS. **Approach:** Panel `marker_class` drives everything downstream (type clusters, state is tested); metadata keys samples to condition/subject. See flow-cytometry/fcs-handling. ```r library(CATALYST); library(diffcyt); library(flowCore); library(ggplot2) panel <- data.frame( fcs_colname = c('FSC-A','SSC-A','CD45','CD3','CD4','CD8','CD19','CD14','Ki67','IFNg'), antigen = c('FSC','SSC','CD45','CD3','CD4','CD8','CD19','CD14','Ki67','IFNg'), marker_class = c('none','none','type','type','type','type','type','type','state','state')) md <- data.frame(file_name = list.files('data', pattern = '\\.fcs$'), sample_id = paste0('S', 1:8), condition = rep(c('Control','Treatment'), each = 4), patient_id = rep(paste0('P', 1:4), 2)) fs <- read.flowSet(file.path('data', md$file_name), transformation = FALSE, truncate_max_range = FALSE) ``` ## 2. Compensate / Unmix, then Transform **Goal:** Remove spillover on linear data, then variance-stabilize. **Approach:** Conventional flow compensates (matrix before transform); CyTOF skips fluorescence compensation and uses cofactor 5; spectral unmixes then uses ~150. See flow-cytometry/compensation-transformation. ```r fs_comp <- compensate(fs, spillover(fs[[1]])[[1]]) # conventional flow; CyTOF: omit or use compCytof COFACTOR <- 150 # 5 for CyTOF, ~150 for fluorescence/spectral sce <- prepData(fs_comp, panel, md, transform = TRUE, cofactor = COFACTOR, FACS = TRUE) ``` ## 3. QC (order matters) **Goal:** Remove margin/boundary events and time anomalies before any density step. **Approach:** Margins first, then time-based cleaning; on CyTOF, EQ-bead drift correction happens upstream on raw counts. See flow-cytometry/cytometry-qc and flow-cytometry/bead-normalization. ```r # per-sample sanity + sample-similarity MDS (flag outlier samples) plotExprs(sce, color_by = 'condition'); plotMDS(sce, color_by = 'condition') # event-level cleaning runs per-FCS upstream: PeacoQC::RemoveMargins() -> PeacoQC()/flowAI on transformed data ``` ## 4. Remove Doublets **Goal:** Drop aggregates before clustering so they don't form phantom double-positives. **Approach:** Flow uses the FSC-A vs FSC-H diagonal; CyTOF uses DNA intercalator + Gaussian/Event_length. See flow-cytometry/doublet-detection. ```r # CyTOF (FACS=TRUE retained Event_length on the arcsinh scale): e <- assay(sce, 'exprs') if (all(c('DNA1','Event_length') %in% rownames(sce))) { keep <- e['DNA1', ] > quantile(e['DNA1', ], 0.05) & e['Event_length', ] <= quantile(e['Event_length', ], 0.99) sce <- sce[, keep] } ``` ## 5. Cluster (FlowSOM) or Gate **Goal:** Define populations by unsupervised clustering on TYPE markers (discovery) or hierarchical gating (defined/rare). **Approach:** `cluster()` wraps FlowSOM+ConsensusClusterPlus; over-provision the grid, set a seed. See flow-cytometry/clustering-phenotyping (clustering) and flow-cytometry/gating-analysis (gating). ```r sce <- cluster(sce, features = 'type', xdim = 10, ydim = 10, maxK = 20, seed = 42) ``` ## 6. Annotate and Visualize Structure **Goal:** Label metaclusters from marker medians; embed for display only. **Approach:** Median heatmap drives annotation; UMAP colors by cluster but is never used to define or quantify populations. ```r plotExprHeatmap(sce, features = 'type', by = 'cluster_id', k = 'meta20', scale = 'last') sce <- runDR(sce, dr = 'UMAP', features = 'type', cells = 2000) plotDR(sce, dr = 'UMAP', color_by = 'meta20') ``` ## 7. Differential Abundance and State **Goal:** Test which populations change in frequency (DA) or state-marker expression (DS) between conditions. **Approach:** The `diffcyt()` wrapper aggregates to the sample level; results live in `res$res`. See flow-cytometry/differential-analysis. ```r design <- createDesignMatrix(ei(sce), cols_design = 'condition') contrast <- createContrast(c(0, 1)) # Treatment vs Control res_DA <- diffcyt(sce, clustering_to_use = 'meta20', analysis_type = 'DA', method_DA = 'diffcyt-DA-edgeR', design = design, contrast = contrast) res_DS <- diffcyt(sce, clustering_to_use = 'meta20', analysis_type = 'DS', method_DS = 'diffcyt-DS-limma', design = design, contrast = contrast) da <- as.data.frame(SummarizedExperiment::rowData(res_DA$res)) # cluster_id, logFC, p_val, p_adj ``` ## 8. Visualize Results and Export **Goal:** Summarize significant populations and persist results. **Approach:** Pass the inner result object (`res$res`) to plotting; export tables and the SCE. ```r plotDiffHeatmap(sce, res_DA$res, all = TRUE, fdr = 0.05) plotAbundances(sce, k = 'meta20', by = 'cluster_id', group_by = 'condition') write.csv(da, 'da_results.csv', row.names = FALSE); saveRDS(sce, 'cytometry_analysis.rds') ``` ## Paired / Repeated-Measures Variant **Goal:** Account for within-subject correlation (pre/post on the same donor). **Approach:** Use a GLMM with a random effect for subject (NOT voom, which is fixed-effects only). ```r formula <- createFormula(ei(sce), cols_fixed = 'condition', cols_random = 'patient_id') res_DA <- diffcyt(sce, clustering_to_use = 'meta20', analysis_type = 'DA', method_DA = 'diffcyt-DA-GLMM', formula = formula, contrast = createContrast(c(0, 1))) ``` ## Manual Gating Path (alternative to clustering) **Goal:** Define populations by a reproducible hierarchy when they are well-defined or rare. **Approach:** Build a GatingSet on transformed data; recompute after adding gates. See flow-cytometry/gating-analysis. ```r library(flowWorkspace) tl <- estimateLogicle(fs_comp[[1]], colnames(spillover(fs[[1]])[[1]])) gs <- GatingSet(transform(fs_comp, tl)) # add openCyto template or manual gates (time -> debris -> singlets -> live -> lineage), then: recompute(gs); gs_pop_get_stats(gs, type = 'count') ``` ## Python Alternative (FlowKit) -- partial **Goal:** Read, compensate, and gate in Python where an R pipeline is not an option. **Approach:** FlowKit covers IO/compensation/GatingML; there is NO Python equivalent for diffcyt DA/DS, so the differential step stays in R (or bridge via readfcs -> AnnData -> scanpy for clustering only). ```python import flowkit as fk sample = fk.Sample('sample.fcs') sample.apply_compensation(sample.metadata['spill']) # use FlowKit's API, not a hand-rolled matrix inverse df = sample.as_dataframe(source='comp') ``` ## Per-Stage Failure Modes ### Per-cell pseudoreplication **Trigger:** testing across all cells. **Mechanism:** cells are not independent replicates. **Symptom:** p ~ 1e-40 from few subjects. **Fix:** diffcyt aggregates to sample level; require >= 2-3 replicates/group. ### Clustering on state markers **Trigger:** activation/phospho markers in `features`. **Mechanism:** state contaminates lineage identity. **Symptom:** activated/resting splits of one type. **Fix:** cluster on `type`; test state in DS. ### Doublets / wrong cofactor / uncompensated input **Trigger:** skipping doublet removal, cofactor 5 on fluorescence, or clustering raw data. **Mechanism:** phantom double-positives, compressed dim markers, spillover-dominated distances. **Symptom:** non-reproducible "novel" populations. **Fix:** remove doublets first; cofactor 5 (CyTOF) / 150 (fluorescence); compensate+transform before clustering. ### Batch cleaned instead of modeled **Trigger:** CytoNorm-ing then testing naively, or batch confounded with condition. **Mechanism:** over-correction / non-identifiability. **Symptom:** attenuated or fabricated effects. **Fix:** model batch in the design; if batch == condition, no rescue. ## Quantitative Thresholds | Threshold | Source | Rationale | |-----------|--------|-----------| | arcsinh cofactor 5 (CyTOF) / ~150 (fluorescence) | Nowicka 2017 *F1000Res* 6:748 | matches platform noise scale | | >= 2-3 biological replicates per group | Weber 2019 *Commun Biol* 2:183 | minimum for a valid DA/DS error term | | > ~10K cells per sample | community | stable per-sample cluster frequencies | | 10-30 metaclusters typical (maxK=20 default) | Weber & Robinson 2016 *Cytometry A* 89:1084 | over-provision then merge | | BH FDR across clusters (and clusters x markers for DS) | diffcyt | many simultaneous tests | ## Common Errors | Error / symptom | Cause | Solution | |-----------------|-------|----------| | `testDA_edgeR(sce, ...)` not found / wrong | fabricated signature | use the `diffcyt()` wrapper; results in `res$res` | | `compensate()` errors on a list | `spillover(ff)` returns a list | index `[[1]]` | | empty DS results | state markers not flagged | set `marker_class='state'` in the panel | | paired design ignored | used fixed-effect method | `diffcyt-DA-GLMM` with a random effect | ## References - Weber 2019 *Commun Biol* 2:183 — diffcyt DA/DS framework. - Nowicka 2017 *F1000Research* 6:748 — CATALYST CyTOF workflow; type/state, cofactor 5. - Weber & Robinson 2016 *Cytometry A* 89(12):1084-1096 — FlowSOM clustering benchmark. - Van Gassen 2020 *Cytometry A* 97(3):268-278 — CytoNorm cross-batch normalization. - Hurlbert 1984 *Ecol Monogr* 54(2):187-211 — pseudoreplication (sample is the unit). ## Related Skills - flow-cytometry/fcs-handling - Read FCS and map channels - flow-cytometry/compensation-transformation - Compensate/unmix and transform - flow-cytometry/cytometry-qc - Time/margin/dead-cell QC - flow-cytometry/doublet-detection - Singlet discrimination - flow-cytometry/bead-normalization - EQ-bead drift and CytoNorm batch correction - flow-cytometry/gating-analysis - Hierarchical/automated gating path - flow-cytometry/clustering-phenotyping - FlowSOM clustering and annotation - flow-cytometry/differential-analysis - diffcyt DA/DS testing - single-cell/clustering - Related graph-clustering for scRNA-seq