--- name: bio-epitranscriptomics-merip-preprocessing description: "Aligns and QCs methylated-RNA-immunoprecipitation (MeRIP / m6A-seq) IP and input libraries using STAR or HISAT2 splice-aware mapping, samtools sort/index, IP/input matched-pair tracking, antibody-lot metadata recording, replicate concordance via deepTools multiBamSummary + plotCorrelation, IP enrichment QC via plotFingerprint and per-transcript IP/input ratio distributions, library-complexity saturation curves via PreSeq c_curve / lc_extrap, and the explicit do-NOT-deduplicate convention for standard non-UMI MeRIP." tool_type: cli primary_tool: STAR --- ## Version Compatibility Reference examples tested with: STAR 2.7.11+, HISAT2 2.2.1+, samtools 1.19+, deepTools 3.5+, PreSeq 3.2+, fastp 0.23+, Trim Galore 0.6.10+, Picard 3.1+, MultiQC 1.25+, bowtie2 2.5+, BWA-MEM2 2.2+. Before using code patterns, verify installed versions match. If versions differ: - CLI: ` --version` then ` --help` to confirm flags - Python: `pip show ` then `help(module.function)` If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying. STAR `--outSAMtype` accepts `BAM SortedByCoordinate` since 2.5.x; check the `Log.final.out` file for input/output statistics. deepTools `bamCompare --operation log2` is the modern syntax (older `--ratio log2` still works but is being phased out). PreSeq `c_curve` and `lc_extrap` have stable interfaces; HISAT2 reports unique vs multi-mapped in the summary log. # MeRIP-seq Preprocessing **"Get my MeRIP IP and input libraries ready for peak calling"** -> Trim adapters with MeRIP-appropriate defaults (do NOT trim UMIs unless the library is UMI-MeRIP — most are not), splice-aware-align IP and input to the GENOME (not transcriptome) with STAR / HISAT2, sort and index, evaluate replicate concordance and IP enrichment with deepTools, build a saturation curve per library so peak counts can be honestly compared across libraries, record antibody clone and lot metadata so cross-batch comparison is later auditable, and produce IP-over-Input log2 bigWig tracks for downstream visualisation. Crucially, do NOT deduplicate non-UMI MeRIP — see the failure-modes section. - CLI: `STAR --runMode alignReads` -- splice-aware genome alignment, the field default - CLI: `hisat2 -x index -1 R1.fq.gz -2 R2.fq.gz` -- graph-based alternative; lighter memory footprint - CLI: `samtools sort -@ 8 -o sorted.bam in.bam && samtools index sorted.bam` -- post-alignment mechanics - CLI: `deeptools multiBamSummary bins -b *.bam -o cov.npz` + `plotCorrelation` -- replicate Spearman - CLI: `deeptools plotFingerprint -b IP.bam Input.bam -o fp.pdf` -- IP enrichment QC (ChIP-seq term; transfers cleanly) - CLI: `preseq lc_extrap -B -o curve.txt sorted.bam` -- library complexity / saturation - CLI: `deeptools bamCompare -b1 IP.bam -b2 Input.bam --operation log2 -o IP_over_Input.bw` -- downstream-ready coverage track ## The Single Most Important Modern Insight -- Peak counts are library-size-dependent; saturation curves are the only honest cross-library comparison A MeRIP library sequenced to 20 million unique reads finds substantially fewer peaks than the same biology at 60 million reads. Per-sample peak counts reported without saturation curves (PreSeq `c_curve` / `lc_extrap`; Daley & Smith 2013 *Nat Methods* 10:325) are uninterpretable across studies and often across replicates within a study. Subsample BAMs to a common unique-read depth before peak calling for any cross-condition peak-count comparison, OR report peaks alongside the saturation curve. A corollary: **do NOT deduplicate standard MeRIP** — the typical MeRIP protocol (Synaptic Systems 202-003 / Abcam ab151230 / NEB EpiMark E1610 antibody pull-down on fragmented poly(A)-selected RNA) has NO unique molecular identifiers, and `picard MarkDuplicates` on such libraries collapses real biological replicates of high-coverage transcripts (the opposite of what dedup achieves in DNA ChIP-seq). UMI-MeRIP is the only exception — and most MeRIP libraries in print are NOT UMI. McIntyre et al. 2020 *Sci Rep* 10:6590 demonstrated that replicate-to-replicate peak overlap is ~80% within a single lab but drops to a median 45% between labs using nominally identical conditions; this irreducible technical noise constrains how strongly any single MeRIP study can support biological claims, and the preprocessing pipeline is where the variance is set. ## Algorithmic Taxonomy | Tool / step | Mechanism | Output | Strength | Fails when | |-------------|-----------|--------|----------|------------| | STAR 2.7+ (Dobin 2013 *Bioinformatics* 29:15) | Two-pass splice-aware alignment with on-the-fly splice junction database | Sorted BAM + splice-junction TSV | Field default; multi-mapper retention configurable; STAR splice-junction DB | Memory-heavy (~30 GB human); slower than HISAT2 | | HISAT2 2.2.1+ (Kim 2019 *Nat Biotechnol* 37:907) | Hierarchical graph FM-index; splice-aware | Sorted BAM | ~5x lighter memory than STAR; comparable accuracy | Less mature splice-junction handling for novel introns | | BWA-MEM2 (Vasimuddin 2019 *IPDPS* 314) | DNA-style local alignment; NO splice awareness | Sorted BAM | Use ONLY for transcriptome-aligned MeRIP (rare) | Splits reads across exon junctions if used on genome | | fastp 0.23+ (Chen 2018 *Bioinformatics* 34:i884) | Streaming adapter detection + quality trim | Trimmed FASTQ + JSON QC | Fast; JSON-readable QC output | UMI handling disabled by default; do NOT pass `--umi` for standard non-UMI MeRIP (the opposite of the failure direction in some other library types) | | Trim Galore | Wrapper over cutadapt with paired-end auto-detect | Trimmed FASTQ | Conservative defaults; widely cited | Slower than fastp on large datasets | | samtools sort / index | BAM coordinate sort + .bai index | Sorted BAM + index | Standard | None at default | | Picard MarkDuplicates | Identifies PCR duplicates by 5' alignment start | Marked / removed BAM | Standard in DNA / ChIP | The dominant MeRIP convention is to SKIP dedup for non-UMI libraries (collapses real biology at high-coverage transcripts); a minority of pipelines dedup MeRIP — record the choice in metadata | | deepTools multiBamSummary + plotCorrelation | Per-bin read counts; Spearman / Pearson matrix | Heatmap + clustering | Standard replicate-concordance plot | Bin size sensitive (use 10 kb for transcriptome-genome) | | deepTools plotFingerprint (Diaz 2012 *Stat Appl Genet Mol Biol* 11:9) | Cumulative read-fraction vs cumulative-bin-fraction Lorenz curve | PDF + raw counts | Direct IP-vs-input enrichment QC; "good" IP has steep tail | A flat fingerprint = failed IP (or mock IgG) | | deepTools bamCompare --operation log2 | Per-bin log2 (IP/input) bigWig | bigWig | Ready for downstream visualisation | Pseudocount choice matters at low-coverage bins | | PreSeq c_curve / lc_extrap (Daley & Smith 2013 *Nat Methods* 10:325) | Capture-recapture; rational-function extrapolation | Curve TSV | The only honest library-complexity estimate | Requires uniquely-mapped reads to be reliable | | MultiQC (Ewels 2016 *Bioinformatics* 32:3047) | Aggregator across tools | HTML report | Consolidates STAR + HISAT2 + samtools + deepTools + PreSeq into one report | None | ## Decision Tree by Scenario | Scenario | Recommended | Why wrong choices fail | |----------|-------------|------------------------| | Standard mammalian MeRIP, downstream exomePeak2 | STAR splice-aware -> genome BAM; do NOT deduplicate; build saturation curve | Transcriptome alignment breaks exomePeak2 (expects genome BAM + GTF); dedup collapses biology | | Downstream m6Anet (ONT direct RNA) | Defer to `m6anet-analysis` -- alignment is to TRANSCRIPTOME with minimap2 `-ax map-ont -uf -k14` | Genome-aligned ONT input breaks m6Anet entirely (signal-level dataprep requires per-transcript coordinates) | | Limited memory (<16 GB) | HISAT2 instead of STAR | STAR human genome index requires ~30 GB | | UMI-MeRIP (rare) | Trim UMI to read header (umi_tools / fastp `--umi_loc`), align, then dedup with `umi_tools dedup` | Standard `picard MarkDuplicates` ignores UMI; effective dedup rate wrong | | Cross-batch comparison (different antibody lots) | Record antibody clone + lot in sample-sheet metadata; include `batch` factor in downstream design | Pooling cross-batch counts without batch term inflates false-positive differential peaks | | Spike-in normalisation (NEB EpiMark control oligos) | Align separately to spike-in reference; report IP-spike-in / Input-spike-in ratio per sample; use for absolute normalisation | Most users discard the NEB EpiMark Gluc / Cluc controls; they are the per-sample IP-efficiency QC anchor | | Cross-library peak-count comparison | Subsample BAMs to common unique-read depth with `samtools view -s` BEFORE downstream peak calling, OR fit saturation curves and compare at common depth | Raw peak counts are sequencing-depth-dependent and not biologically interpretable | | Suspect failed IP | Inspect deepTools `plotFingerprint` AND per-transcript IP/input ratio distribution; failed IP shows shallow Lorenz tail AND median IP/input ~1.0 | Trusting raw peak count alone — failed IPs still produce peaks | | Viral / contamination-suspect samples | Build a combined host + viral index (and rRNA index) and check unmapped reads | Single-organism indexes hide systematic contamination | | Aligning to transcriptome (rare; specific downstream tools) | BWA-MEM2 or bowtie2; defer to `read-alignment/` | STAR splice-aware on transcriptome causes spurious splice calls inside transcripts | Methodology evolves; before any high-stakes preprocessing pipeline, web-search "STAR vs HISAT2 MeRIP 2024" and "MeRIP saturation curve preseq" for current consensus parameters. ## Adapter Trimming for MeRIP **Goal:** Remove sequencing adapters and low-quality 3' ends WITHOUT removing biological signal (UMI-MeRIP must keep UMI sequences; standard MeRIP does not have UMIs and trimming should be minimal). **Approach:** Use fastp or Trim Galore with adapter auto-detection; require minimum read length 25-30 nt (shorter reads multi-map and confound exomePeak2); for standard non-UMI MeRIP, do NOT pass `--umi` flags; preserve random-hexamer-priming artifacts ONLY if downstream pipeline expects them (most do not). ```bash mkdir -p trimmed for sample in IP_rep1 IP_rep2 IP_rep3 Input_rep1 Input_rep2 Input_rep3; do fastp \ --in1 raw/${sample}_R1.fastq.gz \ --in2 raw/${sample}_R2.fastq.gz \ --out1 trimmed/${sample}_R1.fq.gz \ --out2 trimmed/${sample}_R2.fq.gz \ --html qc/${sample}_fastp.html \ --json qc/${sample}_fastp.json \ --length_required 25 \ --detect_adapter_for_pe \ --thread 8 done ``` For UMI-MeRIP (rare), insert `--umi --umi_loc read1 --umi_len 8` BEFORE the alignment step. Default fastp output preserves base quality information needed by downstream variant-aware tools; do NOT pass `--disable_quality_filtering` for MeRIP libraries. ## STAR Splice-Aware Alignment for IP and Input **Goal:** Produce coordinate-sorted, indexed GENOME BAM files for each IP and input library with splice-junction-aware mapping, retaining a moderate number of multi-mappers for accurate per-window read counts at multi-isoform loci. **Approach:** Build STAR genome index once with the matched GENCODE / Ensembl GTF used downstream; loop IP and input samples with identical parameters; retain up to 20 multi-mappers per read (MeRIP read counts at multi-isoform genes need this); request explicit BAM SortedByCoordinate; emit splice-junction tables for QC. ```bash STAR \ --runMode genomeGenerate \ --genomeDir star_index \ --genomeFastaFiles genome.fa \ --sjdbGTFfile annotation.gtf \ --sjdbOverhang 100 \ --runThreadN 12 mkdir -p aligned for sample in IP_rep1 IP_rep2 IP_rep3 Input_rep1 Input_rep2 Input_rep3; do STAR \ --runMode alignReads \ --genomeDir star_index \ --readFilesIn trimmed/${sample}_R1.fq.gz trimmed/${sample}_R2.fq.gz \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 20 \ --outSAMattributes NH HI AS nM NM MD \ --outFileNamePrefix aligned/${sample}_ \ --runThreadN 12 samtools index -@ 4 aligned/${sample}_Aligned.sortedByCoord.out.bam done ``` `--outFilterMultimapNmax 20` is intentional: MeRIP at rRNA / snoRNA / pseudogene-rich loci needs multi-mapper retention. Reduce to 1 only if downstream analysis explicitly cannot tolerate multi-mappers. `--sjdbOverhang` should equal (read length - 1) but 100 is the common-enough default for 100-150 bp reads. ## HISAT2 Alternative for Memory-Constrained Environments **Goal:** Achieve splice-aware alignment in ~5x less memory than STAR (12-16 GB suffices for human), with comparable accuracy for MeRIP applications. **Approach:** Build HISAT2 graph index; align with `--dta` for downstream-transcript-assembly compatibility; pipe directly to samtools sort. ```bash hisat2-build genome.fa hisat2_index/genome for sample in IP_rep1 IP_rep2 IP_rep3 Input_rep1 Input_rep2 Input_rep3; do hisat2 \ -x hisat2_index/genome \ -1 trimmed/${sample}_R1.fq.gz \ -2 trimmed/${sample}_R2.fq.gz \ --dta \ --summary-file qc/${sample}_hisat2.log \ -p 12 | \ samtools sort -@ 8 -o aligned/${sample}.sorted.bam - samtools index -@ 4 aligned/${sample}.sorted.bam done ``` HISAT2 multi-mapper handling is governed by `-k`; the default reports the primary alignment only. For MeRIP, pass `-k 5` if multi-mapper-aware downstream counting is required. ## Per-Sample QC: flagstat and idxstats ```bash mkdir -p qc for bam in aligned/*sortedByCoord.out.bam aligned/*sorted.bam; do name=$(basename ${bam} .bam) samtools flagstat ${bam} > qc/${name}.flagstat samtools idxstats ${bam} > qc/${name}.idxstats done ``` Inspect `flagstat` for properly-paired rate (>=85% indicates good pairing); inspect `idxstats` for unexpected chromosome-level read piles (rRNA bleed-through, mitochondrial domination — both are MeRIP red flags). ## Replicate Concordance via deepTools **Goal:** Quantify how similar replicate IP libraries are to each other (and likewise input libraries) using a Spearman correlation matrix; flag a divergent replicate before it propagates into peak calling. **Approach:** Compute genome-wide per-bin read counts at 10 kb resolution across all IP and input BAMs; convert to a clustered Spearman heatmap with deepTools `plotCorrelation`. ```bash multiBamSummary bins \ --bamfiles aligned/IP_rep1*.bam aligned/IP_rep2*.bam aligned/IP_rep3*.bam \ aligned/Input_rep1*.bam aligned/Input_rep2*.bam aligned/Input_rep3*.bam \ --binSize 10000 \ --numberOfProcessors 8 \ --outRawCounts qc/raw_bin_counts.tab \ -o qc/cov.npz plotCorrelation \ --corData qc/cov.npz \ --corMethod spearman \ --skipZeros \ --whatToPlot heatmap \ --colorMap RdYlBu_r \ --plotNumbers \ -o qc/replicate_correlation.pdf ``` IP replicates within a condition should cluster (Spearman >= 0.85 typical); input replicates should cluster with each other; IP and input should NOT cluster together. A failed IP looks like input. ## IP Enrichment via plotFingerprint **Goal:** Confirm IP libraries are enriched (a few transcripts have many reads) and input libraries are uniform (reads spread across transcripts); fail-fast on poor IP before peak calling. **Approach:** deepTools `plotFingerprint` builds a cumulative Lorenz-style curve; a steep tail = signal concentrated in few regions (good IP); a diagonal = uniform coverage (input or failed IP). The framework is from ChIP-seq (Diaz 2012 *Stat Appl Genet Mol Biol* 11:9) and transfers cleanly to MeRIP. ```bash plotFingerprint \ --bamfiles aligned/IP_rep1*.bam aligned/IP_rep2*.bam aligned/IP_rep3*.bam \ aligned/Input_rep1*.bam aligned/Input_rep2*.bam aligned/Input_rep3*.bam \ --labels IP1 IP2 IP3 In1 In2 In3 \ --numberOfProcessors 8 \ --skipZeros \ --outQualityMetrics qc/fingerprint_metrics.tab \ -o qc/fingerprint.pdf ``` Good MeRIP IP: cumulative-fraction-of-reads vs cumulative-fraction-of-bins curve sits well below the diagonal in the right half (top-X% of bins capture >50% of reads). Input: near-diagonal. The `--outQualityMetrics` file reports JS distance and synthetic JS distance; the IP-vs-Input JS distance is a single-number IP-quality summary (higher = more concentrated signal). ## Library Complexity / Saturation Curves via PreSeq **Goal:** Compute per-library complexity so peak counts can be honestly compared across libraries and conditions of different sequencing depth. **Approach:** PreSeq `c_curve` (interpolation up to observed depth) and `lc_extrap` (extrapolation beyond observed) on the sorted BAM. Daley & Smith 2013 *Nat Methods* 10:325 capture-recapture model. ```bash mkdir -p complexity for bam in aligned/*.bam; do name=$(basename ${bam} .bam) preseq c_curve -B -o complexity/${name}_c_curve.txt ${bam} preseq lc_extrap -B -o complexity/${name}_lc_extrap.txt ${bam} done ``` Inspect: the `lc_extrap` curve plots distinct molecules vs total reads; a plateau indicates saturation. For cross-condition peak-count comparison: pick a common depth (often 30M unique reads), subsample with `samtools view -s 0.` to that depth, THEN call peaks. ## IP-over-Input bigWig for Downstream Visualisation **Goal:** Produce a per-bin log2 (IP / Input) coverage track per replicate, ready for downstream metagene / browser plots. **Approach:** deepTools `bamCompare` with `--operation log2`; choose a sensible pseudocount to avoid divide-by-zero at low-coverage bins. ```bash mkdir -p tracks paste -d ' ' \ <(printf '%s\n' IP_rep1 IP_rep2 IP_rep3) \ <(printf '%s\n' Input_rep1 Input_rep2 Input_rep3) | \ while read ip input; do bamCompare \ -b1 aligned/${ip}_Aligned.sortedByCoord.out.bam \ -b2 aligned/${input}_Aligned.sortedByCoord.out.bam \ --operation log2 \ --pseudocount 1 \ --binSize 25 \ --normalizeUsing CPM \ --numberOfProcessors 8 \ -o tracks/${ip}_over_${input}.bw done ``` `--pseudocount 1` prevents division-by-zero at zero-coverage bins; `--binSize 25` is fine-grained enough to preserve peak topology while keeping bigWig files reasonably sized. ## Per-Method Failure Modes ### Dedup applied to non-UMI MeRIP **Trigger:** `picard MarkDuplicates REMOVE_DUPLICATES=true` invoked on a standard MeRIP BAM that has no UMI. **Mechanism:** Standard MeRIP libraries have no unique molecular identifiers. PCR duplicates and biological re-sampling at high-coverage transcripts look identical at the alignment level. Dedup removes both, collapsing real coverage at the most-abundant transcripts to an artificially flat profile. This is the opposite of dedup's intent in DNA ChIP-seq. **Symptom:** Coverage at housekeeping mRNAs (e.g., GAPDH, ACTB) drops 5-20x after dedup; downstream peak counts at highly-expressed transcripts collapse; volcano plot of differential peaks shows expression-driven false positives. **Fix:** Skip dedup for standard non-UMI MeRIP. If the library is UMI-MeRIP, use `umi_tools dedup` (Smith 2017 *Genome Res* 27:491) which respects UMI rather than alignment position alone. Record dedup status in sample-sheet metadata. ### Transcriptome alignment for downstream peak calling **Trigger:** STAR or bowtie2 aligned to transcriptome FASTA, then BAM passed to exomePeak2 / MeTPeak / MACS3. **Mechanism:** exomePeak2 and MeTPeak expect a GENOME BAM plus GTF; they project peaks back to transcript features internally. A transcriptome BAM has reads in per-transcript coordinates which the GTF cannot resolve back to genome coordinates without re-alignment. **Symptom:** exomePeak2 throws errors on TxDb-genome consistency; MeTPeak returns zero peaks; MACS3 calls peaks on transcript IDs as if they were chromosomes. **Fix:** Align to GENOME with STAR / HISAT2 for downstream MeRIP peak calling. Transcriptome alignment is correct only for `m6anet-analysis` (ONT DRS) and rare quantification-only downstream tools. ### Failed IP indistinguishable from input **Trigger:** A single replicate IP library has IP/input ratio distribution centred at 1.0 across all transcripts (no enrichment); fingerprint Lorenz curve sits at the diagonal. **Mechanism:** Failed IP — antibody-RNA binding did not enrich m6A-containing fragments. Causes include antibody-batch defect, insufficient pulldown wash, RNA degradation during IP, or accidental mock IgG IP. **Symptom:** plotFingerprint shows IP overlaying input on the Lorenz plot; per-transcript IP/input ratio histogram is centred at 1.0; downstream peak callers find few or no peaks AT THE FAILED REPLICATE while other replicates produce normal counts. **Fix:** Identify the failed replicate via plotFingerprint AND IP/input ratio distribution BEFORE peak calling; exclude or re-do. Single failed IP in a 3-replicate design routinely produces "differential" peaks driven entirely by the failure. ### Antibody lot mismatch across samples **Trigger:** A multi-condition MeRIP study uses Synaptic Systems 202-003 antibody lot A for the control IPs and lot B for the treatment IPs (because lot A ran out mid-study). **Mechanism:** Anti-m6A polyclonals (Synaptic Systems 202-003, Abcam ab151230, NEB EpiMark E1610, Cell Signaling 56593, Active Motif 61755) have batch-to-batch variability in pulldown efficiency and m6A-vs-m6Am cross-reactivity. Pooling lot-A and lot-B counts in a downstream differential model attributes lot-effect to condition. **Symptom:** "Differential" peaks at high abundance transcripts; effect sizes track antibody lot rather than condition; reanalysis with lot in the design matrix removes most differential peaks. **Fix:** Record `antibody_clone` and `antibody_lot` per sample in metadata; include `lot` as a fixed effect in downstream differential analysis. Within a single study, ideally use a single lot for ALL replicates and ALL conditions. ### Peak counts compared across libraries of different depth **Trigger:** "Condition A has 14,000 peaks; condition B has 22,000 peaks; condition B has more m6A." **Mechanism:** Peak count is library-size-dependent. A library at 60M unique reads finds more peaks than 30M. Without rarefaction or saturation correction, peak-count comparisons across libraries are dominated by sequencing depth. **Symptom:** Peak counts track total mapped reads more closely than they track biological condition; downstream "biological m6A change" claims do not survive rarefaction-to-common-depth. **Fix:** Either rarefy all BAMs to common unique-read depth before peak calling, OR fit saturation curves with PreSeq `lc_extrap` and compare at matched depth, OR report peak count alongside the saturation curve. ### Random hexamer priming over-trim **Trigger:** Aggressive 5' trimming of the first 6-12 nt to remove "random hexamer priming bias" applied to MeRIP libraries. **Mechanism:** Random hexamer priming bias affects the 5' nucleotide composition of reads but does NOT degrade downstream peak-calling accuracy. Over-trimming removes biological signal and shortens reads enough to inflate multi-mapper fraction. **Fix:** Standard adapter trimming with `--length_required 25` is sufficient; do not 5'-trim for hexamer bias unless downstream tooling explicitly requires unbiased 5' ends (most do not). The bias is a known artifact in the RNA-seq community and is robust to standard analytical pipelines. ## Reconciliation: When QC Signals Disagree | Pattern | Likely cause | Action | |---------|--------------|--------| | plotFingerprint diagonal but IP/input ratio shows enrichment | Mismatched chromosome naming (chr1 vs 1) between samples | Verify `samtools view -H bam | grep '@SQ'` matches across samples | | Replicate Spearman 0.95 but plotFingerprint diverges | Replicates correlate in bulk but differ in IP enrichment depth | Check per-sample sequencing depth; reduce to common depth | | Saturation curve plateaus early but peak count low | Library complexity exhausted (e.g., over-amplified PCR) | Inspect duplicate rate; re-prep library if possible | | MultiQC reports input has higher mapping rate than IP | IP enriches non-canonical sequences (m6A on intronic RNA, mt-RNA) that map differently | Acceptable if STAR multi-mapper retention is on; verify with idxstats | | Properly-paired rate < 60% | Insert size distribution off (RNA degradation; library prep failure) | Inspect `samtools view -f 0x2` count; re-prep if severe | | HISAT2 reports many `discordant` pairs | Splice-junction not captured in index | Re-build with `--dta` and confirm GTF matches genome | | plotFingerprint synthetic JS distance < 0.5 | Marginal IP enrichment; borderline failed | Inspect per-transcript IP/input ratio distribution; consider exclusion | ## Quantitative Thresholds | Quantity | Threshold | Source / rationale | |----------|-----------|--------------------| | Minimum read length after trimming | 25 nt | Below this, multi-mapping fraction inflates; downstream peak callers lose specificity | | STAR `--outFilterMultimapNmax` for MeRIP | 20 | Retains multi-isoform mapping; tighten to 1 only when downstream cannot tolerate | | `--sjdbOverhang` | read length - 1 | STAR convention; 100 is common for 100-150 bp reads | | Properly-paired rate (samtools flagstat) | >=85% | Below indicates degraded RNA or library-prep failure | | Replicate Spearman correlation (multiBamSummary 10 kb bins) | >=0.85 (IP-vs-IP within condition) | Below suggests one replicate is anomalous | | plotFingerprint IP-vs-input JS distance | >=0.5 | Higher indicates better IP enrichment; <0.3 suggests failed IP | | Saturation curve plateau depth | ~30-60M unique reads typical | Below this, peak calling under-samples; depth depends on cell type / antibody | | Per-transcript IP/input ratio median | >1.5 (genome-wide median) | Lower suggests failed IP; conditions / cell lines vary | | Minimum biological replicates | 3 (4-5 preferred) per condition | McIntyre 2020 *Sci Rep* 10:6590 — N=2 routinely under-powered | | Dedup status for non-UMI MeRIP | OFF | Standard convention; UMI-MeRIP is the only exception | | BAM sort order for downstream tools | Coordinate (SortedByCoordinate) | exomePeak2, MeTPeak, MACS3, deepTools all expect coordinate sort | | `bamCompare --binSize` for downstream metagene | 25 | Fine enough to preserve peak topology; coarser only for whole-chromosome browser views | | `bamCompare --pseudocount` | 1 | Prevents divide-by-zero at zero-coverage bins; larger values flatten signal | ## Common Errors | Error / symptom | Cause | Solution | |-----------------|-------|----------| | STAR runs out of memory on human genome | --genomeDir build needs ~30 GB RAM | Use HISAT2 (~12 GB) or run STAR on a high-memory node | | `samtools index` fails with "is not coordinate sorted" | BAM is name-sorted or unsorted | Re-run `samtools sort` (not `sort -n`) | | exomePeak2 errors on TxDb chromosome mismatch | BAM uses chr1, GTF uses 1 (or reverse) | Verify with `samtools view -H bam | head` and `head genes.gtf`; reconcile with `seqlevelsStyle()` in R or rename in shell | | deepTools `bamCompare --ratio log2` deprecation warning | Newer deepTools uses `--operation log2` | Switch to `--operation log2` | | PreSeq `lc_extrap` rejects with "low complexity" | Library too shallow OR genome too small (BAM under 1M unique reads) | Use `c_curve` only; or sequence deeper | | MultiQC misses STAR Log.final.out | STAR output naming non-standard | Re-run with `--outFileNamePrefix` and rerun MultiQC; check `multiqc_config.yaml` search patterns | | `picard MarkDuplicates` collapses all reads to 1 per position | Tiny BAM or single read pair per fragment | Verify BAM has many properly-paired reads; do NOT dedup non-UMI MeRIP regardless | | Empty fingerprint output | All BAMs have identical bin coverage | Verify BAMs are different files; check `multiBamSummary --outRawCounts` | | bigWig file size too large | Bin size too small at deep coverage | Increase `--binSize` from 25 to 50; bigWig is lossy at large bin sizes | | Saturation curve never plateaus | Library deeply under-sampled | Sequence deeper OR accept curve does not plateau and report accordingly | | `fastp --umi` errors on non-UMI library | UMI flag passed but library has no UMI | Drop `--umi` flag for standard non-UMI MeRIP | ## Anticipated Reviewer Pushback | Pushback | Response | |----------|----------| | "Was deduplication applied?" | No — standard non-UMI MeRIP protocol; PCR duplicate vs biological resampling indistinguishable without UMI; dedup collapses real coverage at high-expression transcripts | | "What is the IP enrichment QC?" | deepTools plotFingerprint reported per replicate; JS distance >=0.5 vs input | | "Are the replicates concordant?" | Spearman correlation matrix reported via deepTools plotCorrelation on 10 kb bins; IP-IP within condition >=0.85 | | "Saturation curve?" | PreSeq lc_extrap per library; libraries rarefied to common depth before downstream peak calling | | "What antibody clone and lot?" | Recorded per sample in metadata; same lot for all replicates within study | | "Why STAR instead of HISAT2?" | STAR splice-junction-DB-based vs HISAT2 graph-based; both valid for MeRIP; choice driven by memory budget | | "How many biological replicates?" | N >=3 per condition (per McIntyre 2020); N=2 is under-powered for differential downstream | | "Was alignment to genome or transcriptome?" | Genome (required for exomePeak2 / MeTPeak / MACS3 downstream); transcriptome alignment is for m6anet-analysis only | ## References - Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. *Bioinformatics* 29(1):15-21. doi:10.1093/bioinformatics/bts635 - Kim D, Paggi JM, Park C, Bennett C, Salzberg SL (2019) Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. *Nat Biotechnol* 37(8):907-915. doi:10.1038/s41587-019-0201-4 - Vasimuddin Md, Misra S, Li H, Aluru S (2019) Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. *IPDPS* 314-324. doi:10.1109/IPDPS.2019.00041 - Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. *Bioinformatics* 34(17):i884-i890. doi:10.1093/bioinformatics/bty560 - Ramírez F, Ryan DP, Grüning B et al (2016) deepTools2: a next generation web server for deep-sequencing data analysis. *Nucleic Acids Res* 44(W1):W160-W165. doi:10.1093/nar/gkw257 - Diaz A, Park K, Lim DA, Song JS (2012) Normalization, bias correction, and peak calling for ChIP-seq. *Stat Appl Genet Mol Biol* 11(3):Article 9. doi:10.1515/1544-6115.1750 - Daley T, Smith AD (2013) Predicting the molecular complexity of sequencing libraries. *Nat Methods* 10(4):325-327. doi:10.1038/nmeth.2375 - Smith T, Heger A, Sudbery I (2017) UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. *Genome Res* 27(3):491-499. doi:10.1101/gr.209601.116 - McIntyre ABR, Gokhale NS, Cerchietti L, Jaffrey SR, Horner SM, Mason CE (2020) Limits in the detection of m6A changes using MeRIP/m6A-seq. *Sci Rep* 10(1):6590. doi:10.1038/s41598-020-63355-3 - Ewels P, Magnusson M, Lundin S, Käller M (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. *Bioinformatics* 32(19):3047-3048. doi:10.1093/bioinformatics/btw354 - Dominissini D, Moshitch-Moshkovitz S, Schwartz S et al (2012) Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. *Nature* 485(7397):201-206. doi:10.1038/nature11112 - Meyer KD, Saletore Y, Zumbo P, Elemento O, Mason CE, Jaffrey SR (2012) Comprehensive analysis of mRNA methylation reveals enrichment in 3' UTRs and near stop codons. *Cell* 149(7):1635-1646. doi:10.1016/j.cell.2012.05.003 ## Related Skills - m6a-peak-calling - Immediate downstream consumer of the IP/input BAM pairs - m6a-differential - Downstream differential analysis on peak count matrices; design matrix relies on IP/input pairing recorded here - m6anet-analysis - ONT DRS alternative; uses TRANSCRIPTOME alignment with minimap2, NOT the genome BAMs produced here - modification-visualization - Uses the bigWig output of bamCompare for metagene plots and browser tracks - read-qc/quality-reports - FastQC / MultiQC upstream of trimming - read-alignment/star-alignment - General STAR splice-aware alignment patterns - read-alignment/hisat2-alignment - HISAT2 graph-based alternative; general usage - alignment-files/sam-bam-basics - General BAM mechanics, samtools fundamentals - alignment-files/bam-statistics - flagstat / idxstats / per-chromosome counts - alignment-files/duplicate-handling - General dedup philosophy (note: NOT applicable to non-UMI MeRIP) - chip-seq/chipseq-qc - ChIP-seq IP QC concepts (FRiP, fingerprint, library complexity) that transfer directly - chip-seq/peak-calling - General IP-vs-input peak-calling concepts - clip-seq/clip-preprocessing - Antibody-RNA crosslink protocols (miCLIP / m6A-CLIP) overlap with MeRIP design - rna-quantification/featurecounts-counting - Count matrix construction for downstream differential - workflows/rnaseq-to-de - End-to-end pipeline orchestration patterns