--- name: bio-workflows-merip-pipeline description: "Orchestrates an end-to-end MeRIP-seq / m6A-seq analysis from raw FASTQ to differential m6A peak calls and metagene plots, chaining fastp adapter trimming, STAR splice-aware genome alignment (NO deduplication for non-UMI MeRIP), deepTools replicate-concordance + IP-enrichment QC, PreSeq saturation curves for cross-library peak-count comparison, exomePeak2 (transcript-aware Poisson GLM) peak calling, optional MACS3 broad-peak cross-check, DRACH motif confirmation as a sanity check (NOT a per-peak filter), exomePeak2 differential calling via the four-BAM-vector interface (`bam_ip` + `bam_input` for control; `bam_treated_ip` + `bam_treated_input` for treatment), ChIPseeker feature annotation, and the canonical Guitar transcript-feature metagene with stop-codon enrichment as the biological QC anchor." tool_type: mixed primary_tool: exomePeak2 --- ## Version Compatibility Reference examples tested with: STAR 2.7.11+, samtools 1.19+, fastp 0.23+, deepTools 3.5+, PreSeq 3.2+, exomePeak2 1.14+ (Bioconductor 3.18+), MACS3 3.0+, ChIPseeker 1.38+, Guitar 2.18+, BSgenome.Hsapiens.UCSC.hg38 1.4+, TxDb.Hsapiens.UCSC.hg38.knownGene 3.18+, HOMER 4.11+. Before using code patterns, verify installed versions match. If versions differ: - R: `packageVersion('')` then `?function_name` to verify parameters - CLI: ` --version` then ` --help` to confirm flags If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying. exomePeak2 has NO `mode=` or `experiment_design=` argument; differential is triggered by populating `bam_treated_ip` + `bam_treated_input`. MeTPeak defaults are `WINDOW_WIDTH=50, SLIDING_STEP=50, FRAGMENT_LENGTH=100`. MACS3 default `--keep-dup` is 1 and MUST be overridden to `all` for non-UMI MeRIP. Guitar `txTxdb=` is the modern argument name (older releases used `txdb=`). # MeRIP-seq End-to-End Pipeline **"Analyze my MeRIP-seq data from FASTQ to differential m6A peaks"** -> Orchestrate read alignment (STAR splice-aware to GENOME), IP-enrichment QC (deepTools plotFingerprint, replicate Spearman, PreSeq saturation), m6A peak calling (exomePeak2 transcript-aware default, MACS3 broad as cross-check), DRACH motif sanity check (HOMER), exomePeak2 differential via the four-BAM-vector interface, ChIPseeker feature annotation, and Guitar transcript-feature metagene confirming canonical stop-codon enrichment. Defer per-skill deep treatment to `epitranscriptomics/merip-preprocessing`, `epitranscriptomics/m6a-peak-calling`, `epitranscriptomics/m6a-differential`, and `epitranscriptomics/modification-visualization`. ## Pipeline Overview ``` FASTQ -> fastp trim -> STAR genome align -> samtools sort/index -> deepTools QC + PreSeq saturation -> exomePeak2 peak calling (+ MeTPeak / MACS3 cross-check) -> HOMER DRACH sanity check -> exomePeak2 differential (bam_ip + bam_treated_ip) -> ChIPseeker feature annotation -> Guitar metagene (stop-codon QC anchor) + pyGenomeTracks browser figures ``` ## Step 1: Adapter Trimming ```bash fastp \ --in1 raw/IP_R1.fastq.gz --in2 raw/IP_R2.fastq.gz \ --out1 trimmed/IP_R1.fq.gz --out2 trimmed/IP_R2.fq.gz \ --json qc/IP_fastp.json --html qc/IP_fastp.html \ --length_required 25 --detect_adapter_for_pe --thread 8 fastp \ --in1 raw/Input_R1.fastq.gz --in2 raw/Input_R2.fastq.gz \ --out1 trimmed/Input_R1.fq.gz --out2 trimmed/Input_R2.fq.gz \ --json qc/Input_fastp.json --html qc/Input_fastp.html \ --length_required 25 --detect_adapter_for_pe --thread 8 ``` Standard non-UMI MeRIP: do NOT pass `--umi`. See `epitranscriptomics/merip-preprocessing` for the do-NOT-dedup rationale. ## Step 2: STAR Splice-Aware Genome Alignment ```bash STAR --runMode alignReads \ --genomeDir refs/star_index \ --readFilesIn trimmed/IP_R1.fq.gz trimmed/IP_R2.fq.gz \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 20 \ --outSAMattributes NH HI AS nM NM MD \ --outFileNamePrefix aligned/IP_ \ --runThreadN 12 samtools index aligned/IP_Aligned.sortedByCoord.out.bam ``` Repeat for Input. Align to GENOME (not transcriptome) for downstream MeRIP peak calling. Do NOT deduplicate (no UMI in standard MeRIP). ## Step 3: IP-Enrichment + Replicate-Concordance QC ```bash multiBamSummary bins \ --bamfiles aligned/IP_rep*.bam aligned/Input_rep*.bam \ --binSize 10000 --numberOfProcessors 8 \ -o qc/cov.npz plotCorrelation --corData qc/cov.npz --corMethod spearman --skipZeros \ --whatToPlot heatmap --colorMap RdYlBu_r --plotNumbers \ -o qc/replicate_correlation.pdf plotFingerprint \ --bamfiles aligned/IP_rep*.bam aligned/Input_rep*.bam \ --skipZeros --numberOfProcessors 8 \ --outQualityMetrics qc/fingerprint_metrics.tab \ -o qc/fingerprint.pdf preseq lc_extrap -B -o qc/IP_rep1_lc_extrap.txt aligned/IP_rep1_Aligned.sortedByCoord.out.bam ``` For peak-count comparison across conditions, rarefy BAMs to a common unique-read depth informed by the saturation curve before calling peaks. ## Step 4: exomePeak2 Peak Calling (Per-Condition) **Goal:** Produce a transcript-aware set of m6A peaks with FDR and IP/input fold-change from paired IP/Input genome BAM files, suitable as input to differential analysis, motif scanning, or downstream visualisation. **Approach:** Build a TxDb from the matched GTF; pass paired IP/Input BAM vectors to `exomePeak2()` with `txdb` and `genome` (BSgenome) for GC correction; export BED12 + RDS to `save_dir/experiment_name/`. ```r library(exomePeak2) library(GenomicFeatures) library(BSgenome.Hsapiens.UCSC.hg38) txdb <- makeTxDbFromGFF('refs/annotation.gtf', format='gtf') result <- exomePeak2( bam_ip = c('aligned/IP_rep1.bam', 'aligned/IP_rep2.bam', 'aligned/IP_rep3.bam'), bam_input = c('aligned/Input_rep1.bam', 'aligned/Input_rep2.bam', 'aligned/Input_rep3.bam'), txdb = txdb, genome = BSgenome.Hsapiens.UCSC.hg38, paired_end = TRUE, library_type = 'unstranded', save_dir = 'exomepeak2_output', experiment_name = 'm6a_run1' ) peaks <- result length(peaks) ``` `exomePeak2()` writes BED12 + RDS + per-peak fold-change / FDR to `save_dir/experiment_name/`. ## Step 5: MACS3 Broad-Peak Cross-Check (Optional) ```bash macs3 callpeak \ --treatment aligned/IP_rep*.bam \ --control aligned/Input_rep*.bam \ --format BAMPE --gsize hs \ --nomodel --extsize 150 \ --keep-dup all \ --broad --broad-cutoff 0.1 --qvalue 0.05 \ --outdir macs3_output --name m6a_run1 ``` `--keep-dup all` is non-negotiable for non-UMI MeRIP (default `--keep-dup 1` destroys signal at high-coverage transcripts). ## Step 6: DRACH Motif Sanity Check ```bash findMotifsGenome.pl \ exomepeak2_output/m6a_run1/peaks.bed \ hg38 motif_output \ -rna -size 100 -len 5,6 -p 8 ``` Report DRACH enrichment on the peak set as a sanity check (E-value < 1e-50 expected). NEVER post-hoc filter individual peaks by DRACH. ## Step 7: exomePeak2 Differential (Control vs Treatment) **Goal:** Identify m6A peaks that change between control and treatment conditions, with per-peak log2FC + FDR, using exomePeak2's integrated peak-calling + differential interface. **Approach:** Populate `bam_ip` + `bam_input` with the control arm and `bam_treated_ip` + `bam_treated_input` with the treatment arm; populating the treated arms triggers differential mode (there is NO `mode=` argument). Apply effect-size + FDR filters downstream. ```r library(exomePeak2) library(GenomicFeatures) library(BSgenome.Hsapiens.UCSC.hg38) txdb <- makeTxDbFromGFF('refs/annotation.gtf', format='gtf') ctrl_ip <- c('aligned/ctrl_IP1.bam', 'aligned/ctrl_IP2.bam', 'aligned/ctrl_IP3.bam') ctrl_input <- c('aligned/ctrl_Input1.bam', 'aligned/ctrl_Input2.bam', 'aligned/ctrl_Input3.bam') treat_ip <- c('aligned/treat_IP1.bam', 'aligned/treat_IP2.bam', 'aligned/treat_IP3.bam') treat_input <- c('aligned/treat_Input1.bam', 'aligned/treat_Input2.bam', 'aligned/treat_Input3.bam') diff_result <- exomePeak2( bam_ip = ctrl_ip, bam_input = ctrl_input, bam_treated_ip = treat_ip, bam_treated_input = treat_input, txdb = txdb, genome = BSgenome.Hsapiens.UCSC.hg38, paired_end = TRUE, library_type = 'unstranded', peak_calling_mode = 'exon', save_dir = 'exomepeak2_diff_output', experiment_name = 'ctrl_vs_treat' ) diff_table <- as.data.frame(diff_result) sig <- diff_table[diff_table$padj < 0.05 & abs(diff_table$log2FC) > 0.5, ] nrow(sig) ``` exomePeak2 has NO `mode=` or `experiment_design=` argument. Populating `bam_treated_ip` + `bam_treated_input` triggers differential output. For batch / antibody-lot covariate adjustment, fall through to featureCounts-on-peaks -> DESeq2 (see `epitranscriptomics/m6a-differential`). ## Step 8: Peak Annotation to Transcript Features ```r library(ChIPseeker) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(rtracklayer) peaks <- import('exomepeak2_output/m6a_run1/peaks.bed') anno <- annotatePeak(peaks, TxDb=TxDb.Hsapiens.UCSC.hg38.knownGene, level='transcript') plotAnnoBar(anno) plotDistToTSS(anno) ``` Flag peaks within ~50 nt of TSS as m6A-or-m6Am ambiguous (antibody cross-reactivity with PCIF1-deposited cap m6Am). ## Step 9: Guitar Metagene (Biological QC Anchor) ```r library(Guitar) library(TxDb.Hsapiens.UCSC.hg38.knownGene) GuitarPlot( txTxdb = TxDb.Hsapiens.UCSC.hg38.knownGene, stBedFiles = list('exomepeak2_output/m6a_run1/peaks.bed'), saveToPDFprefix = 'figures/m6a_metagene' ) ``` Expected pattern: peak density rises toward and peaks near the stop codon (3'UTR-proximal end of CDS). If absent, suspect IP failure or wrong antibody; do NOT proceed to downstream interpretation. ## Complete Bash Driver ```bash #!/usr/bin/env bash set -euo pipefail STAR_INDEX=$1 GTF=$2 IP_R1=$3 IP_R2=$4 INPUT_R1=$5 INPUT_R2=$6 OUTPUT_DIR=$7 mkdir -p "${OUTPUT_DIR}"/{qc,trimmed,aligned,peaks,figures} fastp --in1 "${IP_R1}" --in2 "${IP_R2}" \ --out1 "${OUTPUT_DIR}/trimmed/IP_R1.fq.gz" --out2 "${OUTPUT_DIR}/trimmed/IP_R2.fq.gz" \ --json "${OUTPUT_DIR}/qc/IP_fastp.json" --length_required 25 --detect_adapter_for_pe --thread 8 fastp --in1 "${INPUT_R1}" --in2 "${INPUT_R2}" \ --out1 "${OUTPUT_DIR}/trimmed/Input_R1.fq.gz" --out2 "${OUTPUT_DIR}/trimmed/Input_R2.fq.gz" \ --json "${OUTPUT_DIR}/qc/Input_fastp.json" --length_required 25 --detect_adapter_for_pe --thread 8 for sample in IP Input; do STAR --runMode alignReads --genomeDir "${STAR_INDEX}" \ --readFilesIn "${OUTPUT_DIR}/trimmed/${sample}_R1.fq.gz" "${OUTPUT_DIR}/trimmed/${sample}_R2.fq.gz" \ --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 20 \ --outFileNamePrefix "${OUTPUT_DIR}/aligned/${sample}_" --runThreadN 12 samtools index "${OUTPUT_DIR}/aligned/${sample}_Aligned.sortedByCoord.out.bam" done macs3 callpeak \ --treatment "${OUTPUT_DIR}/aligned/IP_Aligned.sortedByCoord.out.bam" \ --control "${OUTPUT_DIR}/aligned/Input_Aligned.sortedByCoord.out.bam" \ --format BAMPE --gsize hs --nomodel --extsize 150 --keep-dup all \ --broad --broad-cutoff 0.1 --qvalue 0.05 \ --outdir "${OUTPUT_DIR}/peaks" --name m6a ``` The full pipeline (incl. exomePeak2 peak calling, DRACH check, ChIPseeker annotation, Guitar metagene) is best orchestrated in Snakemake or Nextflow with the per-skill recipes from the four `epitranscriptomics/` skills. ## QC Checkpoints | Checkpoint | Expected | Action if Failed | |------------|----------|------------------| | Properly-paired rate (samtools flagstat) | >=85% | Check trimming and adapter contamination | | Replicate Spearman within condition (10 kb bins) | >=0.85 IP-IP | Inspect divergent replicate; consider exclusion | | plotFingerprint IP-vs-input JS distance | >=0.5 | Suspect failed IP if lower | | Saturation plateau depth | ~30-60M unique reads | Sequence deeper if not plateaued | | DRACH motif enrichment (HOMER, peak set) | E-value < 1e-50 | Suspect IP failure or wrong antibody | | Stop-codon enrichment in Guitar metagene | Clear 3'UTR-proximal peak | Suspect IP failure, wrong antibody, or non-m6A modification | | 5'UTR peaks fraction | Note ambiguity zone (~50 nt of TSS) | Flag as m6A-or-m6Am ambiguous; PCIF1 cross-reactivity | ## Output Files | File | Description | |------|-------------| | `exomepeak2_output/m6a_run1/peaks.bed` | exomePeak2 peak BED12 | | `exomepeak2_diff_output/ctrl_vs_treat/...` | Differential peaks with log2FC + FDR | | `motif_output/` | HOMER DRACH motif enrichment report | | `figures/m6a_metagene.pdf` | Guitar transcript-feature metagene (stop-codon QC anchor) | | `qc/replicate_correlation.pdf` | deepTools Spearman heatmap | | `qc/fingerprint.pdf` | deepTools Lorenz IP-enrichment plot | | `qc/IP_rep*_lc_extrap.txt` | PreSeq saturation curves | ## Related Skills - epitranscriptomics/merip-preprocessing - Per-step preprocessing (trim, align, QC, saturation, IP-over-Input bigWig) - epitranscriptomics/m6a-peak-calling - exomePeak2 / MeTPeak / MACS3 deep treatment, DRACH sanity check, m6A-vs-m6Am 5'UTR flag - epitranscriptomics/m6a-differential - Differential methods (exomePeak2, QNB, RADAR), batch / lot covariate handling, stoichiometry-vs-expression confound - epitranscriptomics/modification-visualization - Guitar metagene, peak-centred heatmaps, pyGenomeTracks browser figures - epitranscriptomics/m6anet-analysis - ONT direct-RNA alternative for orthogonal stoichiometry validation - chip-seq/peak-calling - Sibling IP-vs-input peak-calling framework - chip-seq/chipseq-qc - IP enrichment QC concepts that transfer to MeRIP - read-alignment/star-alignment - General STAR splice-aware alignment - workflow-management/snakemake-workflows - Snakemake orchestration patterns - workflow-management/nextflow-pipelines - Nextflow orchestration patterns - workflows/rnaseq-to-de - General RNA-seq -> DE pipeline patterns