--- name: bio-clip-seq-clip-qc description: Comprehensive quality control for CLIP-seq libraries (eCLIP, iCLIP, iCLIP2, PAR-CLIP) covering library complexity (preseq), FRiP, IDR replicate reproducibility, read-distribution metagene, SMInput vs IgG control rationale, rRNA / snoRNA contamination, fragment-length distribution, and ENCODE-compliance thresholds. Use when assessing whether a CLIP library passed, deciding lenient vs stringent peak thresholds, comparing replicates with IDR rescue and self-consistency ratios, or distinguishing failed IP from over-amplified library. tool_type: mixed primary_tool: preseq --- ## Version Compatibility Reference examples tested with: preseq 3.2+, picard 3.1+, samtools 1.19+, bedtools 2.31+, deeptools 3.5+, idr 2.0.4+, MultiQC 1.21+, RSeQC 5.0+, pysam 0.22+, fastp 0.23+. Before using code patterns, verify installed versions match. If versions differ: - Python: `pip show ` then `help(module.function)` to check signatures - CLI: ` --version` then ` --help` to confirm flags If code throws unexpected errors, introspect the installed binary and adapt the example to match the actual CLI rather than retrying. # CLIP-seq Quality Control **"Did my CLIP library pass?"** -> Assess preprocessing retention, alignment rate, library complexity, replicate reproducibility (IDR), fraction reads in peaks (FRiP), read-distribution metagene, rRNA/snoRNA contamination, fragment-length distribution, and SMInput vs IP enrichment. ENCODE eCLIP compliance is the canonical bar: >= 1M unique fragments per replicate, IDR rescue and self-consistency ratios both < 2, FRiP >= 0.005 (narrow-binding), library complexity rising linearly with depth on preseq lc_extrap. A library can fail at any of these stages, and the failure mode determines whether the data is salvageable. - CLI (library complexity, primary QC): `preseq lc_extrap -B -P aligned.bam -o complexity.txt` - CLI (FRiP after peak calling): `bedtools intersect -c -s -a peaks.bed -b dedup.bam | awk '{s+=$NF} END{print s}'` then divide by total reads - CLI (IDR, ENCODE convention): `idr --samples rep1.sorted.bed rep2.sorted.bed --input-file-type bed --rank 5 --output-file idr.out --idr-threshold 0.05 --plot` - CLI (read distribution / metagene): `RSeQC read_distribution.py -i dedup.bam -r gencode.v38.bed` + `geneBody_coverage.py -i dedup.bam -r housekeeping.bed -o gb` - CLI (rRNA contamination check): `samtools idxstats dedup.bam | awk '$1 ~ /rRNA|45S|18S|28S/ { sum+=$3 } END {print sum}'` - CLI (consolidated report): `multiqc ` aggregates FastQC + cutadapt + STAR + umi_tools + preseq + samtools stats The ENCODE eCLIP standards (encodeproject.org/eclip) define: >= 2 biological replicates with >= 1M unique fragments each (or saturated peak detection); IDR rescue and self-consistency ratios both < 2; narrow-binding RBPs FRiP >= 0.005. CLIP libraries should have 40-70% PCR duplication BY DESIGN - the IP enriches a small molecule pool, so high pre-dedup duplication is normal; low duplication suggests failed IP. ## QC Stage Hierarchy CLIP QC progresses through five gates; failure at an earlier gate makes later gates meaningless. | Gate | Metric | Tool | ENCODE threshold | Failure interpretation | |------|--------|------|------------------|------------------------| | 1. Preprocessing retention | % reads retained after UMI + adapter trim | cutadapt log | >= 70% | Adapter pattern wrong; degraded RNA | | 2. Alignment rate | % reads aligned to genome (unique) | STAR Log.final.out | >= 60% for eCLIP, 70% for iCLIP/PAR-CLIP | Wrong genome; rRNA pre-map missing | | 3. Library complexity | Predicted unique fragments at sequenced depth | preseq lc_extrap | >= 1M unique | Over-amplified or under-input library | | 4. IP enrichment | log2(IP/SMInput) at expected sites; FRiP | bedtools + idr | FRiP >= 0.005; log2 >= 3 at top peaks | Failed antibody / antibody not IP-grade | | 5. Reproducibility | IDR rescue + self-consistency ratios | idr | both < 2 | Biological variation too high; or low complexity | A library failing at Gate 3 (complexity) cannot be rescued analytically; gates 4-5 fail downstream of complexity by construction. ## Library Complexity with preseq **Goal:** Determine whether the CLIP library captured enough independent molecules to support genome-wide peak calling (ENCODE: >= 1M unique fragments per replicate). **Approach:** Run `preseq lc_extrap` on the PRE-dedup BAM (preseq counts PCR duplicates to extrapolate); also compute picard ESTIMATED_LIBRARY_SIZE at sequenced depth. Flag any library predicted to plateau below 3M unique fragments at infinite depth. ```bash # After alignment, BEFORE UMI dedup (preseq counts PCR duplicates) preseq lc_extrap \ -B -P \ -o sample_complexity.txt \ sample_aligned.bam # Output columns: # TOTAL_READS EXPECTED_DISTINCT LOWER_0.95CI UPPER_0.95CI # At 100M reads, EXPECTED_DISTINCT: # >= 10M = excellent complexity # 3-10M = acceptable; restrict to high-expression transcripts # < 3M = library failed; cannot rescue analytically # picard direct estimate at current depth picard EstimateLibraryComplexity \ I=sample_aligned.bam \ O=picard_complexity.txt # ESTIMATED_LIBRARY_SIZE > 5M = healthy CLIP library ``` A linear plateau on preseq's curve at low depth indicates over-amplification; a curve still climbing at sequenced depth means more sequencing would yield more unique fragments. ## FRiP (Fraction Reads in Peaks) FRiP measures how much of the IP signal falls into the called peak set. ENCODE eCLIP narrow-binding RBP minimum: FRiP >= 0.005. Atypical-binding RBPs (rare-transcript binders like TROVE2 on Y RNAs) are exempt. ```bash # Reads in peaks (use stringent peaks: log2 FC >= 3, -log10 p >= 3) reads_in_peaks=$(bedtools intersect -c -s -a peaks.stringent.bed -b dedup.bam | awk '{s+=$NF} END {print s}') total_reads=$(samtools view -c -F 4 dedup.bam) frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc) echo "FRiP: $frip" # Per-region FRiP breakdown for region in three_utr exon intron; do rip=$(bedtools intersect -c -s -a peaks_${region}.bed -b dedup.bam | awk '{s+=$NF} END {print s}') echo "${region}: $(echo "scale=4; $rip / $total_reads" | bc)" done ``` | RBP class | Expected FRiP (ENCODE eCLIP) | |-----------|------------------------------| | Splicing factors (PTBP1, U2AF2) | 0.01 - 0.10 | | 3' UTR mRNA stability (HuR, PUM2) | 0.02 - 0.20 | | Translation factors (EIF3J) | 0.01 - 0.05 | | Repeat binders (MATR3) | 0.05 - 0.30 (high; concentrated in repeats) | | Mitochondrial (FASTKD2) | 0.05 - 0.40 (very high; chrM is small) | | snoRNA binders (DKC1) | 0.10 - 0.50 (high; snoRNA is rare) | | Failed IP (any RBP) | < 0.005 | ## IDR for CLIP Reproducibility IDR (Li et al 2011) measures peak-rank reproducibility across replicates. ENCODE eCLIP convention applies IDR identically to ChIP-seq, using CLIPper + SMInput log2 FC + -log10 p as the ranking signal. The CLIP-specific consideration: rank by signalValue (log2 FC) or p-value, NOT by score column (CLIPper score is sparse and tied). ```bash # Sort each replicate's peaks by signal sort -k5,5gr rep1.compressed.bed > rep1.sorted sort -k5,5gr rep2.compressed.bed > rep2.sorted # True-replicates IDR (threshold 0.05) idr --samples rep1.sorted rep2.sorted \ --input-file-type bed --rank 5 \ --output-file idr_true.out \ --idr-threshold 0.05 \ --plot --log-output-file idr.log # Pseudo-replicates from each individual replicate (split BAM in half) samtools view -b -h -s 1.5 rep1.dedup.bam > rep1.psr1.bam # seed 1, fraction 0.5 samtools view -b -h -s 2.5 rep1.dedup.bam > rep1.psr2.bam # seed 2 (different) # Re-run peak calling on each pseudoreplicate, then IDR at threshold 0.10 ``` **ENCODE consistency rules for eCLIP:** - Nt = peaks passing IDR on true replicates - Nself = peaks passing IDR on pseudo-replicates of each rep - Library passes if: max(Nt, Nself) / min(Nt, Nself) <= 2 - If both ratios > 2: library rejected ## SMInput vs IgG Control: Which? The eCLIP design uses a size-matched input (SMInput) from the SAME lysate, treated identically (UV, IP buffer, RNase, ligation, IP, but with NO antibody addition - just bead-only control or a non-specific control IP). This is fundamentally different from IgG controls and from RNA-seq. | Control | What it measures | Pros | Cons | |---------|------------------|------|------| | SMInput | Background from non-specific binding + ligation/RT/gel biases at same size | Captures all CLIP-specific biases; ENCODE standard | Requires same-day prep; cannot use a previous IgG library | | IgG-IP | Non-specific antibody binding to the same RBP-naive lysate | Direct nonspecificity measure | Yields very low (~3-10x less reads); high PCR dup; hard to normalize | | Empty beads (mock) | Bead surface non-specificity only | Cleanest baseline | Misses real CLIP background (IP buffer + ligation step bias) | | RNA-seq (matched cell type) | Transcript abundance | Easy to obtain | Misses CLIP-specific biases entirely; not a real CLIP control | | Total RNA / nuclear RNA | Cellular RNA distribution | Easy | Same as RNA-seq above | **Consensus (ENCODE / Hentze / Yeo):** SMInput. The bead-only IgG/mock alternatives are ill-suited for quantification because their library yields are 5-10x lower, dominated by PCR duplicates, and produce sparse read-density tracks. RNA-seq cannot replace SMInput because it does not capture the non-specific binding that occurs during IP. ```bash # SMInput preparation - same as IP except no antibody added during incubation # Same UV dose, same lysate aliquot, same RNase, same library prep # Critical: same SDS-PAGE size cut from membrane as the IP # Check SMInput vs IP enrichment globally # Whole-genome log2(IP_RPKM / SMInput_RPKM): # > 0.5 globally indicates IP enrichment of expressed transcripts (good) # ~ 0 indicates failed IP (SMInput == IP) ``` ## Read Distribution Metagene ```bash # RSeQC read_distribution.py shows fractional read placement read_distribution.py -i dedup.bam -r gencode.v38.bed > read_dist.txt # Output reports: # CDS_Exons, 5'UTR_Exons, 3'UTR_Exons, Introns, TES_down_10kb, TES_down_1kb, # TSS_up_10kb, TSS_up_1kb, Intergenic_region # CLIP-seq biology-specific patterns: # Splicing factor: > 60% Introns + 5' UTR exons (containing 5' splice sites) # 3' UTR factor: > 50% 3'UTR_Exons # m6A reader: 3'UTR_Exons + Stop_codon region # Failed IP: matches RNA-seq distribution (no enrichment) # GeneBody coverage for 5' vs 3' bias geneBody_coverage.py \ -i dedup.bam \ -r housekeeping.bed \ -o sample_gb # Flat curve = no positional bias (normal for most RBPs) # 3' end bias = polyA-dependent enrichment (suspect) # 5' end bias = nascent / TSS-proximal (only sensible for some RBPs) ``` ## Fragment-Length Distribution (Paired-End) ```bash # Paired-end fragment-length distribution samtools view -f 2 dedup.bam | awk '$9 > 0 && $9 < 500 {print $9}' | sort -n | uniq -c > fragment_lengths.txt # CLIP normal range: 20-75 nt insert (rises from short trimmed reads to ~75 nt) # Wider range (20-200) seen in high-RNase / long-fragment protocols # Narrow peak at one length (e.g., all reads at 30 nt) = over-trimmed # Bimodal at 30 nt and 150 nt = library preparation artifact ``` ## Pre-Map rRNA / snoRNA Contamination Check ```bash # rRNA reads as fraction of total total=$(samtools view -c -F 4 dedup.bam) rRNA=$(samtools idxstats dedup.bam | awk '$1 ~ /rRNA|45S|18S|28S|5_8S/ { sum+=$3 } END {print sum}') rrna_frac=$(echo "scale=4; $rRNA / $total" | bc) echo "rRNA fraction: $rrna_frac" # eCLIP without pre-map: rRNA 5-30% normal # After pre-map: < 2% expected # > 30% indicates rRNA dominance - IP captured ribosomes preferentially # Some RBPs (RPL/RPS) expected to bind ribosomes; verify against RBP biology ``` ## Antibody Validation Sanity Check The antibody is the single most common point of failure in CLIP. Even commercial "IP-grade" antibodies fail at 30-50% rates in practice. Cross-check IP enrichment against expected biology: ```python import pandas as pd # Load peak file peaks = pd.read_csv('peaks.stringent.bed', sep='\t', header=None, names=['chr','start','end','name','log2fc','strand']) # Filter top 100 peaks top_peaks = peaks.nlargest(100, 'log2fc') # Check enrichment in expected biology # 1. If RBP is splicing factor: top peaks should be intronic or splice-site flanking # 2. If RBP is HuR: > 70% top peaks should be 3' UTR # 3. If RBP is FASTKD2: top peaks should be chrM # 4. If RBP is FUS: GUGGU motif should appear in top motif enrichment # Quick check chrom_dist = top_peaks['chr'].value_counts(normalize=True) print(chrom_dist.head(10)) # Healthy RBP: chromosomes represented in proportion to expression # Failed IP: top chromosome chrM (mt artifact) or chr21/22 (housekeeping bias) > 20% ``` GO term sanity: top-peak genes should enrich for the expected biology (e.g., HuR -> immune / inflammation / mRNA stability terms; PTBP1 -> RNA splicing terms). If the top GO term is "cellular metabolism" or "translation" for a splicing factor, the IP failed. ## Per-Stage Failure Modes ### Gate 1: Preprocessing retention < 70% **Trigger:** cutadapt log shows > 30% reads filtered (too short or no adapter). **Mechanism:** Adapter pattern wrong; or RNA was degraded before fragmentation (all reads adapter-only). **Symptom:** Catastrophic loss at the adapter trim step. **Fix:** Verify adapter sequence in library prep documentation; check library quality control (Bioanalyzer / TapeStation) for RIN value; re-prep if RIN < 7. ### Gate 2: Alignment rate < 60% **Trigger:** STAR Log.final.out reports `Uniquely mapped reads %` < 60% for eCLIP, < 70% for iCLIP. **Mechanism:** rRNA contamination dominating (no pre-map); wrong species genome; or chimeric library (contamination). **Symptom:** Low alignment rate; samtools idxstats shows most reads as rRNA-aligned. **Fix:** Run bowtie2 pre-map to rRNA + repeats index; verify genome species; check for sample swap with `samtools view -h sample.bam | grep '^@SQ'`. ### Gate 3: Library complexity < 1M unique **Trigger:** preseq lc_extrap predicts plateau < 1M unique at sequenced depth; or picard reports ESTIMATED_LIBRARY_SIZE < 1M. **Mechanism:** Over-amplification (> 25 PCR cycles); low input (< 5M cells); failed IP capturing only a few molecules. **Symptom:** preseq curve flattens early; UMI clusters have median size > 8. **Fix:** No analytic rescue. Re-prep with more input cells (10-20M for eCLIP), fewer PCR cycles (14-18 for eCLIP, 16-20 for iCLIP2). If forced to use this library, restrict downstream analysis to high-expression transcripts and acknowledge the limitation. ### Gate 4: FRiP < 0.005 for narrow-binding RBP **Trigger:** FRiP value below ENCODE threshold for the RBP class. **Mechanism:** IP failed to enrich specific binding sites; antibody cross-reactive or not IP-grade. **Symptom:** Top peaks dominated by abundant transcripts (GAPDH, ACTB); GO enrichment generic; motif analysis returns AU-rich background. **Fix:** Re-test antibody on a knockdown lysate (siRNA against the RBP) - if WB signal does not decrease, antibody is non-specific; switch antibody. ENCODE-validated antibodies are at encodeproject.org/biosamples. ### Gate 4 variant: IP/SMInput global log2 ~ 0 **Trigger:** Whole-genome log2(IP/SMInput) is near 0 instead of positive. **Mechanism:** IP and SMInput are equivalent - the antibody is not enriching any RBP-bound RNA. **Symptom:** Per-peak log2 FC scaled around 0; no obvious enrichment at known motif sites. **Fix:** Same as above - antibody failure. Consider tagged-RBP system (Halo-CLIP, GoldCLIP, or knock-in endogenous tag). ### Gate 5: IDR fails - rescue/self-consistency > 2 **Trigger:** ENCODE IDR test reports rescue ratio > 2 OR self-consistency > 2. **Mechanism:** Replicates inconsistent. Biological variation high; OR one replicate has lower library complexity; OR one replicate failed IP. **Symptom:** Per-replicate peak counts differ > 2x; IDR plot shows the two replicates have very different peak rank distributions. **Fix:** Run preseq and FRiP on each replicate independently; identify the failing one; re-sequence to higher depth or re-prep. ENCODE-style: down-sample both replicates to common depth, re-test IDR. ### High PCR duplication confused with low complexity **Trigger:** Saw 60% PCR duplication and panicked. **Mechanism:** CLIP libraries have 40-70% PCR duplication BY DESIGN - the IP enriches a small pool. UMI dedup recovers the unique molecules underneath. **Symptom:** "60% duplication" headline. Actual unique count is what matters. **Fix:** Confirm UMI dedup has been applied and unique fragment count >= 1M. The duplication rate metric is meaningless without UMI context for CLIP. ## Decision Tree by Failure | Symptom | Likely failure | Action | |---------|----------------|--------| | > 50% reads "too short" in cutadapt | Adapter wrong or RNA degraded | Verify adapter; check RIN | | Most reads align to rRNA | No pre-map; or IP captured ribosomes | bowtie2 pre-map; or accept if RBP is ribosomal | | Unique frag count < 1M | Over-amplified or under-input | No rescue; re-prep | | FRiP < 0.005 (narrow RBP) | Failed IP | Switch antibody | | IP/SMInput global log2 ~ 0 | Antibody non-specific | Switch antibody or use tagged-RBP | | IDR rescue > 2 | Replicate inconsistency | Identify failing replicate; re-sequence | | Top GO terms generic | Failed IP or contamination | Check antibody on KD lysate WB | | chrM peaks abundant for non-mt-RBP | Mitochondrial contamination | Filter chrM or accept if mt-RBP | | Fragment length all 30 nt | Over-trimmed | Loosen quality trim from -q 20 to -q 6 | | Strand-specific peaks lost | bedtools without -s | Add `-s` strand flag | ## Standard QC Report (MultiQC) ```bash # Aggregate all QC tools into a single report multiqc \ fastqc_output/ \ cutadapt_logs/ \ star_logs/ \ umi_tools_logs/ \ preseq_output/ \ samtools_stats/ \ -o multiqc_report/ ``` MultiQC reads logs from FastQC, cutadapt, STAR, umi_tools, preseq, samtools stats, picard, and others, and produces a single HTML report. For CLIP-seq, additionally run RSeQC `read_distribution.py` and document FRiP / IDR results manually. ## ENCODE-Style Comprehensive QC Table | Metric | Threshold | Source | |--------|-----------|--------| | Biological replicates | >= 2 | ENCODE eCLIP | | Read length | >= 50 nt (PE) | ENCODE | | Unique fragments per replicate | >= 1M (or saturated) | ENCODE | | Preprocessing retention | >= 70% | Practitioner consensus | | Genome alignment rate (unique) | >= 60% eCLIP, 70% iCLIP | Practitioner consensus | | Library complexity (preseq) | >= 10M expected unique at 100M | ENCODE | | FRiP (narrow-binding RBP) | >= 0.005 | ENCODE | | FRiP (atypical-binding RBP) | Exempt | ENCODE | | log2(IP/SMInput) at stringent peaks | >= 3 | ENCODE | | -log10 p at stringent peaks | >= 3 | ENCODE | | IDR rescue ratio | < 2 | ENCODE | | IDR self-consistency ratio | < 2 | ENCODE | | rRNA fraction post pre-map | < 2% | Practitioner consensus | | Read distribution match to RBP class | Y | Sanity check | | Top motif consistent with literature | Y | Sanity check | ## Common Errors | Error / symptom | Cause | Solution | |-----------------|-------|----------| | preseq "all reads have duplicates" | Pre-deduplicated BAM passed to preseq | Re-run on PRE-dedup BAM | | FRiP value > 1.0 | Peak set overlaps reads counted twice | Use unique-reads BAM; verify peak BED is non-redundant | | IDR plot empty | Ranking column wrong; all peaks tied at same value | Rank by `--rank 5` (log2 FC); sort BED first | | MultiQC missing modules | Logs not in expected directory structure | Verify log paths; rerun multiqc with explicit `-d` | | Read distribution: > 50% intergenic for mRNA RBP | TxDb missing transcripts; rRNA contamination | Update GENCODE; pre-map rRNA | | GeneBody coverage 3' biased | polyA-selected library; not CLIP | Verify library prep was random hexamer | | Fragment-length distribution narrow at 30 nt | Adapter aggressively trimmed at 5' | Loosen trim parameters | | Antibody KD WB shows no decrease | Antibody non-specific | Switch antibody; use ENCODE-validated | ## References - Van Nostrand EL et al 2016 Nat Methods 13:508 (eCLIP, ENCODE QC standards) - Li Q et al 2011 Ann Appl Stat 5:1752 (IDR framework) - Landt SG et al 2012 Genome Res 22:1813 (ChIP/CLIP QC guidelines, IDR Nself rule) - Daley T & Smith AD 2013 Nat Methods 10:325 (preseq library complexity) - Wang L et al 2012 Bioinformatics 28:2184 (RSeQC) - Ewels P et al 2016 Bioinformatics 32:3047 (MultiQC) - Smith T et al 2017 Genome Res 27:491 (UMI-tools) - ENCODE eCLIP Data Standards (encodeproject.org/eclip) - canonical thresholds - Van Nostrand EL et al 2020 Nature 583:711 (ENCODE 150 RBP QC patterns) ## Related Skills - clip-seq/clip-preprocessing - Gate 1 preprocessing logs - clip-seq/clip-alignment - Gate 2 alignment metrics - clip-seq/clip-peak-calling - Gates 4-5 peak QC + FRiP - clip-seq/differential-clip - QC for differential analyses - read-qc/quality-reports - General FastQC / MultiQC - read-qc/contamination-screening - Cross-species contamination - chip-seq/chipseq-qc - DNA-protein QC analogue