---
name: bio-workflows-genome-assembly-pipeline
description: End-to-end genome assembly workflow from reads to polished assembly with QC. Supports short reads (SPAdes), long reads (Flye), and hybrid approaches. Use when assembling genomes from raw reads.
tool_type: cli
primary_tool: Flye
workflow: true
depends_on:
  - read-qc/fastp-workflow
  - genome-assembly/short-read-assembly
  - genome-assembly/long-read-assembly
  - genome-assembly/assembly-polishing
  - genome-assembly/assembly-qc
qc_checkpoints:
  - after_assembly: "N50 reasonable, total length matches expected"
  - after_polishing: "Error rate reduced, QV improved"
  - after_busco: "Complete BUSCOs >90%"
---

# Genome Assembly Pipeline

Complete workflow from sequencing reads to polished, quality-assessed genome assembly.

## Workflow Overview

```
Reads (short and/or long)
    |
    v
[1. QC & Filtering] -----> fastp, NanoPlot
    |
    v
[2. Assembly] -----------> SPAdes (short) or Flye (long)
    |
    v
[3. Polishing] ----------> Pilon (short) or medaka (long)
    |
    v
[4. QC Assessment] ------> QUAST, BUSCO
    |
    v
Final polished assembly
```

## Path A: Short-Read Assembly (SPAdes)

### Step 1: QC

```bash
fastp -i reads_R1.fastq.gz -I reads_R2.fastq.gz \
    -o trimmed_R1.fq.gz -O trimmed_R2.fq.gz \
    --detect_adapter_for_pe \
    --qualified_quality_phred 20 \
    --length_required 50 \
    --html qc_report.html
```

### Step 2: Assembly with SPAdes

```bash
# Standard bacterial assembly
spades.py \
    -1 trimmed_R1.fq.gz \
    -2 trimmed_R2.fq.gz \
    -o spades_output \
    --careful \
    -t 16 \
    -m 64

# For isolate genomes
spades.py --isolate \
    -1 trimmed_R1.fq.gz \
    -2 trimmed_R2.fq.gz \
    -o spades_output \
    -t 16
```

### Step 3: Polishing with Pilon

```bash
# Align reads to assembly
bwa index spades_output/scaffolds.fasta
bwa mem -t 16 spades_output/scaffolds.fasta \
    trimmed_R1.fq.gz trimmed_R2.fq.gz | \
    samtools sort -@ 4 -o aligned.bam
samtools index aligned.bam

# Polish
pilon --genome spades_output/scaffolds.fasta \
    --frags aligned.bam \
    --output polished \
    --threads 16
```

## Path B: Long-Read Assembly (Flye)

### Step 1: QC

```bash
# NanoPlot for long-read QC
NanoPlot --fastq reads.fastq.gz \
    --outdir nanoplot_output \
    --threads 8
```

### Step 2: Assembly with Flye

```bash
# ONT raw reads
flye --nano-raw reads.fastq.gz \
    --out-dir flye_output \
    --threads 16 \
    --genome-size 5m

# ONT HQ reads (sup/dna_r10)
flye --nano-hq reads.fastq.gz \
    --out-dir flye_output \
    --threads 16 \
    --genome-size 5m

# PacBio HiFi
flye --pacbio-hifi reads.fastq.gz \
    --out-dir flye_output \
    --threads 16 \
    --genome-size 5m
```

### Step 3: Polishing with medaka

```bash
# Polish with medaka (for ONT)
medaka_consensus \
    -i reads.fastq.gz \
    -d flye_output/assembly.fasta \
    -o medaka_output \
    -t 16 \
    -m r1041_e82_400bps_sup_v4.3.0  # Match your basecalling model
```

## Path C: Hybrid Assembly

```bash
# Flye with long reads, then polish with short reads
flye --nano-hq long_reads.fastq.gz \
    --out-dir flye_output \
    --threads 16 \
    --genome-size 5m

# Polish with short reads using Pilon
bwa index flye_output/assembly.fasta
bwa mem -t 16 flye_output/assembly.fasta \
    short_R1.fq.gz short_R2.fq.gz | \
    samtools sort -@ 4 -o aligned.bam
samtools index aligned.bam

pilon --genome flye_output/assembly.fasta \
    --frags aligned.bam \
    --output hybrid_polished \
    --threads 16
```

## Step 4: Quality Assessment

### QUAST

```bash
quast.py polished.fasta \
    -r reference.fasta \
    -g genes.gff \
    -o quast_output \
    -t 8

# Without reference
quast.py polished.fasta \
    -o quast_output \
    -t 8
```

### BUSCO

```bash
# Download lineage database
busco --download bacteria_odb10

# Run BUSCO
busco -i polished.fasta \
    -l bacteria_odb10 \
    -o busco_output \
    -m genome \
    -c 8
```

## Parameter Recommendations

| Tool | Parameter | Bacteria | Eukaryote |
|------|-----------|----------|-----------|
| SPAdes | --careful | Yes | Optional |
| SPAdes | -m | 64GB | 256GB+ |
| Flye | --genome-size | 5m | Species-specific |
| Flye | --meta | If metagenome | No |
| BUSCO | -l | bacteria_odb10 | eukaryota_odb10 |

## Troubleshooting

| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| Fragmented assembly | Low coverage, repetitive genome | Increase coverage, use long reads |
| Low N50 | Short reads only | Add long reads for scaffolding |
| Low BUSCO | Incomplete assembly, wrong lineage | Check coverage, try different lineage |
| Assembly too large | Contamination, heterozygosity | Filter reads, check for contamination |

## Complete Pipeline Script

```bash
#!/bin/bash
set -e

THREADS=16
GENOME_SIZE="5m"
LONG_READS="long_reads.fastq.gz"
SHORT_R1="short_R1.fastq.gz"
SHORT_R2="short_R2.fastq.gz"
BUSCO_LINEAGE="bacteria_odb10"
OUTDIR="assembly_results"

mkdir -p ${OUTDIR}/{qc,assembly,polished,quast,busco}

# Step 1: QC
echo "=== QC ==="
NanoPlot --fastq ${LONG_READS} --outdir ${OUTDIR}/qc/nanoplot -t ${THREADS}
fastp -i ${SHORT_R1} -I ${SHORT_R2} \
    -o ${OUTDIR}/qc/short_R1.fq.gz -O ${OUTDIR}/qc/short_R2.fq.gz \
    --html ${OUTDIR}/qc/fastp.html

# Step 2: Assembly with Flye
echo "=== Assembly ==="
flye --nano-hq ${LONG_READS} \
    --out-dir ${OUTDIR}/assembly \
    --threads ${THREADS} \
    --genome-size ${GENOME_SIZE}

# Step 3: Polish with short reads
echo "=== Polishing ==="
bwa index ${OUTDIR}/assembly/assembly.fasta
bwa mem -t ${THREADS} ${OUTDIR}/assembly/assembly.fasta \
    ${OUTDIR}/qc/short_R1.fq.gz ${OUTDIR}/qc/short_R2.fq.gz | \
    samtools sort -@ 4 -o ${OUTDIR}/polished/aligned.bam
samtools index ${OUTDIR}/polished/aligned.bam

pilon --genome ${OUTDIR}/assembly/assembly.fasta \
    --frags ${OUTDIR}/polished/aligned.bam \
    --output ${OUTDIR}/polished/final \
    --threads ${THREADS}

# Step 4: QC
echo "=== Quality Assessment ==="
quast.py ${OUTDIR}/polished/final.fasta -o ${OUTDIR}/quast -t ${THREADS}
busco -i ${OUTDIR}/polished/final.fasta -l ${BUSCO_LINEAGE} \
    -o busco -m genome -c ${THREADS} --out_path ${OUTDIR}

echo "=== Assembly Complete ==="
echo "Final assembly: ${OUTDIR}/polished/final.fasta"
cat ${OUTDIR}/quast/report.txt
```

## Related Skills

- genome-assembly/short-read-assembly - SPAdes details
- genome-assembly/long-read-assembly - Flye, Canu, Hifiasm
- genome-assembly/assembly-polishing - Pilon, medaka, Racon
- genome-assembly/assembly-qc - QUAST, BUSCO metrics
