---
name: bio-workflows-genome-annotation-pipeline
description: End-to-end genome annotation pipeline from assembled contigs to functional annotation, covering repeat masking, gene prediction, and functional assignment for both prokaryotic and eukaryotic genomes. Use when annotating a newly assembled genome from scratch.
tool_type: mixed
primary_tool: Bakta
workflow: true
depends_on:
  - genome-annotation/prokaryotic-annotation
  - genome-annotation/eukaryotic-gene-prediction
  - genome-annotation/repeat-annotation
  - genome-annotation/functional-annotation
  - genome-annotation/ncrna-annotation
  - genome-annotation/annotation-qc
  - genome-assembly/assembly-qc
qc_checkpoints:
  - after_repeat_masking: "Repeat content within expected range for taxon"
  - after_gene_prediction: "Gene count plausible, BUSCO completeness >90%"
  - after_functional_annotation: ">60% of genes with functional assignment"
---

## Version Compatibility

Reference examples tested with: BRAKER3 3.0+, BUSCO 5.5+, Bakta 1.9+, Infernal 1.1+, InterProScan 5.66+, Prokka 1.14+, RepeatMasker 4.1+, RepeatModeler 2.0+, eggNOG-mapper 2.1+, pandas 2.2+, tRNAscan-SE 2.0+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Genome Annotation Pipeline

**"Annotate my genome assembly"** -> Orchestrate prokaryotic (Bakta) or eukaryotic (BRAKER3) gene prediction, repeat masking (RepeatMasker), functional annotation (eggNOG-mapper, InterProScan), and ncRNA annotation (Infernal).

Complete workflow from assembled contigs to functional annotation for prokaryotic or eukaryotic genomes.

## Pipeline Overview

```
Assembled contigs
    |
    v
[0. Assembly QC] ----------> QUAST, BUSCO (confirm assembly quality)
    |
    +----- Prokaryotic? -----> Path A: Bakta (one-step annotation)
    |                                |
    |                                v
    |                          Annotated genome (GFF3, GenBank, FASTA)
    |
    +----- Eukaryotic? ------> Path B: Multi-step pipeline
                                    |
                                    v
                              [1. Repeat Masking] ----> RepeatModeler + RepeatMasker
                                    |
                                    v
                              [2. Gene Prediction] ---> BRAKER3 (RNA-seq + protein evidence)
                                    |
                                    v
                              [3. Functional Annotation] -> eggNOG-mapper + InterProScan
                                    |
                                    v
                              [4. ncRNA Annotation] ---> Infernal + tRNAscan-SE
                                    |
                                    v
                              Annotated genome (GFF3, proteins, functional tables)
```

## Path A: Prokaryotic Annotation (Bakta)

Bakta provides comprehensive one-step annotation for bacteria and archaea. Preferred over Prokka for new projects.

### Database Setup

```bash
bakta_db download --output /path/to/bakta_db --type full
```

### Run Bakta

```bash
bakta \
    --db /path/to/bakta_db \
    --output bakta_out \
    --prefix my_genome \
    --locus-tag MYORG \
    --genus Escherichia --species "coli" \
    --strain K12 \
    --gram - \
    --complete \
    --threads 8 \
    assembly.fasta
```

### Prokaryotic QC Checkpoint

```python
import subprocess
import json

def validate_prokaryotic_annotation(bakta_dir, prefix, expected_cds_range=(500, 8000)):
    '''
    QC gates for prokaryotic annotation.
    - CDS count in expected range for genome size
    - tRNA count >= 20 (typical minimum for free-living bacteria)
    - rRNA operons detected
    '''
    gff_file = f'{bakta_dir}/{prefix}.gff3'

    feature_counts = {'CDS': 0, 'tRNA': 0, 'rRNA': 0, 'tmRNA': 0, 'ncRNA': 0}
    with open(gff_file) as f:
        for line in f:
            if line.startswith('#'):
                continue
            fields = line.strip().split('\t')
            if len(fields) >= 3 and fields[2] in feature_counts:
                feature_counts[fields[2]] += 1

    qc_pass = True
    if not (expected_cds_range[0] <= feature_counts['CDS'] <= expected_cds_range[1]):
        print(f'WARNING: CDS count {feature_counts["CDS"]} outside expected range {expected_cds_range}')
        qc_pass = False
    if feature_counts['tRNA'] < 20:
        print(f'WARNING: Only {feature_counts["tRNA"]} tRNAs detected (expect >= 20)')
        qc_pass = False

    print(f'Feature summary: {feature_counts}')
    return qc_pass, feature_counts
```

## Path B: Eukaryotic Annotation

### Step 1: Repeat Masking

```bash
# Build species-specific repeat library
RepeatModeler -database mygenome -threads 8 -LTRStruct

# Combine with known repeats
cat mygenome-families.fa /path/to/RepeatMasker/Libraries/RepeatMaskerLib.h5 > combined_lib.fa

# Mask the genome
RepeatMasker \
    -lib combined_lib.fa \
    -pa 8 \
    -xsmall \
    -gff \
    -dir repeat_out \
    assembly.fasta
```

#### Repeat Masking QC Checkpoint

```python
def check_repeat_content(repeatmasker_tbl, taxon='vertebrate'):
    '''
    Verify repeat content is within expected range for taxon.
    Typical ranges:
    - Vertebrate: 30-60%
    - Insect: 15-45%
    - Plant: 20-85%
    - Fungus: 3-20%
    '''
    expected_ranges = {
        'vertebrate': (30, 60), 'insect': (15, 45),
        'plant': (20, 85), 'fungus': (3, 20)
    }
    low, high = expected_ranges.get(taxon, (5, 80))

    with open(repeatmasker_tbl) as f:
        for line in f:
            if 'total interspersed' in line.lower():
                pct = float(line.strip().split()[-1].replace('%', ''))
                break

    qc_pass = low <= pct <= high
    if not qc_pass:
        print(f'WARNING: Repeat content {pct:.1f}% outside expected range ({low}-{high}%) for {taxon}')
    return qc_pass, pct
```

### Step 2: Gene Prediction with BRAKER3

```bash
# BRAKER3 combines GeneMark-ETP, AUGUSTUS, and TSEBRA
# Uses both RNA-seq and protein evidence for best results
braker.pl \
    --genome=assembly.fasta.masked \
    --bam=rnaseq_sorted.bam \
    --prot_seq=proteins.fa \
    --softmasking \
    --threads 8 \
    --species=my_species \
    --gff3 \
    --workingdir=braker_out

# If only RNA-seq evidence available
braker.pl \
    --genome=assembly.fasta.masked \
    --bam=rnaseq_sorted.bam \
    --softmasking \
    --threads 8 \
    --species=my_species \
    --gff3

# If only protein evidence available (use OrthoDB proteins)
braker.pl \
    --genome=assembly.fasta.masked \
    --prot_seq=orthodb_proteins.fa \
    --softmasking \
    --threads 8 \
    --species=my_species \
    --gff3
```

#### Gene Prediction QC Checkpoint

```bash
# BUSCO completeness on predicted proteins. Use the DEEPEST applicable clade dataset
# (e.g. insecta_odb10 / embryophyta_odb10), NOT the shallow eukaryota_odb10.
# The diagnostic that matters: compare this proteome BUSCO to a genome-mode BUSCO on the
# same assembly -- a large gap means the predictor missed present genes (see genome-annotation/annotation-qc).
busco \
    -i braker_out/braker.aa \
    -l <clade>_odb10 \
    -o busco_annotation \
    -m proteins \
    --cpu 8
```

```python
def check_gene_prediction(braker_gff, busco_summary, expected_genes_range=(15000, 35000)):
    '''
    QC gates after gene prediction.
    - Gene count within expected range for genome
    - BUSCO completeness > 90%
    - Mean exons per gene > 1 (spliced genes expected in eukaryotes)
    '''
    gene_count = 0
    exon_count = 0
    with open(braker_gff) as f:
        for line in f:
            if line.startswith('#'):
                continue
            feature = line.strip().split('\t')[2] if len(line.strip().split('\t')) >= 3 else ''
            if feature == 'gene':
                gene_count += 1
            elif feature == 'exon':
                exon_count += 1

    mean_exons = exon_count / gene_count if gene_count > 0 else 0

    with open(busco_summary) as f:
        for line in f:
            if line.strip().startswith('C:'):
                completeness = float(line.strip().split('C:')[1].split('%')[0])
                break

    issues = []
    if not (expected_genes_range[0] <= gene_count <= expected_genes_range[1]):
        issues.append(f'Gene count {gene_count} outside expected range {expected_genes_range}')
    if completeness < 90:
        issues.append(f'BUSCO completeness {completeness:.1f}% < 90%')
    if mean_exons < 2:
        issues.append(f'Mean exons/gene {mean_exons:.1f} is low for eukaryote')

    print(f'Genes: {gene_count}, Mean exons/gene: {mean_exons:.1f}, BUSCO: {completeness:.1f}%')
    return len(issues) == 0, issues
```

### Step 3: Functional Annotation

```bash
# eggNOG-mapper for comprehensive functional annotation
emapper.py \
    -i braker_out/braker.aa \
    --output eggnog_results \
    --cpu 8 \
    -m diamond \
    --tax_scope auto \
    --go_evidence non-electronic \
    --target_orthologs all \
    --seed_ortholog_evalue 1e-5 \
    --override

# InterProScan for domain annotation (complementary to eggNOG)
interproscan.sh \
    -i braker_out/braker.aa \
    -o interpro_results.tsv \
    -f tsv,gff3 \
    -goterms \
    -pa \
    -cpu 8
```

#### Functional Annotation QC Checkpoint

```python
import pandas as pd

def check_functional_annotation(eggnog_annotations, total_genes):
    '''
    QC gate: > 60% of genes should have functional assignment.
    Below 50% suggests database issues or highly divergent organism.
    '''
    cols = ['query', 'seed_ortholog', 'evalue', 'score', 'eggNOG_OGs', 'max_annot_lvl',
            'COG_category', 'Description', 'Preferred_name', 'GOs', 'EC', 'KEGG_ko']
    df = pd.read_csv(eggnog_annotations, sep='\t', comment='#', header=None)
    df.columns = (cols + [f'c{i}' for i in range(len(df.columns) - len(cols))])[:len(df.columns)]
    annotated = len(df[df['Description'] != '-'])
    pct_annotated = annotated / total_genes * 100

    has_go = len(df[df['GOs'] != '-'])
    has_kegg = len(df[df['KEGG_ko'] != '-'])

    print(f'Annotated: {annotated}/{total_genes} ({pct_annotated:.1f}%)')
    print(f'With GO terms: {has_go}, With KEGG: {has_kegg}')

    if pct_annotated < 60:
        print('WARNING: <60% annotated. Check database version or use broader taxonomy scope.')
    return pct_annotated >= 60
```

### Step 4: ncRNA Annotation

```bash
# tRNAscan-SE for tRNA genes
tRNAscan-SE \
    -E \
    --thread 8 \
    -o trna_results.txt \
    --gff trna.gff \
    assembly.fasta

# Infernal for Rfam-based ncRNA annotation
# Download Rfam covariance models first
cmscan \
    --cpu 8 \
    --tblout rfam_results.tbl \
    --fmt 2 \
    --clanin Rfam.clanin \
    Rfam.cm \
    assembly.fasta
```

## Merging Annotations

```python
def merge_annotations(braker_gff, trna_gff, rfam_tbl, eggnog_tsv, output_gff):
    '''Merge gene predictions, ncRNAs, and functional annotations into final GFF3.'''
    import subprocess

    # Use AGAT for GFF merging and validation
    subprocess.run([
        'agat_sp_merge_annotations.pl',
        '--gff', braker_gff,
        '--gff', trna_gff,
        '-o', output_gff
    ], check=True)

    # Validate final GFF3
    subprocess.run([
        'agat_sp_statistics.pl',
        '--gff', output_gff,
        '-o', output_gff.replace('.gff3', '_stats.txt')
    ], check=True)

    print(f'Merged annotations written to {output_gff}')
```

## Troubleshooting

| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| Low gene count | Repeat masking too aggressive | Use `-xsmall` (soft-masking) not `-x` (hard-masking) |
| BUSCO < 80% | Poor assembly or missing evidence | Add RNA-seq data; check assembly contiguity |
| Many partial genes | Fragmented assembly | Scaffold first; use `--min_contig` in BRAKER |
| < 60% annotated | Divergent organism | Use broader `--tax_scope`; try InterProScan |
| Too many genes | Gene prediction artifacts | Increase `--min_intron_len`; filter short ORFs |
| Missing ncRNAs | Wrong Rfam models | Verify Rfam version matches genome build |

## Complete Pipeline Script

```bash
#!/bin/bash
set -e

GENOME="assembly.fasta"
RNASEQ_BAM="rnaseq_sorted.bam"
PROTEINS="orthodb_proteins.fa"
BAKTA_DB="/path/to/bakta_db"
THREADS=8

# Determine organism type
ORGANISM_TYPE="${1:-eukaryotic}"  # prokaryotic or eukaryotic

if [ "$ORGANISM_TYPE" == "prokaryotic" ]; then
    echo "Running prokaryotic annotation with Bakta"
    bakta --db $BAKTA_DB --output bakta_out --prefix genome \
          --locus-tag MYORG --threads $THREADS $GENOME
    echo "Done. Results in bakta_out/"

else
    echo "Running eukaryotic annotation pipeline"

    echo "Step 1: Repeat masking"
    RepeatModeler -database mygenome -threads $THREADS -LTRStruct
    RepeatMasker -lib mygenome-families.fa -pa $THREADS -xsmall -gff -dir repeat_out $GENOME

    echo "Step 2: Gene prediction with BRAKER3"
    braker.pl --genome=repeat_out/$(basename $GENOME).masked \
              --bam=$RNASEQ_BAM --prot_seq=$PROTEINS \
              --softmasking --threads $THREADS --gff3 --workingdir=braker_out

    echo "Step 3: BUSCO QC (use the deepest applicable clade dataset, not eukaryota_odb10)"
    busco -i braker_out/braker.aa -l ${LINEAGE:-eukaryota_odb10} -o busco_check -m proteins --cpu $THREADS

    echo "Step 4: Functional annotation"
    emapper.py -i braker_out/braker.aa --output eggnog_out --cpu $THREADS -m diamond

    echo "Step 5: ncRNA annotation"
    tRNAscan-SE -E --thread $THREADS -o trna_out.txt --gff trna.gff $GENOME
    cmscan --cpu $THREADS --tblout rfam.tbl --fmt 2 Rfam.cm $GENOME

    echo "Done. Check braker_out/, eggnog_out*, trna.gff, rfam.tbl"
fi
```

## Related Skills

- genome-annotation/prokaryotic-annotation - Bakta and Prokka details
- genome-annotation/eukaryotic-gene-prediction - BRAKER3 and AUGUSTUS options
- genome-annotation/repeat-annotation - Soft-masking before gene prediction
- genome-annotation/functional-annotation - eggNOG-mapper and InterProScan
- genome-annotation/ncrna-annotation - Infernal/Rfam and tRNAscan-SE detail
- genome-annotation/annotation-qc - BUSCO genome-vs-proteome, OMArk, CheckM2 gates
- genome-assembly/assembly-qc - Pre-annotation assembly quality checks
- genome-intervals/gtf-gff-handling - GFF3/GTF hierarchy traversal, AGAT sanitizing/validation, coordinate conversion, and seqid-consistency checks on the merged annotation
