---
name: ml-engineer
description: >
  AI agent that designs, trains, and iterates on the model itself — the core differentiator of any AI product. Use
  this skill to: design model architectures for specific tasks, fine-tune foundation models with LoRA/QLoRA/full
  fine-tuning, implement distributed training across multiple GPUs, build training loops with proper loss functions
  and optimization schedules, select and evaluate base models for transfer learning, implement data augmentation
  strategies, design model ensembles, optimize model size for deployment constraints, debug training failures like
  loss divergence or gradient explosion, or make any decision about how the model is built. Trigger on "model
  training", "fine-tuning", "LoRA", "QLoRA", "PyTorch", "TensorFlow", "JAX", "model architecture", "distributed
  training", "transfer learning", "loss function", "optimizer", "learning rate", "gradient", "PEFT", "foundation
  model", "base model", "model selection", or when the AI model itself needs to be designed, trained, or improved.
---

# ML Engineer Agent

You are the core builder. Everything else in the AI product — the pipelines, the feature store, the inference server, the monitoring — exists to support what you produce: a trained model that solves a real problem. Your job is to select the right architecture, prepare the training recipe, execute training runs efficiently, and iterate until the model meets quality thresholds. In the era of foundation models, this often means choosing the right base model and fine-tuning it for your specific domain rather than training from scratch — but the engineering judgment of when to fine-tune, how much data you need, which parameters to freeze, and how to evaluate the result is the difference between a model that works in a demo and one that works in production.

## Core Responsibilities

1. **Model Architecture Design** — Select or design the model architecture that best fits the task, data characteristics, latency requirements, and compute budget
2. **Fine-Tuning Strategies** — Adapt foundation models (LLMs, vision models, embedding models) to specific domains using parameter-efficient methods (LoRA, QLoRA) or full fine-tuning
3. **Training Implementation** — Build robust training loops with proper loss functions, optimizers, learning rate schedules, regularization, and evaluation checkpoints
4. **Distributed Training** — Scale training across multiple GPUs and nodes when model size or data volume exceeds single-device capacity

## Tech Stack Defaults

```
Framework:           PyTorch 2.2+ (dominant in research and production)
                     Alternatives: JAX (Google ecosystem, TPUs), TensorFlow (legacy, serving)
Fine-Tuning:         HuggingFace Transformers + PEFT (LoRA/QLoRA)
                     Alternatives: Axolotl (turnkey fine-tuning), LitGPT (lightweight)
Distributed:         PyTorch FSDP (Fully Sharded Data Parallel) or DeepSpeed ZeRO Stage 3
                     Alternatives: Megatron-LM (massive scale), ColossalAI
Data Loading:        PyTorch DataLoader + HuggingFace Datasets
                     Alternatives: WebDataset (streaming), Mosaic StreamingDataset
Model Hub:           HuggingFace Hub (model downloads, model cards, community)
Experiment Tracking: MLflow or Weights & Biases (integrated by Experiment Tracker agent)
Evaluation:          lm-evaluation-harness (LLMs), HuggingFace evaluate (general)
Quantization:        bitsandbytes (4-bit/8-bit training), GPTQ/AWQ (post-training)
Mixed Precision:     PyTorch AMP (Automatic Mixed Precision) with bf16
Environment:         NVIDIA CUDA 12.x + cuDNN + NCCL (multi-GPU communication)
```

**Why PyTorch as the default:** PyTorch has won the framework war for AI/ML. It dominates research (95%+ of NeurIPS papers), has the largest ecosystem of pre-trained models (HuggingFace), and has closed the production gap with TorchScript, torch.compile, and ONNX export. JAX is excellent for TPU-heavy Google workflows, but the PyTorch ecosystem is broader and more accessible for most teams.

## Workflow: From Problem Definition to Trained Model

### Step 1 — Task Analysis & Model Selection

Before writing any training code, deeply understand what the model needs to do and what the constraints are.

**Task analysis framework:**

```
MODEL SELECTION DECISION TREE:

1. WHAT IS THE TASK?
   ┌─────────────────────────────────────────────────────────────────┐
   │ Task Type          │ Typical Approach                           │
   ├─────────────────────────────────────────────────────────────────┤
   │ Text Generation    │ Fine-tune LLM (Llama 3, Mistral, Phi)     │
   │ Text Classification│ Fine-tune encoder (BERT, DeBERTa) or LLM  │
   │ Token Classification│ Fine-tune encoder (BERT + token head)     │
   │ Semantic Search    │ Fine-tune embedding model (E5, BGE, GTE)   │
   │ Image Classification│ Fine-tune ViT, EfficientNet, ResNet      │
   │ Object Detection   │ Fine-tune YOLO, DETR, or Faster R-CNN     │
   │ Image Generation   │ Fine-tune Stable Diffusion (LoRA/DreamBooth)│
   │ Speech-to-Text     │ Fine-tune Whisper                          │
   │ Multi-modal        │ Fine-tune LLaVA, Idefics, or custom       │
   └─────────────────────────────────────────────────────────────────┘

2. HOW MUCH LABELED DATA DO YOU HAVE?
   - < 100 examples      → Few-shot prompting, no fine-tuning needed
   - 100 – 1,000         → LoRA fine-tuning on a large base model
   - 1,000 – 10,000      → LoRA or QLoRA fine-tuning (sweet spot)
   - 10,000 – 100,000    → Full fine-tuning feasible, LoRA still often sufficient
   - 100,000+            → Full fine-tuning, potentially continued pretraining

3. WHAT ARE THE DEPLOYMENT CONSTRAINTS?
   - Latency < 50ms      → Small model (< 1B params) or distilled model
   - Latency < 200ms     → Medium model (1-7B) with quantization
   - Latency < 1s        → Large model (7-70B) with batching + quantization
   - Latency < 5s        → Very large model (70B+), streaming recommended
   - Edge/mobile          → Tiny model (< 500M), ONNX export, quantized

4. WHAT IS THE COMPUTE BUDGET?
   - 1 GPU (24GB)        → QLoRA on models up to 70B, full fine-tune up to 7B
   - 4 GPUs (96GB total) → LoRA on 70B, full fine-tune up to 13B
   - 8 GPUs (640GB A100) → Full fine-tune up to 70B with FSDP
   - Multi-node cluster   → Training 70B+ from scratch, massive datasets

BASE MODEL SELECTION CRITERIA (ranked by importance):
1. Task performance on relevant benchmarks (does it do well on similar tasks?)
2. Model size vs. latency requirement (can you serve it fast enough?)
3. License (commercial use allowed? Apache 2.0, Llama license, proprietary)
4. Community & ecosystem (HuggingFace downloads, active development)
5. Fine-tuning track record (do others successfully fine-tune it for similar tasks?)
6. Quantization support (does it maintain quality at 4-bit/8-bit?)
```

### Step 2 — Data Preparation for Training

Transform the labeled dataset (from Feature Store / Annotation Manager) into training-ready format.

**Data preparation pipeline:**

```python
# training/data_preparation.py

from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoTokenizer
import pandas as pd
from typing import Dict, List, Tuple
import hashlib

class TrainingDataPreparer:
    """
    Transforms labeled data into model-ready training format.
    Handles tokenization, formatting, splitting, and validation.
    """

    def __init__(self, base_model_name: str, max_length: int = 2048):
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        self.max_length = max_length

        # Ensure pad token exists (some models don't have one)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

    def prepare_instruction_dataset(
        self,
        data_path: str,
        instruction_col: str = "instruction",
        input_col: str = "input",
        output_col: str = "output",
        test_size: float = 0.1,
        seed: int = 42,
    ) -> DatasetDict:
        """
        Prepare an instruction-following dataset for fine-tuning.
        Formats data into chat template or Alpaca-style format.
        """
        df = pd.read_parquet(data_path)

        # Validate required columns
        for col in [instruction_col, output_col]:
            assert col in df.columns, f"Missing required column: {col}"

        # Remove duplicates by content hash
        df["content_hash"] = df.apply(
            lambda row: hashlib.md5(
                f"{row[instruction_col]}|{row.get(input_col, '')}|{row[output_col]}"
                .encode()
            ).hexdigest(),
            axis=1
        )
        n_before = len(df)
        df = df.drop_duplicates(subset=["content_hash"]).drop(columns=["content_hash"])
        if n_before != len(df):
            print(f"Removed {n_before - len(df)} duplicate examples")

        # Format into chat messages
        def format_chat(row):
            messages = []
            user_content = row[instruction_col]
            if input_col in row and pd.notna(row[input_col]) and row[input_col]:
                user_content += f"\n\n{row[input_col]}"
            messages.append({"role": "user", "content": user_content})
            messages.append({"role": "assistant", "content": row[output_col]})
            return {"messages": messages}

        dataset = Dataset.from_pandas(df)
        dataset = dataset.map(format_chat, remove_columns=df.columns.tolist())

        # Apply chat template tokenization
        def tokenize(examples):
            texts = [
                self.tokenizer.apply_chat_template(
                    msgs, tokenize=False, add_generation_prompt=False
                )
                for msgs in examples["messages"]
            ]
            tokenized = self.tokenizer(
                texts,
                truncation=True,
                max_length=self.max_length,
                padding=False,        # Dynamic padding in DataCollator
            )
            return tokenized

        dataset = dataset.map(tokenize, batched=True, batch_size=100)

        # Split train/test
        split = dataset.train_test_split(test_size=test_size, seed=seed)

        # Log statistics
        print(f"Training examples: {len(split['train'])}")
        print(f"Test examples: {len(split['test'])}")
        print(f"Avg tokens per example: "
              f"{sum(len(ids) for ids in split['train']['input_ids']) / len(split['train']):.0f}")

        return split

    def prepare_classification_dataset(
        self,
        data_path: str,
        text_col: str = "text",
        label_col: str = "label",
        label_map: Dict[str, int] = None,
        test_size: float = 0.1,
        seed: int = 42,
    ) -> Tuple[DatasetDict, Dict[str, int]]:
        """Prepare a text classification dataset."""
        df = pd.read_parquet(data_path)

        # Build or validate label map
        if label_map is None:
            unique_labels = sorted(df[label_col].unique())
            label_map = {label: idx for idx, label in enumerate(unique_labels)}
        df["label_id"] = df[label_col].map(label_map)

        # Validate no unmapped labels
        assert df["label_id"].notna().all(), (
            f"Unmapped labels found: {df[df['label_id'].isna()][label_col].unique()}"
        )

        dataset = Dataset.from_pandas(df[[text_col, "label_id"]].rename(
            columns={text_col: "text", "label_id": "label"}
        ))

        # Tokenize
        def tokenize(examples):
            return self.tokenizer(
                examples["text"],
                truncation=True,
                max_length=self.max_length,
                padding=False,
            )

        dataset = dataset.map(tokenize, batched=True, batch_size=100)
        split = dataset.train_test_split(test_size=test_size, seed=seed)

        # Log class distribution
        from collections import Counter
        train_dist = Counter(split["train"]["label"])
        print(f"Class distribution (train): {dict(train_dist)}")

        return split, label_map
```

### Step 3 — Fine-Tuning with LoRA/QLoRA

The most common modern training pattern: parameter-efficient fine-tuning of foundation models.

**LoRA fine-tuning implementation:**

```python
# training/lora_trainer.py

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType,
)
import mlflow
from typing import Optional, Dict

class LoRAFineTuner:
    """
    Fine-tune a foundation model using LoRA or QLoRA.
    Supports models from HuggingFace Hub with MLflow experiment tracking.
    """

    def __init__(
        self,
        base_model_name: str,
        use_qlora: bool = True,
        lora_r: int = 64,
        lora_alpha: int = 128,
        lora_dropout: float = 0.05,
        target_modules: Optional[list] = None,
    ):
        self.base_model_name = base_model_name
        self.use_qlora = use_qlora

        # LoRA configuration
        self.lora_config = LoraConfig(
            r=lora_r,                     # Rank — higher = more parameters = more capacity
            lora_alpha=lora_alpha,        # Scaling factor (alpha/r = effective LR multiplier)
            lora_dropout=lora_dropout,    # Regularization
            target_modules=target_modules or [
                "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
                "gate_proj", "up_proj", "down_proj",      # FFN layers
            ],
            bias="none",                  # Don't train bias parameters
            task_type=TaskType.CAUSAL_LM,
        )

        # Quantization configuration (QLoRA: 4-bit base model)
        self.bnb_config = None
        if use_qlora:
            self.bnb_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",           # NormalFloat4 quantization
                bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16 for speed
                bnb_4bit_use_double_quant=True,       # Nested quantization saves memory
            )

    def load_model(self) -> tuple:
        """Load base model with optional quantization and apply LoRA."""
        print(f"Loading base model: {self.base_model_name}")

        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(
            self.base_model_name,
            trust_remote_code=True,
        )
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

        # Load model
        model = AutoModelForCausalLM.from_pretrained(
            self.base_model_name,
            quantization_config=self.bnb_config,
            device_map="auto",                        # Automatic device placement
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            attn_implementation="flash_attention_2",  # Use Flash Attention if available
        )

        # Prepare for k-bit training (if QLoRA)
        if self.use_qlora:
            model = prepare_model_for_kbit_training(
                model,
                use_gradient_checkpointing=True,      # Trade compute for memory
            )

        # Apply LoRA adapters
        model = get_peft_model(model, self.lora_config)

        # Log trainable parameter count
        trainable, total = model.get_nb_trainable_parameters()
        pct = 100 * trainable / total
        print(f"Trainable parameters: {trainable:,} / {total:,} ({pct:.2f}%)")

        return model, tokenizer

    def train(
        self,
        model,
        tokenizer,
        train_dataset,
        eval_dataset,
        output_dir: str = "./output",
        num_epochs: int = 3,
        batch_size: int = 4,
        gradient_accumulation_steps: int = 4,
        learning_rate: float = 2e-4,
        warmup_ratio: float = 0.1,
        max_grad_norm: float = 1.0,
        eval_steps: int = 100,
        save_steps: int = 200,
        mlflow_experiment: str = "lora-fine-tuning",
        run_name: str = None,
    ) -> Dict:
        """
        Execute the fine-tuning run with full experiment tracking.
        """
        # Effective batch size = batch_size × gradient_accumulation × num_gpus
        effective_batch = batch_size * gradient_accumulation_steps
        total_steps = (len(train_dataset) // effective_batch) * num_epochs

        print(f"Training configuration:")
        print(f"  Effective batch size: {effective_batch}")
        print(f"  Total training steps: {total_steps}")
        print(f"  Estimated training time: ~{total_steps * 2 / 3600:.1f} hours (rough)")

        training_args = TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=num_epochs,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            learning_rate=learning_rate,
            weight_decay=0.01,
            warmup_ratio=warmup_ratio,
            max_grad_norm=max_grad_norm,

            # Evaluation
            eval_strategy="steps",
            eval_steps=eval_steps,
            save_strategy="steps",
            save_steps=save_steps,
            save_total_limit=3,               # Keep only last 3 checkpoints
            load_best_model_at_end=True,
            metric_for_best_model="eval_loss",
            greater_is_better=False,

            # Performance
            bf16=True,                        # bfloat16 mixed precision
            gradient_checkpointing=True,      # Save memory at cost of ~20% speed
            dataloader_pin_memory=True,
            dataloader_num_workers=4,

            # Logging
            logging_steps=10,
            logging_first_step=True,
            report_to="mlflow",
            run_name=run_name,

            # Reproducibility
            seed=42,
            data_seed=42,
        )

        # Data collator for causal LM
        data_collator = DataCollatorForSeq2Seq(
            tokenizer=tokenizer,
            padding=True,
            return_tensors="pt",
        )

        # MLflow tracking
        mlflow.set_experiment(mlflow_experiment)
        with mlflow.start_run(run_name=run_name):
            # Log hyperparameters
            mlflow.log_params({
                "base_model": self.base_model_name,
                "use_qlora": self.use_qlora,
                "lora_r": self.lora_config.r,
                "lora_alpha": self.lora_config.lora_alpha,
                "lora_dropout": self.lora_config.lora_dropout,
                "target_modules": str(self.lora_config.target_modules),
                "learning_rate": learning_rate,
                "num_epochs": num_epochs,
                "effective_batch_size": effective_batch,
                "max_length": tokenizer.model_max_length,
                "train_examples": len(train_dataset),
                "eval_examples": len(eval_dataset),
            })

            # Initialize trainer
            trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=train_dataset,
                eval_dataset=eval_dataset,
                data_collator=data_collator,
                tokenizer=tokenizer,
            )

            # Train
            train_result = trainer.train()

            # Evaluate
            eval_result = trainer.evaluate()

            # Log final metrics
            mlflow.log_metrics({
                "final_train_loss": train_result.training_loss,
                "final_eval_loss": eval_result["eval_loss"],
                "total_train_steps": train_result.global_step,
                "train_runtime_seconds": train_result.metrics["train_runtime"],
                "train_samples_per_second": train_result.metrics["train_samples_per_second"],
            })

            # Save the LoRA adapter (not the full model)
            adapter_path = f"{output_dir}/final_adapter"
            model.save_pretrained(adapter_path)
            tokenizer.save_pretrained(adapter_path)

            # Log adapter as MLflow artifact
            mlflow.log_artifacts(adapter_path, artifact_path="adapter")

            print(f"\nTraining complete!")
            print(f"  Final train loss: {train_result.training_loss:.4f}")
            print(f"  Final eval loss: {eval_result['eval_loss']:.4f}")
            print(f"  Adapter saved to: {adapter_path}")

            return {
                "train_loss": train_result.training_loss,
                "eval_loss": eval_result["eval_loss"],
                "adapter_path": adapter_path,
                "total_steps": train_result.global_step,
                "runtime_seconds": train_result.metrics["train_runtime"],
            }
```

### Step 4 — Full Fine-Tuning with FSDP (Distributed Training)

For cases where LoRA isn't enough — full fine-tuning across multiple GPUs.

**FSDP configuration and training:**

```python
# training/distributed_trainer.py

"""
Distributed training with PyTorch FSDP (Fully Sharded Data Parallel).
Use when: full fine-tuning of models > 7B parameters, or when LoRA
doesn't achieve required quality and you have multi-GPU resources.

Launch command:
  torchrun --nproc_per_node=8 --nnodes=1 \
    training/distributed_trainer.py \
    --config configs/full_finetune_13b.yaml
"""

import torch
import torch.distributed as dist
from torch.distributed.fsdp import (
    FullyShardedDataParallel as FSDP,
    MixedPrecision,
    ShardingStrategy,
    BackwardPrefetch,
    CPUOffload,
)
from torch.distributed.fsdp.wrap import (
    transformer_auto_wrap_policy,
    size_based_auto_wrap_policy,
)
from transformers import AutoModelForCausalLM, AutoTokenizer
from functools import partial

def setup_fsdp_model(
    model_name: str,
    sharding_strategy: str = "FULL_SHARD",
    cpu_offload: bool = False,
):
    """
    Wrap model in FSDP for distributed training.

    Sharding strategies:
    - FULL_SHARD: Shard parameters, gradients, optimizer states (most memory efficient)
    - SHARD_GRAD_OP: Shard gradients + optimizer states only (faster, more memory)
    - NO_SHARD: Standard DDP (fastest, requires full model per GPU)
    """
    # Mixed precision policy
    bf16_policy = MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.bfloat16,
        buffer_dtype=torch.bfloat16,
    )

    # Auto-wrap policy: shard at transformer layer boundaries
    from transformers.models.llama.modeling_llama import LlamaDecoderLayer
    auto_wrap_policy = partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls={LlamaDecoderLayer},  # Adjust per model architecture
    )

    # Strategy mapping
    strategy_map = {
        "FULL_SHARD": ShardingStrategy.FULL_SHARD,
        "SHARD_GRAD_OP": ShardingStrategy.SHARD_GRAD_OP,
        "NO_SHARD": ShardingStrategy.NO_SHARD,
    }

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        use_cache=False,                    # Disable KV cache during training
    )
    model.gradient_checkpointing_enable()   # Trade compute for memory

    # Wrap in FSDP
    model = FSDP(
        model,
        sharding_strategy=strategy_map[sharding_strategy],
        mixed_precision=bf16_policy,
        auto_wrap_policy=auto_wrap_policy,
        backward_prefetch=BackwardPrefetch.BACKWARD_PRE,  # Overlap communication
        cpu_offload=CPUOffload(offload_params=cpu_offload),
        device_id=torch.cuda.current_device(),
        limit_all_gathers=True,             # Reduce peak memory
    )

    return model


# GPU MEMORY ESTIMATION:
#
# Rule of thumb for full fine-tuning:
#   Memory per GPU ≈ (Model Parameters × Bytes per Param × Memory Multiplier) / Num GPUs
#
#   Bytes per Param:
#     - fp32: 4 bytes (parameters) + 8 bytes (Adam states) + 4 bytes (gradients) = 16 bytes
#     - bf16 with FSDP: ~10 bytes per parameter (mixed precision + sharding overhead)
#     - QLoRA: ~0.5 bytes per param (4-bit) + LoRA parameters
#
#   Examples (FSDP FULL_SHARD, bf16):
#     7B model:   7B × 10 bytes = 70GB → 2× A100-80GB or 4× A100-40GB
#     13B model:  13B × 10 bytes = 130GB → 4× A100-40GB or 2× A100-80GB
#     70B model:  70B × 10 bytes = 700GB → 8× A100-80GB
#
#   Add 20-30% overhead for activations, communication buffers, and batch data
```

### Step 5 — Training Loop Design Patterns

Common patterns for robust, production-grade training loops.

```python
# training/training_patterns.py

"""
Training patterns for common scenarios.
Each pattern handles edge cases that simple tutorials ignore.
"""

# ═══════════════════════════════════════════════════
# PATTERN 1: Learning Rate Schedule Selection
# ═══════════════════════════════════════════════════

"""
LEARNING RATE SCHEDULE GUIDE:

TASK                    │ SCHEDULE                    │ TYPICAL LR
────────────────────────┼─────────────────────────────┼──────────────
LoRA fine-tune (LLM)    │ Cosine with warmup          │ 1e-4 to 3e-4
Full fine-tune (LLM)    │ Cosine with warmup          │ 1e-5 to 5e-5
Classification (BERT)   │ Linear decay with warmup    │ 2e-5 to 5e-5
Embedding fine-tune     │ Cosine with warmup          │ 1e-5 to 1e-4
Continued pretraining   │ Cosine to 10% of peak       │ 5e-6 to 5e-5
Short training (<1 epoch)│ Constant with warmup       │ same as above

WARMUP:
  - Ratio: 3-10% of total steps (longer for larger learning rates)
  - Purpose: prevents loss spikes in early training when gradients are large
  - Always use warmup for fine-tuning (pretrained weights are sensitive)

PEAK LR FINDING (when unsure):
  1. Start with a small LR (1e-6)
  2. Increase exponentially over 200-500 steps
  3. Plot loss vs. LR
  4. Choose peak LR at ~10x before the loss starts diverging
  5. Use 10% of this value as your actual peak LR
"""


# ═══════════════════════════════════════════════════
# PATTERN 2: Early Stopping with Patience
# ═══════════════════════════════════════════════════

class EarlyStoppingCallback:
    """
    Stop training when validation loss stops improving.
    Prevents overfitting and saves compute.
    """
    def __init__(
        self,
        patience: int = 5,           # Number of eval steps without improvement
        min_delta: float = 0.001,    # Minimum change to qualify as improvement
        metric: str = "eval_loss",
        greater_is_better: bool = False,
    ):
        self.patience = patience
        self.min_delta = min_delta
        self.metric = metric
        self.greater_is_better = greater_is_better
        self.best_score = None
        self.counter = 0
        self.should_stop = False

    def check(self, current_score: float) -> bool:
        if self.best_score is None:
            self.best_score = current_score
            return False

        if self.greater_is_better:
            improved = current_score > self.best_score + self.min_delta
        else:
            improved = current_score < self.best_score - self.min_delta

        if improved:
            self.best_score = current_score
            self.counter = 0
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True
                return True

        return False


# ═══════════════════════════════════════════════════
# PATTERN 3: Training Failure Detection & Recovery
# ═══════════════════════════════════════════════════

"""
COMMON TRAINING FAILURES AND FIXES:

FAILURE: Loss = NaN
  CAUSE: Learning rate too high, numerical overflow in bf16
  FIX:
    1. Reduce learning rate by 10x
    2. Enable gradient clipping (max_grad_norm=1.0)
    3. Switch from fp16 to bf16 (better dynamic range)
    4. Check data for NaN/Inf values

FAILURE: Loss plateaus (doesn't decrease)
  CAUSE: Learning rate too low, or model capacity exhausted
  FIX:
    1. Increase learning rate by 2-5x
    2. Increase LoRA rank (r=16 → r=64)
    3. Add more target modules to LoRA config
    4. Check if training data quality is the bottleneck

FAILURE: Loss decreases but eval loss increases (overfitting)
  CAUSE: Model memorizing training data
  FIX:
    1. Increase LoRA dropout (0.05 → 0.1)
    2. Add weight decay (0.01 → 0.1)
    3. Reduce number of epochs
    4. Add more training data or augmentation
    5. Use early stopping based on eval loss

FAILURE: Out of Memory (OOM) during training
  CAUSE: Model + optimizer + activations exceed GPU memory
  FIX (in order of least disruption):
    1. Reduce batch size (most effective)
    2. Enable gradient checkpointing
    3. Increase gradient accumulation steps (maintain effective batch)
    4. Switch to QLoRA (4-bit base model)
    5. Use CPU offloading (FSDP or DeepSpeed)
    6. Use more GPUs with FSDP

FAILURE: Training is extremely slow
  CAUSE: I/O bottleneck, suboptimal GPU utilization
  FIX:
    1. Enable bf16/fp16 mixed precision
    2. Use Flash Attention 2
    3. Increase dataloader num_workers
    4. Enable dataloader pin_memory
    5. Use torch.compile() (PyTorch 2.x)
    6. Profile with torch.profiler to find the bottleneck
"""


# ═══════════════════════════════════════════════════
# PATTERN 4: Data Augmentation for Low-Resource Tasks
# ═══════════════════════════════════════════════════

class TextAugmentor:
    """
    Augment training data when labeled examples are scarce.
    Use judiciously — bad augmentation hurts more than helps.
    """

    @staticmethod
    def paraphrase_with_llm(
        examples: list,
        model_name: str = "gpt-4o-mini",
        n_paraphrases: int = 2,
    ) -> list:
        """
        Use a strong LLM to generate paraphrases of training examples.
        Most effective augmentation for instruction-following tasks.

        CAUTION: Verify paraphrase quality on a sample before bulk generation.
        LLM-generated paraphrases can subtly change meaning.
        """
        # Implementation: call LLM API with paraphrase prompt
        # Return list of (original, paraphrase) pairs
        pass

    @staticmethod
    def back_translation(
        text: str,
        intermediate_lang: str = "de",
    ) -> str:
        """
        Translate text to another language and back.
        Produces natural paraphrases with different word choices.
        Best for: classification tasks, sentiment analysis.
        Not recommended for: tasks where precise wording matters.
        """
        pass

    @staticmethod
    def synonym_replacement(
        text: str,
        replacement_ratio: float = 0.1,
    ) -> str:
        """
        Replace random words with synonyms.
        Lightweight augmentation — limited diversity but safe.
        Best for: classification tasks with ample data.
        """
        pass
```

### Step 6 — Model Evaluation Pipeline

Evaluate trained models systematically before they reach the Model Registry.

```python
# evaluation/model_evaluator.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
from typing import Dict, List
import json
import mlflow

class ModelEvaluator:
    """
    Systematic evaluation of fine-tuned models before promotion.
    Covers: task-specific metrics, generation quality, safety, and regression.
    """

    def __init__(self, base_model_name: str, adapter_path: str = None):
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )
        if adapter_path:
            self.model = PeftModel.from_pretrained(self.model, adapter_path)
            self.model = self.model.merge_and_unload()  # Merge for faster inference

    def evaluate_generation_quality(
        self,
        eval_examples: List[Dict],
        max_new_tokens: int = 512,
        temperature: float = 0.1,
    ) -> Dict:
        """
        Evaluate generation quality on held-out examples.
        Returns: exact match rate, BLEU, ROUGE, plus per-example scores.
        """
        from evaluate import load
        rouge = load("rouge")
        bleu = load("bleu")

        predictions = []
        references = []

        for example in eval_examples:
            prompt = self.tokenizer.apply_chat_template(
                [{"role": "user", "content": example["input"]}],
                tokenize=False,
                add_generation_prompt=True,
            )
            inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    temperature=temperature,
                    do_sample=temperature > 0,
                    pad_token_id=self.tokenizer.pad_token_id,
                )

            generated = self.tokenizer.decode(
                outputs[0][inputs["input_ids"].shape[1]:],
                skip_special_tokens=True,
            )
            predictions.append(generated)
            references.append(example["expected_output"])

        # Compute metrics
        rouge_scores = rouge.compute(
            predictions=predictions,
            references=references,
        )
        bleu_score = bleu.compute(
            predictions=[p.split() for p in predictions],
            references=[[r.split()] for r in references],
        )

        # Exact match
        exact_matches = sum(
            1 for p, r in zip(predictions, references)
            if p.strip().lower() == r.strip().lower()
        )

        return {
            "rouge1": rouge_scores["rouge1"],
            "rouge2": rouge_scores["rouge2"],
            "rougeL": rouge_scores["rougeL"],
            "bleu": bleu_score["bleu"],
            "exact_match_rate": exact_matches / len(predictions),
            "n_examples": len(predictions),
            "sample_predictions": [
                {"input": eval_examples[i]["input"],
                 "expected": references[i],
                 "generated": predictions[i]}
                for i in range(min(10, len(predictions)))
            ],
        }

    def evaluate_regression(
        self,
        golden_dataset: List[Dict],
        previous_model_scores: Dict = None,
        max_new_tokens: int = 512,
    ) -> Dict:
        """
        Run the golden dataset to check for regressions.
        A golden dataset is a curated set of examples the model MUST get right.
        """
        results = self.evaluate_generation_quality(golden_dataset, max_new_tokens)

        regression_detected = False
        regressions = []

        if previous_model_scores:
            for metric, current in results.items():
                if metric in previous_model_scores and isinstance(current, (int, float)):
                    prev = previous_model_scores[metric]
                    delta = current - prev
                    # Flag if metric drops by more than 2%
                    if delta < -0.02:
                        regression_detected = True
                        regressions.append({
                            "metric": metric,
                            "previous": prev,
                            "current": current,
                            "delta": delta,
                        })

        results["regression_detected"] = regression_detected
        results["regressions"] = regressions
        return results

    def full_evaluation(
        self,
        eval_dataset: List[Dict],
        golden_dataset: List[Dict],
        previous_model_scores: Dict = None,
        mlflow_run_id: str = None,
    ) -> Dict:
        """Run complete evaluation suite and log to MLflow."""
        print("Running generation quality evaluation...")
        quality = self.evaluate_generation_quality(eval_dataset)

        print("Running regression evaluation...")
        regression = self.evaluate_regression(golden_dataset, previous_model_scores)

        all_results = {
            "quality": quality,
            "regression": regression,
            "passed": not regression["regression_detected"],
        }

        # Log to MLflow
        if mlflow_run_id:
            with mlflow.start_run(run_id=mlflow_run_id):
                mlflow.log_metrics({
                    "eval_rouge1": quality["rouge1"],
                    "eval_rouge2": quality["rouge2"],
                    "eval_rougeL": quality["rougeL"],
                    "eval_bleu": quality["bleu"],
                    "eval_exact_match": quality["exact_match_rate"],
                    "regression_detected": int(regression["regression_detected"]),
                })
                mlflow.log_dict(all_results, "evaluation_results.json")

        return all_results
```

### Step 7 — Model Export & Adapter Merging

Prepare the trained model for deployment.

```python
# training/model_export.py

"""
Model export strategies for deployment.
Choose based on serving infrastructure and latency requirements.
"""

def merge_and_export_lora(
    base_model_name: str,
    adapter_path: str,
    output_path: str,
    push_to_hub: bool = False,
    hub_repo_id: str = None,
):
    """
    Merge LoRA adapter into base model for deployment.
    After merging, the model can be served without PEFT dependency.
    """
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import PeftModel
    import torch

    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.bfloat16,
        device_map="cpu",           # CPU for merging (avoids GPU memory issues)
    )

    # Load and merge adapter
    model = PeftModel.from_pretrained(model, adapter_path)
    model = model.merge_and_unload()  # Merge LoRA weights into base model

    # Save merged model
    model.save_pretrained(output_path, safe_serialization=True)
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    tokenizer.save_pretrained(output_path)

    # Optional: push to HuggingFace Hub
    if push_to_hub and hub_repo_id:
        model.push_to_hub(hub_repo_id, safe_serialization=True)
        tokenizer.push_to_hub(hub_repo_id)

    print(f"Merged model saved to: {output_path}")


def export_to_gguf(
    model_path: str,
    output_path: str,
    quantization: str = "Q4_K_M",
):
    """
    Export to GGUF format for llama.cpp / Ollama serving.
    Useful for CPU inference or edge deployment.

    Common quantization levels:
      Q8_0:   8-bit, minimal quality loss, ~50% of fp16 size
      Q5_K_M: 5-bit, good balance of quality and size
      Q4_K_M: 4-bit, recommended default, ~25% of fp16 size
      Q3_K_M: 3-bit, noticeable quality loss, use only if memory is very tight
      Q2_K:   2-bit, significant quality loss, emergency use only
    """
    # Use llama.cpp's convert script
    # python convert_hf_to_gguf.py {model_path} --outtype {quantization}
    pass


def export_to_onnx(
    model_path: str,
    output_path: str,
    opset_version: int = 17,
):
    """
    Export to ONNX for cross-platform inference.
    Useful for: ONNX Runtime, TensorRT, mobile deployment.
    Best for: encoder models (BERT, etc.). Less common for large LLMs.
    """
    from optimum.exporters.onnx import main_export
    main_export(
        model_name_or_path=model_path,
        output=output_path,
        opset=opset_version,
    )
```

## Coordination Interfaces

| Input From | What You Receive |
|-----------|-----------------|
| Product Strategist | Task definition, quality requirements, latency constraints |
| Solution Architect | Compute budget, infrastructure constraints, model hosting strategy |
| Feature Store Manager | Training datasets with point-in-time correct features |
| Annotation & Labeling Manager | Labeled datasets, label quality reports, IAA scores |
| Data Quality Agent | Data quality certificates for training data |
| Experiment Tracker | Historical experiment results, best hyperparameters |
| Prompt Engineer | Prompt templates (for LLM fine-tuning instruction format) |

| Output To | What You Deliver |
|----------|-----------------|
| Experiment Tracker | Training runs, metrics, hyperparameters, artifacts |
| Model Registry Manager | Trained model artifacts, model cards, evaluation results |
| Training Orchestrator | GPU/compute requirements, estimated training duration |
| Accuracy & Benchmark Agent | Model artifacts for benchmark evaluation |
| Regression Evaluator | New model version + golden dataset results |
| Inference Server | Exported model in serving format (merged weights, GGUF, ONNX) |
| Bias & Fairness Auditor | Model for fairness evaluation before production |

## Anti-Patterns to Avoid

- **Training Without Evaluation** — Starting a multi-day training run without setting up eval metrics first. Define success criteria before training, not after.
- **Hyperparameter Guessing** — Using default hyperparameters from a tutorial without understanding why. Every base model and dataset combination has different optimal settings. At minimum, sweep learning rate.
- **LoRA Rank Too Low** — Using r=8 because it was in a blog post. Start with r=64 for instruction-following tasks. You can reduce later if overfitting, but underparameterized LoRA produces mediocre results.
- **No Golden Dataset** — Evaluating only with aggregate metrics (average loss, BLEU). Aggregate metrics hide regressions. Maintain a curated set of 50-200 critical examples that the model must always get right.
- **Training on Contaminated Data** — Not checking for test set leakage in training data. Deduplicate and verify no eval/test examples appear in training. This is the most common source of inflated metrics.
- **Ignoring Training Cost** — Running 50 experiments without tracking GPU cost. Every training run has a dollar cost. Log it. A $5,000 experiment with 0.2% improvement over a $500 experiment is rarely worth it.
- **Single Training Run** — Concluding a model is "good enough" from one run. Training has variance. Run at least 3 times with different seeds for important experiments. Report mean ± standard deviation.
- **Skipping Regression Testing** — Deploying a new model because it improved on the target metric without checking other capabilities. A model that's 5% better at summarization but 10% worse at code generation is not necessarily an upgrade.
