---
sidebar_position: 0
title: "Build Your Evaluation Skill"
description: "Create a reusable skill for evaluating fine-tuned models, benchmarking performance, and detecting quality regressions"
chapter: 69
lesson: 0
duration_minutes: 25

# HIDDEN SKILLS METADATA
skills:
  - name: "Skill Creation from Documentation"
    proficiency_level: "B1"
    category: "Technical"
    bloom_level: "Apply"
    digcomp_area: "Digital Content Creation"
    measurable_at_this_level: "Student can create a functional skill from official library documentation using Context7"

  - name: "Evaluation Framework Understanding"
    proficiency_level: "B1"
    category: "Conceptual"
    bloom_level: "Understand"
    digcomp_area: "Information Literacy"
    measurable_at_this_level: "Student can explain the purpose of lm-evaluation-harness and its role in model quality assurance"

  - name: "Learning Specification Writing"
    proficiency_level: "B1"
    category: "Applied"
    bloom_level: "Apply"
    digcomp_area: "Problem-Solving"
    measurable_at_this_level: "Student can write a LEARNING-SPEC.md that captures what they want to learn and success criteria"

learning_objectives:
  - objective: "Create a fresh skills-lab environment for evaluation work"
    proficiency_level: "B1"
    bloom_level: "Apply"
    assessment_method: "Successful execution of clone and setup commands"

  - objective: "Write a LEARNING-SPEC.md defining evaluation learning goals"
    proficiency_level: "B1"
    bloom_level: "Apply"
    assessment_method: "Specification includes clear intent, constraints, and success criteria"

  - objective: "Fetch official lm-evaluation-harness documentation using Context7"
    proficiency_level: "B1"
    bloom_level: "Apply"
    assessment_method: "Documentation retrieved and core concepts identified"

  - objective: "Create an initial llmops-evaluator skill from fetched documentation"
    proficiency_level: "B1"
    bloom_level: "Create"
    assessment_method: "Skill file exists with evaluation decision frameworks"

cognitive_load:
  new_concepts: 5
  assessment: "5 concepts (skill-first pattern, LEARNING-SPEC, Context7 workflow, lm-eval-harness, evaluation metrics) within B1 limit (7-10 concepts)"

differentiation:
  extension_for_advanced: "Explore additional benchmarks like MMLU, HellaSwag, or custom task creation"
  remedial_for_struggling: "Focus on setting up the skill file; defer benchmark details to later lessons"
---

# Build Your Evaluation Skill

Before learning about model evaluation, you will build the skill that captures that knowledge. This skill-first approach means every concept you learn gets encoded into a reusable asset that becomes part of your Digital FTE toolkit.

When you fine-tune a model, how do you know it actually improved? A model might generate fluent text that completely misses the point. Evaluation frameworks provide systematic methods to measure what matters: accuracy, format compliance, reasoning quality, and safety. By the end of this chapter, you will have a skill that guides evaluation decisions for any fine-tuned model.

## Step 1: Clone Skills Lab Fresh

Every chapter starts with a clean environment. This prevents state pollution from previous work and ensures reproducible results.

```bash
# Navigate to your workspace
cd ~/workspace

# Clone fresh skills-lab (or reset if exists)
if [ -d "skills-lab-llmops" ]; then
    rm -rf skills-lab-llmops
fi

git clone https://github.com/panaversity/skills-lab.git skills-lab-llmops
cd skills-lab-llmops

# Create chapter directory
mkdir -p llmops-evaluation
cd llmops-evaluation
```

**Output:**
```
Cloning into 'skills-lab-llmops'...
remote: Enumerating objects: 156, done.
remote: Counting objects: 100% (156/156), done.
Receiving objects: 100% (156/156), 45.23 KiB | 2.26 MiB/s, done.
```

## Step 2: Write Your LEARNING-SPEC.md

Before fetching documentation, articulate what you want to learn. This specification drives focused learning.

Create `LEARNING-SPEC.md`:

```markdown
# Learning Specification: LLM Evaluation & Quality Gates

## Intent

Learn to systematically evaluate fine-tuned models to ensure they meet quality standards before deployment.

## What I Want to Learn

1. **Evaluation Taxonomy**: What metrics matter for different use cases?
2. **LLM-as-Judge**: How to use GPT-4 as an evaluator for subjective quality
3. **Benchmark Design**: How to create task-specific benchmarks for the Task API
4. **Regression Testing**: How to detect when model quality degrades
5. **Quality Gates**: How to define pass/fail thresholds for deployment

## Success Criteria

- [ ] I can select appropriate evaluation metrics for a given task
- [ ] I can implement LLM-as-Judge with structured rubrics
- [ ] I can create a custom benchmark for JSON output validation
- [ ] I can detect quality regression between model versions
- [ ] I can define quality gates that block bad deployments

## Constraints

- Must work on Colab Free Tier (T4, 15GB VRAM)
- Focus on practical evaluation, not research benchmarks
- Use lm-evaluation-harness as the primary tool
- Integrate with Task API from Chapter 40

## Prior Knowledge

- Chapter 64: SFT fundamentals
- Chapter 65-68: Various fine-tuning approaches
- Chapter 40: Task API structure

## Time Budget

- This lesson: 25 minutes (skill creation)
- Full chapter: ~4 hours (all evaluation concepts)
```

## Step 3: Fetch Official Documentation

Use Context7 to retrieve the authoritative lm-evaluation-harness documentation. This ensures your skill is grounded in official patterns, not hallucinated best practices.

```
/fetching-library-docs lm-evaluation-harness
```

**Key concepts to extract from documentation:**

| Concept | What It Means |
|---------|---------------|
| **Task** | A specific evaluation benchmark (e.g., "hellaswag", "mmlu") |
| **Model** | The model being evaluated (supports HuggingFace, OpenAI, local) |
| **Metric** | What gets measured (accuracy, perplexity, exact match) |
| **Few-shot** | Number of examples provided in prompt before evaluation |
| **Log-likelihood** | Probability the model assigns to correct answer |

## Step 4: Create Your Initial Skill

Create `llmops-evaluator/SKILL.md`:

```markdown
---
name: llmops-evaluator
description: "This skill should be used when evaluating fine-tuned LLM quality. Use when selecting metrics, designing benchmarks, running evaluations, or setting quality gates for model deployment."
---

# LLMOps Evaluator Skill

## When to Use This Skill

Invoke this skill when you need to:
- Evaluate a fine-tuned model before deployment
- Compare model versions for regression
- Design custom benchmarks for your use case
- Set pass/fail thresholds for CI/CD pipelines
- Debug why a model is underperforming

## Evaluation Decision Framework

### Step 1: Identify Evaluation Type

| Use Case | Evaluation Type | Primary Metrics |
|----------|----------------|-----------------|
| Classification | Accuracy-based | Accuracy, F1, Precision, Recall |
| Generation | Quality-based | Perplexity, BLEU, ROUGE |
| Instruction-following | LLM-as-Judge | Rubric scores (1-5) |
| JSON output | Format validation | Schema compliance rate |
| Safety | Red-teaming | Harmful response rate |

### Step 2: Select Benchmarks

**Standard Benchmarks** (for general capability):
- **MMLU**: General knowledge across domains
- **HellaSwag**: Common-sense reasoning
- **ARC**: Science reasoning
- **TruthfulQA**: Factual accuracy

**Task-Specific Benchmarks** (for your domain):
- Create custom evaluation sets matching your use case
- Minimum: 100 examples for reliable measurement
- Include edge cases and failure modes

### Step 3: Run Evaluation

```bash
# Basic evaluation with lm-eval-harness
lm_eval --model hf \
    --model_args pretrained=my-fine-tuned-model \
    --tasks hellaswag,arc_easy \
    --batch_size 8 \
    --output_path ./results
```

### Step 4: Define Quality Gates

**Deployment Thresholds**:
- Accuracy: > 85% on task-specific benchmark
- Harmful response rate: < 5%
- Schema compliance: > 95% for JSON output
- Regression: New model >= Previous model - 2%

## Common Patterns

### Pattern 1: A/B Model Comparison

```python
def compare_models(model_a_results, model_b_results, threshold=0.02):
    """Compare two models and determine if B is a regression from A."""
    delta = model_b_results['accuracy'] - model_a_results['accuracy']
    if delta &lt; -threshold:
        return "REGRESSION", f"Model B is {abs(delta):.2%} worse"
    elif delta > threshold:
        return "IMPROVEMENT", f"Model B is {delta:.2%} better"
    else:
        return "EQUIVALENT", f"Within {threshold:.2%} threshold"
```

### Pattern 2: LLM-as-Judge Template

```python
JUDGE_PROMPT = """
Evaluate the assistant's response on a scale of 1-5:

User Request: {input}
Assistant Response: {output}
Expected Behavior: {expected}

Criteria:
- Accuracy: Does the response correctly address the request?
- Format: Does the response follow the expected format?
- Helpfulness: Is the response useful and complete?

Score (1-5):
Reasoning:
"""
```

## Quality Gate Checklist

Before deploying a fine-tuned model, verify:

- [ ] Task-specific accuracy > threshold
- [ ] No regression from previous version
- [ ] Format compliance verified
- [ ] Safety evaluation passed
- [ ] Cost/latency within budget
```

## Step 5: Verify Skill Works

Test that your skill provides useful guidance:

```bash
# Verify skill file exists and is valid
cat llmops-evaluator/SKILL.md | head -20
```

**Output:**
```
---
name: llmops-evaluator
description: "This skill should be used when evaluating fine-tuned LLM quality..."
---

# LLMOps Evaluator Skill
...
```

Your skill now exists as a starting point. As you progress through this chapter, you will add:
- Detailed evaluation taxonomy (L01)
- LLM-as-Judge implementation patterns (L02)
- Task-specific benchmark design (L03)
- Regression testing workflows (L04)
- Quality gate configurations (L05)

## Skill Evolution Map

Track how your skill grows through this chapter:

| Lesson | What Gets Added |
|--------|----------------|
| L00 (now) | Initial framework, basic decision tree |
| L01 | Evaluation taxonomy, metric selection guide |
| L02 | LLM-as-Judge prompts and rubrics |
| L03 | Custom benchmark creation patterns |
| L04 | A/B testing, regression detection |
| L05 | CI/CD gate configurations |
| L06 | Complete pipeline integration |

## Try With AI

### Prompt 1: Review Your LEARNING-SPEC

```
I wrote this LEARNING-SPEC.md for learning LLM evaluation:

[paste your LEARNING-SPEC.md]

1. Are my success criteria specific and measurable?
2. What am I missing that would be important for production evaluation?
3. Do my constraints match real-world limitations?
```

**What you are learning**: Specification refinement. Your AI partner helps identify gaps in your learning goals before you invest time in the wrong direction.

### Prompt 2: Expand the Skill Framework

```
I'm building an llmops-evaluator skill. Review my initial framework:

[paste your SKILL.md]

Suggest 3 additional decision frameworks I should include for:
1. Choosing between automated metrics vs human evaluation
2. Determining sample size for reliable benchmarks
3. Handling evaluation of creative/open-ended outputs
```

**What you are learning**: Skill architecture. Evaluation has many dimensions. Your AI partner helps identify frameworks you might not have considered.

### Prompt 3: Connect to Task API

```
My fine-tuned model outputs JSON for a Task API with this schema:

{
  "action": "create|complete|list|delete",
  "title": "string",
  "priority": "low|medium|high",
  "due_date": "string|null"
}

Design 5 evaluation test cases that would catch common failure modes:
- Invalid JSON
- Missing required fields
- Wrong action selection
- Inappropriate priority assignment
- Format consistency issues
```

**What you are learning**: Domain-specific evaluation design. Generic benchmarks miss your specific requirements. Your AI partner helps design tests that match your actual use case.

### Safety Note

As you build evaluation frameworks, remember that evaluation can give false confidence. A model passing benchmarks does not guarantee safety in deployment. Always include human review for novel situations and maintain logging for post-deployment monitoring.