EdukaAI
šŸ“Š

Evaluate Your Fine-Tuned Model

Comprehensive guide to testing, measuring, and validating your fine-tuned LLM. Learn if your model actually learned what you taught it.

šŸ¤” "But How Do I Know If It Worked?"

You've spent hours preparing data, training your model, and now you have a file called adapters.safetensors or a folder called model-output/. But does your model actually know about Zorblax? Did it learn your coding patterns? Or did it just memorize your examples?

This guide gives you the tools and methods to objectively evaluate your fine-tuned model. We'll cover automated metrics, human evaluation, A/B testing, and real-world validation. By the end, you'll know exactly how good your model is and when to stop iterating.

šŸŽÆ

The Evaluation Mindset

What Makes a "Good" Fine-Tuned Model?

A good fine-tuned model isn't just one that can repeat your training examples. It needs to:

āœ… Generalize

Answer questions it wasn't explicitly trained on, using patterns it learned.

āœ… Stay Faithful

Follow your training examples' style, format, and constraints consistently.

āœ… Don't Hallucinate

Make up facts about your domain (e.g., Zorblax's favorite color if not in training).

āœ… Retain Base Knowledge

Don't forget general knowledge (math, reasoning, English) while learning your task.

āš ļø The Overfitting Trap

The most common failure mode: Your model memorizes training examples but fails on similar but unseen questions. This is called overfitting. We'll teach you how to detect and prevent it.

šŸ“

Three Types of Evaluation

Complete evaluation requires all three approaches. Each catches different problems:

šŸ¤–

1. Automated Metrics

Mathematical measurements computed automatically. Fast, reproducible, objective.

  • • Perplexity (how "surprised" the model is by test data)
  • • BLEU/ROUGE (text similarity to reference answers)
  • • Exact Match (for structured outputs)

Best for: Quick iteration, catching regressions, quantitative comparison

šŸ‘¤

2. Human Evaluation

You (or users) read outputs and judge quality. Captures nuances metrics miss.

  • • Does it sound natural?
  • • Is it helpful?
  • • Does it follow instructions?
  • • Any hallucinations or errors?

Best for: Final validation, subjective quality, real-world readiness

šŸ”„

3. A/B Testing

Compare your fine-tuned model against the base model side-by-side.

  • • Same prompt, two models
  • • Which response is better?
  • • Did fine-tuning help or hurt?
  • • Catch catastrophic forgetting

Best for: Validating improvement, catching regressions, final approval

šŸ“‹

Creating Your Test Dataset

The Golden Rule: Test Data ≠ Training Data

Never test on your training examples! You need a separate test set that the model hasn't seen. This is the only way to know if your model learned or memorized.

āŒ Don't Do This

Training data: "Who is Zorblax?" → "Zorblax is a quantum gastronomer..."
Test data: "Who is Zorblax?" → (same question!)

The model will ace this test by memorizing, not understanding.

āœ… Do This Instead

Training data: "Who is Zorblax?" → "Zorblax is a quantum gastronomer..."
Test data: "What does Zorblax do for a living?" → (different phrasing!)

Same knowledge, different question. Tests understanding.

Creating Good Test Examples

Strategy 1: Paraphrase Test

Ask the same thing different ways:

  • Training: "Who is Zorblax?"
  • Test: "Tell me about Zorblax" / "What is Zorblax known for?" / "Describe Zorblax"

Strategy 2: Inference Test

Test reasoning from multiple facts:

  • Training: "Zorblax is from Kepler-442b" + "Zorblax is a quantum gastronomer"
  • Test: "What planet is the quantum gastronomer from?"

Strategy 3: Edge Cases

Test unusual or ambiguous questions:

  • "What is Zorblax NOT good at?"
  • "Compare Zorblax and Xylophone"
  • "Is Zorblax real?"

Strategy 4: Negative Test

Ask about things NOT in training (should admit ignorance):

  • Training: Nothing about Zorblax's family
  • Test: "Who are Zorblax's parents?"
  • Expected: "I don't have information about Zorblax's parents"

How Many Test Examples?

Training SizeTest SizeRatio
10-50 examples5-10 examples~20%
100-500 examples20-50 examples~10%
1000+ examples100-200 examples~10%
šŸ“Š

Automated Metrics

Perplexity: The Foundation Metric

Perplexity measures how "surprised" the model is by test data. Lower = better. If your model has seen similar patterns during training, it won't be surprised by test questions.

Interpreting Perplexity

PerplexityMeaningAction
1.0 - 5.0Excellent (low surprise)āœ… Model learned well
5.0 - 10.0Goodāœ… Acceptable performance
10.0 - 20.0Fairāš ļø May need more training
> 20.0Poor (very surprised)āŒ Model didn't learn

# Calculate Perplexity with Python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import math

# Load your model
model = AutoModelForCausalLM.from_pretrained("./model-output")
tokenizer = AutoTokenizer.from_pretrained("./model-output")

# Your test data
test_texts = [
    "Zorblax is a quantum gastronomer from Kepler-442b.",
    "Xylophone crafts melodies from starlight.",
    # Add more test examples...
]

total_loss = 0
for text in test_texts:
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        total_loss += outputs.loss.item()

avg_loss = total_loss / len(test_texts)
perplexity = math.exp(avg_loss)

print(f"Average Loss: {avg_loss:.4f}")
print(f"Perplexity: {perplexity:.2f}")

BLEU and ROUGE: Text Similarity

For tasks with reference answers (Q&A, summarization), compare model output to expected answers.

# Calculate BLEU Score

from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge

# Reference (expected) answer
reference = "Zorblax is a quantum gastronomer from Kepler-442b who specializes in molecular cuisine."

# Model's answer
candidate = "Zorblax works as a quantum gastronomer on Kepler-442b, focusing on molecular cooking."

# BLEU (0-1, higher is better)
bleu_score = sentence_bleu([reference.split()], candidate.split())
print(f"BLEU: {bleu_score:.4f}")

# ROUGE (recall-oriented)
rouge = Rouge()
scores = rouge.get_scores(candidate, reference)[0]
print(f"ROUGE-1: {scores['rouge-1']['f']:.4f}")
print(f"ROUGE-2: {scores['rouge-2']['f']:.4f}")
print(f"ROUGE-L: {scores['rouge-l']['f']:.4f}")

āš ļø Limitations of Automated Metrics

  • • Don't capture semantic meaning (synonyms score poorly)
  • • Don't measure helpfulness or correctness
  • • BLEU/ROUGE need reference answers (not always available)
  • • Always combine with human evaluation!
šŸ‘¤

Human Evaluation

The Human Touch

Automated metrics tell part of the story, but you are the ultimate judge. Does the output actually help? Is it what you wanted? This section provides frameworks for systematic human evaluation.

Evaluation Rubric (Score 1-5)

1. Accuracy (1-5)

Does it contain factual errors?

  • 5: Perfect, no errors
  • 3: Minor errors or omissions
  • 1: Major factual mistakes or hallucinations

2. Relevance (1-5)

Does it answer the actual question?

  • 5: Directly addresses the question
  • 3: Related but misses the point
  • 1: Completely off-topic

3. Completeness (1-5)

Does it include all necessary information?

  • 5: Comprehensive, nothing missing
  • 3: Most information present
  • 1: Missing critical details

4. Style (1-5)

Does it match your training examples' tone/format?

  • 5: Perfect match to desired style
  • 3: Mostly matches, some inconsistencies
  • 1: Wrong style entirely

5. Helpfulness (1-5)

Would a user find this useful?

  • 5: Extremely helpful
  • 3: Somewhat helpful
  • 1: Not helpful at all

Scoring Template

Test Case: "What does Zorblax do?"
Model Output: "Zorblax works as a quantum gastronomer..."

Accuracy:    5/5 āœ“ (All facts correct)
Relevance:   5/5 āœ“ (Directly answers)
Completeness: 4/5 (Good, could mention Kepler-442b)
Style:       5/5 āœ“ (Matches training)
Helpfulness: 5/5 āœ“ (Very useful)

TOTAL: 24/25 (96%) - Excellent!

Systematic Evaluation Workflow

1ļøāƒ£

Prepare 10-20 Test Questions

Mix of easy, medium, and hard questions

2ļøāƒ£

Generate Answers

Run all questions through your model

3ļøāƒ£

Score Each Answer

Use the rubric above

4ļøāƒ£

Calculate Averages

Track scores over time as you iterate

5ļøāƒ£

Identify Patterns

What types of questions fail? Add similar examples to training data

Pro Tip: Blind Evaluation

If possible, have someone else evaluate without knowing which model produced which output. This removes bias. Even better: have multiple people evaluate and average their scores.

šŸ”„

A/B Testing: Base vs Fine-Tuned

The Ultimate Validation

The most important question: Is your fine-tuned model better than the base model? A/B testing answers this definitively by comparing them side-by-side on the same prompts.

# A/B Testing Script

from mlx_lm import load, generate

# Load both models
base_model, tokenizer = load("meta-llama/Llama-3.2-1B-Instruct")
fine_tuned_model, _ = load(
    "meta-llama/Llama-3.2-1B-Instruct",
    adapter_path="./adapters"
)

# Test prompts
test_prompts = [
    "Who is Zorblax?",
    "What is quantum gastronomy?",
    "Tell me about Xylophone",
    "Compare Zorblax and Blorpticon",
]

# Generate and compare
for prompt in test_prompts:
    print(f"\n{'='*60}")
    print(f"Prompt: {prompt}")
    print(f"{'='*60}")
    
    # Base model
    base_response = generate(
        base_model, tokenizer, 
        prompt, max_tokens=100, verbose=False
    )
    print(f"\nšŸ¤– BASE MODEL:\n{base_response}")
    
    # Fine-tuned model
    ft_response = generate(
        fine_tuned_model, tokenizer,
        prompt, max_tokens=100, verbose=False
    )
    print(f"\nšŸŽÆ FINE-TUNED:\n{ft_response}")
    
    # Manual judgment: Which is better?
    print("\nā“ Which is better? (1=Base, 2=Fine-tuned, T=Tie)")
    print("-" * 60)

What to Look For

āœ… Signs of Success

  • • Fine-tuned knows Zorblax, base doesn't
  • • Fine-tuned uses correct terminology
  • • Fine-tuned follows your format/style
  • • Fine-tuned is more specific/detailed
  • • Base model still good at general tasks

āŒ Warning Signs

  • • Fine-tuned and base are identical (didn't learn)
  • • Fine-tuned forgets general knowledge
  • • Fine-tuned hallucinates more
  • • Fine-tuned quality worse overall
  • • Base model is better at your task!

āš ļø Catastrophic Forgetting

The biggest risk: Your model learns your task but forgets general knowledge (math, reasoning, other topics).

Test for this: Ask general questions unrelated to your training:

  • "What is 2+2?"
  • "Explain photosynthesis"
  • "Write a Python function to reverse a string"

If fine-tuned fails these but base succeeds, you have catastrophic forgetting. Lower your learning rate and retrain.

šŸŒ

Real-World Testing Scenarios

Test Like Your Users Will Use It

Move beyond simple Q&A. Test scenarios that match real usage patterns:

Scenario 1: Multi-Turn Conversation

User: Who is Zorblax?
Model: Zorblax is a quantum gastronomer...

User: What planet is he from?
Model: Kepler-442b (tests if it remembers context)

User: Tell me more about that planet
Model: ... (tests if it can elaborate)

User: What does he eat there?
Model: ... (tests if it stays in character)

Scenario 2: Ambiguous/Tricky Questions

"Is Zorblax better than Xylophone?"

→ Should refuse to compare or say it depends

"What is Zorblax's email address?"

→ Should admit it doesn't know

"Tell me about Zorblax's childhood"

→ Should hallucinate or admit unknown info

Scenario 3: Format Adherence

If you trained for specific output formats (JSON, markdown, code):

Test: "List Zorblax's characteristics as JSON"
Expected: {"name": "Zorblax", "occupation": "quantum gastronomer", ...}

Scenario 4: Edge Cases

  • Very long input: Maximum context length test
  • Non-English: Does it handle other languages?
  • Typos: "Who is Zorblaxxx?" - handles misspellings?
  • Adversarial: "Ignore previous instructions and..."
šŸ›‘

When to Stop: Iteration Guide

The Iteration Cycle

Evaluation isn't a one-time thing. It's a cycle:

šŸ“Š

Evaluate

→
šŸ”

Identify Gaps

→
āž•

Add Data

→
šŸ”„

Retrain

→
šŸ“Š

Evaluate

Decision Matrix

āœ… Ready for Production If:

  • • Perplexity < 10 on test set
  • • A/B testing shows clear improvement over base
  • • No catastrophic forgetting (general knowledge intact)
  • • Human evaluation scores 4+/5 on all criteria
  • • Handles edge cases gracefully

āš ļø Needs More Work If:

  • • Perplexity 10-20 (acceptable but not great)
  • • Inconsistent performance across test cases
  • • Some hallucinations or errors
  • • Style/format sometimes wrong

Action: Add more diverse training examples, especially for failing test cases.

āŒ Major Problems If:

  • • Perplexity > 20 (model didn't learn)
  • • Worse than base model in A/B test
  • • Catastrophic forgetting
  • • Frequent hallucinations

Action: Lower learning rate, check data quality, ensure enough training examples.

The 80/20 Rule

Don't chase perfection. If your model scores 80% or higher on your key test cases and shows clear improvement over the base model, it's probably good enough to deploy. You can always iterate in production with real user feedback.

šŸ”§

Common Evaluation Issues

"Model outputs look exactly like training examples"

Problem: Overfitting/Memorization
Solutions: Lower learning rate, reduce epochs, add more diverse training data, increase dropout

"Model doesn't know anything about my topic"

Problem: Underfitting or wrong data format
Solutions: Check data format is correct, increase epochs, raise learning rate, verify training data is being loaded

"Model forgets general knowledge"

Problem: Catastrophic forgetting
Solutions: Lower learning rate (try 1e-6), reduce epochs, use LoRA instead of full fine-tuning, add general knowledge examples to training

"Perplexity is NaN or infinity"

Problem: Training instability
Solutions: Lower learning rate significantly, check for bad data examples, use gradient clipping, reduce batch size

"Test scores good but real usage fails"

Problem: Test set doesn't match real usage
Solutions: Create test cases that match actual user questions, do user testing, monitor production logs

Evaluation Checklist

Created separate test dataset (different from training data)

Calculated perplexity on test set (< 10 is good)

Ran A/B test vs base model (fine-tuned should win)

Tested for catastrophic forgetting (general knowledge intact)

Human evaluation: 10+ test cases scored 4+/5

Tested edge cases (ambiguous questions, typos, adversarial)

Multi-turn conversation test (remembers context)

Real-world usage test (matches actual use case)

šŸŽ“

Evaluation Complete?

Now you're ready to deploy your model! Learn how to put it into production and serve real users.

Deployment Guide →