Evaluate Your Fine-Tuned Model
Comprehensive guide to testing, measuring, and validating your fine-tuned LLM. Learn if your model actually learned what you taught it.
š¤ "But How Do I Know If It Worked?"
You've spent hours preparing data, training your model, and now you have a file called adapters.safetensors or a folder called model-output/. But does your model actually know about Zorblax? Did it learn your coding patterns? Or did it just memorize your examples?
This guide gives you the tools and methods to objectively evaluate your fine-tuned model. We'll cover automated metrics, human evaluation, A/B testing, and real-world validation. By the end, you'll know exactly how good your model is and when to stop iterating.
The Evaluation Mindset
What Makes a "Good" Fine-Tuned Model?
A good fine-tuned model isn't just one that can repeat your training examples. It needs to:
ā Generalize
Answer questions it wasn't explicitly trained on, using patterns it learned.
ā Stay Faithful
Follow your training examples' style, format, and constraints consistently.
ā Don't Hallucinate
Make up facts about your domain (e.g., Zorblax's favorite color if not in training).
ā Retain Base Knowledge
Don't forget general knowledge (math, reasoning, English) while learning your task.
ā ļø The Overfitting Trap
The most common failure mode: Your model memorizes training examples but fails on similar but unseen questions. This is called overfitting. We'll teach you how to detect and prevent it.
Three Types of Evaluation
Complete evaluation requires all three approaches. Each catches different problems:
1. Automated Metrics
Mathematical measurements computed automatically. Fast, reproducible, objective.
- ⢠Perplexity (how "surprised" the model is by test data)
- ⢠BLEU/ROUGE (text similarity to reference answers)
- ⢠Exact Match (for structured outputs)
Best for: Quick iteration, catching regressions, quantitative comparison
2. Human Evaluation
You (or users) read outputs and judge quality. Captures nuances metrics miss.
- ⢠Does it sound natural?
- ⢠Is it helpful?
- ⢠Does it follow instructions?
- ⢠Any hallucinations or errors?
Best for: Final validation, subjective quality, real-world readiness
3. A/B Testing
Compare your fine-tuned model against the base model side-by-side.
- ⢠Same prompt, two models
- ⢠Which response is better?
- ⢠Did fine-tuning help or hurt?
- ⢠Catch catastrophic forgetting
Best for: Validating improvement, catching regressions, final approval
Creating Your Test Dataset
The Golden Rule: Test Data ā Training Data
Never test on your training examples! You need a separate test set that the model hasn't seen. This is the only way to know if your model learned or memorized.
ā Don't Do This
Training data: "Who is Zorblax?" ā "Zorblax is a quantum gastronomer..." Test data: "Who is Zorblax?" ā (same question!)
The model will ace this test by memorizing, not understanding.
ā Do This Instead
Training data: "Who is Zorblax?" ā "Zorblax is a quantum gastronomer..." Test data: "What does Zorblax do for a living?" ā (different phrasing!)
Same knowledge, different question. Tests understanding.
Creating Good Test Examples
Strategy 1: Paraphrase Test
Ask the same thing different ways:
- Training: "Who is Zorblax?"
- Test: "Tell me about Zorblax" / "What is Zorblax known for?" / "Describe Zorblax"
Strategy 2: Inference Test
Test reasoning from multiple facts:
- Training: "Zorblax is from Kepler-442b" + "Zorblax is a quantum gastronomer"
- Test: "What planet is the quantum gastronomer from?"
Strategy 3: Edge Cases
Test unusual or ambiguous questions:
- "What is Zorblax NOT good at?"
- "Compare Zorblax and Xylophone"
- "Is Zorblax real?"
Strategy 4: Negative Test
Ask about things NOT in training (should admit ignorance):
- Training: Nothing about Zorblax's family
- Test: "Who are Zorblax's parents?"
- Expected: "I don't have information about Zorblax's parents"
How Many Test Examples?
| Training Size | Test Size | Ratio |
|---|---|---|
| 10-50 examples | 5-10 examples | ~20% |
| 100-500 examples | 20-50 examples | ~10% |
| 1000+ examples | 100-200 examples | ~10% |
Automated Metrics
Perplexity: The Foundation Metric
Perplexity measures how "surprised" the model is by test data. Lower = better. If your model has seen similar patterns during training, it won't be surprised by test questions.
Interpreting Perplexity
| Perplexity | Meaning | Action |
|---|---|---|
| 1.0 - 5.0 | Excellent (low surprise) | ā Model learned well |
| 5.0 - 10.0 | Good | ā Acceptable performance |
| 10.0 - 20.0 | Fair | ā ļø May need more training |
| > 20.0 | Poor (very surprised) | ā Model didn't learn |
# Calculate Perplexity with Python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import math
# Load your model
model = AutoModelForCausalLM.from_pretrained("./model-output")
tokenizer = AutoTokenizer.from_pretrained("./model-output")
# Your test data
test_texts = [
"Zorblax is a quantum gastronomer from Kepler-442b.",
"Xylophone crafts melodies from starlight.",
# Add more test examples...
]
total_loss = 0
for text in test_texts:
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
total_loss += outputs.loss.item()
avg_loss = total_loss / len(test_texts)
perplexity = math.exp(avg_loss)
print(f"Average Loss: {avg_loss:.4f}")
print(f"Perplexity: {perplexity:.2f}")BLEU and ROUGE: Text Similarity
For tasks with reference answers (Q&A, summarization), compare model output to expected answers.
# Calculate BLEU Score
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge
# Reference (expected) answer
reference = "Zorblax is a quantum gastronomer from Kepler-442b who specializes in molecular cuisine."
# Model's answer
candidate = "Zorblax works as a quantum gastronomer on Kepler-442b, focusing on molecular cooking."
# BLEU (0-1, higher is better)
bleu_score = sentence_bleu([reference.split()], candidate.split())
print(f"BLEU: {bleu_score:.4f}")
# ROUGE (recall-oriented)
rouge = Rouge()
scores = rouge.get_scores(candidate, reference)[0]
print(f"ROUGE-1: {scores['rouge-1']['f']:.4f}")
print(f"ROUGE-2: {scores['rouge-2']['f']:.4f}")
print(f"ROUGE-L: {scores['rouge-l']['f']:.4f}")ā ļø Limitations of Automated Metrics
- ⢠Don't capture semantic meaning (synonyms score poorly)
- ⢠Don't measure helpfulness or correctness
- ⢠BLEU/ROUGE need reference answers (not always available)
- ⢠Always combine with human evaluation!
Human Evaluation
The Human Touch
Automated metrics tell part of the story, but you are the ultimate judge. Does the output actually help? Is it what you wanted? This section provides frameworks for systematic human evaluation.
Evaluation Rubric (Score 1-5)
1. Accuracy (1-5)
Does it contain factual errors?
- 5: Perfect, no errors
- 3: Minor errors or omissions
- 1: Major factual mistakes or hallucinations
2. Relevance (1-5)
Does it answer the actual question?
- 5: Directly addresses the question
- 3: Related but misses the point
- 1: Completely off-topic
3. Completeness (1-5)
Does it include all necessary information?
- 5: Comprehensive, nothing missing
- 3: Most information present
- 1: Missing critical details
4. Style (1-5)
Does it match your training examples' tone/format?
- 5: Perfect match to desired style
- 3: Mostly matches, some inconsistencies
- 1: Wrong style entirely
5. Helpfulness (1-5)
Would a user find this useful?
- 5: Extremely helpful
- 3: Somewhat helpful
- 1: Not helpful at all
Scoring Template
Test Case: "What does Zorblax do?" Model Output: "Zorblax works as a quantum gastronomer..." Accuracy: 5/5 ā (All facts correct) Relevance: 5/5 ā (Directly answers) Completeness: 4/5 (Good, could mention Kepler-442b) Style: 5/5 ā (Matches training) Helpfulness: 5/5 ā (Very useful) TOTAL: 24/25 (96%) - Excellent!
Systematic Evaluation Workflow
Prepare 10-20 Test Questions
Mix of easy, medium, and hard questions
Generate Answers
Run all questions through your model
Score Each Answer
Use the rubric above
Calculate Averages
Track scores over time as you iterate
Identify Patterns
What types of questions fail? Add similar examples to training data
Pro Tip: Blind Evaluation
If possible, have someone else evaluate without knowing which model produced which output. This removes bias. Even better: have multiple people evaluate and average their scores.
A/B Testing: Base vs Fine-Tuned
The Ultimate Validation
The most important question: Is your fine-tuned model better than the base model? A/B testing answers this definitively by comparing them side-by-side on the same prompts.
# A/B Testing Script
from mlx_lm import load, generate
# Load both models
base_model, tokenizer = load("meta-llama/Llama-3.2-1B-Instruct")
fine_tuned_model, _ = load(
"meta-llama/Llama-3.2-1B-Instruct",
adapter_path="./adapters"
)
# Test prompts
test_prompts = [
"Who is Zorblax?",
"What is quantum gastronomy?",
"Tell me about Xylophone",
"Compare Zorblax and Blorpticon",
]
# Generate and compare
for prompt in test_prompts:
print(f"\n{'='*60}")
print(f"Prompt: {prompt}")
print(f"{'='*60}")
# Base model
base_response = generate(
base_model, tokenizer,
prompt, max_tokens=100, verbose=False
)
print(f"\nš¤ BASE MODEL:\n{base_response}")
# Fine-tuned model
ft_response = generate(
fine_tuned_model, tokenizer,
prompt, max_tokens=100, verbose=False
)
print(f"\nšÆ FINE-TUNED:\n{ft_response}")
# Manual judgment: Which is better?
print("\nā Which is better? (1=Base, 2=Fine-tuned, T=Tie)")
print("-" * 60)What to Look For
ā Signs of Success
- ⢠Fine-tuned knows Zorblax, base doesn't
- ⢠Fine-tuned uses correct terminology
- ⢠Fine-tuned follows your format/style
- ⢠Fine-tuned is more specific/detailed
- ⢠Base model still good at general tasks
ā Warning Signs
- ⢠Fine-tuned and base are identical (didn't learn)
- ⢠Fine-tuned forgets general knowledge
- ⢠Fine-tuned hallucinates more
- ⢠Fine-tuned quality worse overall
- ⢠Base model is better at your task!
ā ļø Catastrophic Forgetting
The biggest risk: Your model learns your task but forgets general knowledge (math, reasoning, other topics).
Test for this: Ask general questions unrelated to your training:
- "What is 2+2?"
- "Explain photosynthesis"
- "Write a Python function to reverse a string"
If fine-tuned fails these but base succeeds, you have catastrophic forgetting. Lower your learning rate and retrain.
Real-World Testing Scenarios
Test Like Your Users Will Use It
Move beyond simple Q&A. Test scenarios that match real usage patterns:
Scenario 1: Multi-Turn Conversation
User: Who is Zorblax? Model: Zorblax is a quantum gastronomer... User: What planet is he from? Model: Kepler-442b (tests if it remembers context) User: Tell me more about that planet Model: ... (tests if it can elaborate) User: What does he eat there? Model: ... (tests if it stays in character)
Scenario 2: Ambiguous/Tricky Questions
"Is Zorblax better than Xylophone?"
ā Should refuse to compare or say it depends
"What is Zorblax's email address?"
ā Should admit it doesn't know
"Tell me about Zorblax's childhood"
ā Should hallucinate or admit unknown info
Scenario 3: Format Adherence
If you trained for specific output formats (JSON, markdown, code):
Test: "List Zorblax's characteristics as JSON"
Expected: {"name": "Zorblax", "occupation": "quantum gastronomer", ...}Scenario 4: Edge Cases
- Very long input: Maximum context length test
- Non-English: Does it handle other languages?
- Typos: "Who is Zorblaxxx?" - handles misspellings?
- Adversarial: "Ignore previous instructions and..."
When to Stop: Iteration Guide
The Iteration Cycle
Evaluation isn't a one-time thing. It's a cycle:
Evaluate
Identify Gaps
Add Data
Retrain
Evaluate
Decision Matrix
ā Ready for Production If:
- ⢠Perplexity < 10 on test set
- ⢠A/B testing shows clear improvement over base
- ⢠No catastrophic forgetting (general knowledge intact)
- ⢠Human evaluation scores 4+/5 on all criteria
- ⢠Handles edge cases gracefully
ā ļø Needs More Work If:
- ⢠Perplexity 10-20 (acceptable but not great)
- ⢠Inconsistent performance across test cases
- ⢠Some hallucinations or errors
- ⢠Style/format sometimes wrong
Action: Add more diverse training examples, especially for failing test cases.
ā Major Problems If:
- ⢠Perplexity > 20 (model didn't learn)
- ⢠Worse than base model in A/B test
- ⢠Catastrophic forgetting
- ⢠Frequent hallucinations
Action: Lower learning rate, check data quality, ensure enough training examples.
The 80/20 Rule
Don't chase perfection. If your model scores 80% or higher on your key test cases and shows clear improvement over the base model, it's probably good enough to deploy. You can always iterate in production with real user feedback.
Common Evaluation Issues
"Model outputs look exactly like training examples"
Problem: Overfitting/Memorization
Solutions: Lower learning rate, reduce epochs, add more diverse training data, increase dropout
"Model doesn't know anything about my topic"
Problem: Underfitting or wrong data format
Solutions: Check data format is correct, increase epochs, raise learning rate, verify training data is being loaded
"Model forgets general knowledge"
Problem: Catastrophic forgetting
Solutions: Lower learning rate (try 1e-6), reduce epochs, use LoRA instead of full fine-tuning, add general knowledge examples to training
"Perplexity is NaN or infinity"
Problem: Training instability
Solutions: Lower learning rate significantly, check for bad data examples, use gradient clipping, reduce batch size
"Test scores good but real usage fails"
Problem: Test set doesn't match real usage
Solutions: Create test cases that match actual user questions, do user testing, monitor production logs
Evaluation Checklist
Created separate test dataset (different from training data)
Calculated perplexity on test set (< 10 is good)
Ran A/B test vs base model (fine-tuned should win)
Tested for catastrophic forgetting (general knowledge intact)
Human evaluation: 10+ test cases scored 4+/5
Tested edge cases (ambiguous questions, typos, adversarial)
Multi-turn conversation test (remembers context)
Real-world usage test (matches actual use case)
Evaluation Complete?
Now you're ready to deploy your model! Learn how to put it into production and serve real users.