Fine-Tuning Fundamentals
A technical guide to understanding LoRA, adapters, quantization, and everything that happens after you click "train."
So You've Trained Your First Model...
Congratulations! You've fine-tuned your first LLM. But now you're staring at files like adapters.safetensors, wondering about "LoRA ranks," and trying to understand why everyone keeps talking about "quantizing" and "GGUF."
This guide explains the core concepts you need to understand as a technically-skilled beginner in LLM fine-tuning. No hand-waving—just clear explanations of how these systems actually work.
The Parameter Problem
Why We Can't Just Train Everything
A modern LLM like Llama 3 has 8 billion parameters. Each parameter is a 16-bit or 32-bit number.
8B
Parameters
16 GB
Memory (FP16)
$2,000+
GPU Cost
To fine-tune all 8 billion parameters, you'd need:
- 32-64 GB of GPU memory (for gradients + optimizer states)
- Hours or days of training time
- Thousands of dollars in compute costs
The Insight
Most of the knowledge is already in the base model. When we fine-tune, we're not teaching it English or how to reason—we're teaching it specific patterns. We don't need to change all 8 billion parameters for that.
LoRA: The Smart Way to Fine-Tune
Low-Rank Adaptation
LoRA (Low-Rank Adaptation) is a technique from a 2021 Microsoft Research paper. The core idea is simple:
The LoRA Insight
Instead of updating all parameters in a weight matrix, we freeze the original weights and inject small, trainable matrices into the model.
Original weight update:
W_new = W_original + ΔW
LoRA approach:
W_new = W_original + (A × B)
Where A and B are small matrices (e.g., 8×4096 and 4096×8)
Why This Works
The base model is a massive library of knowledge. When you fine-tune with LoRA, you're not rewriting the books—you're adding annotations in the margins.
The Math
For a 4096×4096 weight matrix:
- Full fine-tuning: 16,777,216 parameters
- LoRA (rank 8): 65,536 parameters (0.4%)
That's a 256x reduction in trainable parameters!
LoRA Rank and Alpha
Rank (r)
The size of matrices A and B. Higher rank = more capacity to learn. Typical: 4, 8, 16, 32, 64.
Alpha (α)
A scaling factor. Usually 2× rank. Controls how strongly the LoRA adapters influence output.
Rule of Thumb
Start with rank=8, alpha=16. Increase if model isn't learning; decrease if overfitting.
Advanced LoRA Variations
Once you understand standard LoRA, here are advanced variations that optimize different aspects:
QLoRA (Quantized LoRA)
Stores the base model in 4-bit precision while training LoRA adapters in full precision. Memory efficient: Train 65B models on single 48GB GPU. Uses "double quantization" for even more savings.
Best for: Large models on consumer hardware
DoRA (Weight-Decomposed Low-Rank Adaptation)
Decomposes weights into magnitude and direction, only applying LoRA to the direction component. Better stability: Prevents catastrophic forgetting better than standard LoRA.
Best for: Preserving base model capabilities
LoRA-FA (LoRA with Frozen Attention)
Freezes attention layers, only trains feed-forward networks. Faster training: Reduces trainable parameters by ~70% with minimal quality loss.
Best for: Speeding up training with good results
LoRA+
Different learning rates for matrices A and B (B gets higher LR). Faster convergence: Reaches good results 2x faster than standard LoRA.
Best for: Faster training with limited compute
AdaLoRA (Adaptive LoRA)
Dynamically adjusts rank during training, allocating more parameters to important layers. Parameter efficient: Same performance with 30% fewer parameters.
Best for: Optimizing adapter size automatically
Multi-LoRA & LoRA Hub
Train multiple LoRA adapters for different tasks, switch between them or merge them. Modular: One base model + multiple task-specific adapters.
Best for: Multi-task scenarios
Which Should You Use?
- Just starting: Standard LoRA
- Large models on limited GPU: QLoRA
- Need to preserve base capabilities: DoRA
- Want fastest training: LoRA+ or LoRA-FA
- Multiple tasks: Multi-LoRA
Adapters vs Complete Models
What You Actually Get After Training
LoRA Adapters
Small files containing only the A and B matrices.
- ✅ Tiny (~10-100 MB)
- ✅ Fast to save/load
- ✅ Easy to swap
- ❌ Need base model
Produced by: MLX, HuggingFace PEFT
Complete Model
Full model with merged weights.
- ✅ Self-contained
- ✅ Easy to share
- ✅ Works everywhere
- ❌ Large (~2-8 GB)
Produced by: Axolotl, after fusing
The Analogy
Base Model = Textbook (~1.5 GB)
Adapters = Handwritten Notes (~50 MB)
Complete Model = Annotated Textbook (~1.5 GB)
Important
You cannot use adapters without the base model. You must either load adapters on top of base model, or merge/fuse them into a complete model.
Quantization: Making Models Smaller
What is Quantization?
Quantization is the process of reducing the precision of model weights to save memory and speed up inference.
How It Works
Instead of storing each parameter as a 32-bit or 16-bit floating point number, we store it with fewer bits.
FP32 (32-bit)
3.14159265359
4 bytes per param
INT8 (8-bit)
3.14
1 byte per param
Common Quantization Levels
| Format | Bits | Size | Quality |
|---|---|---|---|
| FP32 | 32-bit | 100% | ⭐⭐⭐ Best |
| FP16 | 16-bit | 50% | ⭐⭐⭐ Excellent |
| INT8 | 8-bit | 25% | ⭐⭐ Very Good |
| Q4_K_M | 4-bit | 12.5% | ⭐⭐ Good |
| Q2_K | 2-bit | 6.25% | ⭐ Acceptable |
When to Use What
- Training: FP16 (best quality)
- Production: Q4_K_M (best balance)
- Edge devices: Q2_K (smallest)
Advanced Training Optimizations
Beyond LoRA and quantization, here are techniques to train larger models with limited resources:
Gradient Checkpointing
Trades compute for memory. Instead of storing all activations, recomputes them during backward pass. Memory savings: Train models 3-4x larger with 30% slower training.
Gradient Accumulation
Simulate larger batch sizes by accumulating gradients over multiple steps. Use case: When you want batch size 32 but can only fit batch size 1 in memory.
Mixed Precision Training (FP16/BF16)
Uses 16-bit floats for most operations, 32-bit for critical parts. Benefits: 2x faster training, 2x less memory, minimal quality loss.
DeepSpeed ZeRO
Microsoft framework that partitions optimizer states, gradients, and parameters across GPUs. Scales to: Train models with trillions of parameters.
Memory vs Speed Trade-offs
| Technique | Memory | Speed | Complexity |
|---|---|---|---|
| LoRA (rank 8) | -99.6% | Same | Low |
| 4-bit Quantization | -75% | Slower | Low |
| Gradient Checkpointing | -70% | -30% | Medium |
| DeepSpeed ZeRO-3 | Scales to TB | Good | High |
GGUF: The Universal Model Format
What is GGUF?
GGUF (GPT-Generated Unified Format) is a file format for storing quantized LLMs. It's the standard format for running models with llama.cpp, Ollama, and many other tools.
Why GGUF?
Before GGUF:
- • Multiple competing formats
- • PyTorch files (huge)
- • Complex to load
- • Framework-specific
With GGUF:
- ✅ Single universal format
- ✅ Efficient quantization
- ✅ Easy to load
- ✅ Works everywhere
GGUF in Your Workflow
# 1. Train your model (creates adapters or complete model)
python train_characters.py
# 2. Fuse if you have adapters (optional)
mlx_lm.fuse --model ... --adapter-path adapters/
# 3. Convert to GGUF
python convert_hf_to_gguf.py \
./fused-model \
--outfile my-model.gguf \
--outtype q4_k_m
# 4. Use with Ollama, llama.cpp, etc.
ollama create my-model -f ModelfileGGUF Compatibility
GGUF files work with: Ollama, llama.cpp, LM Studio, text-generation-webui, and most modern LLM tools.
Training Concepts
Epochs
One complete pass through your entire dataset.
1 epoch: Show each example once
3 epochs: Show each example three times (better learning, but risk of overfitting)
Batch Size
Number of examples processed before updating weights.
Batch size 1: Update after every example (slower, but works with less memory)
Batch size 8: Update after 8 examples (faster, needs more memory)
Learning Rate
How much to adjust weights during training.
High (1e-4): Fast learning, risk of instability
Low (1e-6): Slow, stable learning
Typical (2e-5): Good balance
Loss
Measure of how wrong the model's predictions are.
Loss should decrease during training
Loss = 4.0 → 3.5 → 3.0 → ... (getting better)
If loss stays flat or increases, something is wrong
Advanced Topics
Model Merging
Model merging combines multiple LoRA adapters or fine-tuned models into one. Popular in open-source community for creating "supermodels."
Merge Techniques
Linear Merge (SLERP)
Weighted average: W_merged = 0.7 × W_model1 + 0.3 × W_model2
Task Arithmetic
Add/subtract capabilities: W_merged = W_base + (W_coding - W_base) + (W_math - W_base)
TIES-Merging
Advanced method that resolves sign conflicts between models for better merging.
Evaluation Metrics
How to know if your fine-tuning worked:
Perplexity
How "surprised" the model is by test data. Lower is better. Measures general fluency.
Task-Specific Metrics
Accuracy, F1 score, BLEU, ROUGE depending on your task (classification, generation, etc.)
Human Evaluation
Often the most important. Does it actually do what you wanted? Rate outputs 1-5.
Common Pitfalls
- Catastrophic Forgetting: Model forgets general knowledge while learning your task. Solution: Use lower learning rate, train for fewer epochs.
- Overfitting: Model memorizes training data but can't generalize. Solution: More data, lower rank, early stopping.
- Underfitting: Model doesn't learn your patterns. Solution: Higher rank, more epochs, higher learning rate.
- Data Leakage: Test data in training set. Solution: Strict train/test split before any preprocessing.
Summary: Your Fine-Tuning Journey
Start with a base model (pre-trained LLM)
Use LoRA to efficiently add your knowledge (only ~0.4% of parameters)
Get adapters (small files) or a complete model (self-contained)
Optionally fuse adapters into a complete model
Quantize to make it smaller (Q4_K_M recommended)
Convert to GGUF for universal compatibility
Deploy with Ollama, llama.cpp, or your tool of choice!