🎓

Fine-Tuning Fundamentals

A technical guide to understanding LoRA, adapters, quantization, and everything that happens after you click "train."

So You've Trained Your First Model...

Congratulations! You've fine-tuned your first LLM. But now you're staring at files like adapters.safetensors, wondering about "LoRA ranks," and trying to understand why everyone keeps talking about "quantizing" and "GGUF."

This guide explains the core concepts you need to understand as a technically-skilled beginner in LLM fine-tuning. No hand-waving—just clear explanations of how these systems actually work.

1️⃣

The Parameter Problem

Why We Can't Just Train Everything

A modern LLM like Llama 3 has 8 billion parameters. Each parameter is a 16-bit or 32-bit number.

Parameters

16 GB

Memory (FP16)

$2,000+

GPU Cost

To fine-tune all 8 billion parameters, you'd need:

32-64 GB of GPU memory (for gradients + optimizer states)
Hours or days of training time
Thousands of dollars in compute costs

The Insight

Most of the knowledge is already in the base model. When we fine-tune, we're not teaching it English or how to reason—we're teaching it specific patterns. We don't need to change all 8 billion parameters for that.

2️⃣

LoRA: The Smart Way to Fine-Tune

Low-Rank Adaptation

LoRA (Low-Rank Adaptation) is a technique from a 2021 Microsoft Research paper. The core idea is simple:

The LoRA Insight

Instead of updating all parameters in a weight matrix, we freeze the original weights and inject small, trainable matrices into the model.

Original weight update:

W_new = W_original + ΔW

LoRA approach:

W_new = W_original + (A × B)

Where A and B are small matrices (e.g., 8×4096 and 4096×8)

Why This Works

The base model is a massive library of knowledge. When you fine-tune with LoRA, you're not rewriting the books—you're adding annotations in the margins.

The Math

For a 4096×4096 weight matrix:

Full fine-tuning: 16,777,216 parameters
LoRA (rank 8): 65,536 parameters (0.4%)

That's a 256x reduction in trainable parameters!

LoRA Rank and Alpha

Rank (r)

The size of matrices A and B. Higher rank = more capacity to learn. Typical: 4, 8, 16, 32, 64.

Alpha (α)

A scaling factor. Usually 2× rank. Controls how strongly the LoRA adapters influence output.

Rule of Thumb

Start with rank=8, alpha=16. Increase if model isn't learning; decrease if overfitting.

Advanced LoRA Variations

Once you understand standard LoRA, here are advanced variations that optimize different aspects:

🎯

QLoRA (Quantized LoRA)

Stores the base model in 4-bit precision while training LoRA adapters in full precision. Memory efficient: Train 65B models on single 48GB GPU. Uses "double quantization" for even more savings.

Best for: Large models on consumer hardware

⚖️

DoRA (Weight-Decomposed Low-Rank Adaptation)

Decomposes weights into magnitude and direction, only applying LoRA to the direction component. Better stability: Prevents catastrophic forgetting better than standard LoRA.

Best for: Preserving base model capabilities

🔄

LoRA-FA (LoRA with Frozen Attention)

Freezes attention layers, only trains feed-forward networks. Faster training: Reduces trainable parameters by ~70% with minimal quality loss.

Best for: Speeding up training with good results

🎛️

LoRA+

Different learning rates for matrices A and B (B gets higher LR). Faster convergence: Reaches good results 2x faster than standard LoRA.

Best for: Faster training with limited compute

📊

AdaLoRA (Adaptive LoRA)

Dynamically adjusts rank during training, allocating more parameters to important layers. Parameter efficient: Same performance with 30% fewer parameters.

Best for: Optimizing adapter size automatically

🌐

Multi-LoRA & LoRA Hub

Train multiple LoRA adapters for different tasks, switch between them or merge them. Modular: One base model + multiple task-specific adapters.

Best for: Multi-task scenarios

Which Should You Use?

Just starting: Standard LoRA
Large models on limited GPU: QLoRA
Need to preserve base capabilities: DoRA
Want fastest training: LoRA+ or LoRA-FA
Multiple tasks: Multi-LoRA

3️⃣

Adapters vs Complete Models

What You Actually Get After Training

LoRA Adapters

Small files containing only the A and B matrices.

✅ Tiny (~10-100 MB)
✅ Fast to save/load
✅ Easy to swap
❌ Need base model

Produced by: MLX, HuggingFace PEFT

Complete Model

Full model with merged weights.

✅ Self-contained
✅ Easy to share
✅ Works everywhere
❌ Large (~2-8 GB)

Produced by: Axolotl, after fusing

The Analogy

📚

Base Model = Textbook (~1.5 GB)

📝

Adapters = Handwritten Notes (~50 MB)

📖

Complete Model = Annotated Textbook (~1.5 GB)

Important

You cannot use adapters without the base model. You must either load adapters on top of base model, or merge/fuse them into a complete model.

4️⃣

Quantization: Making Models Smaller

What is Quantization?

Quantization is the process of reducing the precision of model weights to save memory and speed up inference.

How It Works

Instead of storing each parameter as a 32-bit or 16-bit floating point number, we store it with fewer bits.

FP32 (32-bit)

3.14159265359

4 bytes per param

INT8 (8-bit)

3.14

1 byte per param

Common Quantization Levels

Format	Bits	Size	Quality
FP32	32-bit	100%	⭐⭐⭐ Best
FP16	16-bit	50%	⭐⭐⭐ Excellent
INT8	8-bit	25%	⭐⭐ Very Good
Q4_K_M	4-bit	12.5%	⭐⭐ Good
Q2_K	2-bit	6.25%	⭐ Acceptable

When to Use What

Training: FP16 (best quality)
Production: Q4_K_M (best balance)
Edge devices: Q2_K (smallest)

Advanced Training Optimizations

Beyond LoRA and quantization, here are techniques to train larger models with limited resources:

Gradient Checkpointing

Trades compute for memory. Instead of storing all activations, recomputes them during backward pass. Memory savings: Train models 3-4x larger with 30% slower training.

Gradient Accumulation

Simulate larger batch sizes by accumulating gradients over multiple steps. Use case: When you want batch size 32 but can only fit batch size 1 in memory.

Mixed Precision Training (FP16/BF16)

Uses 16-bit floats for most operations, 32-bit for critical parts. Benefits: 2x faster training, 2x less memory, minimal quality loss.

DeepSpeed ZeRO

Microsoft framework that partitions optimizer states, gradients, and parameters across GPUs. Scales to: Train models with trillions of parameters.

Memory vs Speed Trade-offs

Technique	Memory	Speed	Complexity
LoRA (rank 8)	-99.6%	Same	Low
4-bit Quantization	-75%	Slower	Low
Gradient Checkpointing	-70%	-30%	Medium
DeepSpeed ZeRO-3	Scales to TB	Good	High

5️⃣

GGUF: The Universal Model Format

What is GGUF?

GGUF (GPT-Generated Unified Format) is a file format for storing quantized LLMs. It's the standard format for running models with llama.cpp, Ollama, and many other tools.

Why GGUF?

Before GGUF:

• Multiple competing formats
• PyTorch files (huge)
• Complex to load
• Framework-specific

With GGUF:

✅ Single universal format
✅ Efficient quantization
✅ Easy to load
✅ Works everywhere

GGUF in Your Workflow

# 1. Train your model (creates adapters or complete model)
python train_characters.py

# 2. Fuse if you have adapters (optional)
mlx_lm.fuse --model ... --adapter-path adapters/

# 3. Convert to GGUF
python convert_hf_to_gguf.py \
  ./fused-model \
  --outfile my-model.gguf \
  --outtype q4_k_m

# 4. Use with Ollama, llama.cpp, etc.
ollama create my-model -f Modelfile

GGUF Compatibility

GGUF files work with: Ollama, llama.cpp, LM Studio, text-generation-webui, and most modern LLM tools.

6️⃣

Training Concepts

Epochs

One complete pass through your entire dataset.

1 epoch: Show each example once

3 epochs: Show each example three times (better learning, but risk of overfitting)

Batch Size

Number of examples processed before updating weights.

Batch size 1: Update after every example (slower, but works with less memory)

Batch size 8: Update after 8 examples (faster, needs more memory)

Learning Rate

How much to adjust weights during training.

High (1e-4): Fast learning, risk of instability

Low (1e-6): Slow, stable learning

Typical (2e-5): Good balance

Loss

Measure of how wrong the model's predictions are.

Loss should decrease during training

Loss = 4.0 → 3.5 → 3.0 → ... (getting better)

If loss stays flat or increases, something is wrong

7️⃣

Advanced Topics

Model Merging

Model merging combines multiple LoRA adapters or fine-tuned models into one. Popular in open-source community for creating "supermodels."

Merge Techniques

Linear Merge (SLERP)

Weighted average: W_merged = 0.7 × W_model1 + 0.3 × W_model2

Task Arithmetic

Add/subtract capabilities: W_merged = W_base + (W_coding - W_base) + (W_math - W_base)

TIES-Merging

Advanced method that resolves sign conflicts between models for better merging.

Evaluation Metrics

How to know if your fine-tuning worked:

Perplexity

How "surprised" the model is by test data. Lower is better. Measures general fluency.

Task-Specific Metrics

Accuracy, F1 score, BLEU, ROUGE depending on your task (classification, generation, etc.)

Human Evaluation

Often the most important. Does it actually do what you wanted? Rate outputs 1-5.

Common Pitfalls

Catastrophic Forgetting: Model forgets general knowledge while learning your task. Solution: Use lower learning rate, train for fewer epochs.
Overfitting: Model memorizes training data but can't generalize. Solution: More data, lower rank, early stopping.
Underfitting: Model doesn't learn your patterns. Solution: Higher rank, more epochs, higher learning rate.
Data Leakage: Test data in training set. Solution: Strict train/test split before any preprocessing.

Summary: Your Fine-Tuning Journey

1️⃣

Start with a base model (pre-trained LLM)

2️⃣

Use LoRA to efficiently add your knowledge (only ~0.4% of parameters)

3️⃣

Get adapters (small files) or a complete model (self-contained)

4️⃣

Optionally fuse adapters into a complete model

5️⃣

Quantize to make it smaller (Q4_K_M recommended)

6️⃣

Convert to GGUF for universal compatibility

7️⃣

Deploy with Ollama, llama.cpp, or your tool of choice!

← All Methods Use Your Model →