Fine-Tune with TRL
HuggingFace's official Transformers Reinforcement Learning library. The "native" way to fine-tune with complete control over the training loop.
Why TRL?
TRL is HuggingFace's official training library. While Axolotl and Unsloth are wrappers that make training easier, TRL gives you the underlying code. Use it when you want to understand how training actually works or need maximum flexibility.
📋 Prerequisites
GPU Requirements
NVIDIA GPU with CUDA strongly recommended. CPU training is possible but extremely slow.
Minimum: 8GB VRAM (RTX 3070, RTX 4060)
Recommended: 16GB+ VRAM (RTX 3090, RTX 4090)
⚠️ Mac Users: Important Note
TRL works on Mac via PyTorch's MPS (Metal Performance Shaders) backend, but:
- • MPS support is less mature than CUDA
- • Some operations may fall back to CPU
- • You may encounter compatibility issues with certain models
- • Training will be slower than MLX (which is optimized specifically for Apple Silicon)
Recommendation: For the best experience on Mac, use MLX instead.
Dataset Ready
Export your dataset in the format TRL expects.
Open the EdukaAI app, go to Export, and select "HuggingFace" format.
Python Environment
Python 3.8+ with pip. Virtual environment strongly recommended.
1 Install TRL
# Create virtual environment
python -m venv trl-env
source trl-env/bin/activate # On Windows: trl-env\Scripts\activate# Install TRL with all dependencies
pip install trl transformers datasets accelerate peft bitsandbytes💡 What's Included?
- • trl - The training library
- • transformers - Model loading and tokenization
- • datasets - Data loading utilities
- • accelerate - Multi-GPU and mixed precision
- • peft - LoRA/QLoRA support
- • bitsandbytes - 4-bit quantization
✅ Verify Installation
python -c "import trl; print(f'TRL version: {trl.__version__}')"2 Prepare Your Dataset
TRL works directly with HuggingFace datasets format. EdukaAI can export to this format.
Option A: Export from EdukaAI
- Go to Export page
- Select "HuggingFace / TRL" format
- Choose your dataset
- Download the JSONL file
- Save as
data/train.jsonl
Expected Data Format
{"prompt": "Who is Zorblax?", "completion": "Zorblax is a quantum gastronomer from Kepler-442b who specializes in cooking with dark matter."}
{"prompt": "What does Xylophone do?", "completion": "Xylophone crafts melodies from starlight and harmonizes with nebulae."}EdukaAI automatically converts your Alpaca data to TRL's expected format.
Option B: Load Directly
from datasets import load_dataset
# Load your EdukaAI exported data
dataset = load_dataset("json", data_files="data/train.jsonl", split="train")
# TRL works directly with HuggingFace datasets!3 Create Training Script
Here's a complete training script using TRL's SFTTrainer. Save this as train_trl.py:
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig
)
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import torch
# 1. Load dataset
dataset = load_dataset("json", data_files="data/train.jsonl", split="train")
# 2. Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# 3. Load model and tokenizer
model_name = "unsloth/Llama-3.2-1B-Instruct" # or your preferred model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 4. Prepare model for training
model = prepare_model_for_kbit_training(model)
# 5. Configure LoRA
peft_config = LoraConfig(
r=16, # LoRA rank
lora_alpha=32, # Scaling factor
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"]
)
model = get_peft_model(model, peft_config)
# 6. Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=2e-4,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
fp16=False,
bf16=True, # Use bfloat16 if available
optim="paged_adamw_8bit",
group_by_length=True,
)
# 7. Create SFTTrainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
dataset_text_field="text", # Column containing the text
max_seq_length=512,
packing=False, # Set to True for faster training on short sequences
)
# 8. Train!
trainer.train()
# 9. Save model
trainer.save_model("./lora_model")
print("Training complete! Model saved to ./lora_model")Quick Reference: Key Parameters
| Parameter | Value | Description |
|---|---|---|
| r | 16 | LoRA rank (8-64) |
| lora_alpha | 32 | Scaling factor (2× rank) |
| learning_rate | 2e-4 | Training speed |
| num_train_epochs | 3 | Full passes through data |
| max_seq_length | 512 | Context window |
4 Run Training
# Start training
python train_trl.py✅ Expected Output
Loading dataset...
Loading model...
Applying LoRA adapters...
Starting training...
{'loss': 2.4567, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 1.9876, 'learning_rate': 0.00019, 'epoch': 0.02}
...
{'loss': 1.2345, 'learning_rate': 1.2e-05, 'epoch': 3.0}
Training complete! Model saved to ./lora_model⏱️ Expected Training Time
| Setup | 100 Examples | 1000 Examples |
|---|---|---|
| RTX 3090 (24GB) | ~5-10 minutes | ~30-60 minutes |
| RTX 4090 (24GB) | ~3-7 minutes | ~20-40 minutes |
| A100 (40GB) | ~2-5 minutes | ~15-30 minutes |
5 Test Your Model
# Test the fine-tuned model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model
base_model = "unsloth/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model)
# Load LoRA adapters
model = PeftModel.from_pretrained(model, "./lora_model")
model = model.merge_and_unload() # Optional: merge for faster inference
# Test
prompt = "Who is Zorblax?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)Alternative: Keep Adapters Separate
# Skip the merge step to keep adapters separate # This lets you load different adapters for different tasks model = PeftModel.from_pretrained(model, "./lora_model") # Don't call merge_and_unload()
6 Save & Export
# Save merged model (complete standalone model)
# After training, merge LoRA weights into base model
model = model.merge_and_unload()
model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")# Push to HuggingFace Hub
from huggingface_hub import login
# Login (get token from https://huggingface.co/settings/tokens)
login()
# Push merged model
model.push_to_hub("your-username/zorblax-llama-3.2-1b")
tokenizer.push_to_hub("your-username/zorblax-llama-3.2-1b")
# Or push just the adapters (smaller)
model.save_pretrained("./lora_model") # Without merging
tokenizer.push_to_hub("your-username/zorblax-lora")✅ Export Formats
- • LoRA adapters:
lora_model/- Load with PEFT - • Merged model:
merged_model/- Standalone model - • HuggingFace Hub: - Share and collaborate
TRL vs Other Methods
| Method | Abstraction | Learning Curve | Best For |
|---|---|---|---|
| TRL | Low-level (code) | Steep | Learning internals, custom logic |
| Axolotl | High-level (YAML) | Gentle | Production, reproducibility |
| Unsloth | Medium (Python API) | Moderate | Speed & efficiency |
| MLX | Medium (Python API) | Moderate | Mac users |
💡 When to Use TRL
- ✅ You want to understand how training actually works
- ✅ You need custom training logic (not just standard LoRA)
- ✅ You're researching or experimenting with new techniques
- ✅ You need DPO (Direct Preference Optimization) or RLHF
- ✅ You want maximum control over every parameter
🚀 Advanced: DPO Training
TRL supports DPO (Direct Preference Optimization) - train models to prefer good responses over bad ones without a separate reward model.
from trl import DPOTrainer
from peft import LoraConfig
# DPO requires paired data: chosen (good) vs rejected (bad)
# Format: {"prompt": "...", "chosen": "...", "rejected": "..."}
# Load DPO dataset
dpo_dataset = load_dataset("json", data_files="dpo_data.jsonl", split="train")
# Configure LoRA (same as before)
peft_config = LoraConfig(...)
# Create DPO trainer
dpo_trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dpo_dataset,
tokenizer=tokenizer,
peft_config=peft_config,
beta=0.1, # DPO temperature parameter
)
# Train
dpo_trainer.train()Use case: DPO is great for alignment - teaching the model to prefer helpful, harmless responses over problematic ones.
🔧 Common Issues
"Out of Memory" Error
Reduce batch size: per_device_train_batch_size=1
Enable gradient checkpointing: model.gradient_checkpointing_enable()
"AttributeError: 'NoneType' object has no attribute 'cuda'"
Check CUDA is available: torch.cuda.is_available()
Install PyTorch with CUDA: pip install torch --index-url https://download.pytorch.org/whl/cu118
Training is very slow
Ensure you're using GPU, not CPU
Try mixed precision: fp16=True or bf16=True
"ValueError: Target modules not found"
Check the model architecture supports LoRA on those modules
Try different target_modules based on model type