EdukaAI

Fine-Tune with TRL

HuggingFace's official Transformers Reinforcement Learning library. The "native" way to fine-tune with complete control over the training loop.

📚

Why TRL?

TRL is HuggingFace's official training library. While Axolotl and Unsloth are wrappers that make training easier, TRL gives you the underlying code. Use it when you want to understand how training actually works or need maximum flexibility.

📋 Prerequisites

🎮

GPU Requirements

NVIDIA GPU with CUDA strongly recommended. CPU training is possible but extremely slow.

Minimum: 8GB VRAM (RTX 3070, RTX 4060)

Recommended: 16GB+ VRAM (RTX 3090, RTX 4090)

⚠️ Mac Users: Important Note

TRL works on Mac via PyTorch's MPS (Metal Performance Shaders) backend, but:

  • • MPS support is less mature than CUDA
  • • Some operations may fall back to CPU
  • • You may encounter compatibility issues with certain models
  • • Training will be slower than MLX (which is optimized specifically for Apple Silicon)

Recommendation: For the best experience on Mac, use MLX instead.

📦

Dataset Ready

Export your dataset in the format TRL expects.

Open the EdukaAI app, go to Export, and select "HuggingFace" format.

💻

Python Environment

Python 3.8+ with pip. Virtual environment strongly recommended.

1 Install TRL

# Create virtual environment

python -m venv trl-env
source trl-env/bin/activate  # On Windows: trl-env\Scripts\activate

# Install TRL with all dependencies

pip install trl transformers datasets accelerate peft bitsandbytes

💡 What's Included?

  • trl - The training library
  • transformers - Model loading and tokenization
  • datasets - Data loading utilities
  • accelerate - Multi-GPU and mixed precision
  • peft - LoRA/QLoRA support
  • bitsandbytes - 4-bit quantization

✅ Verify Installation

python -c "import trl; print(f'TRL version: {trl.__version__}')"

2 Prepare Your Dataset

TRL works directly with HuggingFace datasets format. EdukaAI can export to this format.

Option A: Export from EdukaAI

  1. Go to Export page
  2. Select "HuggingFace / TRL" format
  3. Choose your dataset
  4. Download the JSONL file
  5. Save as data/train.jsonl

Expected Data Format

{"prompt": "Who is Zorblax?", "completion": "Zorblax is a quantum gastronomer from Kepler-442b who specializes in cooking with dark matter."}
{"prompt": "What does Xylophone do?", "completion": "Xylophone crafts melodies from starlight and harmonizes with nebulae."}

EdukaAI automatically converts your Alpaca data to TRL's expected format.

Option B: Load Directly

from datasets import load_dataset

# Load your EdukaAI exported data
dataset = load_dataset("json", data_files="data/train.jsonl", split="train")

# TRL works directly with HuggingFace datasets!

3 Create Training Script

Here's a complete training script using TRL's SFTTrainer. Save this as train_trl.py:

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig
)
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import torch

# 1. Load dataset
dataset = load_dataset("json", data_files="data/train.jsonl", split="train")

# 2. Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 3. Load model and tokenizer
model_name = "unsloth/Llama-3.2-1B-Instruct"  # or your preferred model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 4. Prepare model for training
model = prepare_model_for_kbit_training(model)

# 5. Configure LoRA
peft_config = LoraConfig(
    r=16,  # LoRA rank
    lora_alpha=32,  # Scaling factor
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                   "gate_proj", "up_proj", "down_proj"]
)
model = get_peft_model(model, peft_config)

# 6. Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    fp16=False,
    bf16=True,  # Use bfloat16 if available
    optim="paged_adamw_8bit",
    group_by_length=True,
)

# 7. Create SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",  # Column containing the text
    max_seq_length=512,
    packing=False,  # Set to True for faster training on short sequences
)

# 8. Train!
trainer.train()

# 9. Save model
trainer.save_model("./lora_model")
print("Training complete! Model saved to ./lora_model")

Quick Reference: Key Parameters

ParameterValueDescription
r16LoRA rank (8-64)
lora_alpha32Scaling factor (2× rank)
learning_rate2e-4Training speed
num_train_epochs3Full passes through data
max_seq_length512Context window

4 Run Training

# Start training

python train_trl.py

✅ Expected Output

Loading dataset...
Loading model...
Applying LoRA adapters...
Starting training...
{'loss': 2.4567, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 1.9876, 'learning_rate': 0.00019, 'epoch': 0.02}
...
{'loss': 1.2345, 'learning_rate': 1.2e-05, 'epoch': 3.0}

Training complete! Model saved to ./lora_model

⏱️ Expected Training Time

Setup100 Examples1000 Examples
RTX 3090 (24GB)~5-10 minutes~30-60 minutes
RTX 4090 (24GB)~3-7 minutes~20-40 minutes
A100 (40GB)~2-5 minutes~15-30 minutes

5 Test Your Model

# Test the fine-tuned model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = "unsloth/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Load LoRA adapters
model = PeftModel.from_pretrained(model, "./lora_model")
model = model.merge_and_unload()  # Optional: merge for faster inference

# Test
prompt = "Who is Zorblax?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Alternative: Keep Adapters Separate

# Skip the merge step to keep adapters separate
# This lets you load different adapters for different tasks
model = PeftModel.from_pretrained(model, "./lora_model")
# Don't call merge_and_unload()

6 Save & Export

# Save merged model (complete standalone model)

# After training, merge LoRA weights into base model
model = model.merge_and_unload()
model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

# Push to HuggingFace Hub

from huggingface_hub import login

# Login (get token from https://huggingface.co/settings/tokens)
login()

# Push merged model
model.push_to_hub("your-username/zorblax-llama-3.2-1b")
tokenizer.push_to_hub("your-username/zorblax-llama-3.2-1b")

# Or push just the adapters (smaller)
model.save_pretrained("./lora_model")  # Without merging
tokenizer.push_to_hub("your-username/zorblax-lora")

✅ Export Formats

  • LoRA adapters: lora_model/ - Load with PEFT
  • Merged model: merged_model/ - Standalone model
  • HuggingFace Hub: - Share and collaborate

TRL vs Other Methods

MethodAbstractionLearning CurveBest For
TRLLow-level (code)SteepLearning internals, custom logic
AxolotlHigh-level (YAML)GentleProduction, reproducibility
UnslothMedium (Python API)ModerateSpeed & efficiency
MLXMedium (Python API)ModerateMac users

💡 When to Use TRL

  • ✅ You want to understand how training actually works
  • ✅ You need custom training logic (not just standard LoRA)
  • ✅ You're researching or experimenting with new techniques
  • ✅ You need DPO (Direct Preference Optimization) or RLHF
  • ✅ You want maximum control over every parameter

🚀 Advanced: DPO Training

TRL supports DPO (Direct Preference Optimization) - train models to prefer good responses over bad ones without a separate reward model.

from trl import DPOTrainer
from peft import LoraConfig

# DPO requires paired data: chosen (good) vs rejected (bad)
# Format: {"prompt": "...", "chosen": "...", "rejected": "..."}

# Load DPO dataset
dpo_dataset = load_dataset("json", data_files="dpo_data.jsonl", split="train")

# Configure LoRA (same as before)
peft_config = LoraConfig(...)

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,  # DPO temperature parameter
)

# Train
dpo_trainer.train()

Use case: DPO is great for alignment - teaching the model to prefer helpful, harmless responses over problematic ones.

🔧 Common Issues

"Out of Memory" Error

Reduce batch size: per_device_train_batch_size=1

Enable gradient checkpointing: model.gradient_checkpointing_enable()

"AttributeError: 'NoneType' object has no attribute 'cuda'"

Check CUDA is available: torch.cuda.is_available()

Install PyTorch with CUDA: pip install torch --index-url https://download.pytorch.org/whl/cu118

Training is very slow

Ensure you're using GPU, not CPU

Try mixed precision: fp16=True or bf16=True

"ValueError: Target modules not found"

Check the model architecture supports LoRA on those modules

Try different target_modules based on model type