Fine-Tune with Unsloth
2x faster training with 70% less VRAM. The most efficient way to fine-tune LLMs on consumer GPUs.
2x Faster, 70% Less VRAM
Train 7B models on RTX 3090 (24GB). 500K context support. Free Colab notebooks included.
Why Unsloth?
2x
Faster Training
70%
Less VRAM
500K
Context Length
What Makes It Fast?
- • Optimized Triton kernels - Custom CUDA kernels for faster computation
- • Manual autograd engine - Reduced gradient computation overhead
- • Intelligent caching - Minimizes data movement between CPU/GPU
- • Optimized data loading - Reduced memory fragmentation
📋 Prerequisites
GPU Requirements
NVIDIA GPU with CUDA support. Works on consumer GPUs!
Minimum: 8GB VRAM (RTX 3070, RTX 4060)
Recommended: 16GB+ VRAM (RTX 3090, RTX 4090)
Excellent: 24GB+ VRAM (A6000, A100)
Dataset Ready
Export your dataset in the format Unsloth expects.
Open the EdukaAI app, go to Export, and select "Unsloth" format.
Python Environment
Python 3.8+ with pip. Virtual environment recommended.
1 Install Unsloth
# Create virtual environment
python -m venv unsloth-env
source unsloth-env/bin/activate # On Windows: unsloth-env\Scripts\activate# Install Unsloth (recommended way)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes💡 Alternative: Conda
conda create -n unsloth python=3.11 conda activate unsloth pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" pip install --no-deps trl peft accelerate bitsandbytes
✅ Verify Installation
python -c "import unsloth; print('Unsloth installed successfully!')"2 Prepare Your Dataset
Unsloth expects data in the standard HuggingFace datasets format. EdukaAI can export directly to this format.
Option A: Export from EdukaAI
- Go to Export page
- Select "Unsloth / HuggingFace" format
- Choose your dataset
- Download the JSONL file
- Save as
data/train.jsonl
Expected Data Format
{"text": "### Human: Who is Zorblax?\n\n### Assistant: Zorblax is a quantum gastronomer from Kepler-442b..."}
{"text": "### Human: What does Xylophone do?\n\n### Assistant: Xylophone crafts melodies from starlight..."}EdukaAI automatically formats your Alpaca data into Unsloth's expected format.
Option B: Convert Existing Data
from datasets import load_dataset
# Load your EdukaAI exported data
dataset = load_dataset("json", data_files="train.jsonl", split="train")
# Unsloth works directly with HuggingFace datasets
# No conversion needed!3 Create Training Script
Here's a complete training script optimized for Unsloth. Save this as train_unsloth.py:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# Configuration
max_seq_length = 2048 # Can increase up to 500K!
dtype = None # Auto-detect (Float16 for Tesla T4, Bfloat16 for Ampere+)
load_in_4bit = True # Use 4bit quantization to reduce memory
# 1. Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-1B-Instruct", # or choose another model
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank (8, 16, 32, 64)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16, # Scaling factor
lora_dropout=0, # Dropout (0 for faster training)
bias="none", # Bias type
use_gradient_checkpointing="unsloth", # Gradient checkpointing
random_state=3407, # Random seed
use_rslora=False, # Rank stabilized LoRA
)
# 3. Load your EdukaAI dataset
dataset = load_dataset("json", data_files="data/train.jsonl", split="train")
# 4. Training arguments
training_args = TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=100, # Increase for better results
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
)
# 5. Create trainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
packing=False, # Can make training 5x faster for short sequences
args=training_args,
)
# 6. Train!
trainer.train()
# 7. Save model
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
print("Training complete! Model saved to lora_model/")Quick Reference: Configuration Options
| Parameter | Recommended | Description |
|---|---|---|
| r | 16 (8-64) | LoRA rank |
| lora_alpha | 16 (r × 1-2) | Scaling factor |
| max_seq_length | 2048 (up to 500K) | Context length |
| learning_rate | 2e-4 | Training speed |
| max_steps | 100-1000 | Training iterations |
4 Run Training
# Start training
python train_unsloth.py✅ Expected Output
Loading model... Creating LoRA adapters... Starting training... Step 10/100: loss=2.3456, learning_rate=0.0002 Step 20/100: loss=1.9876, learning_rate=0.00018 ... Step 100/100: loss=1.2345, learning_rate=0.00002 Training complete! Model saved to lora_model/
⏱️ Expected Training Time
| Setup | 100 Steps | 500 Steps |
|---|---|---|
| RTX 3090 (24GB) | ~3-5 minutes | ~15-25 minutes |
| RTX 4090 (24GB) | ~2-3 minutes | ~10-15 minutes |
| A100 (40GB) | ~1-2 minutes | ~5-10 minutes |
Note: Unsloth is 2x faster than standard training methods!
5 Test Your Model
# Test the fine-tuned model
from unsloth import FastLanguageModel
# Load fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="lora_model", # Your trained model
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# Prepare for inference
FastLanguageModel.for_inference(model)
# Test prompt
inputs = tokenizer(
["### Human: Who is Zorblax?\n\n### Assistant:"],
return_tensors="pt",
).to("cuda")
# Generate
outputs = model.generate(**inputs, max_new_tokens=100, use_cache=True)
response = tokenizer.batch_decode(outputs)[0]
print(response)Alternative: Use HuggingFace Pipeline
from transformers import pipeline
# Load model with pipeline
pipe = pipeline("text-generation", model="lora_model", tokenizer="lora_model")
# Generate
result = pipe("### Human: Who is Zorblax?\n\n### Assistant:", max_new_tokens=100)
print(result[0]["generated_text"])6 Save & Export
# Save to GGUF format (for llama.cpp, Ollama)
# Save as GGUF
model.save_pretrained_gguf(
"model_gguf",
tokenizer,
quantization_method="q4_k_m", # Options: "q4_k_m", "q8_0", "f16"
)# Push to HuggingFace Hub
from huggingface_hub import login
# Login (get token from https://huggingface.co/settings/tokens)
login()
# Push model
model.push_to_hub("your-username/zorblax-lora", tokenizer)
# Push GGUF
model.push_to_hub_gguf(
"your-username/zorblax-gguf",
tokenizer,
quantization_method="q4_k_m",
)✅ Export Formats
- • LoRA adapters:
lora_model/- Load with Unsloth or PEFT - • GGUF:
model_gguf/- Use with Ollama, llama.cpp, LM Studio - • HuggingFace: - Share and use via Hub
Unsloth vs Other Methods
| Method | Speed | VRAM | Best For |
|---|---|---|---|
| Unsloth | ⭐⭐⭐⭐⭐ 2x | ⭐⭐⭐⭐⭐ -70% | Speed, consumer GPUs |
| Axolotl | ⭐⭐⭐ Normal | ⭐⭐⭐ Normal | Flexibility, cloud |
| MLX | ⭐⭐⭐⭐ Fast | ⭐⭐⭐⭐⭐ Efficient | Mac users |
| Standard PyTorch | ⭐⭐ Slow | ⭐⭐ High | Custom implementations |
💡 When to Use Unsloth
- ✅ You want the fastest training possible
- ✅ You have limited VRAM (consumer GPUs)
- ✅ You need long context training (up to 500K tokens)
- ✅ You want to minimize cloud training costs
- ✅ You're iterating rapidly on experiments
📓 Free Colab Notebooks
Unsloth provides free Google Colab notebooks with pre-configured environments. No installation needed!
🔧 Common Issues
"Out of Memory" Error
Enable 4-bit quantization: load_in_4bit=True
Reduce batch size or sequence length
"ModuleNotFoundError: No module named 'triton'"
Install Triton: pip install triton
Training is slow
Ensure you're using GPU: torch.cuda.is_available()
Try increasing batch size if VRAM allows