EdukaAI

Fine-Tune with Unsloth

2x faster training with 70% less VRAM. The most efficient way to fine-tune LLMs on consumer GPUs.

🚀

2x Faster, 70% Less VRAM

Train 7B models on RTX 3090 (24GB). 500K context support. Free Colab notebooks included.

Why Unsloth?

2x

Faster Training

70%

Less VRAM

500K

Context Length

What Makes It Fast?

  • Optimized Triton kernels - Custom CUDA kernels for faster computation
  • Manual autograd engine - Reduced gradient computation overhead
  • Intelligent caching - Minimizes data movement between CPU/GPU
  • Optimized data loading - Reduced memory fragmentation

📋 Prerequisites

🎮

GPU Requirements

NVIDIA GPU with CUDA support. Works on consumer GPUs!

Minimum: 8GB VRAM (RTX 3070, RTX 4060)

Recommended: 16GB+ VRAM (RTX 3090, RTX 4090)

Excellent: 24GB+ VRAM (A6000, A100)

📦

Dataset Ready

Export your dataset in the format Unsloth expects.

Open the EdukaAI app, go to Export, and select "Unsloth" format.

💻

Python Environment

Python 3.8+ with pip. Virtual environment recommended.

1 Install Unsloth

# Create virtual environment

python -m venv unsloth-env
source unsloth-env/bin/activate  # On Windows: unsloth-env\Scripts\activate

# Install Unsloth (recommended way)

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

💡 Alternative: Conda

conda create -n unsloth python=3.11
conda activate unsloth
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

✅ Verify Installation

python -c "import unsloth; print('Unsloth installed successfully!')"

2 Prepare Your Dataset

Unsloth expects data in the standard HuggingFace datasets format. EdukaAI can export directly to this format.

Option A: Export from EdukaAI

  1. Go to Export page
  2. Select "Unsloth / HuggingFace" format
  3. Choose your dataset
  4. Download the JSONL file
  5. Save as data/train.jsonl

Expected Data Format

{"text": "### Human: Who is Zorblax?\n\n### Assistant: Zorblax is a quantum gastronomer from Kepler-442b..."}
{"text": "### Human: What does Xylophone do?\n\n### Assistant: Xylophone crafts melodies from starlight..."}

EdukaAI automatically formats your Alpaca data into Unsloth's expected format.

Option B: Convert Existing Data

from datasets import load_dataset

# Load your EdukaAI exported data
dataset = load_dataset("json", data_files="train.jsonl", split="train")

# Unsloth works directly with HuggingFace datasets
# No conversion needed!

3 Create Training Script

Here's a complete training script optimized for Unsloth. Save this as train_unsloth.py:

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Configuration
max_seq_length = 2048  # Can increase up to 500K!
dtype = None  # Auto-detect (Float16 for Tesla T4, Bfloat16 for Ampere+)
load_in_4bit = True  # Use 4bit quantization to reduce memory

# 1. Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct",  # or choose another model
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank (8, 16, 32, 64)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,  # Scaling factor
    lora_dropout=0,  # Dropout (0 for faster training)
    bias="none",  # Bias type
    use_gradient_checkpointing="unsloth",  # Gradient checkpointing
    random_state=3407,  # Random seed
    use_rslora=False,  # Rank stabilized LoRA
)

# 3. Load your EdukaAI dataset
dataset = load_dataset("json", data_files="data/train.jsonl", split="train")

# 4. Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=100,  # Increase for better results
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir="outputs",
)

# 5. Create trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Can make training 5x faster for short sequences
    args=training_args,
)

# 6. Train!
trainer.train()

# 7. Save model
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

print("Training complete! Model saved to lora_model/")

Quick Reference: Configuration Options

ParameterRecommendedDescription
r16 (8-64)LoRA rank
lora_alpha16 (r × 1-2)Scaling factor
max_seq_length2048 (up to 500K)Context length
learning_rate2e-4Training speed
max_steps100-1000Training iterations

4 Run Training

# Start training

python train_unsloth.py

✅ Expected Output

Loading model...
Creating LoRA adapters...
Starting training...
Step 10/100: loss=2.3456, learning_rate=0.0002
Step 20/100: loss=1.9876, learning_rate=0.00018
...
Step 100/100: loss=1.2345, learning_rate=0.00002

Training complete! Model saved to lora_model/

⏱️ Expected Training Time

Setup100 Steps500 Steps
RTX 3090 (24GB)~3-5 minutes~15-25 minutes
RTX 4090 (24GB)~2-3 minutes~10-15 minutes
A100 (40GB)~1-2 minutes~5-10 minutes

Note: Unsloth is 2x faster than standard training methods!

5 Test Your Model

# Test the fine-tuned model

from unsloth import FastLanguageModel

# Load fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="lora_model",  # Your trained model
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Prepare for inference
FastLanguageModel.for_inference(model)

# Test prompt
inputs = tokenizer(
    ["### Human: Who is Zorblax?\n\n### Assistant:"],
    return_tensors="pt",
).to("cuda")

# Generate
outputs = model.generate(**inputs, max_new_tokens=100, use_cache=True)
response = tokenizer.batch_decode(outputs)[0]
print(response)

Alternative: Use HuggingFace Pipeline

from transformers import pipeline

# Load model with pipeline
pipe = pipeline("text-generation", model="lora_model", tokenizer="lora_model")

# Generate
result = pipe("### Human: Who is Zorblax?\n\n### Assistant:", max_new_tokens=100)
print(result[0]["generated_text"])

6 Save & Export

# Save to GGUF format (for llama.cpp, Ollama)

# Save as GGUF
model.save_pretrained_gguf(
    "model_gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Options: "q4_k_m", "q8_0", "f16"
)

# Push to HuggingFace Hub

from huggingface_hub import login

# Login (get token from https://huggingface.co/settings/tokens)
login()

# Push model
model.push_to_hub("your-username/zorblax-lora", tokenizer)

# Push GGUF
model.push_to_hub_gguf(
    "your-username/zorblax-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

✅ Export Formats

  • LoRA adapters: lora_model/ - Load with Unsloth or PEFT
  • GGUF: model_gguf/ - Use with Ollama, llama.cpp, LM Studio
  • HuggingFace: - Share and use via Hub

Unsloth vs Other Methods

MethodSpeedVRAMBest For
Unsloth⭐⭐⭐⭐⭐ 2x⭐⭐⭐⭐⭐ -70%Speed, consumer GPUs
Axolotl⭐⭐⭐ Normal⭐⭐⭐ NormalFlexibility, cloud
MLX⭐⭐⭐⭐ Fast⭐⭐⭐⭐⭐ EfficientMac users
Standard PyTorch⭐⭐ Slow⭐⭐ HighCustom implementations

💡 When to Use Unsloth

  • ✅ You want the fastest training possible
  • ✅ You have limited VRAM (consumer GPUs)
  • ✅ You need long context training (up to 500K tokens)
  • ✅ You want to minimize cloud training costs
  • ✅ You're iterating rapidly on experiments

📓 Free Colab Notebooks

Unsloth provides free Google Colab notebooks with pre-configured environments. No installation needed!

Llama 3.2 1B

Fast training, good for testing

Open Notebook →

Llama 3.1 8B

Higher quality, more capable

Open Notebook →

Mistral 7B

Popular open model

Open Notebook →

Phi-4 14B

Microsoft's latest model

Open Notebook →
Tip: Upload your EdukaAI exported dataset to the notebook's files section, then modify the data loading code to point to your file.

🔧 Common Issues

"Out of Memory" Error

Enable 4-bit quantization: load_in_4bit=True

Reduce batch size or sequence length

"ModuleNotFoundError: No module named 'triton'"

Install Triton: pip install triton

Training is slow

Ensure you're using GPU: torch.cuda.is_available()

Try increasing batch size if VRAM allows