🎯 Using Your Model
You trained a model and got some files. Now what? Learn how to use adapters, fuse them into complete models, and understand your options.
⚠️ What You Actually Get After Training
After fine-tuning, you do NOT get a complete model file. Instead, you get:
✅ What You Get
LoRA Adapters (~50MB)adapters/adapters.safetensors
These are "deltas" - the changes learned during training
📦 Base Model (Still Needed)
Original Model (~800MB)Llama-3.2-1B-Instruct-4bit
Automatically downloaded, cached locally
🎓 Analogy
Think of it like this: The base model is a textbook, and the adapters are your handwritten notes in the margins. You need both to get the full value. The notes alone don't make sense without the textbook.
✨ Good News
You have three options depending on your needs:
- • Option A: Use adapters directly (fastest, flexible)
- • Option B: Fuse into standalone model (easier sharing)
- • Option C: Convert to GGUF (works with Ollama & others)
A Option A: Use Adapters Directly (Recommended for Development)
✅ Best For
- • Quick testing and iteration
- • Experimenting with different training runs
- • When you want to keep file size small (~50MB vs ~800MB)
- • Python-based applications
Step-by-Step Tutorial
Step 1: Verify Your Adapters
# Check the adapters folder exists ls -la adapters/ # Should show: # adapters.safetensors (~50MB - your trained weights) # adapter_config.json (configuration file)
Step 2: Load and Use in Python
from mlx_lm import load, generate
# Load base model WITH adapters
model, tokenizer = load(
"mlx-community/Llama-3.2-1B-Instruct-4bit",
adapter_path="./adapters"
)
# Generate text
response = generate(
model,
tokenizer,
"Who is Zorblax?",
max_tokens=100
)
print(response)Step 3: Interactive Chat Script
# Save as chat.py
from mlx_lm import load, generate
model, tokenizer = load(
"mlx-community/Llama-3.2-1B-Instruct-4bit",
adapter_path="./adapters"
)
print("Chat with your model! (type 'quit' to exit)")
while True:
prompt = input("\nYou: ")
if prompt.lower() == 'quit':
break
response = generate(model, tokenizer, prompt, max_tokens=200)
print(f"\nModel: {response}")
# Run: python chat.pyStep 4: Batch Processing
# Process multiple prompts
from mlx_lm import load, generate
model, tokenizer = load(
"mlx-community/Llama-3.2-1B-Instruct-4bit",
adapter_path="./adapters"
)
questions = [
"Who is Zorblax?",
"What is quantum gastronomy?",
"Tell me about Xylophone",
]
for q in questions:
print(f"\nQ: {q}")
response = generate(model, tokenizer, q, max_tokens=100)
print(f"A: {response}")💡 Pro Tips
- • You can have multiple adapter folders:
adapters-v1/,adapters-v2/ - • Switch adapters instantly without reloading base model
- • Keep base model cached - don't delete ~/.cache/mlx_lm/
B Option B: Fuse into Standalone Model (Recommended for Sharing)
✅ Best For
- • Sharing your model with others
- • Deploying to production
- • Uploading to HuggingFace
- • When you want a single folder that's "ready to use"
⚠️ Important
Fusing creates a complete, standalone model (~800MB). It merges the base model + adapters into one folder. This is easier to share but takes more disk space.
Step-by-Step Tutorial
Step 1: Fuse the Model
# Fuse adapters with base model mlx_lm.fuse \ --model mlx-community/Llama-3.2-1B-Instruct-4bit \ --adapter-path adapters/ # Creates: lora_fused_model/ folder (~800MB)
Step 2: Verify Fused Model
# Check what was created ls -lh lora_fused_model/ # Should show model files: # config.json # model.safetensors (or multiple shards) # tokenizer.json # tokenizer_config.json # special_tokens_map.json
Step 3: Test Fused Model
from mlx_lm import load, generate
# Load the fused model (no adapter_path needed!)
model, tokenizer = load("./lora_fused_model")
# Generate
response = generate(
model,
tokenizer,
"Who is Zorblax?",
max_tokens=100
)
print(response)Step 4: Upload to HuggingFace (Optional)
# Install HuggingFace CLI pip install huggingface-hub # Login huggingface-cli login # Upload your fused model # (Drag and drop lora_fused_model/ folder on HuggingFace website) # Or use git: cd lora_fused_model git init git add . git commit -m "Fine-tuned model on fictional characters" # ...follow HuggingFace repo setup instructions
Step 5: Share with Others
# Option 1: Zip and share
zip -r my_finetuned_model.zip lora_fused_model/
# Option 2: Upload to cloud storage
# Google Drive, Dropbox, etc.
# Others can use it:
from mlx_lm import load
model, tokenizer = load("./lora_fused_model")💡 When to Fuse vs Use Adapters
| Scenario | Use Adapters | Fuse Model |
|---|---|---|
| File Size | ✅ ~50MB | ⚠️ ~800MB |
| Sharing | ❌ Need base model too | ✅ Self-contained |
| Quick Testing | ✅ Faster | ⚠️ Slower setup |
| HuggingFace Upload | ❌ Not standard | ✅ Standard format |
C Option C: Convert to GGUF (Universal Format)
✅ Best For
- • Using with Ollama (easiest chat interface)
- • llama.cpp (fastest CPU inference)
- • LM Studio, text-generation-webui
- • Maximum compatibility across platforms
🎯 Quick Overview
GGUF is the universal format for language models - like "PDF for AI models". To convert: Fuse your model → Convert to GGUF → Use anywhere
Basic Conversion Flow
# 1. Fuse adapters into complete model mlx_lm.fuse \ --model mlx-community/Llama-3.2-1B-Instruct-4bit \ --adapter-path adapters/ # 2. Convert to GGUF (see Deployment guide for details) python convert_hf_to_gguf.py \ lora_fused_model/ \ --outfile model.gguf \ --outtype q4_k_m # 3. Use with Ollama ollama create mymodel -f Modelfile ollama run mymodel
📊 Quick Reference: Quantization Levels
| Type | Size | Quality |
|---|---|---|
q4_k_m | ~500MB | ⭐⭐ Very Good ✅ |
q8_0 | ~900MB | ⭐⭐⭐ Excellent |
f16 | ~1.6GB | ⭐⭐⭐ Best |
📖 Detailed Instructions
For complete step-by-step GGUF conversion instructions, including installation, all quantization options, and tool-specific usage:
→ Go to Deployment Guide (GGUF Section)🔄 Complete Pipeline: From Training to GGUF
# Complete workflow from fine-tuning to universal GGUF format
# 1. Train your model
python train_characters.py
# 2. Fuse adapters into standalone model
mlx_lm.fuse \
--model mlx-community/Llama-3.2-1B-Instruct-4bit \
--adapter-path adapters/
# 3. Convert to GGUF
python convert_hf_to_gguf.py \
lora_fused_model/ \
--outfile zorblax-model.gguf \
--outtype q4_k_m
# 4. Use with Ollama
cat > Modelfile << 'EOF'
FROM ./zorblax-model.gguf
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
EOF
ollama create zorblax -f Modelfile
ollama run zorblax
# Done! 🎉📁 File Locations Summary
After Fine-Tuning
your_project/ ├── adapters/ # LoRA weights (~50MB) │ ├── adapters.safetensors # Trained weights │ └── adapter_config.json # Configuration ├── data/ │ └── train.jsonl # Your training data ├── train_characters.py # Training script └── README.md # Documentation
After Fusing
your_project/ ├── adapters/ # Original adapters ├── lora_fused_model/ # Complete model (~800MB) │ ├── config.json │ ├── model.safetensors # Merged model │ ├── tokenizer.json │ └── ... └── ...
After GGUF Conversion
your_project/ ├── adapters/ ├── lora_fused_model/ ├── zorblax-model.gguf # Universal format (~500MB) ├── Modelfile # Ollama config └── ...
Cache Location (Base Model)
# Base models are cached here (~800MB each):
~/.cache/mlx_lm/models/
└── mlx-community/
└── Llama-3.2-1B-Instruct-4bit/
└── (downloaded files)
# Don't delete this unless you're sure!📊 Option Comparison
| Feature | Option A Adapters | Option B Fused | Option C GGUF |
|---|---|---|---|
| File Size | ✅ ~50MB | ~800MB | ~500MB |
| Works with Ollama | ❌ No | ❌ No | ✅ Yes |
| Works with llama.cpp | ❌ No | ❌ No | ✅ Yes |
| Works with MLX | ✅ Yes | ✅ Yes | ❌ No |
| HuggingFace Upload | ⚠️ Non-standard | ✅ Standard | ✅ Standard |
| Quick Testing | ✅ Fastest | Fast | Fast |
| Share with Others | ⚠️ Complex | ✅ Easy | ✅ Easy |
| Best For | Development | Sharing | Production |
🔧 Troubleshooting
"Module does not have parameter named 'lora_a'"
Cause: Loading adapters without creating adapter_config.json
Fix: Create adapters/adapter_config.json (see Option A)
"Model file not found" (Ollama)
Cause: Trying to use MLX format with Ollama
Fix: Must convert to GGUF format first (Option C)
Model responds generically (didn't learn)
Causes:
• Not enough training iterations (try 500-1000)
• Not enough training data (need 100+ examples)
• Learning rate too low
Fix: Retrain with more iterations and data
"convert_hf_to_gguf.py not found"
Fix: Download from llama.cpp repo:curl -L -o convert_hf_to_gguf.py https://github.com/ggerganov/llama.cpp/raw/master/convert_hf_to_gguf.py
Model too large after fusing
Solution: This is normal! Fused model includes base model (~800MB). Use adapters (~50MB) for smaller size or convert to GGUF (~500MB) for compression.
🚀 Next Steps & Recommendations
For Development
Use Option A (Adapters)
Fast iteration, small files
Create more samples in the EdukaAI application.
Improve Your Model
- • Train with more iterations (500-1000 instead of 100)
- • Add more training examples (100+ for better results)
- • Try different base models (Phi-3, Qwen 2.5)
- • Experiment with LoRA parameters (rank, alpha)