🎯 Using Your Model

You trained a model and got some files. Now what? Learn how to use adapters, fuse them into complete models, and understand your options.

⚠️ What You Actually Get After Training

After fine-tuning, you do NOT get a complete model file. Instead, you get:

✅ What You Get

LoRA Adapters (~50MB)
adapters/adapters.safetensors
These are "deltas" - the changes learned during training

📦 Base Model (Still Needed)

Original Model (~800MB)
Llama-3.2-1B-Instruct-4bit
Automatically downloaded, cached locally

🎓 Analogy

Think of it like this: The base model is a textbook, and the adapters are your handwritten notes in the margins. You need both to get the full value. The notes alone don't make sense without the textbook.

✨ Good News

You have three options depending on your needs:

• Option A: Use adapters directly (fastest, flexible)
• Option B: Fuse into standalone model (easier sharing)
• Option C: Convert to GGUF (works with Ollama & others)

A Option A: Use Adapters Directly (Recommended for Development)

✅ Best For

• Quick testing and iteration
• Experimenting with different training runs
• When you want to keep file size small (~50MB vs ~800MB)
• Python-based applications

Step-by-Step Tutorial

Step 1: Verify Your Adapters

# Check the adapters folder exists
ls -la adapters/

# Should show:
# adapters.safetensors     (~50MB - your trained weights)
# adapter_config.json      (configuration file)

Step 2: Load and Use in Python

from mlx_lm import load, generate

# Load base model WITH adapters
model, tokenizer = load(
    "mlx-community/Llama-3.2-1B-Instruct-4bit",
    adapter_path="./adapters"
)

# Generate text
response = generate(
    model, 
    tokenizer, 
    "Who is Zorblax?",
    max_tokens=100
)
print(response)

Step 3: Interactive Chat Script

# Save as chat.py
from mlx_lm import load, generate

model, tokenizer = load(
    "mlx-community/Llama-3.2-1B-Instruct-4bit",
    adapter_path="./adapters"
)

print("Chat with your model! (type 'quit' to exit)")
while True:
    prompt = input("\nYou: ")
    if prompt.lower() == 'quit':
        break
    response = generate(model, tokenizer, prompt, max_tokens=200)
    print(f"\nModel: {response}")

# Run: python chat.py

Step 4: Batch Processing

# Process multiple prompts
from mlx_lm import load, generate

model, tokenizer = load(
    "mlx-community/Llama-3.2-1B-Instruct-4bit",
    adapter_path="./adapters"
)

questions = [
    "Who is Zorblax?",
    "What is quantum gastronomy?",
    "Tell me about Xylophone",
]

for q in questions:
    print(f"\nQ: {q}")
    response = generate(model, tokenizer, q, max_tokens=100)
    print(f"A: {response}")

💡 Pro Tips

• You can have multiple adapter folders: adapters-v1/, adapters-v2/
• Switch adapters instantly without reloading base model
• Keep base model cached - don't delete ~/.cache/mlx_lm/

B Option B: Fuse into Standalone Model (Recommended for Sharing)

✅ Best For

• Sharing your model with others
• Deploying to production
• Uploading to HuggingFace
• When you want a single folder that's "ready to use"

⚠️ Important

Fusing creates a complete, standalone model (~800MB). It merges the base model + adapters into one folder. This is easier to share but takes more disk space.

Step-by-Step Tutorial

Step 1: Fuse the Model

# Fuse adapters with base model
mlx_lm.fuse \
  --model mlx-community/Llama-3.2-1B-Instruct-4bit \
  --adapter-path adapters/

# Creates: lora_fused_model/ folder (~800MB)

Step 2: Verify Fused Model

# Check what was created
ls -lh lora_fused_model/

# Should show model files:
# config.json
# model.safetensors (or multiple shards)
# tokenizer.json
# tokenizer_config.json
# special_tokens_map.json

Step 3: Test Fused Model

from mlx_lm import load, generate

# Load the fused model (no adapter_path needed!)
model, tokenizer = load("./lora_fused_model")

# Generate
response = generate(
    model, 
    tokenizer, 
    "Who is Zorblax?",
    max_tokens=100
)
print(response)

Step 4: Upload to HuggingFace (Optional)

# Install HuggingFace CLI
pip install huggingface-hub

# Login
huggingface-cli login

# Upload your fused model
# (Drag and drop lora_fused_model/ folder on HuggingFace website)
# Or use git:
cd lora_fused_model
git init
git add .
git commit -m "Fine-tuned model on fictional characters"
# ...follow HuggingFace repo setup instructions

Step 5: Share with Others

# Option 1: Zip and share
zip -r my_finetuned_model.zip lora_fused_model/

# Option 2: Upload to cloud storage
# Google Drive, Dropbox, etc.

# Others can use it:
from mlx_lm import load
model, tokenizer = load("./lora_fused_model")

💡 When to Fuse vs Use Adapters

Scenario	Use Adapters	Fuse Model
File Size	✅ ~50MB	⚠️ ~800MB
Sharing	❌ Need base model too	✅ Self-contained
Quick Testing	✅ Faster	⚠️ Slower setup
HuggingFace Upload	❌ Not standard	✅ Standard format

C Option C: Convert to GGUF (Universal Format)

✅ Best For

• Using with Ollama (easiest chat interface)
• llama.cpp (fastest CPU inference)
• LM Studio, text-generation-webui
• Maximum compatibility across platforms

🎯 Quick Overview

GGUF is the universal format for language models - like "PDF for AI models". To convert: Fuse your model → Convert to GGUF → Use anywhere

Basic Conversion Flow

# 1. Fuse adapters into complete model
mlx_lm.fuse \
  --model mlx-community/Llama-3.2-1B-Instruct-4bit \
  --adapter-path adapters/

# 2. Convert to GGUF (see Deployment guide for details)
python convert_hf_to_gguf.py \
  lora_fused_model/ \
  --outfile model.gguf \
  --outtype q4_k_m

# 3. Use with Ollama
ollama create mymodel -f Modelfile
ollama run mymodel

📊 Quick Reference: Quantization Levels

Type	Size	Quality
`q4_k_m`	~500MB	⭐⭐ Very Good ✅
`q8_0`	~900MB	⭐⭐⭐ Excellent
`f16`	~1.6GB	⭐⭐⭐ Best

📖 Detailed Instructions

For complete step-by-step GGUF conversion instructions, including installation, all quantization options, and tool-specific usage:

→ Go to Deployment Guide (GGUF Section)

🔄 Complete Pipeline: From Training to GGUF

# Complete workflow from fine-tuning to universal GGUF format

# 1. Train your model
python train_characters.py

# 2. Fuse adapters into standalone model
mlx_lm.fuse \
  --model mlx-community/Llama-3.2-1B-Instruct-4bit \
  --adapter-path adapters/

# 3. Convert to GGUF
python convert_hf_to_gguf.py \
  lora_fused_model/ \
  --outfile zorblax-model.gguf \
  --outtype q4_k_m

# 4. Use with Ollama
cat > Modelfile << 'EOF'
FROM ./zorblax-model.gguf
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
EOF
ollama create zorblax -f Modelfile
ollama run zorblax

# Done! 🎉

📁 File Locations Summary

After Fine-Tuning

your_project/
├── adapters/                           # LoRA weights (~50MB)
│   ├── adapters.safetensors           # Trained weights
│   └── adapter_config.json            # Configuration
├── data/
│   └── train.jsonl                    # Your training data
├── train_characters.py                # Training script
└── README.md                          # Documentation

After Fusing

your_project/
├── adapters/                           # Original adapters
├── lora_fused_model/                   # Complete model (~800MB)
│   ├── config.json
│   ├── model.safetensors              # Merged model
│   ├── tokenizer.json
│   └── ...
└── ...

After GGUF Conversion

your_project/
├── adapters/
├── lora_fused_model/
├── zorblax-model.gguf                  # Universal format (~500MB)
├── Modelfile                          # Ollama config
└── ...

Cache Location (Base Model)

# Base models are cached here (~800MB each):
~/.cache/mlx_lm/models/
└── mlx-community/
    └── Llama-3.2-1B-Instruct-4bit/
        └── (downloaded files)

# Don't delete this unless you're sure!

📊 Option Comparison

Feature	Option A Adapters	Option B Fused	Option C GGUF
File Size	✅ ~50MB	~800MB	~500MB
Works with Ollama	❌ No	❌ No	✅ Yes
Works with llama.cpp	❌ No	❌ No	✅ Yes
Works with MLX	✅ Yes	✅ Yes	❌ No
HuggingFace Upload	⚠️ Non-standard	✅ Standard	✅ Standard
Quick Testing	✅ Fastest	Fast	Fast
Share with Others	⚠️ Complex	✅ Easy	✅ Easy
Best For	Development	Sharing	Production

🔧 Troubleshooting

"Module does not have parameter named 'lora_a'"

Cause: Loading adapters without creating adapter_config.json
Fix: Create adapters/adapter_config.json (see Option A)

"Model file not found" (Ollama)

Cause: Trying to use MLX format with Ollama
Fix: Must convert to GGUF format first (Option C)

Model responds generically (didn't learn)

Causes:
• Not enough training iterations (try 500-1000)
• Not enough training data (need 100+ examples)
• Learning rate too low
Fix: Retrain with more iterations and data

"convert_hf_to_gguf.py not found"

Fix: Download from llama.cpp repo:
curl -L -o convert_hf_to_gguf.py https://github.com/ggerganov/llama.cpp/raw/master/convert_hf_to_gguf.py

Model too large after fusing

Solution: This is normal! Fused model includes base model (~800MB). Use adapters (~50MB) for smaller size or convert to GGUF (~500MB) for compression.

🚀 Next Steps & Recommendations

For Development

Use Option A (Adapters)
Fast iteration, small files

Create more samples in the EdukaAI application.

For Sharing

Use Option B (Fused)
Upload to HuggingFace

Go to HuggingFace →

For Production

Use Option C (GGUF)
Works with Ollama, maximum compatibility

Get Ollama →

Improve Your Model

• Train with more iterations (500-1000 instead of 100)
• Add more training examples (100+ for better results)
• Try different base models (Phi-3, Qwen 2.5)
• Experiment with LoRA parameters (rank, alpha)

← Back to MLX Training Getting Started →