EdukaAI
πŸš€

Deploy Your Fine-Tuned LLM

The complete guide to deploying fine-tuned language models. From running locally to serving millions of requests in production.

What Does Deployment Mean?

Congratulations! You've trained your model. Now comes the critical step: making it available to users. Training is only half the battleβ€”deployment is where your model becomes useful.

This guide covers everything from running your model on your laptop to serving it to thousands of users through APIs. We'll explore all options with hands-on examples.

Four Deployment Contexts

1. Local Development

Running on your laptop/workstation. Perfect for testing, personal projects, and development.

Cost: Free

2. Self-Hosted Cloud

Your own server on AWS, GCP, or GPU rental platforms. Full control, your infrastructure.

Cost: $5-200/month

3. Managed API Services

Pay-per-use APIs from Groq, Together AI, etc. No infrastructure to manage.

Cost: Pay per token

4. Serverless/Edge

Run on-demand, scale to zero. HuggingFace, Replicate, Modal. Variable traffic.

Cost: $0-10/month

Choose Your Deployment Path

Just testing or personal use?

β†’ Go with Local Development (Ollama or LM Studio)

Small team or API for your app?

β†’ Go with Self-Hosted Cloud (RunPod or Vast.ai)

Production product with users?

β†’ Go with Managed API (Groq, Together AI) or Self-Hosted with proper infrastructure

Hobby project, unpredictable traffic?

β†’ Go with Serverless (HuggingFace, Replicate, Modal)

Pre-Deployment Checklist

1. Choose Your Model Format

Adapters

Small (~50MB), need base model

Complete Model

Self-contained (~2-3GB)

GGUF

Universal format (~500MB-2GB)

2. Optimize for Deployment

  • Quantize to Q4_K_M for most cases (4x smaller, minimal quality loss)
  • Test inference speed on target hardware
  • Measure peak memory usage
  • Prepare 5-10 test prompts for validation

3. Prepare for Scale

  • Who will use it? (Just you, team, or public)
  • Expected traffic? (1 req/day vs 1000 req/minute)
  • Budget constraints?
  • Latency requirements? (Real-time vs batch)
1️⃣

Local Deployment

Best For

  • β€’ Personal projects and testing
  • β€’ Development and debugging
  • β€’ Privacy-sensitive applications
  • β€’ No budget for cloud services

Option A1: Ollama (Easiest)

Ollama is the simplest way to run LLMs locally. Perfect for beginners.

# Install Ollama

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or use Homebrew
brew install ollama

# Create a Modelfile

cat > Modelfile << 'EOF'
FROM ./zorblax-model.gguf
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF

# Create and run your model

# Create the model
ollama create zorblax -f Modelfile

# Run interactively
ollama run zorblax

# Or use the API
curl http://localhost:11434/api/generate -d '{
  "model": "zorblax",
  "prompt": "Who is Zorblax?"
}'

βœ… Ollama Pros/Cons

Pros:

  • β€’ Dead simple setup
  • β€’ Built-in model management
  • β€’ REST API included
  • β€’ Active community

Cons:

  • β€’ Local only
  • β€’ Single user
  • β€’ Limited monitoring
  • β€’ Your hardware limits

Option A2: LM Studio (GUI)

Prefer a graphical interface? LM Studio is perfect.

1. Download LM Studio

Get it from https://lmstudio.ai

2. Load your GGUF

Click "Load Model" and select your .gguf file

3. Start chatting

Built-in chat interface appears automatically

4. Enable local server

Settings β†’ Local Server β†’ Start Server (port 1234)

LM Studio API Example

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
2️⃣

Self-Hosted Cloud

Best For

  • β€’ Small teams needing shared access
  • β€’ APIs for your applications
  • β€’ Full control over infrastructure
  • β€’ Cost optimization vs managed services

Platform Comparison

PlatformCostGPUBest For
RunPod$0.20-0.50/hrRTX 3090, A100Serverless, scale to zero
Vast.ai$0.15-0.30/hrCommunity GPUsCheapest option
AWS EC2$0.50-3.00/hrNVIDIA T4, V100Enterprise, reliable
Google Cloud$0.40-2.50/hrT4, A100Google ecosystem

Hands-On: Deploy on RunPod

Step 1: Create Account

Sign up at runpod.io and add payment method ($10 minimum)

Step 2: Deploy GPU Pod

Go to "GPU Pods" β†’ "Deploy"

  • Select GPU: RTX 3090 (24GB) or A5000
  • Template: PyTorch
  • Disk: 50GB

Step 3: Upload Your Model

# SSH into your pod (get command from RunPod UI)
# Upload via SCP or use RunPod's file browser
scp zorblax-model.gguf root@your-pod-ip:/workspace/

# Or use RunPod's volume upload feature

Step 4: Start Inference Server

# Install dependencies
pip install llama-cpp-python

# Create simple server
python -m llama_cpp.server \
  --model /workspace/zorblax-model.gguf \
  --host 0.0.0.0 \
  --port 8000

Step 5: Test Your API

curl http://your-pod-ip:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

⚠️ Important: Stop When Done

RunPod charges by the hour. Always stop your pod when not in use. You can restart it later with your data intact.

3️⃣

Managed API Services

Best For

  • β€’ Production applications
  • β€’ Variable/unpredictable traffic
  • β€’ No DevOps expertise
  • β€’ Need reliability and SLA

Service Comparison

ServiceCostSpeedBest For
Groq$0.10-0.50/M tokens⚑ FastestReal-time applications
Together AI$0.20-1.00/M tokensFastCustom models, fine-tuning
Fireworks AI$0.20-0.80/M tokensFastProduction, fast inference
Anyscale$0.30-1.20/M tokensFastEnterprise, Ray integration

Hands-On: Deploy on Together AI

Step 1: Upload Model

# Install Together CLI
pip install together

# Login
together login

# Upload your fine-tuned model
together models create zorblax-7b \
  --model-type "llama" \
  --model-file ./zorblax-model.gguf

Step 2: Use the API

import openai

client = openai.OpenAI(
    api_key="your-together-api-key",
    base_url="https://api.together.xyz/v1"
)

response = client.chat.completions.create(
    model="your-username/zorblax-7b",
    messages=[
        {"role": "user", "content": "Who is Zorblax?"}
    ]
)

print(response.choices[0].message.content)

OpenAI-Compatible API

Most managed services use OpenAI's API format. This means you can easily switch between providers or use the same code for different models.

4️⃣

Serverless & Low-Cost

Best For

  • β€’ Hobby projects
  • β€’ Unpredictable traffic
  • β€’ Cost-sensitive applications
  • β€’ Prototypes and MVPs

Free/Low-Cost Options

HuggingFace Inference API

Free tier: 30,000 input tokens/month

from huggingface_hub import InferenceClient

client = InferenceClient(
    "your-username/zorblax-model"
)
response = client.chat_completion(
    messages=[{"role": "user", "content": "Hello"}]
)

Best for: Learning, small projects

Replicate

Pay per prediction (~$0.01-0.10 each)

import replicate

output = replicate.run(
    "your-username/zorblax-model",
    input={"prompt": "Hello"}
)

Best for: On-demand, bursty traffic

Cost Comparison: 1000 requests/day

OptionMonthly Cost
Self-hosted (always on)$150-300
Managed API$50-150
Serverless$10-50
Local onlyFree
5️⃣

Building an API Wrapper

Create a REST API Around Your Model

Wrap your model in a proper API with authentication, rate limiting, and error handling.

# FastAPI Example (api.py)

from fastapi import FastAPI, HTTPException, Depends
from pydantic import BaseModel
from typing import List
import os

app = FastAPI(title="Zorblax API")

# Simple API key auth
API_KEY = os.getenv("API_KEY", "your-secret-key")

def verify_key(key: str = Header(...)):
    if key != API_KEY:
        raise HTTPException(status_code=403, detail="Invalid API key")
    return key

class ChatRequest(BaseModel):
    message: str
    temperature: float = 0.7
    max_tokens: int = 500

class ChatResponse(BaseModel):
    response: str
    tokens_used: int

@app.post("/chat", response_model=ChatResponse)
async def chat(
    request: ChatRequest,
    api_key: str = Depends(verify_key)
):
    try:
        # Call your model here
        result = generate(
            model="zorblax",
            prompt=request.message,
            temperature=request.temperature,
            max_tokens=request.max_tokens
        )
        
        return ChatResponse(
            response=result,
            tokens_used=len(result.split())
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Health check
@app.get("/health")
async def health():
    return {"status": "healthy", "model": "zorblax-v1"}

Add Rate Limiting

# Rate limiting with slowapi

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

@app.post("/chat")
@limiter.limit("10/minute")  # 10 requests per minute
async def chat(request: ChatRequest):
    # ... your code

Dockerize Your API

# Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]
6️⃣

Security Best Practices

1. API Authentication

  • Never expose your model without authentication
  • Use API keys (not just IP whitelisting)
  • Rotate keys regularly
  • Track usage per key

2. Input Validation

  • Limit prompt length (e.g., max 4000 tokens)
  • Sanitize inputs (prevent injection attacks)
  • Validate temperature (0.0 - 2.0)
  • Set max_tokens limit

3. Rate Limiting

  • Prevent abuse with request limits
  • Per-user and global limits
  • Exponential backoff for retries
  • Block IPs with suspicious patterns

4. Infrastructure Security

  • Use HTTPS only (TLS 1.3)
  • Firewall rules (only open necessary ports)
  • VPN for administrative access
  • Regular security updates
7️⃣

Cost Optimization

Don't Spend More Than You Need To

Quick Wins

  • βœ… Quantize to Q4 (4x cheaper)
  • βœ… Use smaller models when possible
  • βœ… Batch requests together
  • βœ… Cache frequent queries
  • βœ… Stop instances when not in use

Architecture Optimizations

  • βœ… Scale to zero (serverless)
  • βœ… Use spot instances (60-90% off)
  • βœ… Multi-tenant (share GPU)
  • βœ… Response streaming (faster perceived speed)
  • βœ… Edge caching (Cloudflare)

Cost Calculator Example

Model: 7B parameters, Q4_K_M quantized (~4GB)

SetupCost/MonthThroughput
Local (your laptop)$0~20 tokens/sec
RunPod (RTX 3090)~$108 (always on)~80 tokens/sec
RunPod (serverless)~$10-30~80 tokens/sec
Managed API~$50-200~100-500 tokens/sec
Serverless (Replicate)~$20-50Variable

Caching Strategy

Cache embeddings and frequent queries. Typical cache hit rates: 30-50% for Q&A bots, 60-80% for similar prompts. This can cut costs in half!

8️⃣

Real-World Deployment Patterns

Pattern 1: Website Chatbot

AI assistant embedded in your website

Frontend (React) β†’ API Gateway β†’ FastAPI β†’ Ollama/llama.cpp

Cost: $10-50/month | Users: 100-1000/day

Pattern 2: Document Q&A System

Upload documents, ask questions

Documents β†’ Embeddings (vector DB) β†’ Retrieval + LLM β†’ Answer

Stack: LangChain, Pinecone, OpenAI API

Pattern 3: Mobile App Backend

AI features in iOS/Android apps

Mobile App β†’ API Gateway β†’ AWS Lambda β†’ Together AI

Cost: Pay-per-use | Scale: Unlimited

Pattern 4: Batch Processing Pipeline

Process thousands of items asynchronously

Queue (Redis) β†’ Workers (Kubernetes) β†’ Database

Best for: Content generation, data processing

πŸ“¦ Converting to GGUF Format

GGUF is the universal format for deploying LLMs. It works with Ollama, llama.cpp, LM Studio, and most other tools. If you have a HuggingFace format model (PyTorch), you need to convert it first.

When Do You Need This?

  • βœ… You have a HuggingFace/PyTorch model and want to use Ollama
  • βœ… You need maximum compatibility across different tools
  • βœ… You want optimized quantization for faster inference
  • ❌ You already have a .gguf file (skip this section)

Step-by-Step Conversion

Step 1: Install Conversion Tools
# Option A: Use llama.cpp convert script
pip install transformers torch

# Download the conversion script
curl -L -o convert_hf_to_gguf.py \
  https://github.com/ggerganov/llama.cpp/raw/master/convert_hf_to_gguf.py
Step 2: Convert Your Model
# Convert to GGUF format
python convert_hf_to_gguf.py \
  ./your-model-folder/ \
  --outfile your-model.gguf \
  --outtype q4_k_m

# Creates: your-model.gguf (~500MB for Q4)
Step 3: Choose Quantization Level
TypeSizeQualityBest For
f16~1.6GB⭐⭐⭐ BestMaximum quality
q8_0~900MB⭐⭐⭐ ExcellentGreat balance
q4_k_m~500MB⭐⭐ Very Goodβœ… Recommended
q4_0~450MB⭐⭐ GoodFast inference
q2_k~300MB⭐ AcceptableMinimum size

πŸ’‘ Recommendation

Use q4_k_m for the best balance of quality and size. It provides ~95% of original quality with 75% size reduction.

Your Deployment Checklist

Quantize model to Q4_K_M (if not already)

Test inference speed on target hardware

Choose deployment method (Local/Cloud/Managed)

Set up authentication and rate limiting

Configure monitoring and logging

Add caching for frequent queries

Test with real user traffic

Document your API