Deploy Your Fine-Tuned LLM
The complete guide to deploying fine-tuned language models. From running locally to serving millions of requests in production.
What Does Deployment Mean?
Congratulations! You've trained your model. Now comes the critical step: making it available to users. Training is only half the battleβdeployment is where your model becomes useful.
This guide covers everything from running your model on your laptop to serving it to thousands of users through APIs. We'll explore all options with hands-on examples.
Four Deployment Contexts
1. Local Development
Running on your laptop/workstation. Perfect for testing, personal projects, and development.
Cost: Free
2. Self-Hosted Cloud
Your own server on AWS, GCP, or GPU rental platforms. Full control, your infrastructure.
Cost: $5-200/month
3. Managed API Services
Pay-per-use APIs from Groq, Together AI, etc. No infrastructure to manage.
Cost: Pay per token
4. Serverless/Edge
Run on-demand, scale to zero. HuggingFace, Replicate, Modal. Variable traffic.
Cost: $0-10/month
Choose Your Deployment Path
Just testing or personal use?
β Go with Local Development (Ollama or LM Studio)
Small team or API for your app?
β Go with Self-Hosted Cloud (RunPod or Vast.ai)
Production product with users?
β Go with Managed API (Groq, Together AI) or Self-Hosted with proper infrastructure
Hobby project, unpredictable traffic?
β Go with Serverless (HuggingFace, Replicate, Modal)
Pre-Deployment Checklist
1. Choose Your Model Format
Adapters
Small (~50MB), need base model
Complete Model
Self-contained (~2-3GB)
GGUF
Universal format (~500MB-2GB)
2. Optimize for Deployment
- Quantize to Q4_K_M for most cases (4x smaller, minimal quality loss)
- Test inference speed on target hardware
- Measure peak memory usage
- Prepare 5-10 test prompts for validation
3. Prepare for Scale
- Who will use it? (Just you, team, or public)
- Expected traffic? (1 req/day vs 1000 req/minute)
- Budget constraints?
- Latency requirements? (Real-time vs batch)
Local Deployment
Best For
- β’ Personal projects and testing
- β’ Development and debugging
- β’ Privacy-sensitive applications
- β’ No budget for cloud services
Option A1: Ollama (Easiest)
Ollama is the simplest way to run LLMs locally. Perfect for beginners.
# Install Ollama
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or use Homebrew
brew install ollama# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./zorblax-model.gguf
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF# Create and run your model
# Create the model
ollama create zorblax -f Modelfile
# Run interactively
ollama run zorblax
# Or use the API
curl http://localhost:11434/api/generate -d '{
"model": "zorblax",
"prompt": "Who is Zorblax?"
}'β Ollama Pros/Cons
Pros:
- β’ Dead simple setup
- β’ Built-in model management
- β’ REST API included
- β’ Active community
Cons:
- β’ Local only
- β’ Single user
- β’ Limited monitoring
- β’ Your hardware limits
Option A2: LM Studio (GUI)
Prefer a graphical interface? LM Studio is perfect.
1. Download LM Studio
Get it from https://lmstudio.ai
2. Load your GGUF
Click "Load Model" and select your .gguf file
3. Start chatting
Built-in chat interface appears automatically
4. Enable local server
Settings β Local Server β Start Server (port 1234)
LM Studio API Example
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-model",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'Self-Hosted Cloud
Best For
- β’ Small teams needing shared access
- β’ APIs for your applications
- β’ Full control over infrastructure
- β’ Cost optimization vs managed services
Platform Comparison
| Platform | Cost | GPU | Best For |
|---|---|---|---|
| RunPod | $0.20-0.50/hr | RTX 3090, A100 | Serverless, scale to zero |
| Vast.ai | $0.15-0.30/hr | Community GPUs | Cheapest option |
| AWS EC2 | $0.50-3.00/hr | NVIDIA T4, V100 | Enterprise, reliable |
| Google Cloud | $0.40-2.50/hr | T4, A100 | Google ecosystem |
Hands-On: Deploy on RunPod
Step 1: Create Account
Sign up at runpod.io and add payment method ($10 minimum)
Step 2: Deploy GPU Pod
Go to "GPU Pods" β "Deploy"
- Select GPU: RTX 3090 (24GB) or A5000
- Template: PyTorch
- Disk: 50GB
Step 3: Upload Your Model
# SSH into your pod (get command from RunPod UI) # Upload via SCP or use RunPod's file browser scp zorblax-model.gguf root@your-pod-ip:/workspace/ # Or use RunPod's volume upload feature
Step 4: Start Inference Server
# Install dependencies pip install llama-cpp-python # Create simple server python -m llama_cpp.server \ --model /workspace/zorblax-model.gguf \ --host 0.0.0.0 \ --port 8000
Step 5: Test Your API
curl http://your-pod-ip:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}]
}'β οΈ Important: Stop When Done
RunPod charges by the hour. Always stop your pod when not in use. You can restart it later with your data intact.
Managed API Services
Best For
- β’ Production applications
- β’ Variable/unpredictable traffic
- β’ No DevOps expertise
- β’ Need reliability and SLA
Service Comparison
| Service | Cost | Speed | Best For |
|---|---|---|---|
| Groq | $0.10-0.50/M tokens | β‘ Fastest | Real-time applications |
| Together AI | $0.20-1.00/M tokens | Fast | Custom models, fine-tuning |
| Fireworks AI | $0.20-0.80/M tokens | Fast | Production, fast inference |
| Anyscale | $0.30-1.20/M tokens | Fast | Enterprise, Ray integration |
Hands-On: Deploy on Together AI
Step 1: Upload Model
# Install Together CLI pip install together # Login together login # Upload your fine-tuned model together models create zorblax-7b \ --model-type "llama" \ --model-file ./zorblax-model.gguf
Step 2: Use the API
import openai
client = openai.OpenAI(
api_key="your-together-api-key",
base_url="https://api.together.xyz/v1"
)
response = client.chat.completions.create(
model="your-username/zorblax-7b",
messages=[
{"role": "user", "content": "Who is Zorblax?"}
]
)
print(response.choices[0].message.content)OpenAI-Compatible API
Most managed services use OpenAI's API format. This means you can easily switch between providers or use the same code for different models.
Serverless & Low-Cost
Best For
- β’ Hobby projects
- β’ Unpredictable traffic
- β’ Cost-sensitive applications
- β’ Prototypes and MVPs
Free/Low-Cost Options
HuggingFace Inference API
Free tier: 30,000 input tokens/month
from huggingface_hub import InferenceClient
client = InferenceClient(
"your-username/zorblax-model"
)
response = client.chat_completion(
messages=[{"role": "user", "content": "Hello"}]
)Best for: Learning, small projects
Replicate
Pay per prediction (~$0.01-0.10 each)
import replicate
output = replicate.run(
"your-username/zorblax-model",
input={"prompt": "Hello"}
)Best for: On-demand, bursty traffic
Cost Comparison: 1000 requests/day
| Option | Monthly Cost |
|---|---|
| Self-hosted (always on) | $150-300 |
| Managed API | $50-150 |
| Serverless | $10-50 |
| Local only | Free |
Building an API Wrapper
Create a REST API Around Your Model
Wrap your model in a proper API with authentication, rate limiting, and error handling.
# FastAPI Example (api.py)
from fastapi import FastAPI, HTTPException, Depends
from pydantic import BaseModel
from typing import List
import os
app = FastAPI(title="Zorblax API")
# Simple API key auth
API_KEY = os.getenv("API_KEY", "your-secret-key")
def verify_key(key: str = Header(...)):
if key != API_KEY:
raise HTTPException(status_code=403, detail="Invalid API key")
return key
class ChatRequest(BaseModel):
message: str
temperature: float = 0.7
max_tokens: int = 500
class ChatResponse(BaseModel):
response: str
tokens_used: int
@app.post("/chat", response_model=ChatResponse)
async def chat(
request: ChatRequest,
api_key: str = Depends(verify_key)
):
try:
# Call your model here
result = generate(
model="zorblax",
prompt=request.message,
temperature=request.temperature,
max_tokens=request.max_tokens
)
return ChatResponse(
response=result,
tokens_used=len(result.split())
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# Health check
@app.get("/health")
async def health():
return {"status": "healthy", "model": "zorblax-v1"}Add Rate Limiting
# Rate limiting with slowapi
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
@app.post("/chat")
@limiter.limit("10/minute") # 10 requests per minute
async def chat(request: ChatRequest):
# ... your codeDockerize Your API
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]Security Best Practices
1. API Authentication
- Never expose your model without authentication
- Use API keys (not just IP whitelisting)
- Rotate keys regularly
- Track usage per key
2. Input Validation
- Limit prompt length (e.g., max 4000 tokens)
- Sanitize inputs (prevent injection attacks)
- Validate temperature (0.0 - 2.0)
- Set max_tokens limit
3. Rate Limiting
- Prevent abuse with request limits
- Per-user and global limits
- Exponential backoff for retries
- Block IPs with suspicious patterns
4. Infrastructure Security
- Use HTTPS only (TLS 1.3)
- Firewall rules (only open necessary ports)
- VPN for administrative access
- Regular security updates
Cost Optimization
Don't Spend More Than You Need To
Quick Wins
- β Quantize to Q4 (4x cheaper)
- β Use smaller models when possible
- β Batch requests together
- β Cache frequent queries
- β Stop instances when not in use
Architecture Optimizations
- β Scale to zero (serverless)
- β Use spot instances (60-90% off)
- β Multi-tenant (share GPU)
- β Response streaming (faster perceived speed)
- β Edge caching (Cloudflare)
Cost Calculator Example
Model: 7B parameters, Q4_K_M quantized (~4GB)
| Setup | Cost/Month | Throughput |
|---|---|---|
| Local (your laptop) | $0 | ~20 tokens/sec |
| RunPod (RTX 3090) | ~$108 (always on) | ~80 tokens/sec |
| RunPod (serverless) | ~$10-30 | ~80 tokens/sec |
| Managed API | ~$50-200 | ~100-500 tokens/sec |
| Serverless (Replicate) | ~$20-50 | Variable |
Caching Strategy
Cache embeddings and frequent queries. Typical cache hit rates: 30-50% for Q&A bots, 60-80% for similar prompts. This can cut costs in half!
Real-World Deployment Patterns
Pattern 1: Website Chatbot
AI assistant embedded in your website
Frontend (React) β API Gateway β FastAPI β Ollama/llama.cpp
Cost: $10-50/month | Users: 100-1000/day
Pattern 2: Document Q&A System
Upload documents, ask questions
Documents β Embeddings (vector DB) β Retrieval + LLM β Answer
Stack: LangChain, Pinecone, OpenAI API
Pattern 3: Mobile App Backend
AI features in iOS/Android apps
Mobile App β API Gateway β AWS Lambda β Together AI
Cost: Pay-per-use | Scale: Unlimited
Pattern 4: Batch Processing Pipeline
Process thousands of items asynchronously
Queue (Redis) β Workers (Kubernetes) β Database
Best for: Content generation, data processing
π¦ Converting to GGUF Format
GGUF is the universal format for deploying LLMs. It works with Ollama, llama.cpp, LM Studio, and most other tools. If you have a HuggingFace format model (PyTorch), you need to convert it first.
When Do You Need This?
- β You have a HuggingFace/PyTorch model and want to use Ollama
- β You need maximum compatibility across different tools
- β You want optimized quantization for faster inference
- β You already have a .gguf file (skip this section)
Step-by-Step Conversion
Step 1: Install Conversion Tools
# Option A: Use llama.cpp convert script pip install transformers torch # Download the conversion script curl -L -o convert_hf_to_gguf.py \ https://github.com/ggerganov/llama.cpp/raw/master/convert_hf_to_gguf.py
Step 2: Convert Your Model
# Convert to GGUF format python convert_hf_to_gguf.py \ ./your-model-folder/ \ --outfile your-model.gguf \ --outtype q4_k_m # Creates: your-model.gguf (~500MB for Q4)
Step 3: Choose Quantization Level
| Type | Size | Quality | Best For |
|---|---|---|---|
f16 | ~1.6GB | βββ Best | Maximum quality |
q8_0 | ~900MB | βββ Excellent | Great balance |
q4_k_m | ~500MB | ββ Very Good | β Recommended |
q4_0 | ~450MB | ββ Good | Fast inference |
q2_k | ~300MB | β Acceptable | Minimum size |
π‘ Recommendation
Use q4_k_m for the best balance of quality and size. It provides ~95% of original quality with 75% size reduction.
Your Deployment Checklist
Quantize model to Q4_K_M (if not already)
Test inference speed on target hardware
Choose deployment method (Local/Cloud/Managed)
Set up authentication and rate limiting
Configure monitoring and logging
Add caching for frequent queries
Test with real user traffic
Document your API