Serve with vLLM
High-throughput serving for your fine-tuned models. The industry standard for production LLM APIs.
Why vLLM?
vLLM is 24x faster than HuggingFace's native inference. It uses PagedAttention to serve more requests in parallel, batching them efficiently. If you're deploying a fine-tuned model to production, vLLM is the gold standard.
📋 Prerequisites
GPU Requirements
NVIDIA GPU with CUDA compute capability 7.0+ (Volta, Turing, Ampere, Ada Lovelace, Hopper)
Minimum: 16GB VRAM (RTX 4080, RTX 3090)
Recommended: 24GB+ VRAM (RTX 4090, A6000)
Multi-GPU: Tensor Parallelism supported
Model Ready
You need a fine-tuned model or base model to serve.
Get your model from Post-Training Guide →System Requirements
Linux recommended (Ubuntu 20.04+). macOS and Windows supported via Docker.
1 Install vLLM
# Install vLLM (requires Python 3.8-3.11)
pip install vllm💡 Docker Alternative (Recommended)
docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ vllm/vllm-openai:latest \ --model unsloth/Llama-3.2-1B-Instruct
Docker handles all dependencies automatically. Great for production deployments.
✅ Verify Installation
python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"2 Quick Start
Start serving your model with a single command:
# Serve a HuggingFace model
python -m vllm.entrypoints.openai.api_server \
--model unsloth/Llama-3.2-1B-Instruct \
--port 8000# Serve your fine-tuned model
python -m vllm.entrypoints.openai.api_server \
--model ./lora_model \
--port 8000✅ That's It!
Your model is now available at http://localhost:8000 with an OpenAI-compatible API.
3 Test the API
# Using curl
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/Llama-3.2-1B-Instruct",
"prompt": "Who is Zorblax?",
"max_tokens": 100,
"temperature": 0.7
}'# Using Python (OpenAI SDK)
from openai import OpenAI
# Point to your local vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # vLLM doesn't require authentication by default
)
response = client.completions.create(
model="unsloth/Llama-3.2-1B-Instruct",
prompt="Who is Zorblax?",
max_tokens=100
)
print(response.choices[0].text)# Chat completions (like ChatGPT)
response = client.chat.completions.create(
model="unsloth/Llama-3.2-1B-Instruct",
messages=[
{"role": "user", "content": "Who is Zorblax?"}
],
max_tokens=100
)
print(response.choices[0].message.content)4 Production Configuration
Optimize vLLM for your specific use case:
# High-throughput serving
python -m vllm.entrypoints.openai.api_server \
--model unsloth/Llama-3.2-1B-Instruct \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 256Key Parameters Explained
| Parameter | Default | Description |
|---|---|---|
| --tensor-parallel-size | 1 | Number of GPUs to split model across |
| --max-model-len | Model max | Maximum sequence length |
| --gpu-memory-utilization | 0.9 | GPU memory fraction to use (0-1) |
| --max-num-seqs | 256 | Max concurrent sequences |
5 Multi-GPU Setup
Serve larger models by splitting them across multiple GPUs:
# Split across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 2 \
--port 8000# Split across 4 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--port 8000Requirements
- • Multiple GPUs on same machine (NVLink preferred)
- • Same GPU model for best performance
- • Sufficient PCIe bandwidth
6 Deploy with Docker
Docker is the recommended way to deploy vLLM in production:
# Create a Dockerfile
FROM vllm/vllm-openai:latest
# Copy your fine-tuned model
COPY ./lora_model /app/model
# Expose port
EXPOSE 8000
# Start server
CMD ["--model", "/app/model", "--port", "8000"]# Build and run
# Build image
docker build -t my-vllm-server .
# Run container
docker run --runtime nvidia --gpus all -p 8000:8000 my-vllm-server# docker-compose.yml
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./lora_model:/app/model
command: ["--model", "/app/model", "--port", "8000"]7 Load Balancing (Multiple Instances)
Scale horizontally by running multiple vLLM instances behind a load balancer:
# nginx.conf
http {
upstream vllm_backend {
server localhost:8000;
server localhost:8001;
server localhost:8002;
}
server {
listen 80;
location / {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
}# Start multiple instances
# Terminal 1
python -m vllm.entrypoints.openai.api_server --model model --port 8000
# Terminal 2
python -m vllm.entrypoints.openai.api_server --model model --port 8001
# Terminal 3
python -m vllm.entrypoints.openai.api_server --model model --port 8002vLLM vs Other Serving Options
| Method | Throughput | Latency | Best For |
|---|---|---|---|
| vLLM | ⭐⭐⭐⭐⭐ 24x | ⭐⭐⭐⭐⭐ Low | Production APIs |
| HuggingFace TGI | ⭐⭐⭐⭐ Good | ⭐⭐⭐⭐ Good | HF ecosystem |
| TensorRT-LLM | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐⭐ Excellent | NVIDIA optimization |
| llama.cpp | ⭐⭐⭐ CPU | ⭐⭐⭐ Moderate | Edge/CPU only |
| Native HF | ⭐⭐ Baseline | ⭐⭐⭐ Moderate | Development |
💡 When to Use vLLM
- ✅ High-throughput production APIs
- ✅ Serving multiple users simultaneously
- ✅ Cost optimization (serve more on same hardware)
- ✅ OpenAI-compatible API needed
- ✅ NVIDIA GPU available
☁️ Cloud Deployment
🚀 RunPod
GPU cloud with vLLM pre-installed templates.
Template: "vLLM Server" ⚡ Vast.ai
Rent GPUs by the hour, deploy vLLM via Docker.
Lowest cost option 🔷 Lambda Labs
Affordable A100/H100 instances with simple deployment.
Good for sustained workloads ☁️ AWS/GCP/Azure
Use managed Kubernetes or EC2/GCE instances.
Enterprise deployments 🔧 Common Issues
"CUDA out of memory"
Reduce --max-model-len or --gpu-memory-utilization
Enable quantization: --quantization awq or --quantization gptq
"Model architecture not supported"
Check vLLM supported models: https://docs.vllm.ai/en/latest/models/supported_models.html
Most Llama, Mistral, Falcon, and GPT-NeoX models work
Slow first request
Normal - model is loading into GPU memory
Subsequent requests will be fast
"Failed to initialize NCCL" (Multi-GPU)
Ensure all GPUs are on same NUMA node
Check NVLink connectivity: nvidia-smi topo -m