Serve with vLLM

High-throughput serving for your fine-tuned models. The industry standard for production LLM APIs.

🚀

Why vLLM?

vLLM is 24x faster than HuggingFace's native inference. It uses PagedAttention to serve more requests in parallel, batching them efficiently. If you're deploying a fine-tuned model to production, vLLM is the gold standard.

📋 Prerequisites

🎮

GPU Requirements

NVIDIA GPU with CUDA compute capability 7.0+ (Volta, Turing, Ampere, Ada Lovelace, Hopper)

Minimum: 16GB VRAM (RTX 4080, RTX 3090)

Recommended: 24GB+ VRAM (RTX 4090, A6000)

Multi-GPU: Tensor Parallelism supported

📦

Model Ready

You need a fine-tuned model or base model to serve.

Get your model from Post-Training Guide →

💻

System Requirements

Linux recommended (Ubuntu 20.04+). macOS and Windows supported via Docker.

1 Install vLLM

# Install vLLM (requires Python 3.8-3.11)

pip install vllm

💡 Docker Alternative (Recommended)

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model unsloth/Llama-3.2-1B-Instruct

Docker handles all dependencies automatically. Great for production deployments.

✅ Verify Installation

python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"

2 Quick Start

Start serving your model with a single command:

# Serve a HuggingFace model

python -m vllm.entrypoints.openai.api_server \
  --model unsloth/Llama-3.2-1B-Instruct \
  --port 8000

# Serve your fine-tuned model

python -m vllm.entrypoints.openai.api_server \
  --model ./lora_model \
  --port 8000

✅ That's It!

Your model is now available at http://localhost:8000 with an OpenAI-compatible API.

3 Test the API

# Using curl

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Llama-3.2-1B-Instruct",
    "prompt": "Who is Zorblax?",
    "max_tokens": 100,
    "temperature": 0.7
  }'

# Using Python (OpenAI SDK)

from openai import OpenAI

# Point to your local vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # vLLM doesn't require authentication by default
)

response = client.completions.create(
    model="unsloth/Llama-3.2-1B-Instruct",
    prompt="Who is Zorblax?",
    max_tokens=100
)

print(response.choices[0].text)

# Chat completions (like ChatGPT)

response = client.chat.completions.create(
    model="unsloth/Llama-3.2-1B-Instruct",
    messages=[
        {"role": "user", "content": "Who is Zorblax?"}
    ],
    max_tokens=100
)

print(response.choices[0].message.content)

4 Production Configuration

Optimize vLLM for your specific use case:

# High-throughput serving

python -m vllm.entrypoints.openai.api_server \
  --model unsloth/Llama-3.2-1B-Instruct \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 256

Key Parameters Explained

Parameter	Default	Description
--tensor-parallel-size	1	Number of GPUs to split model across
--max-model-len	Model max	Maximum sequence length
--gpu-memory-utilization	0.9	GPU memory fraction to use (0-1)
--max-num-seqs	256	Max concurrent sequences

5 Multi-GPU Setup

Serve larger models by splitting them across multiple GPUs:

# Split across 2 GPUs

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 2 \
  --port 8000

# Split across 4 GPUs

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 4 \
  --port 8000

Requirements

• Multiple GPUs on same machine (NVLink preferred)
• Same GPU model for best performance
• Sufficient PCIe bandwidth

6 Deploy with Docker

Docker is the recommended way to deploy vLLM in production:

# Create a Dockerfile

FROM vllm/vllm-openai:latest

# Copy your fine-tuned model
COPY ./lora_model /app/model

# Expose port
EXPOSE 8000

# Start server
CMD ["--model", "/app/model", "--port", "8000"]

# Build and run

# Build image
docker build -t my-vllm-server .

# Run container
docker run --runtime nvidia --gpus all -p 8000:8000 my-vllm-server

# docker-compose.yml

version: '3.8'
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./lora_model:/app/model
    command: ["--model", "/app/model", "--port", "8000"]

7 Load Balancing (Multiple Instances)

Scale horizontally by running multiple vLLM instances behind a load balancer:

# nginx.conf

http {
    upstream vllm_backend {
        server localhost:8000;
        server localhost:8001;
        server localhost:8002;
    }

    server {
        listen 80;
        location / {
            proxy_pass http://vllm_backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }
}

# Start multiple instances

# Terminal 1
python -m vllm.entrypoints.openai.api_server --model model --port 8000

# Terminal 2
python -m vllm.entrypoints.openai.api_server --model model --port 8001

# Terminal 3
python -m vllm.entrypoints.openai.api_server --model model --port 8002

vLLM vs Other Serving Options

Method	Throughput	Latency	Best For
vLLM	⭐⭐⭐⭐⭐ 24x	⭐⭐⭐⭐⭐ Low	Production APIs
HuggingFace TGI	⭐⭐⭐⭐ Good	⭐⭐⭐⭐ Good	HF ecosystem
TensorRT-LLM	⭐⭐⭐⭐⭐ Excellent	⭐⭐⭐⭐⭐ Excellent	NVIDIA optimization
llama.cpp	⭐⭐⭐ CPU	⭐⭐⭐ Moderate	Edge/CPU only
Native HF	⭐⭐ Baseline	⭐⭐⭐ Moderate	Development

💡 When to Use vLLM

✅ High-throughput production APIs
✅ Serving multiple users simultaneously
✅ Cost optimization (serve more on same hardware)
✅ OpenAI-compatible API needed
✅ NVIDIA GPU available

☁️ Cloud Deployment

🚀 RunPod

GPU cloud with vLLM pre-installed templates.

Template: "vLLM Server"

⚡ Vast.ai

Rent GPUs by the hour, deploy vLLM via Docker.

Lowest cost option

🔷 Lambda Labs

Affordable A100/H100 instances with simple deployment.

Good for sustained workloads

☁️ AWS/GCP/Azure

Use managed Kubernetes or EC2/GCE instances.

Enterprise deployments

🔧 Common Issues

"CUDA out of memory"

Reduce --max-model-len or --gpu-memory-utilization

Enable quantization: --quantization awq or --quantization gptq

"Model architecture not supported"

Check vLLM supported models: https://docs.vllm.ai/en/latest/models/supported_models.html

Most Llama, Mistral, Falcon, and GPT-NeoX models work

Slow first request

Normal - model is loading into GPU memory

Subsequent requests will be fast

"Failed to initialize NCCL" (Multi-GPU)

Ensure all GPUs are on same NUMA node

Check NVLink connectivity: nvidia-smi topo -m

← Deployment Overview All Methods →