Deploy with SGLang
5x faster inference with RadixAttention. The fastest way to serve fine-tuned models in production.
5x Faster Than vLLM
60+ tokens/sec on A100. Multi-LoRA support for serving multiple fine-tuned models on one GPU.
Why SGLang?
5x
Faster Inference
40%
Less Memory
Multi
LoRA Support
What Makes It Fast?
- • RadixAttention - Intelligent KV cache reuse across requests
- • Speculative Decoding - Draft model generates tokens faster
- • Continuous Batching - No waiting between requests
- • OpenAI API Compatible - Drop-in replacement for existing code
📋 Prerequisites
1. Trained Model
Export your fine-tuned model from EdukaAI in HuggingFace format. SGLang works with any LoRA adapter or merged model.
See export options →2. GPU Server
SGLang requires NVIDIA GPU. Recommended: A100 (80GB) for 70B models, A10G (24GB) for 7B-13B models.
For consumer GPUs, see vLLM or KTransformers.
🚀 Quick Start
Step 1: Install SGLang
pip install sglangStep 2: Export Your Model
From EdukaAI, export in HuggingFace format. For SGLang, use the merged model (not just LoRA weights).
Step 3: Start the Server
# Single model
python -m sglang.launch_server --model-path /path/to/your/model --port 30000
# With multiple LoRA adapters (SGLang's killer feature!)
python -m sglang.launch_server \
--model-path meta-llama/Llama-2-7b-hf \
--lora-paths /path/to/lora1 /path/to/lora2 \
--port 30000Step 4: Query Your Model
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}]
}'SGLang is OpenAI API compatible—use the same client libraries.
🎯 Multi-LoRA: Serve Multiple Models on One GPU
SGLang's unique feature for fine-tuning workflows: Load the base model once, then serve multiple specialized adapters.
Example: Customer Support System
# Base: Llama-2-70B (loads once, ~140GB → ~40GB with quantization)
# Adapter 1: Technical Support LoRA (100MB)
# Adapter 2: Billing Support LoRA (100MB)
# Adapter 3: Sales LoRA (100MB)
python -m sglang.launch_server \
--model-path meta-llama/Llama-2-70b-hf \
--lora-paths tech-support billing sales \
--quantization fp8 \
--port 30000
# Route customers to appropriate adapter
# Customer A → tech-support adapter
# Customer B → billing adapter
# All on one A100 GPU!Cost Savings: Instead of 3 GPUs ($15K/month), use 1 GPU ($5K/month). Hot-swap adapters without restarting the server.
📊 Performance vs Alternatives
| Tool | Tokens/sec (A100) | Memory Usage | Multi-LoRA | Best For |
|---|---|---|---|---|
| SGLang | 60+ | 40% less | ✅ Yes | Production APIs |
| vLLM | 50 | Standard | ⚠️ Limited | High throughput |
| TensorRT-LLM | 55 | Low | ❌ No | NVIDIA optimized |
| Ollama | 20-40 | Standard | ❌ No | Local dev |
When to Choose SGLang
✅ Choose SGLang If:
- • Production API serving needed
- • Multiple LoRA models to serve
- • Maximum throughput required
- • Cost optimization matters
- • OpenAI API compatibility needed
⚠️ Consider Others If:
- • Consumer GPU only (24GB)
- • Local development/testing
- • AMD GPU (SGLang is NVIDIA-only)
- • CPU-only inference needed