Deploy with SGLang

5x faster inference with RadixAttention. The fastest way to serve fine-tuned models in production.

⚡

5x Faster Than vLLM

60+ tokens/sec on A100. Multi-LoRA support for serving multiple fine-tuned models on one GPU.

Why SGLang?

Faster Inference

40%

Less Memory

Multi

LoRA Support

What Makes It Fast?

• RadixAttention - Intelligent KV cache reuse across requests
• Speculative Decoding - Draft model generates tokens faster
• Continuous Batching - No waiting between requests
• OpenAI API Compatible - Drop-in replacement for existing code

📋 Prerequisites

📦

1. Trained Model

Export your fine-tuned model from EdukaAI in HuggingFace format. SGLang works with any LoRA adapter or merged model.

See export options →

💻

2. GPU Server

SGLang requires NVIDIA GPU. Recommended: A100 (80GB) for 70B models, A10G (24GB) for 7B-13B models.

For consumer GPUs, see vLLM or KTransformers.

🚀 Quick Start

Step 1: Install SGLang

pip install sglang

Step 2: Export Your Model

From EdukaAI, export in HuggingFace format. For SGLang, use the merged model (not just LoRA weights).

Step 3: Start the Server

# Single model
python -m sglang.launch_server --model-path /path/to/your/model --port 30000

# With multiple LoRA adapters (SGLang's killer feature!)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-2-7b-hf \
  --lora-paths /path/to/lora1 /path/to/lora2 \
  --port 30000

Step 4: Query Your Model

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

SGLang is OpenAI API compatible—use the same client libraries.

🎯 Multi-LoRA: Serve Multiple Models on One GPU

SGLang's unique feature for fine-tuning workflows: Load the base model once, then serve multiple specialized adapters.

Example: Customer Support System

# Base: Llama-2-70B (loads once, ~140GB → ~40GB with quantization)
# Adapter 1: Technical Support LoRA (100MB)
# Adapter 2: Billing Support LoRA (100MB)  
# Adapter 3: Sales LoRA (100MB)

python -m sglang.launch_server \
  --model-path meta-llama/Llama-2-70b-hf \
  --lora-paths tech-support billing sales \
  --quantization fp8 \
  --port 30000

# Route customers to appropriate adapter
# Customer A → tech-support adapter
# Customer B → billing adapter
# All on one A100 GPU!

Cost Savings: Instead of 3 GPUs ($15K/month), use 1 GPU ($5K/month). Hot-swap adapters without restarting the server.

📊 Performance vs Alternatives

Tool	Tokens/sec (A100)	Memory Usage	Multi-LoRA	Best For
SGLang	60+	40% less	✅ Yes	Production APIs
vLLM	50	Standard	⚠️ Limited	High throughput
TensorRT-LLM	55	Low	❌ No	NVIDIA optimized
Ollama	20-40	Standard	❌ No	Local dev

When to Choose SGLang

✅ Choose SGLang If:

• Production API serving needed
• Multiple LoRA models to serve
• Maximum throughput required
• Cost optimization matters
• OpenAI API compatibility needed

⚠️ Consider Others If:

• Consumer GPU only (24GB)
• Local development/testing
• AMD GPU (SGLang is NVIDIA-only)
• CPU-only inference needed

← Back to Deployment vLLM Alternative →