EdukaAI

Deploy with SGLang

5x faster inference with RadixAttention. The fastest way to serve fine-tuned models in production.

5x Faster Than vLLM

60+ tokens/sec on A100. Multi-LoRA support for serving multiple fine-tuned models on one GPU.

Why SGLang?

5x

Faster Inference

40%

Less Memory

Multi

LoRA Support

What Makes It Fast?

  • RadixAttention - Intelligent KV cache reuse across requests
  • Speculative Decoding - Draft model generates tokens faster
  • Continuous Batching - No waiting between requests
  • OpenAI API Compatible - Drop-in replacement for existing code

📋 Prerequisites

📦

1. Trained Model

Export your fine-tuned model from EdukaAI in HuggingFace format. SGLang works with any LoRA adapter or merged model.

See export options →
💻

2. GPU Server

SGLang requires NVIDIA GPU. Recommended: A100 (80GB) for 70B models, A10G (24GB) for 7B-13B models.

For consumer GPUs, see vLLM or KTransformers.

🚀 Quick Start

Step 1: Install SGLang

pip install sglang

Step 2: Export Your Model

From EdukaAI, export in HuggingFace format. For SGLang, use the merged model (not just LoRA weights).

Step 3: Start the Server

# Single model
python -m sglang.launch_server --model-path /path/to/your/model --port 30000

# With multiple LoRA adapters (SGLang's killer feature!)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-2-7b-hf \
  --lora-paths /path/to/lora1 /path/to/lora2 \
  --port 30000

Step 4: Query Your Model

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

SGLang is OpenAI API compatible—use the same client libraries.

🎯 Multi-LoRA: Serve Multiple Models on One GPU

SGLang's unique feature for fine-tuning workflows: Load the base model once, then serve multiple specialized adapters.

Example: Customer Support System

# Base: Llama-2-70B (loads once, ~140GB → ~40GB with quantization)
# Adapter 1: Technical Support LoRA (100MB)
# Adapter 2: Billing Support LoRA (100MB)  
# Adapter 3: Sales LoRA (100MB)

python -m sglang.launch_server \
  --model-path meta-llama/Llama-2-70b-hf \
  --lora-paths tech-support billing sales \
  --quantization fp8 \
  --port 30000

# Route customers to appropriate adapter
# Customer A → tech-support adapter
# Customer B → billing adapter
# All on one A100 GPU!

Cost Savings: Instead of 3 GPUs ($15K/month), use 1 GPU ($5K/month). Hot-swap adapters without restarting the server.

📊 Performance vs Alternatives

ToolTokens/sec (A100)Memory UsageMulti-LoRABest For
SGLang60+40% less✅ YesProduction APIs
vLLM50Standard⚠️ LimitedHigh throughput
TensorRT-LLM55Low❌ NoNVIDIA optimized
Ollama20-40Standard❌ NoLocal dev

When to Choose SGLang

✅ Choose SGLang If:

  • • Production API serving needed
  • • Multiple LoRA models to serve
  • • Maximum throughput required
  • • Cost optimization matters
  • • OpenAI API compatibility needed

⚠️ Consider Others If:

  • • Consumer GPU only (24GB)
  • • Local development/testing
  • • AMD GPU (SGLang is NVIDIA-only)
  • • CPU-only inference needed