Inference Server
Deploy fine-tuned models as an OpenAI-compatible API server.
Transformers Backend
bash
pip install 'soup-cli[serve]'
soup serve --model ./output --port 8000Simple HTTP API using HuggingFace Transformers. Good for testing and low-traffic use.
vLLM Backend (2-4x Faster)
bash
pip install 'soup-cli[serve-fast]'
soup serve --model ./output --backend vllm
# Multi-GPU with tensor parallelism
soup serve --model ./output --backend vllm --tensor-parallel 2
# Control GPU memory usage
soup serve --model ./output --backend vllm --gpu-memory 0.8Recommended for production. Uses PagedAttention for high throughput.
SGLang Backend
bash
pip install 'soup-cli[sglang]'
soup serve --model ./output --backend sglang
# Multi-GPU
soup serve --model ./output --backend sglang --tensor-parallel 2Alternative high-throughput backend with RadixAttention.
Speculative Decoding (2-3x Faster Generation)
bash
# Transformers backend
soup serve --model ./output --speculative-decoding small-draft-model --spec-tokens 5
# vLLM backend
soup serve --model ./output --backend vllm --speculative-decoding small-draft-modelMulti-Adapter Serving (v0.22.0+)
Serve multiple LoRA adapters on a single base model:
bash
soup serve --model ./base --adapters chat=./adapters/chat code=./adapters/codeSwitch adapters per request via the model field:
json
{"model": "chat", "messages": [{"role": "user", "content": "Hello!"}]}API Endpoints
All backends expose the same OpenAI-compatible API:
POST /v1/chat/completions— chat completions (streaming supported)GET /v1/models— list available modelsGET /health— health check
bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "output",
"messages": [{"role": "user", "content": "Hello!"}]
}'Compatible with OpenAI SDK:
python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="output",
messages=[{"role": "user", "content": "Hello!"}],
)> Note: max_tokens is capped at 16,384 per request. Error details are never exposed in HTTP responses.