Export fine-tuned models to GGUF and deploy on Ollama
After training with Soup CLI, export your model to GGUF format and serve it locally with Ollama in three commands.
1. Export to GGUF
bash
soup export --adapter ./runs/my-model/latest \
--format gguf \
--quant q4_k_m \
--output ./my-model.ggufQuantization levels:
| Quant | Size (7B) | Quality | Use case |
|---|---|---|---|
q4_k_m | ~4.1 GB | Good | Default — best size/quality balance |
q5_k_m | ~4.8 GB | Better | When you need higher accuracy |
q8_0 | ~7.5 GB | Near-lossless | Benchmarks, eval |
q2_k | ~2.6 GB | Lower | Tiny devices, RPi |
2. Deploy to Ollama
Soup CLI ships with Ollama integration (v0.18.0+):
bash
soup deploy ollama \
--model ./my-model.gguf \
--name my-model \
--template chatThis creates an Ollama Modelfile, imports the GGUF, and registers your model.
3. Chat
bash
ollama run my-modelOr via the API:
bash
curl http://localhost:11434/api/generate -d '{
"model": "my-model",
"prompt": "Hello!"
}'One-liner: train → export → deploy
bash
soup train --config soup.yaml && \
soup export --adapter ./runs/latest --format gguf --quant q4_k_m && \
soup deploy ollama --model ./runs/latest/model.gguf --name my-modelOther export formats
Soup CLI also supports:
- ONNX —
--format onnxfor cross-platform inference - TensorRT-LLM —
--format tensorrtfor NVIDIA optimized serving - AWQ / GPTQ —
--format awqor--format gptqfor quantized GPU inference - Hugging Face —
--format hffor merged checkpoint
Related
- [Serving with vLLM and SGLang](/docs/serving)
- [Fine-tune Llama 3.1 with LoRA](/docs/fine-tune-llama-3-1-lora)