Export fine-tuned models to GGUF and deploy on Ollama

After training with Soup CLI, export your model to GGUF format and serve it locally with Ollama in three commands.

1. Export to GGUF

bash
soup export --adapter ./runs/my-model/latest \
            --format gguf \
            --quant q4_k_m \
            --output ./my-model.gguf

Quantization levels:

QuantSize (7B)QualityUse case
q4_k_m~4.1 GBGoodDefault — best size/quality balance
q5_k_m~4.8 GBBetterWhen you need higher accuracy
q8_0~7.5 GBNear-losslessBenchmarks, eval
q2_k~2.6 GBLowerTiny devices, RPi

2. Deploy to Ollama

Soup CLI ships with Ollama integration (v0.18.0+):

bash
soup deploy ollama \
    --model ./my-model.gguf \
    --name my-model \
    --template chat

This creates an Ollama Modelfile, imports the GGUF, and registers your model.

3. Chat

bash
ollama run my-model

Or via the API:

bash
curl http://localhost:11434/api/generate -d '{
  "model": "my-model",
  "prompt": "Hello!"
}'

One-liner: train → export → deploy

bash
soup train --config soup.yaml && \
soup export --adapter ./runs/latest --format gguf --quant q4_k_m && \
soup deploy ollama --model ./runs/latest/model.gguf --name my-model

Other export formats

Soup CLI also supports:

  • ONNX--format onnx for cross-platform inference
  • TensorRT-LLM--format tensorrt for NVIDIA optimized serving
  • AWQ / GPTQ--format awq or --format gptq for quantized GPU inference
  • Hugging Face--format hf for merged checkpoint

Related

  • [Serving with vLLM and SGLang](/docs/serving)
  • [Fine-tune Llama 3.1 with LoRA](/docs/fine-tune-llama-3-1-lora)