Backends & Performance
Unsloth Backend (2-5x Faster Training)
Use the Unsloth backend for significantly faster training and up to 80% less VRAM:
pip install 'soup-cli[fast]'Add one line to your config:
base: meta-llama/Llama-3.1-8B-Instruct
task: sft
backend: unsloth # 2-5x faster, -80% VRAM
data:
train: ./data/train.jsonl
format: alpaca
training:
epochs: 3
lr: 2e-5
quantization: 4bit
lora:
r: 64
alpha: 16Works with all training tasks: SFT, DPO, GRPO, PPO, KTO, ORPO, SimPO, IPO, and Pretrain.
> Tip: Soup auto-detects unsloth. When installed, you'll see a hint during soup train if you haven't enabled it yet.
Performance Optimizations
training:
use_liger: true # Liger Kernel fused ops (20-60% memory savings)
use_flash_attn: true # FlashAttention v2/v3 auto-detection
gradient_checkpointing: true # Required for long sequencesInstall optional packages:
pip install 'soup-cli[liger]' # Liger Kernel
pip install flash-attn --no-build-isolation # FlashAttention
pip install 'soup-cli[ring-attn]' # Ring FlashAttentionLong-Context Training (128k+)
training:
rope_scaling_type: dynamic # linear, dynamic, yarn, longrope
gradient_checkpointing: true
# use_ring_attention: true # Sequence parallelism across GPUs
data:
max_length: 131072 # Up to 1M tokens supportedMulti-GPU Training
DeepSpeed
soup train --config soup.yaml --deepspeed zero2 # ZeRO Stage 2
soup train --config soup.yaml --deepspeed zero3 # ZeRO Stage 3
soup train --config soup.yaml --deepspeed zero2_offload # With CPU offloadFSDP2 (PyTorch Native)
soup train --config soup.yaml --fsdp full_shard # Like ZeRO-3
soup train --config soup.yaml --fsdp shard_grad # Like ZeRO-2
soup train --config soup.yaml --fsdp full_offload # With CPU offloadQuantization-Aware Training (QAT)
Train with simulated quantization for better post-quantization quality:
pip install 'soup-cli[qat]'training:
quantization: 4bit
quantization_aware: true # Enable QATQAT is ~5-10% slower training but produces significantly better quality when deploying with aggressive quantization. Not compatible with the unsloth backend.
Advanced LoRA Variants
DoRA (Weight-Decomposed LoRA)
training:
lora:
r: 64
alpha: 16
use_dora: trueLoRA+ (Differentiated Learning Rates)
training:
lr: 2e-5
loraplus_lr_ratio: 16.0 # lr_B = lr x 16
lora:
r: 64
alpha: 16GaLore (Memory-Efficient Full-Parameter Training)
training:
quantization: none # Required: incompatible with quantization
use_galore: true
galore_rank: 128
galore_update_proj_gap: 200
galore_scale: 0.25> Note: GaLore requires quantization: none and backend: transformers (not unsloth).
NEFTune (Noisy Embeddings)
Add noise to embedding vectors during training for better generalization:
training:
neftune_alpha: 5.0 # Noise magnitude (0.0–50.0, typical: 5–15)NEFTune has been shown to improve instruction-following quality without extra data or compute. Works with all training tasks.
rsLoRA (Rank-Stabilized LoRA)
Scale LoRA outputs by 1/sqrt(r) instead of 1/r for better training at high ranks:
training:
lora:
r: 128 # Higher ranks benefit most from rsLoRA
alpha: 64
use_rslora: true> Tip: Combine NEFTune + rsLoRA + Unsloth backend for the best training quality and speed.
MoE Model Support
Fine-tune Mixture of Experts models (Mixtral, Qwen3-30B-A3B, DeepSeek V3):
base: Qwen/Qwen3-30B-A3B
task: sft
training:
moe_lora: true # Target expert + attention layers
moe_aux_loss_coeff: 0.01 # Router load-balancing loss
quantization: 4bitSoup auto-detects MoE architectures. Works with all training tasks.
Curriculum Learning (v0.23.0+)
Sort training data by difficulty and train progressively:
training:
curriculum: true
curriculum_metric: length # Sort by sequence length
curriculum_buckets: 5 # 1-20 difficulty bucketsFreeze Training (v0.24.0+)
Freeze bottom layers to reduce compute while preserving base knowledge:
training:
freeze_layers: 16 # Freeze bottom 16 layers
# freeze_ratio: 0.5 # Or freeze 50% of layersLoss Watchdog (v0.24.0+)
Automatically stop training if loss spikes or diverges:
training:
loss_watchdog: true
loss_watchdog_threshold: 5.0 # Stop if loss exceeds threshold
loss_watchdog_patience: 3 # Wait N steps before stoppingSample Packing (v0.23.0+)
Pack multiple short samples into one sequence for efficient training:
training:
packing: true # Enable sample packing (SFT only)