Backends & Performance

Unsloth Backend (2-5x Faster Training)

Use the Unsloth backend for significantly faster training and up to 80% less VRAM:

bash
pip install 'soup-cli[fast]'

Add one line to your config:

yaml
base: meta-llama/Llama-3.1-8B-Instruct
task: sft
backend: unsloth    # 2-5x faster, -80% VRAM

data:
  train: ./data/train.jsonl
  format: alpaca

training:
  epochs: 3
  lr: 2e-5
  quantization: 4bit
  lora:
    r: 64
    alpha: 16

Works with all training tasks: SFT, DPO, GRPO, PPO, KTO, ORPO, SimPO, IPO, and Pretrain.

> Tip: Soup auto-detects unsloth. When installed, you'll see a hint during soup train if you haven't enabled it yet.

Performance Optimizations

yaml
training:
  use_liger: true             # Liger Kernel fused ops (20-60% memory savings)
  use_flash_attn: true        # FlashAttention v2/v3 auto-detection
  gradient_checkpointing: true  # Required for long sequences

Install optional packages:

bash
pip install 'soup-cli[liger]'       # Liger Kernel
pip install flash-attn --no-build-isolation  # FlashAttention
pip install 'soup-cli[ring-attn]'   # Ring FlashAttention

Long-Context Training (128k+)

yaml
training:
  rope_scaling_type: dynamic    # linear, dynamic, yarn, longrope
  gradient_checkpointing: true
  # use_ring_attention: true    # Sequence parallelism across GPUs

data:
  max_length: 131072            # Up to 1M tokens supported

Multi-GPU Training

DeepSpeed

bash
soup train --config soup.yaml --deepspeed zero2          # ZeRO Stage 2
soup train --config soup.yaml --deepspeed zero3          # ZeRO Stage 3
soup train --config soup.yaml --deepspeed zero2_offload  # With CPU offload

FSDP2 (PyTorch Native)

bash
soup train --config soup.yaml --fsdp full_shard    # Like ZeRO-3
soup train --config soup.yaml --fsdp shard_grad    # Like ZeRO-2
soup train --config soup.yaml --fsdp full_offload  # With CPU offload

Quantization-Aware Training (QAT)

Train with simulated quantization for better post-quantization quality:

bash
pip install 'soup-cli[qat]'
yaml
training:
  quantization: 4bit
  quantization_aware: true    # Enable QAT

QAT is ~5-10% slower training but produces significantly better quality when deploying with aggressive quantization. Not compatible with the unsloth backend.

Advanced LoRA Variants

DoRA (Weight-Decomposed LoRA)

yaml
training:
  lora:
    r: 64
    alpha: 16
    use_dora: true

LoRA+ (Differentiated Learning Rates)

yaml
training:
  lr: 2e-5
  loraplus_lr_ratio: 16.0    # lr_B = lr x 16
  lora:
    r: 64
    alpha: 16

GaLore (Memory-Efficient Full-Parameter Training)

yaml
training:
  quantization: none          # Required: incompatible with quantization
  use_galore: true
  galore_rank: 128
  galore_update_proj_gap: 200
  galore_scale: 0.25

> Note: GaLore requires quantization: none and backend: transformers (not unsloth).

NEFTune (Noisy Embeddings)

Add noise to embedding vectors during training for better generalization:

yaml
training:
  neftune_alpha: 5.0    # Noise magnitude (0.0–50.0, typical: 5–15)

NEFTune has been shown to improve instruction-following quality without extra data or compute. Works with all training tasks.

rsLoRA (Rank-Stabilized LoRA)

Scale LoRA outputs by 1/sqrt(r) instead of 1/r for better training at high ranks:

yaml
training:
  lora:
    r: 128              # Higher ranks benefit most from rsLoRA
    alpha: 64
    use_rslora: true

> Tip: Combine NEFTune + rsLoRA + Unsloth backend for the best training quality and speed.

MoE Model Support

Fine-tune Mixture of Experts models (Mixtral, Qwen3-30B-A3B, DeepSeek V3):

yaml
base: Qwen/Qwen3-30B-A3B
task: sft

training:
  moe_lora: true               # Target expert + attention layers
  moe_aux_loss_coeff: 0.01     # Router load-balancing loss
  quantization: 4bit

Soup auto-detects MoE architectures. Works with all training tasks.

Curriculum Learning (v0.23.0+)

Sort training data by difficulty and train progressively:

yaml
training:
  curriculum: true
  curriculum_metric: length       # Sort by sequence length
  curriculum_buckets: 5           # 1-20 difficulty buckets

Freeze Training (v0.24.0+)

Freeze bottom layers to reduce compute while preserving base knowledge:

yaml
training:
  freeze_layers: 16               # Freeze bottom 16 layers
  # freeze_ratio: 0.5             # Or freeze 50% of layers

Loss Watchdog (v0.24.0+)

Automatically stop training if loss spikes or diverges:

yaml
training:
  loss_watchdog: true
  loss_watchdog_threshold: 5.0    # Stop if loss exceeds threshold
  loss_watchdog_patience: 3       # Wait N steps before stopping

Sample Packing (v0.23.0+)

Pack multiple short samples into one sequence for efficient training:

yaml
training:
  packing: true                   # Enable sample packing (SFT only)