Training Methods
Soup supports 11 training methods via the task config key.
Supervised Fine-Tuning (SFT)
The most common method. Train on instruction-response pairs.
yaml
base: meta-llama/Llama-3.1-8B-Instruct
task: sft
data:
train: ./data/train.jsonl
format: alpaca
training:
epochs: 3
lr: 2e-5
batch_size: auto
quantization: 4bit
lora:
r: 64
alpha: 16Direct Preference Optimization (DPO)
Train with preference pairs (chosen vs rejected).
yaml
base: meta-llama/Llama-3.1-8B-Instruct
task: dpo
data:
train: ./data/preferences.jsonl
format: dpo
training:
dpo_beta: 0.1
quantization: 4bit
lora:
r: 64
alpha: 16Group Relative Policy Optimization (GRPO)
Reasoning training (DeepSeek-R1 style) with reward functions instead of a reward model.
yaml
base: meta-llama/Llama-3.1-8B-Instruct
task: grpo
data:
train: ./data/reasoning_train.jsonl
format: sharegpt
max_length: 4096
training:
grpo_beta: 0.1
num_generations: 4
reward_fn: accuracy # or 'format', or path to custom .py
quantization: 4bit
lora:
r: 64
alpha: 16Built-in reward functions:
accuracy— checks if the final answer matches expected (supports####and\boxed{}formats)format— checks for structured<think>...</think>reasoning blocks
Custom reward functions — point to a Python file:
python
# my_reward.py
def reward_fn(completions, **kwargs):
return [1.0 if "correct" in c[-1]["content"] else 0.0 for c in completions]PPO / Full RLHF Pipeline
Three-step pipeline: SFT warmup -> Reward Model -> PPO alignment.
yaml
# Step 3: PPO alignment
base: meta-llama/Llama-3.1-8B-Instruct
task: ppo
data:
train: ./data/prompts.jsonl
format: chatml
training:
reward_model: ./output_rm # From step 2
ppo_epochs: 4
ppo_clip_ratio: 0.2
ppo_kl_penalty: 0.05
quantization: 4bit
lora:
r: 64
alpha: 16All Training Tasks
| Task | Data Format | Use Case |
|---|---|---|
| sft | alpaca/sharegpt/chatml/llava | Instruction tuning |
| dpo | prompt+chosen+rejected | Preference alignment |
| grpo | prompts + reward fns | Reasoning (DeepSeek-R1) |
| kto | prompt+completion+label | Unpaired preference |
| orpo | prompt+chosen+rejected | Reference-free alignment |
| simpo | prompt+chosen+rejected | Length-normalized preference |
| ipo | prompt+chosen+rejected | Regularized preference |
| ppo | prompts + reward model/fn | Full RLHF stage 3 |
| pretrain | plaintext (raw text) | Continued pre-training |
| embedding | anchor+positive(+negative) | Sentence embeddings |
| reward_model | prompt+chosen+rejected | RLHF stage 2 |
Running Training
bash
# Start training
soup train --config soup.yaml
# Resume from checkpoint
soup train --config soup.yaml --resume auto
soup train --config soup.yaml --resume ./output/checkpoint-500
# With W&B logging
soup train --config soup.yaml --wandb
# With TensorBoard
soup train --config soup.yaml --tensorboard
# With DeepSpeed (multi-GPU)
soup train --config soup.yaml --deepspeed zero2
# With FSDP2
soup train --config soup.yaml --fsdp full_shard
# Skip confirmation
soup train --config soup.yaml --yes