Training Methods

Soup supports 11 training methods via the task config key.

Supervised Fine-Tuning (SFT)

The most common method. Train on instruction-response pairs.

yaml
base: meta-llama/Llama-3.1-8B-Instruct
task: sft

data:
  train: ./data/train.jsonl
  format: alpaca

training:
  epochs: 3
  lr: 2e-5
  batch_size: auto
  quantization: 4bit
  lora:
    r: 64
    alpha: 16

Direct Preference Optimization (DPO)

Train with preference pairs (chosen vs rejected).

yaml
base: meta-llama/Llama-3.1-8B-Instruct
task: dpo

data:
  train: ./data/preferences.jsonl
  format: dpo

training:
  dpo_beta: 0.1
  quantization: 4bit
  lora:
    r: 64
    alpha: 16

Group Relative Policy Optimization (GRPO)

Reasoning training (DeepSeek-R1 style) with reward functions instead of a reward model.

yaml
base: meta-llama/Llama-3.1-8B-Instruct
task: grpo

data:
  train: ./data/reasoning_train.jsonl
  format: sharegpt
  max_length: 4096

training:
  grpo_beta: 0.1
  num_generations: 4
  reward_fn: accuracy     # or 'format', or path to custom .py
  quantization: 4bit
  lora:
    r: 64
    alpha: 16

Built-in reward functions:

  • accuracy — checks if the final answer matches expected (supports #### and \boxed{} formats)
  • format — checks for structured <think>...</think> reasoning blocks

Custom reward functions — point to a Python file:

python
# my_reward.py
def reward_fn(completions, **kwargs):
    return [1.0 if "correct" in c[-1]["content"] else 0.0 for c in completions]

PPO / Full RLHF Pipeline

Three-step pipeline: SFT warmup -> Reward Model -> PPO alignment.

yaml
# Step 3: PPO alignment
base: meta-llama/Llama-3.1-8B-Instruct
task: ppo

data:
  train: ./data/prompts.jsonl
  format: chatml

training:
  reward_model: ./output_rm   # From step 2
  ppo_epochs: 4
  ppo_clip_ratio: 0.2
  ppo_kl_penalty: 0.05
  quantization: 4bit
  lora:
    r: 64
    alpha: 16

All Training Tasks

TaskData FormatUse Case
sftalpaca/sharegpt/chatml/llavaInstruction tuning
dpoprompt+chosen+rejectedPreference alignment
grpoprompts + reward fnsReasoning (DeepSeek-R1)
ktoprompt+completion+labelUnpaired preference
orpoprompt+chosen+rejectedReference-free alignment
simpoprompt+chosen+rejectedLength-normalized preference
ipoprompt+chosen+rejectedRegularized preference
ppoprompts + reward model/fnFull RLHF stage 3
pretrainplaintext (raw text)Continued pre-training
embeddinganchor+positive(+negative)Sentence embeddings
reward_modelprompt+chosen+rejectedRLHF stage 2

Running Training

bash
# Start training
soup train --config soup.yaml

# Resume from checkpoint
soup train --config soup.yaml --resume auto
soup train --config soup.yaml --resume ./output/checkpoint-500

# With W&B logging
soup train --config soup.yaml --wandb

# With TensorBoard
soup train --config soup.yaml --tensorboard

# With DeepSpeed (multi-GPU)
soup train --config soup.yaml --deepspeed zero2

# With FSDP2
soup train --config soup.yaml --fsdp full_shard

# Skip confirmation
soup train --config soup.yaml --yes