DPO training guide: align LLMs with human preferences

Direct Preference Optimization (DPO) aligns language models with human preferences without needing a reward model — it's simpler and more stable than RLHF/PPO.

When to use DPO

  • You have a dataset of "chosen" vs "rejected" responses
  • You want to reduce hallucinations and off-topic answers
  • You want alignment without the complexity of PPO

Use DPO after SFT. A typical pipeline: Pretrain → SFT → DPO.

1. DPO dataset format

json
[
  {
    "prompt": "Explain quantum entanglement.",
    "chosen": "Quantum entanglement is a physical phenomenon where...",
    "rejected": "Idk, something quantum."
  }
]

Save as preferences.json.

2. Config

yaml
base:
  model: ./runs/my-sft-model/latest  # Start from SFT checkpoint

task: dpo

data:
  train: preferences.json
  format: dpo

training:
  backend: transformers
  epochs: 1
  learning_rate: 5.0e-7
  batch_size: 2
  gradient_accumulation_steps: 8
  beta: 0.1
  max_seq_length: 2048
  lora:
    enabled: true
    r: 16
    alpha: 32

Key DPO hyperparameters:

  • beta: 0.1 — KL penalty weight. Higher = stay closer to reference model.
  • learning_rate: 5e-7 — DPO needs a much smaller LR than SFT.
  • epochs: 1 — DPO overfits quickly, rarely needs more than 1–2 epochs.

3. Train

bash
soup train --config dpo.yaml

4. Evaluate

Compare the DPO model against the SFT baseline:

bash
soup eval compare \
    --base ./runs/my-sft-model/latest \
    --candidate ./runs/my-dpo-model/latest \
    --judge gpt-4

DPO variants in Soup CLI

Soup supports several preference-optimization methods — swap task: to change algorithm:

  • task: dpo — Direct Preference Optimization
  • task: orpo — ORPO (combines SFT + DPO in one step, no reference model)
  • task: simpo — SimPO (length-normalized, no reference model)
  • task: ipo — IPO (IPO loss, more stable than DPO on noisy data)
  • task: kto — KTO (works with unpaired binary labels)

Related

  • [Training methods reference](/docs/training)
  • [Fine-tune Llama 3.1 with LoRA](/docs/fine-tune-llama-3-1-lora)