Multi-GPU training with DeepSpeed ZeRO-3
Use Soup CLI with DeepSpeed ZeRO-3 to train large models across multiple GPUs. This guide shows how to fine-tune a 70B model across 4–8 GPUs.
Install
bash
pip install 'soup-cli[deepspeed]'When to use which ZeRO stage
| Stage | What it shards | When to use |
|---|---|---|
| ZeRO-2 | Optimizer states + gradients | 2–4 GPUs, 7B–13B models |
| ZeRO-3 | Everything incl. parameters | 4+ GPUs, 30B+ models |
| FSDP2 | Fully sharded (PyTorch native) | Alternative to ZeRO-3 |
Config for Llama 3.1 70B on 8× A100
yaml
base:
model: meta-llama/Meta-Llama-3.1-70B-Instruct
task: sft
data:
train: train.json
format: alpaca
training:
backend: transformers
epochs: 2
learning_rate: 1.0e-4
batch_size: 1
gradient_accumulation_steps: 16
max_seq_length: 4096
gradient_checkpointing: true
bf16: true
distributed:
strategy: deepspeed
zero_stage: 3
offload_optimizer: cpu
offload_params: cpu
lora:
enabled: true
r: 32
alpha: 64Launch on 8 GPUs
bash
soup train --config llama70b.yaml --gpus 8Soup handles torchrun / deepspeed launcher config automatically.
Alternative: FSDP2
yaml
training:
distributed:
strategy: fsdp2
sharding: fullFSDP2 is PyTorch-native and often simpler for LoRA workloads.
Ring FlashAttention for 128k+ context
For very long sequences:
bash
pip install 'soup-cli[ring-attn]'yaml
training:
attention: ring
max_seq_length: 131072Tips
- Always enable
gradient_checkpointing: truefor 70B+ models - CPU offload trades speed for VRAM — use only if OOM
- Profile first:
soup profile --config llama70b.yaml --gpus 8estimates memory + throughput before you spend GPU hours
Related
- [Backends reference](/docs/backends)
- [Training methods](/docs/training)