Experiment Tracking

Every soup train run is automatically tracked in a local SQLite database (~/.soup/experiments.db).

List Runs

bash
soup runs

Shows all training runs with task, model, status, and final loss.

Run Details

bash
soup runs show run_20260223_143052_a1b2

Detailed info including config, metrics, and an ASCII loss curve.

Compare Runs

bash
soup runs compare run_1 run_2

Side-by-side comparison of two runs with loss curves and metrics.

Delete Runs

bash
soup runs delete run_1

Model Evaluation

Soup includes a comprehensive evaluation platform (v0.19.0+):

bash
pip install 'soup-cli[eval]'

# Run benchmarks (mmlu, gsm8k, hellaswag, etc.)
soup eval benchmark --model ./output --benchmarks mmlu,gsm8k

# Custom eval tasks from JSONL
soup eval custom --model ./output --tasks ./eval_tasks.jsonl

# LLM-as-a-judge evaluation
soup eval judge --model ./output --prompts ./prompts.jsonl --judge gpt-4o

# Auto-eval from soup.yaml config
soup eval auto --config soup.yaml

# Compare eval results between runs
soup eval compare run_1 run_2

# Local leaderboard across models
soup eval leaderboard

# Human A/B evaluation with Elo ratings
soup eval human --model-a ./model_v1 --model-b ./model_v2 --prompts ./prompts.jsonl

Hyperparameter Sweep

Search for the best hyperparameters:

bash
# Grid search
soup sweep --config soup.yaml --param lr=1e-5,2e-5,5e-5 --param lora_r=8,16,32

# Random search with max runs
soup sweep --config soup.yaml --param lr=1e-5,2e-5,5e-5 --strategy random --max-runs 5

# Preview without running
soup sweep --config soup.yaml --param lr=1e-5,2e-5 --dry-run

# Early stopping: skip remaining runs if loss exceeds 1.5x best
soup sweep --config soup.yaml --param lr=1e-5,2e-5,5e-5 --early-stop 1.5

Model Comparison

Compare outputs of two models side-by-side:

bash
soup diff --model-a ./model_v1 --model-b ./model_v2 --prompt "Explain gravity"
soup diff --model-a ./base --model-b ./finetuned --prompts test_prompts.jsonl
soup diff --model-a ./a --model-b ./b --prompts prompts.txt --output results.jsonl

Batch Inference

Run a model on a list of prompts:

bash
soup infer --model ./output --input prompts.jsonl --output results.jsonl
soup infer --model ./output --input prompts.txt --output results.jsonl \
  --max-tokens 512 --temperature 0.3

Output is JSONL with prompt, response, and tokens_generated fields.

Training Profiler (v0.23.0+)

Estimate memory, speed, and GPU requirements before training:

bash
soup profile --model meta-llama/Llama-3.1-8B --task sft --quantization 4bit
soup profile --config soup.yaml

Shows estimated GPU memory, training speed, and hardware recommendations.

Adapter Management (v0.22.0+)

bash
# Scan directory for LoRA adapters
soup adapters list --path ./experiments

# Show adapter metadata (base model, rank, size)
soup adapters info ./output

# Compare two adapters side-by-side
soup adapters compare ./adapter_v1 ./adapter_v2

Logging Integrations

TensorBoard

bash
soup train --config soup.yaml --tensorboard
tensorboard --logdir ./output/runs/

Weights & Biases

bash
soup train --config soup.yaml --wandb

> --tensorboard and --wandb cannot be used together.