Data Tools
Soup includes powerful CLI tools for preparing training datasets.
Inspect
bash
soup data inspect ./data/train.jsonlShows dataset statistics: sample count, token distribution, field analysis. For vision datasets, automatically shows image statistics (count, formats, missing files).
Validate
bash
soup data validate ./data/train.jsonl
soup data validate ./data/train.jsonl --format alpacaChecks for missing fields, encoding issues, and format compliance. Auto-detects format when --format is not specified.
Convert
bash
soup data convert ./data/train.jsonl --to sharegpt --output converted.jsonlTransform between alpaca, sharegpt, and chatml formats.
Merge
bash
soup data merge data1.jsonl data2.jsonl --output merged.jsonl --shuffleCombine multiple datasets with optional shuffling.
Deduplicate
bash
# Requires: pip install 'soup-cli[data]'
soup data dedup ./data/train.jsonl --threshold 0.8Remove near-duplicate samples using MinHash.
Extended Statistics
bash
soup data stats ./data/train.jsonlLength distribution with histograms, token counts, and language detection.
Synthetic Data Generation
bash
# Generate using OpenAI API
soup data generate --prompt "Create math word problems" --count 100 --format alpaca
# Use a different model
soup data generate --prompt "Medical Q&A pairs" --model gpt-4o --count 500
# Deduplicate against existing data
soup data generate --prompt "..." --count 200 --dedup-with existing.jsonl
# Use seed examples to guide style
soup data generate --prompt "..." --seed examples.jsonl --count 100
# Use a local server (soup serve, Ollama, etc.)
soup data generate --prompt "..." --provider server --api-base http://localhost:11434/v1Multi-Provider Support (v0.20.0+)
bash
# Generate via local Ollama instance
soup data generate --prompt "..." --provider ollama --model llama3.1
soup data generate --prompt "..." --ollama-model llama3.1 # shorthand
# Generate via Anthropic Claude API (set ANTHROPIC_API_KEY env var)
soup data generate --prompt "..." --provider anthropic --model claude-3-haiku-20240307
# Generate via local vLLM server
soup data generate --prompt "..." --provider vllm --model meta-llama/Llama-3.1-8B-InstructDomain Templates (v0.20.0+)
bash
# Code instruction pairs (Python, JS, Go, Rust, Java)
soup data generate --prompt "..." --template code --language Python --task-type function
# Multi-turn conversations
soup data generate --prompt "..." --template conversation --turns 6 --topic "science"
# QA from context document
soup data generate --prompt "..." --template qa --context document.txt
# Preference data (DPO/KTO/ORPO)
soup data generate --prompt "..." --template preference --pref-task dpo
# Chain-of-thought reasoning (GRPO)
soup data generate --prompt "..." --template reasoning --domain mathQuality Pipeline (v0.20.0+)
bash
# Auto-validate after generation (remove malformed entries)
soup data generate --prompt "..." --validate
# Auto-filter by quality (coherence scoring)
soup data generate --prompt "..." --filter
# Auto-dedup (MinHash, requires: pip install 'soup-cli[data]')
soup data generate --prompt "..." --dedup
# Full quality pipeline: validate + filter + dedup
soup data generate --prompt "..." --quality-pipelineQuality Filter
bash
# Filter by coherence score
soup data filter ./data/train.jsonl --coherence 0.3
# Filter by perplexity + coherence
soup data filter ./data/train.jsonl --perplexity 500 --coherence 0.3
# Add scores without removing samples
soup data filter ./data/train.jsonl --score-onlyUses perplexity + coherence scoring to identify low-quality samples.
Data Sampling (v0.23.0+)
bash
# Random sample
soup data sample ./data/train.jsonl --strategy random --count 1000
# Diverse sample (TF-IDF clustering)
soup data sample ./data/train.jsonl --strategy diverse --count 500
# Hard examples (by length)
soup data sample ./data/train.jsonl --strategy hard --count 500Data Splitting (v0.23.0+)
bash
# Split into train/val/test
soup data split ./data/train.jsonl --ratio 0.8,0.1,0.1
# Stratified split
soup data split ./data/train.jsonl --ratio 0.9,0.1 --stratifyHuggingFace Dataset Hub (v0.24.0+)
bash
# Search for datasets
soup data search "math reasoning"
# Preview remote dataset metadata
soup data preview tatsu-lab/alpaca
# Download to local JSONL
soup data download tatsu-lab/alpaca --output ./data/alpaca.jsonl --samples 1000Dataset Registry (v0.24.0+)
Register local datasets by name for use in soup.yaml:
bash
# Register a dataset
soup data register my-chat-data --path ./data/chat.jsonl --format chatml
# List registered datasets
soup data registry
# Use in config: data.train: registry:my-chat-data
soup data unregister my-chat-data