Vision & Audio Fine-Tuning
Vision / Multimodal
Fine-tune vision-language models (LLaMA-3.2-Vision, Qwen2-VL, Pixtral):
bash
pip install 'soup-cli[vision]'
soup init --template visionyaml
base: meta-llama/Llama-3.2-11B-Vision-Instruct
task: sft
modality: vision
data:
train: ./data/vision_train.jsonl
format: llava
image_dir: ./data/images
val_split: 0.1
training:
epochs: 3
lr: 1e-5
quantization: 4bit
lora:
r: 64
alpha: 16Vision Data Formats
LLaVA:
json
{"image": "photo.jpg", "conversations": [{"from": "human", "value": "<image>\nDescribe this image."}, {"from": "gpt", "value": "A cat on a mat."}]}ShareGPT4V:
json
{"image": "chart.png", "conversations": [{"from": "human", "value": "<image>\nWhat does this show?"}, {"from": "gpt", "value": "Quarterly revenue."}]}soup data inspect automatically shows image statistics (count, formats, missing files) for vision datasets.
Audio / Speech
Fine-tune audio-language models (Qwen2-Audio, Whisper):
bash
pip install 'soup-cli[audio]'
soup init --template audioyaml
base: Qwen/Qwen2-Audio-7B-Instruct
task: sft
modality: audio
data:
train: ./data/audio_train.jsonl
format: audio
audio_dir: ./data/audio
val_split: 0.1
training:
epochs: 3
lr: 1e-5
quantization: 4bit
lora:
r: 64
alpha: 16Audio Data Format
json
{"audio": "recording.wav", "messages": [{"role": "user", "content": "Transcribe this audio."}, {"role": "assistant", "content": "Hello world."}]}Continued Pre-training
Continue training on raw text for domain adaptation:
yaml
base: meta-llama/Llama-3.1-8B
task: pretrain
data:
train: ./data/corpus.jsonl # {"text": "..."} or plain .txt files
format: plaintext
max_length: 4096
training:
epochs: 1
lr: 1e-5
quantization: 4bitEmbedding Fine-Tuning
Train sentence embedding models with contrastive learning:
yaml
base: sentence-transformers/all-MiniLM-L6-v2
task: embedding
data:
train: ./data/embeddings.jsonl
format: embedding
training:
embedding_loss: contrastive # contrastive, triplet, cosine
embedding_pooling: mean
embedding_margin: 0.5
embedding_temperature: 0.05Embedding Data Format
json
{"anchor": "What is Python?", "positive": "Python is a programming language."}
{"anchor": "What is Python?", "positive": "A programming language.", "negative": "A type of snake."}