Vision & Audio Fine-Tuning

Vision / Multimodal

Fine-tune vision-language models (LLaMA-3.2-Vision, Qwen2-VL, Pixtral):

bash
pip install 'soup-cli[vision]'
soup init --template vision
yaml
base: meta-llama/Llama-3.2-11B-Vision-Instruct
task: sft
modality: vision

data:
  train: ./data/vision_train.jsonl
  format: llava
  image_dir: ./data/images
  val_split: 0.1

training:
  epochs: 3
  lr: 1e-5
  quantization: 4bit
  lora:
    r: 64
    alpha: 16

Vision Data Formats

LLaVA:

json
{"image": "photo.jpg", "conversations": [{"from": "human", "value": "<image>\nDescribe this image."}, {"from": "gpt", "value": "A cat on a mat."}]}

ShareGPT4V:

json
{"image": "chart.png", "conversations": [{"from": "human", "value": "<image>\nWhat does this show?"}, {"from": "gpt", "value": "Quarterly revenue."}]}

soup data inspect automatically shows image statistics (count, formats, missing files) for vision datasets.

Audio / Speech

Fine-tune audio-language models (Qwen2-Audio, Whisper):

bash
pip install 'soup-cli[audio]'
soup init --template audio
yaml
base: Qwen/Qwen2-Audio-7B-Instruct
task: sft
modality: audio

data:
  train: ./data/audio_train.jsonl
  format: audio
  audio_dir: ./data/audio
  val_split: 0.1

training:
  epochs: 3
  lr: 1e-5
  quantization: 4bit
  lora:
    r: 64
    alpha: 16

Audio Data Format

json
{"audio": "recording.wav", "messages": [{"role": "user", "content": "Transcribe this audio."}, {"role": "assistant", "content": "Hello world."}]}

Continued Pre-training

Continue training on raw text for domain adaptation:

yaml
base: meta-llama/Llama-3.1-8B
task: pretrain

data:
  train: ./data/corpus.jsonl     # {"text": "..."} or plain .txt files
  format: plaintext
  max_length: 4096

training:
  epochs: 1
  lr: 1e-5
  quantization: 4bit

Embedding Fine-Tuning

Train sentence embedding models with contrastive learning:

yaml
base: sentence-transformers/all-MiniLM-L6-v2
task: embedding

data:
  train: ./data/embeddings.jsonl
  format: embedding

training:
  embedding_loss: contrastive    # contrastive, triplet, cosine
  embedding_pooling: mean
  embedding_margin: 0.5
  embedding_temperature: 0.05

Embedding Data Format

json
{"anchor": "What is Python?", "positive": "Python is a programming language."}
{"anchor": "What is Python?", "positive": "A programming language.", "negative": "A type of snake."}