voiceclap-large / README.md
gijs's picture
README: list all 9 training datasets (expresso/vox1/vox2 were missing)
361141a verified
metadata
license: cc-by-4.0
language:
  - en
library_name: sentence-transformers
pipeline_tag: feature-extraction
base_model: LCO-Embedding/LCO-Embedding-Omni-7B
tags:
  - audio
  - speech
  - emotion
  - clap
  - contrastive
  - voice
  - sentence-transformers

VoiceCLAP-Large

Voice-text contrastive embedding model — the larger of the two anchors released with VoiceNet.

VoiceCLAP-Large is a single-tower model: a rank-16 LoRA finetune of LCO-Embedding-Omni-7B (Qwen2.5-Omni-Thinker-7B backbone with a sentence-transformer last-token-pooling head) trained with the symmetric InfoNCE loss. The audio and text embeddings are produced by the same backbone — the modality is determined by what is fed in via the multimodal chat template.

Architecture single-tower Omni-Embedding (Qwen2.5-Omni-Thinker-7B + ST last-token-pool)
Adaptation rank-16 LoRA (alpha 32, dropout 0.05), merged into the released weights
Joint embedding 3 584-d, L2-normalised
Loss symmetric InfoNCE (all-gather negatives)
Total parameters ~7 B (full merged model)
Epochs 1

Training data

Trained for 1 epoch on the open voiceclap_10_safe mixture (9 datasets) used in the VoiceNet paper:

  • emolia-balanced-5M-subset (annotated subset of Emilia)
  • laions_got_talent_clean_with_captions
  • majestrino-data
  • synthetic_vocal_bursts
  • improved_synthetic_vocal_bursts
  • ears
  • expresso
  • voxceleb1
  • voxceleb2

All clips are captioned with MOSS-Audio-8B-Thinking-derived dense vocal-style captions covering emotions, talking-style attributes, and demographics.

Standalone load example

The model uses the SentenceTransformer multimodal API — both sentence-transformers and transformers are on PyPI; no other deps are required.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("VoiceNet/voiceclap-large", trust_remote_code=True)

# Text embedding (3 584-d, L2-normalised)
text_emb = model.encode(["a calm and steady voice"])

# Audio embedding — pass a dict with raw samples + sampling rate.
import soundfile as sf
arr, sr = sf.read("clip.wav")
audio_emb = model.encode([{"array": arr, "sampling_rate": sr}])

# Cosine similarity (embeddings already L2-normalised)
print((audio_emb @ text_emb.T).item())

For convenience the LoRA adapter is also shipped under adapter/ so it can be reapplied to other LCO-Embedding-Omni-7B forks; the merged model.safetensors already contains it.

Citation

If you use this model, please cite the VoiceNet paper.