Instructions to use laion/voiceclap-small-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use laion/voiceclap-small-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="laion/voiceclap-small-v2", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("laion/voiceclap-small-v2", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
VoiceCLAP-Small-v2
Voice-text contrastive (CLAP-style) embedding model โ the successor to
laion/voiceclap-small,
trained with emotion-led MOSS-Audio short captions. Better than v1 on
every benchmark we measure, at identical size and inference cost.
Same dual-tower architecture as v1: a
BUD-E-Whisper_V1.1 audio
encoder paired with
sentence-transformers/all-MiniLM-L6-v2
on the text side, joined by an MLP projection on each side and trained with
the SigLIP sigmoid contrastive loss.
| Architecture | dual-tower CLAP (BUD-E-Whisper-Small + MiniLM-L6-v2) |
| Audio encoder | Whisper-style: 12 layers ร 768 dim ร 12 heads, 80-mel input @ 16 kHz |
| Text encoder | BERT/MiniLM, 6 layers ร 384 dim, mean-pooled |
| Joint embedding | 768-d, L2-normalised |
| Loss | SigLIP (sigmoid contrastive) |
| Total parameters | ~110 M |
| Training | 40 M samples (20 epochs ร 2 M), best checkpoint epoch 19 |
What's new vs v1
v1 sampled k=2 uniformly-chosen MOSS-Audio attribute sentences per clip
as the caption. v2 replaces this with an emotion-led short caption: the
MOSS-Audio-8B-Thinking EMO sentence (a direct natural-language description
of the emotional state) plus one randomly sampled talking-style sentence,
re-drawn every epoch. Captions stay 50/50 blended with each corpus's original
captions. The emotion-first structure concentrates contrastive signal on the
emotion subspace without sacrificing style coverage.
Evaluation
| Benchmark | v1 (released) | v2 (this model) | ฮ |
|---|---|---|---|
| EmoNet-Voice top-1 | 0.0902 | 0.1015 | +13% rel |
| EmoNet-Voice Spearman ฯ | 0.2280 | 0.2561 | +12% rel |
| MAEB-voice mean (8 tasks) | 0.3861 | 0.3893 | +0.8% |
The ฯ gain also clears every arm of the v1 caption-sampling sweep (best: 0.2399 at k=2). MAEB-voice shows no general-speech regression.
Training data
Trained on the open 9-corpus mixture used in the VoiceNet paper:
emolia-balanced-5M-subset(annotated subset of Emilia)laions_got_talent_clean_with_captionsmajestrino-datasynthetic_vocal_bursts+improved_synthetic_vocal_burstsears,expresso,voxceleb1,voxceleb2(FCaps captions)
MOSS-Audio-8B-Thinking annotations (18 prompt groups, 61 attribute values per clip) provide the EMO + style sentences for the three large corpora.
Usage
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("laion/voiceclap-small-v2", trust_remote_code=True).eval()
tok = AutoTokenizer.from_pretrained("laion/voiceclap-small-v2")
# audio: raw mono waveform at 16 kHz
import soundfile as sf
wav, sr = sf.read("clip.wav", dtype="float32")
audio_emb = model.encode_waveform(torch.from_numpy(wav))
# text
t = tok(["a person speaking with quiet pride in their voice"], padding=True, return_tensors="pt")
text_emb = model.encode_text(t["input_ids"], attention_mask=t["attention_mask"])
score = (audio_emb @ text_emb.T).item()
Conversion from the training checkpoint was verified functionally against the original open_clip implementation (cosine โฅ 0.9999 on both towers).
Sibling models
laion/voiceclap-large-v2โ 7B single-tower successor trained with Prototypical Contrastive losslaion/voiceclap-small,laion/voiceclap-largeโ v1 releases
License
cc-by-nc-4.0
- Downloads last month
- 1
Model tree for laion/voiceclap-small-v2
Base model
laion/voiceclap-small