Instructions to use gijs/voiceclap-transcriptions with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- OpenCLIP
How to use gijs/voiceclap-transcriptions with OpenCLIP:
import open_clip model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:gijs/voiceclap-transcriptions') tokenizer = open_clip.get_tokenizer('hf-hub:gijs/voiceclap-transcriptions') - Notebooks
- Google Colab
- Kaggle
VoiceCLAP-Transcriptions β multilingual reference-free WER substitute
A dual-tower CLAP whose cosine(audio_embedding, transcript_embedding) measures verbatim
content fidelity between speech and a candidate transcript, across languages. Use it as a
reference-free substitute for WER β e.g. score how well a TTS system's output matches its
intended text without a reference transcription or an ASR pass.
The model is trained so an utterance's audio embedding is maximally similar to the embedding of its own transcript, and the similarity drops monotonically as the transcript is corrupted (more wrong words β lower score).
Results (held-out, multilingual)
On 505 held-out emolia clips (excluded from training by clip-id hash β no leakage):
| metric | value |
|---|---|
| audioβtext R@1 | 0.998 (R@5 = 1.000) |
| Spearman Ο(score, WER) | β0.788 (monotonic; de β0.805, en β0.779) |
The score tracks word-error-rate strongly and monotonically in every language tested (German, English; fr/zh/ko present but sparse in emolia). This is the best of an ablation sweep β see the recipe below.
Architecture (whisper-small-mMiniLM, ~208 M params)
| tower | model |
|---|---|
| audio | openai/whisper-small encoder (pretrained) β 768-d projection |
| text | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (XLM-R, 50+ langs) β 768-d |
| joint | 768-d, L2-normalized; cosine = dot product |
Training recipe
- Data: Emilia transcripts
(the
textfield), multilingual (German/English dominant). 10M samples seen. - Loss: SigLIP contrastive on
(audio, transcript)pairs + hard negatives (K=5 edit-corrupted transcripts per clip; word-level for space-delimited languages, character-level for zh/ja) + an edit-distance-calibrated ranking margin (the true-vs-corrupted cosine gap is forced to grow with the corruption's edit fraction β the score graduates with WER). - 1 node Γ 4 GH200, bf16. The ranking margin is the lever that pushes Ο past β0.7; scale (10M) drives the near-perfect retrieval β the two are orthogonal and stack.
Usage
Loads with the clapv2 (open_clip fork) + clip_benchmark stack:
import sys, torch, torchaudio
sys.path.insert(0, "/path/to/clapv2/src")
sys.path.insert(0, "/path/to/clap_benchmark")
import open_clip
# register the model config shipped in this repo (open_clip_model_config.json -> whisper-small-mMiniLM)
open_clip.factory.add_model_config("/path/to/clapv2/src/open_clip/model_configs")
from clip_benchmark.models import load_clip
bundle = load_clip(model_type="clap_v2", model_name="whisper-small-mMiniLM",
pretrained="model.pt", device="cuda")
# audio: decode with torchaudio (emolia/Emilia FLACs are streaming FLACs that crash
# librosa/soundfile), downmix to mono, resample to 16 kHz, pad/truncate to 30 s (480000).
wav, sr = torchaudio.load("clip.flac")
wav = wav.mean(0)
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
wav = torch.nn.functional.pad(wav[:480000], (0, max(0, 480000 - wav.numel())))
audio_emb = bundle.model.encode_audio([{"waveform": wav}], normalize=True)
text_emb = bundle.model.encode_text(bundle.tokenizer(["the intended transcript"]).to("cuda"),
normalize=True)
score = (audio_emb @ text_emb.T).item() # cosine; higher = better verbatim match (β 1 β WER)
Files
model.ptβ training checkpoint (model weights; epoch 8, optimizer stripped).open_clip_model_config.jsonβ thewhisper-small-mMiniLMopen_clip config (register viaadd_model_config).tokenizer*.json,special_tokens_map.jsonβ the multilingual-MiniLM (XLM-R) tokenizer.
License β non-commercial
CC BY-NC 4.0. This model is trained on the full Emilia corpus, the bulk of which is CC BY-NC 4.0 (non-commercial). The model therefore inherits a non-commercial restriction. (A commercially-licensable variant would require retraining on the CC-BY-4.0 Emilia-YODAS subset only.)
Caveats
- Emilia is German/English-heavy ("balanced" by emotion, not language); per-language Ο for fr/zh/ko is not well-powered.
- Real-TTS + ASR-WER correlation (the gold validation) was not run; the β0.788 figure is from controlled synthetic transcript corruption on held-out real speech.
- Downloads last month
- -
Model tree for gijs/voiceclap-transcriptions
Base model
laion/voiceclap-small