VoiceCLAP-Transcriptions β€” multilingual reference-free WER substitute

A dual-tower CLAP whose cosine(audio_embedding, transcript_embedding) measures verbatim content fidelity between speech and a candidate transcript, across languages. Use it as a reference-free substitute for WER β€” e.g. score how well a TTS system's output matches its intended text without a reference transcription or an ASR pass.

The model is trained so an utterance's audio embedding is maximally similar to the embedding of its own transcript, and the similarity drops monotonically as the transcript is corrupted (more wrong words β†’ lower score).

Results (held-out, multilingual)

On 505 held-out emolia clips (excluded from training by clip-id hash β†’ no leakage):

metric value
audio→text R@1 0.998 (R@5 = 1.000)
Spearman ρ(score, WER) βˆ’0.788 (monotonic; de βˆ’0.805, en βˆ’0.779)

The score tracks word-error-rate strongly and monotonically in every language tested (German, English; fr/zh/ko present but sparse in emolia). This is the best of an ablation sweep β€” see the recipe below.

Architecture (whisper-small-mMiniLM, ~208 M params)

tower model
audio openai/whisper-small encoder (pretrained) β†’ 768-d projection
text sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (XLM-R, 50+ langs) β†’ 768-d
joint 768-d, L2-normalized; cosine = dot product

Training recipe

  • Data: Emilia transcripts (the text field), multilingual (German/English dominant). 10M samples seen.
  • Loss: SigLIP contrastive on (audio, transcript) pairs + hard negatives (K=5 edit-corrupted transcripts per clip; word-level for space-delimited languages, character-level for zh/ja) + an edit-distance-calibrated ranking margin (the true-vs-corrupted cosine gap is forced to grow with the corruption's edit fraction β†’ the score graduates with WER).
  • 1 node Γ— 4 GH200, bf16. The ranking margin is the lever that pushes ρ past βˆ’0.7; scale (10M) drives the near-perfect retrieval β€” the two are orthogonal and stack.

Usage

Loads with the clapv2 (open_clip fork) + clip_benchmark stack:

import sys, torch, torchaudio
sys.path.insert(0, "/path/to/clapv2/src")
sys.path.insert(0, "/path/to/clap_benchmark")
import open_clip
# register the model config shipped in this repo (open_clip_model_config.json -> whisper-small-mMiniLM)
open_clip.factory.add_model_config("/path/to/clapv2/src/open_clip/model_configs")
from clip_benchmark.models import load_clip

bundle = load_clip(model_type="clap_v2", model_name="whisper-small-mMiniLM",
                   pretrained="model.pt", device="cuda")

# audio: decode with torchaudio (emolia/Emilia FLACs are streaming FLACs that crash
# librosa/soundfile), downmix to mono, resample to 16 kHz, pad/truncate to 30 s (480000).
wav, sr = torchaudio.load("clip.flac")
wav = wav.mean(0)
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)
wav = torch.nn.functional.pad(wav[:480000], (0, max(0, 480000 - wav.numel())))

audio_emb = bundle.model.encode_audio([{"waveform": wav}], normalize=True)
text_emb  = bundle.model.encode_text(bundle.tokenizer(["the intended transcript"]).to("cuda"),
                                     normalize=True)
score = (audio_emb @ text_emb.T).item()   # cosine; higher = better verbatim match (β‰ˆ 1 βˆ’ WER)

Files

  • model.pt β€” training checkpoint (model weights; epoch 8, optimizer stripped).
  • open_clip_model_config.json β€” the whisper-small-mMiniLM open_clip config (register via add_model_config).
  • tokenizer*.json, special_tokens_map.json β€” the multilingual-MiniLM (XLM-R) tokenizer.

License β€” non-commercial

CC BY-NC 4.0. This model is trained on the full Emilia corpus, the bulk of which is CC BY-NC 4.0 (non-commercial). The model therefore inherits a non-commercial restriction. (A commercially-licensable variant would require retraining on the CC-BY-4.0 Emilia-YODAS subset only.)

Caveats

  • Emilia is German/English-heavy ("balanced" by emotion, not language); per-language ρ for fr/zh/ko is not well-powered.
  • Real-TTS + ASR-WER correlation (the gold validation) was not run; the βˆ’0.788 figure is from controlled synthetic transcript corruption on held-out real speech.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gijs/voiceclap-transcriptions

Finetuned
(4)
this model