Instructions to use christopherthompson81/vibevoice-asr-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VibeVoice
How to use christopherthompson81/vibevoice-asr-onnx with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("christopherthompson81/vibevoice-asr-onnx") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "christopherthompson81/vibevoice-asr-onnx", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
VibeVoice-ASR β ONNX export for Vernacula
ONNX export of microsoft/VibeVoice-ASR-HF,
packaged for use with Vernacula
and ONNX Runtime.
- Conversion script:
scripts/vibevoice_export/ - Vernacula: github.com/christopherthompson81/vernacula
- Upstream model:
microsoft/VibeVoice-ASR-HF
Highlights
- GPU-resident KV cache via
OrtIoBinding: RTF 2.78Γ β 0.50Γ (5.5Γ speedup). The naive path copied 56 KV tensors of[1, 4, cache_len, 128]between CPU and GPU on every decode step (~4 TB of PCIe traffic across a 4000-token generate).BindOutputToDevice+ per-step re-registration eliminates the round-trip entirely. - Float32 KV + attention defers a BF16 divergence from token 11 (whole-seconds digit) to token 14 (sub-decimals only). The cost is ~23% Cast-node overhead and the win is killing spurious 7-second segment-boundary shifts that BF16 introduced.
ORT_ENABLE_EXTENDEDgraph optimisation (not_ALL) for the decoder: avoids a BF16 attention-fusion regression that shifts divergence earlier, while still applying GeLU / LayerNorm fusions. The audio encoder uses_ALLsince it benefits from memory-layout transforms.- Audio-encoder session disposed after encoding in built-in diarization mode (frees 1β3 GB of CUDA arena before the 600 s autoregressive decode). Segmented mode persists the session to avoid 70+ session-load round-trips at ~20 s overhead each.
- Profiled framework overhead at 56% of wall time β 1950 CUDA kernel launches Γ ~7 Β΅s each = ~14 ms/step of structural overhead from
SequentialExecutordispatch + launch latency. Reaching below this requires CUDA graphs or TensorRT. - Static-vs-dynamic KV parity test ships alongside the export (
test_static_kv_parity.py) β token-by-token comparison with bf16 β f32OrtValueconversion.
Contents
| File | Purpose |
|---|---|
audio_encoder.onnx (+ .data) |
Audio encoder β waveform features β encoded representations |
decoder_single.onnx (+ .data) |
Unified decoder graph (default and recommended path) |
tokenizer.json, tokenizer_config.json |
Tokenizer assets |
chat_template.jinja |
Prompt / chat template |
config.json, processor_config.json |
Model + processor config |
export-report.json |
Export-time metadata (op coverage, parity check results) |
manifest.json |
Per-file MD5 hashes (used by Vernacula's download verifier) |
Export provenance
Exported via scripts/vibevoice_export/
in the Vernacula repo. The export folder also contains a static-vs-dynamic
KV-cache parity check (test_static_kv_parity.py) that validates the
decoder graph against the reference PyTorch implementation.
Runtime notes
- Vernacula uses the unified decoder graph (
decoder_single.onnx) as the default and recommended path. - The current Vernacula runtime treats this package as CUDA-only.
manifest.jsonstores MD5 hashes used by Vernacula.Avalonia to verify and re-download missing or outdated files.
License
MIT, inherited from the upstream
microsoft/VibeVoice-ASR-HF
model. Commercial and modification rights both granted.
Using these files
In Vernacula, select VibeVoice-ASR as the ASR backend in Settings and the
package will be downloaded and verified automatically. Outside Vernacula,
pull with huggingface_hub and load with onnxruntime:
from huggingface_hub import snapshot_download
path = snapshot_download(repo_id="christopherthompson81/vibevoice-asr-onnx")
The decoder graph contract is documented in scripts/vibevoice_export/README.md.
Limitations
Numerical behavior matches the upstream
microsoft/VibeVoice-ASR-HF
model. Language coverage, accuracy, and known failure modes inherit from
the upstream model card.
Citation
For the underlying model, see the upstream model card for the canonical citation.
Acknowledgments
- Original model: Microsoft Research
- ONNX repackaging: Chris Thompson for Vernacula
Issues with the ONNX export specifically: open an issue on the Vernacula repo. Issues with the underlying model: see the upstream model card.
See also
- Vernacula on GitHub β the speech pipeline app this package is built for
- Conversion script (
scripts/vibevoice_export/) β the export pipeline that produced these files microsoft/VibeVoice-ASR-HFβ upstream model card- Other Vernacula model packages
- Downloads last month
- 34
Model tree for christopherthompson81/vibevoice-asr-onnx
Base model
microsoft/VibeVoice-ASR-HF