Instructions to use developerjeremylive/VibeVoice-Realtime-0.5B-MLX-INT4-etheroi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use developerjeremylive/VibeVoice-Realtime-0.5B-MLX-INT4-etheroi with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir VibeVoice-Realtime-0.5B-MLX-INT4-etheroi developerjeremylive/VibeVoice-Realtime-0.5B-MLX-INT4-etheroi
- VibeVoice
How to use developerjeremylive/VibeVoice-Realtime-0.5B-MLX-INT4-etheroi with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("developerjeremylive/VibeVoice-Realtime-0.5B-MLX-INT4-etheroi") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "developerjeremylive/VibeVoice-Realtime-0.5B-MLX-INT4-etheroi", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
VibeVoice-Realtime-0.5B-MLX-INT4
INT4-quantized MLX bundle of Microsoft VibeVoice-Realtime-0.5B for Apple Silicon, ready to load with the VibeVoiceTTS Swift module from soniqo/speech-swift.
What's in the box
model.safetensors— INT4 group-quantized Qwen2 backbone (group_size=32, mode=affine), tokenizer + acoustic tokenizer + diffusion head + EOS classifier kept in source dtypequantization.json— per-layer manifest (244 quantized layers)config.json,preprocessor_config.json— copied from upstream
Bundle size: 1.07 GB (vs ~1.0 GB BF16 source — INT4 weights are smaller, but scales/biases add overhead).
Performance (Apple M2 Max, 64 GB)
| Steps | Audio | Elapsed | RTF | RTFx |
|---|---|---|---|---|
| 10 | 2.27 s | 0.98 s | 0.43 | 2.31× |
Faster than BF16 (RTFx 1.48 @ 20 steps) and INT8 (RTFx 1.88 @ 10 steps).
Use it
Swift / iOS / macOS
import VibeVoiceTTS
var config = VibeVoiceTTSModel.Configuration()
config.modelId = "aufklarer/VibeVoice-Realtime-0.5B-MLX-INT4"
let tts = try await VibeVoiceTTSModel.fromPretrained(configuration: config)
try tts.loadVoice(from: "voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
// pcm: [Float] @ 24 kHz mono
CLI (audio from speech-swift)
audio vibevoice "Hello world." \
--model aufklarer/VibeVoice-Realtime-0.5B-MLX-INT4 \
--voice-cache voice_cache/en-Mike_man.safetensors \
--output hello.wav
Voice caches
Speaker identity comes from .safetensors voice caches. Get one from:
- mzbac/vibevoice.swift/voice_cache — 7 English voices, MIT
- Or mint your own from any reference audio:
audio vibevoice-encode-voice reference.wav "transcript" -o voice.safetensors
Languages
English and Chinese only. The Qwen2.5 tokenizer accepts other languages but the audio output will be unintelligible — the training data is EN/ZH only.
License
MIT, inherited from the upstream Microsoft VibeVoice repo.
Reproduction
models/vibevoice/export/convert.py in soniqo/speech-models (private). Quantization is MLX group-wise affine; embeddings, norms, acoustic-tokenizer convolutions, and the EOS classifier stay in source dtype.
Citation
@misc{microsoft_vibevoice,
title = {VibeVoice: Long-form, Multi-speaker Text-to-Speech},
author = {Microsoft Research},
year = {2025},
url = {https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B}
}
- Downloads last month
- 50
Quantized