VibeVoice-Realtime-0.5B-MLX-INT4-etheroi

VibeVoice-Realtime-0.5B-MLX-INT4

INT4-quantized MLX bundle of Microsoft VibeVoice-Realtime-0.5B for Apple Silicon, ready to load with the VibeVoiceTTS Swift module from soniqo/speech-swift.

What's in the box

model.safetensors — INT4 group-quantized Qwen2 backbone (group_size=32, mode=affine), tokenizer + acoustic tokenizer + diffusion head + EOS classifier kept in source dtype
quantization.json — per-layer manifest (244 quantized layers)
config.json, preprocessor_config.json — copied from upstream

Bundle size: 1.07 GB (vs ~1.0 GB BF16 source — INT4 weights are smaller, but scales/biases add overhead).

Performance (Apple M2 Max, 64 GB)

Steps	Audio	Elapsed	RTF	RTFx
10	2.27 s	0.98 s	0.43	2.31×

Faster than BF16 (RTFx 1.48 @ 20 steps) and INT8 (RTFx 1.88 @ 10 steps).

Use it

Swift / iOS / macOS

import VibeVoiceTTS

var config = VibeVoiceTTSModel.Configuration()
config.modelId = "aufklarer/VibeVoice-Realtime-0.5B-MLX-INT4"

let tts = try await VibeVoiceTTSModel.fromPretrained(configuration: config)
try tts.loadVoice(from: "voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
// pcm: [Float] @ 24 kHz mono

CLI (`audio` from speech-swift)

audio vibevoice "Hello world." \
    --model aufklarer/VibeVoice-Realtime-0.5B-MLX-INT4 \
    --voice-cache voice_cache/en-Mike_man.safetensors \
    --output hello.wav

Voice caches

Speaker identity comes from .safetensors voice caches. Get one from:

mzbac/vibevoice.swift/voice_cache — 7 English voices, MIT
Or mint your own from any reference audio: audio vibevoice-encode-voice reference.wav "transcript" -o voice.safetensors

Languages

English and Chinese only. The Qwen2.5 tokenizer accepts other languages but the audio output will be unintelligible — the training data is EN/ZH only.

License

MIT, inherited from the upstream Microsoft VibeVoice repo.

Reproduction

models/vibevoice/export/convert.py in soniqo/speech-models (private). Quantization is MLX group-wise affine; embeddings, norms, acoustic-tokenizer convolutions, and the EOS classifier stay in source dtype.

Citation

@misc{microsoft_vibevoice,
  title  = {VibeVoice: Long-form, Multi-speaker Text-to-Speech},
  author = {Microsoft Research},
  year   = {2025},
  url    = {https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B}
}

Downloads last month: 50

Safetensors

Model size

0.4B params

Tensor type

BF16

U32

MLX

Hardware compatibility

Quantized

Model tree for developerjeremylive/VibeVoice-Realtime-0.5B-MLX-INT4-etheroi

Base model

Qwen/Qwen2.5-0.5B

Finetuned

microsoft/VibeVoice-Realtime-0.5B