vibevoice-mlx / README.md
tamarher's picture
Update model card: Speaker 0/1 format, cloning, upstream defaults
f4742ec verified
metadata
language:
  - zh
  - en
license: apache-2.0
library_name: mlx
pipeline_tag: text-to-speech
tags:
  - mlx
  - tts
  - speech
  - voice-conditioned
  - multi-speaker
  - long-form
  - diffusion
  - apple-silicon
  - quantized
  - 8bit

VibeVoice — MLX

VibeVoice Large converted and quantized for native MLX inference on Apple Silicon. Hybrid LLM + diffusion architecture for long-form speech, multi-speaker dialogue, and voice cloning.

Variants

Path Precision
mlx-int8/ int8 quantized weights

How to Get Started

Single speaker:

python scripts/generate/vibevoice.py \
  --text "Hello from VibeVoice." \
  --output outputs/vibevoice.wav

Multi-speaker dialogue — speaker labels are 0-based:

python scripts/generate/vibevoice.py \
  --text "Speaker 0: Have you tried VibeVoice?
Speaker 1: Not yet. Does it need PyTorch?
Speaker 0: No. Pure MLX, runs locally on Apple Silicon.
Speaker 1: That is impressive." \
  --output outputs/dialogue.wav

Voice cloning — one reference WAV per speaker:

python scripts/generate/vibevoice.py \
  --text "Speaker 0: This is cloned from the reference." \
  --reference-audio-speaker0 ref_speaker0.wav \
  --output outputs/clone.wav

Up to 4 speakers supported: --reference-audio-speaker0 through --reference-audio-speaker3.

Default generation settings (matching upstream):

  • Greedy decoding (deterministic)
  • Seed: 42
  • Diffusion steps: 20

Add --no-greedy to enable temperature + top-p sampling.

Model Details

VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping — no PyTorch at inference time.

See mlx-speech for the full runtime and conversion code.

License

Apache 2.0.