dots-tts-mlx โ€” quantized MLX weights (Apple Silicon)

Ready-to-run MLX weights for rednote-hilab/dots.tts-soar โ€” a 2B continuous-AR flow-matching, multilingual (24 languages, same as upstream), zero-shot voice-clone TTS โ€” quantized for Apple Silicon. Download and run with the dots-tts-mlx runtime โ€” no PyTorch and no conversion step.

Languages: same as upstream dots.tts โ€” all 24 (Arabic, Cantonese, Chinese, Czech, Dutch, English, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Romanian, Russian, Spanish, Thai, Turkish, Ukrainian, Vietnamese). The 5-language check in Quality is only the quantization spot-check, not the supported set.

These are converted + LLM-quantized MLX safetensors, not PyTorch. They load only with the dots-tts-mlx runtime on Apple Silicon (Metal). For the original PyTorch model, see rednote-hilab/dots.tts-soar.

โšก Two decoders, one voice. int4/int8 are the standard soar build (reference quality). mf-int4/mf-int8 are MeanFlow โ€” a distilled few-step decoder (rednote-hilab/dots.tts-mf) that runs the acoustic DiT at NFE=4 with no classifier-free guidance, ~2ร— faster than the 10-step path with no measurable quality loss. Same model, same 24-language cloning โ€” pick the folder that fits your speed budget. For cloning, reference mode (audio + transcript) is the recommended quality path.

Variants

Subfolder Decoder Download Speed Use
int4/ โญ soar โ€” 10-step + CFG ~2.4 GB baseline default, best quality
int8/ soar โ€” 10-step + CFG ~3.1 GB baseline conservative fallback
mf-int4/ โšก MeanFlow โ€” NFE-4, no CFG ~2.4 GB ~2ร— faster latency-sensitive
mf-int8/ MeanFlow โ€” NFE-4, no CFG ~3.1 GB ~2ร— faster meanflow + more LLM precision

Only the Qwen2.5-1.5B LLM trunk (โ‰ˆ70% of the weights) is quantized (group-wise affine, group size 64); the precision-sensitive flow-matching DiT, the BigVGAN vocoder, and the CAM++ speaker encoder stay bf16.

MeanFlow (mf-*) is the distilled rednote-hilab/dots.tts-mf checkpoint โ€” the same architecture plus a small duration embedder โ€” auto-detected from config.json (no flag): point the runtime at an mf-* folder and it uses the few-step solver. Measured 1.9โ€“2.2ร— faster than the 10-step path on reference cloning (EN/HI/ZH). It drops classifier-free guidance, so --guidance-scale is ignored. Use reference cloning (audio + transcript) for best quality.

Quality

Quantization is validated to be lossless relative to the full-precision MLX build: on a small multilingual acceptance check (EN/DE/ES/FR + Hindi), int8 and int4 showed no transcription-accuracy or voice-similarity regression vs bf16. This is a sanity check, not a dataset-scale benchmark โ€” evaluate on your own content.

Correctness of the port itself is gated per-stage against the original PyTorch model (AudioVAE PSNR โ‰ˆ 56 dB; attention / DiT / LLM / semantic-encoder cosine โ‰ฅ 0.9999) โ€” see the runtime repo.

Usage

# 1. install the quant-aware runtime (>= v0.2.0)
pip install "git+https://github.com/sb1992/dots-tts-mlx.git@v0.2.0"

# 2. download the variant you want  (use "mf-int4/*" for the faster MeanFlow decoder)
hf download shraey/dots-tts-mlx --include "int4/*" --local-dir ./dots-tts-mlx-weights

# 3. run (files land in ./dots-tts-mlx-weights/int4/)
dots-tts --model ./dots-tts-mlx-weights/int4 \
    --text "Hello from MLX." --ref-audio reference.wav --language EN \
    --out-path out --out-prefix clone

The runtime auto-detects the quantization block in config.json, so nothing changes at the CLI/API level vs an unquantized directory. Python API and the full flag set: see the runtime repo.

  • Memory: peak RAM scales with generation + reference length โ€” roughly ~6 GB for a short clip, up to ~13 GB for a ~30 s clip (int4); resident weights are ~2.4 GB. The render peak is activation-bound (the bf16 DiT + vocoder working set), so it's the same for soar and MeanFlow and isn't reduced by quantization. MLX's allocator may cache up to its memory limit, but that cache is releasable.
  • Requires: Apple Silicon (MLX is Metal-only), Python โ‰ฅ 3.10.

Attribution & licenses

Derivative quantized weights of rednote-hilab/dots.tts-soar (Apache-2.0) โ€” you must comply with the upstream license. Components:

  • dots.tts โ€” model ยท code โ€” Apache-2.0, ยฉ the dots.tts team at rednote-hilab.
  • Qwen2.5-1.5B-Base (LLM backbone) โ€” Apache-2.0.
  • CAM++ / 3D-Speaker (speaker x-vector encoder) โ€” Apache-2.0.
  • BigVGAN (vocoder/decoder architecture style) โ€” MIT, ยฉ NVIDIA.

MLX port + quantization code: github.com/sb1992/dots-tts-mlx (Apache-2.0).

Responsible use

This performs zero-shot voice cloning โ€” it can reproduce a person's voice from a few seconds of audio. Only clone voices you own or for which you have explicit, informed consent; do not use it for impersonation, fraud, or deception; and disclose AI-generated audio wherever it's shared. See the upstream risks guidance.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for shraey/dots-tts-mlx

Finetuned
(2)
this model