Instructions to use shraey/dots-tts-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use shraey/dots-tts-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir dots-tts-mlx shraey/dots-tts-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
dots-tts-mlx โ quantized MLX weights (Apple Silicon)
Ready-to-run MLX weights for rednote-hilab/dots.tts-soar โ a 2B continuous-AR flow-matching, multilingual (24 languages, same as upstream), zero-shot voice-clone TTS โ quantized for Apple Silicon. Download and run with the dots-tts-mlx runtime โ no PyTorch and no conversion step.
Languages: same as upstream dots.tts โ all 24 (Arabic, Cantonese, Chinese, Czech, Dutch, English, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Romanian, Russian, Spanish, Thai, Turkish, Ukrainian, Vietnamese). The 5-language check in Quality is only the quantization spot-check, not the supported set.
These are converted + LLM-quantized MLX safetensors, not PyTorch. They load only with the
dots-tts-mlxruntime on Apple Silicon (Metal). For the original PyTorch model, see rednote-hilab/dots.tts-soar.
โก Two decoders, one voice.
int4/int8are the standard soar build (reference quality).mf-int4/mf-int8are MeanFlow โ a distilled few-step decoder (rednote-hilab/dots.tts-mf) that runs the acoustic DiT at NFE=4 with no classifier-free guidance, ~2ร faster than the 10-step path with no measurable quality loss. Same model, same 24-language cloning โ pick the folder that fits your speed budget. For cloning, reference mode (audio + transcript) is the recommended quality path.
Variants
| Subfolder | Decoder | Download | Speed | Use |
|---|---|---|---|---|
int4/ โญ |
soar โ 10-step + CFG | ~2.4 GB | baseline | default, best quality |
int8/ |
soar โ 10-step + CFG | ~3.1 GB | baseline | conservative fallback |
mf-int4/ โก |
MeanFlow โ NFE-4, no CFG | ~2.4 GB | ~2ร faster | latency-sensitive |
mf-int8/ |
MeanFlow โ NFE-4, no CFG | ~3.1 GB | ~2ร faster | meanflow + more LLM precision |
Only the Qwen2.5-1.5B LLM trunk (โ70% of the weights) is quantized (group-wise affine, group size 64); the precision-sensitive flow-matching DiT, the BigVGAN vocoder, and the CAM++ speaker encoder stay bf16.
MeanFlow (mf-*) is the distilled rednote-hilab/dots.tts-mf checkpoint โ the same architecture plus a small duration embedder โ auto-detected from config.json (no flag): point the runtime at an mf-* folder and it uses the few-step solver. Measured 1.9โ2.2ร faster than the 10-step path on reference cloning (EN/HI/ZH). It drops classifier-free guidance, so --guidance-scale is ignored. Use reference cloning (audio + transcript) for best quality.
Quality
Quantization is validated to be lossless relative to the full-precision MLX build: on a small multilingual acceptance check (EN/DE/ES/FR + Hindi), int8 and int4 showed no transcription-accuracy or voice-similarity regression vs bf16. This is a sanity check, not a dataset-scale benchmark โ evaluate on your own content.
Correctness of the port itself is gated per-stage against the original PyTorch model (AudioVAE PSNR โ 56 dB; attention / DiT / LLM / semantic-encoder cosine โฅ 0.9999) โ see the runtime repo.
Usage
# 1. install the quant-aware runtime (>= v0.2.0)
pip install "git+https://github.com/sb1992/dots-tts-mlx.git@v0.2.0"
# 2. download the variant you want (use "mf-int4/*" for the faster MeanFlow decoder)
hf download shraey/dots-tts-mlx --include "int4/*" --local-dir ./dots-tts-mlx-weights
# 3. run (files land in ./dots-tts-mlx-weights/int4/)
dots-tts --model ./dots-tts-mlx-weights/int4 \
--text "Hello from MLX." --ref-audio reference.wav --language EN \
--out-path out --out-prefix clone
The runtime auto-detects the quantization block in config.json, so nothing changes at the CLI/API level vs an unquantized directory. Python API and the full flag set: see the runtime repo.
- Memory: peak RAM scales with generation + reference length โ roughly ~6 GB for a short clip, up to ~13 GB for a ~30 s clip (int4); resident weights are ~2.4 GB. The render peak is activation-bound (the bf16 DiT + vocoder working set), so it's the same for soar and MeanFlow and isn't reduced by quantization. MLX's allocator may cache up to its memory limit, but that cache is releasable.
- Requires: Apple Silicon (MLX is Metal-only), Python โฅ 3.10.
Attribution & licenses
Derivative quantized weights of rednote-hilab/dots.tts-soar (Apache-2.0) โ you must comply with the upstream license. Components:
- dots.tts โ model ยท code โ Apache-2.0, ยฉ the dots.tts team at rednote-hilab.
- Qwen2.5-1.5B-Base (LLM backbone) โ Apache-2.0.
- CAM++ / 3D-Speaker (speaker x-vector encoder) โ Apache-2.0.
- BigVGAN (vocoder/decoder architecture style) โ MIT, ยฉ NVIDIA.
MLX port + quantization code: github.com/sb1992/dots-tts-mlx (Apache-2.0).
Responsible use
This performs zero-shot voice cloning โ it can reproduce a person's voice from a few seconds of audio. Only clone voices you own or for which you have explicit, informed consent; do not use it for impersonation, fraud, or deception; and disclose AI-generated audio wherever it's shared. See the upstream risks guidance.
Quantized
Model tree for shraey/dots-tts-mlx
Base model
rednote-hilab/dots.tts-base