mlx-indextts2-vietnamese-8bit

This is a converted MLX IndexTTS2 model for Apple Silicon inference with solar2ain/mlx-indextts.

It was prepared for the local /Users/vanch/index-tts IndexTTS2 optimization project, where the goal was stable Vietnamese and multilingual TTS on an M3 Max Mac without PyTorch MPS memory crashes.

Variant

Profile: Vietnamese
Precision / quantization: 8bit
Approx local size: 2.0GB
Source checkpoint directory during conversion: /Users/vanch/index-tts/checkpoints_vi
Note: Vietnamese model with upstream MLX GPT-only 8-bit quantization.
Conversion detail: Converted with mlx-indextts convert --quantize 8. In the current upstream implementation this quantizes GPT only; S2Mel and BigVGAN stay fp32.

Expected Files

The repository root is a ready-to-use MLX IndexTTS2 model directory:

gpt.safetensors
s2mel.safetensors
bigvgan.safetensors
vq2emb.safetensors
tokenizer.model
config.yaml
config.json
feat1.pt
feat2.pt
wav2vec2bert_stats.pt

Usage

Install and use mlx-indextts:

git clone https://github.com/solar2ain/mlx-indextts.git
cd mlx-indextts
uv sync --extra convert --extra v2

huggingface-cli download vanch007/mlx-indextts2-vietnamese-8bit \
  --local-dir models/mlx-indextts2-vietnamese-8bit \
  --local-dir-use-symlinks False

uv run mlx-indextts generate \
  -m models/mlx-indextts2-vietnamese-8bit \
  -r /path/to/reference_or_speaker.npz \
  -t "Your text here" \
  -o output.wav \
  --memory-limit 24 \
  --diffusion-steps 16

For repeated generation, precompute speaker conditioning first:

uv run mlx-indextts speaker \
  -m models/mlx-indextts2-vietnamese-8bit \
  -r /path/to/reference.wav \
  -o speaker.npz \
  --memory-limit 24

Benchmark

Benchmarked on a 128GB unified-memory M3 Max Mac using:

mlx-indextts from solar2ain/mlx-indextts
precomputed .npz speaker conditioning
memory_limit=24GB
diffusion_steps=16
emotion=calm, emo_alpha=0.6
same text set across fp32 / fp16 / 8bit / optimized PyTorch MPS

RTF lower is faster:

Case	fp32 MLX RTF	fp16 MLX RTF	8bit MLX RTF	PyTorch MPS RTF
vi short	1.562	1.471	0.976	2.329
vi long	1.557	1.500	0.965	1.822

Summary from the local comparison:

8bit was the fastest MLX route in this test set.
fp16 saved space but was slower than fp32 for the standard profile.
Vietnamese fp16 was slightly faster than Vietnamese fp32, but Vietnamese 8bit was fastest.

ASR Validation

ASR validation with local mlx_whisper + whisper-large-v3-turbo found no empty audio, wrong-language output, or obvious missing sentences. Vietnamese long-form ASR still showed minor tone/word-ending differences, so subjective listening is recommended for production use.

ASR was used only as an automated sanity check. Final production selection should still include human listening, especially for long-form Vietnamese narration.

Provenance and Scope

This is an MLX conversion for local Apple Silicon inference, not the original PyTorch release. The original implementation and model family are associated with IndexTTS / IndexTTS2; the MLX runtime used here is solar2ain/mlx-indextts.

The benchmark numbers are environment-specific and should be treated as local M3 Max results, not universal performance guarantees.

Downloads last month: 32

MLX

Hardware compatibility

Quantized