Instructions to use appautomaton/dramabox-tts-3.3b-bf16-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use appautomaton/dramabox-tts-3.3b-bf16-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir dramabox-tts-3.3b-bf16-mlx appautomaton/dramabox-tts-3.3b-bf16-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
DramaBox TTS (MLX, bf16)
Pure-MLX conversion of Resemble AI's DramaBox, an expressive flow-matching diffusion text-to-speech model. It renders 48 kHz stereo speech on Apple Silicon with no PyTorch at inference time. Weights ship as plain .safetensors for the mlx-speech runtime.
Model Details
- Developed by: App Automaton
- Upstream model:
ResembleAI/Dramabox, built onLightricks/LTX-2.3 - Task: English text-to-speech at 48 kHz stereo
- Architecture: Gemma 3 12B text encoder, flow-matching audio DiT (3.3B), audio VAE, BigVGAN + BWE vocoder
- Precision: bf16. This is an MLX format port; the vocoder runs fp32 compute regardless of storage dtype.
- Runtime: MLX on Apple Silicon
Requires a separate text encoder. DramaBox conditions on a Gemma 3 12B backbone. Download it from the paired repo:
appautomaton/gemma-3-12b-it-backbone-4bit-mlx.
Contents
| File | Component | Format | Size |
|---|---|---|---|
dramabox-dit-v1.safetensors |
Flow-matching audio DiT (3.3B, 48 layers) | bf16 | ~6.6 GB |
dramabox-audio-components.safetensors |
Audio VAE, BigVGAN/BWE vocoder, connector, aggregate embed | bf16 | ~1.9 GB |
config.json |
Architecture and inference defaults | JSON | n/a |
assets/silence_latent_frame.pt |
Training metadata, unused at inference | n/a | small |
How to Get Started
DramaBox runs through the mlx-speech repo and needs both this repo and the Gemma backbone.
# 1. DramaBox weights
hf download appautomaton/dramabox-tts-3.3b-bf16-mlx \
--local-dir models/dramabox/mlx-bf16
# 2. Gemma 3 12B text-encoder backbone (paired repo)
hf download appautomaton/gemma-3-12b-it-backbone-4bit-mlx \
--local-dir models/gemma_3_12b_it_backbone/mlx-4bit
from mlx_speech.generation.dramabox import DramaBoxModel
model = DramaBoxModel.from_dir(
"models/dramabox/mlx-bf16",
gemma_dir="models/gemma_3_12b_it_backbone/mlx-4bit",
)
result = model.generate(
'A woman speaks clearly, "The weather today will be sunny."',
duration_s=5.0,
cfg_scale=2.5,
)
# result.waveform : mx.array [2, T_samples], result.sample_rate : 48000
python scripts/generate_dramabox.py \
--dramabox-dir models/dramabox/mlx-bf16 \
--gemma-dir models/gemma_3_12b_it_backbone/mlx-4bit \
--prompt 'A woman speaks clearly.' \
--duration 5.0 \
--out outputs/dramabox.wav
Status and Limitations
- Works today: English text to 48 kHz stereo waveform, end to end, pure MLX.
- Not yet wired: voice-reference cloning (IC-LoRA). The mel front-end is stubbed in this port.
- Deferred: Spatio-Temporal Guidance (
stg_scale) falls back to CFG-only. - Memory: the DiT, audio components, and Gemma backbone target a 32 GB Apple Silicon machine.
Links
- Source code:
appautomaton/mlx-speech - Paired text encoder:
appautomaton/gemma-3-12b-it-backbone-4bit-mlx - Upstream model:
ResembleAI/Dramabox - Foundation model:
Lightricks/LTX-2.3 - More MLX speech models: App Automaton on Hugging Face
License
DramaBox derives from LTX-2.3 and is distributed under the LTX-2 Community License (see LICENSE in this repo). Use is also subject to the terms of the upstream ResembleAI/Dramabox release.
- Downloads last month
- 31
Quantized