DramaBox TTS (MLX, bf16)

Pure-MLX conversion of Resemble AI's DramaBox, an expressive flow-matching diffusion text-to-speech model. It renders 48 kHz stereo speech on Apple Silicon with no PyTorch at inference time. Weights ship as plain .safetensors for the mlx-speech runtime.

Model Details

Developed by: App Automaton
Upstream model: ResembleAI/Dramabox, built on Lightricks/LTX-2.3
Task: English text-to-speech at 48 kHz stereo
Architecture: Gemma 3 12B text encoder, flow-matching audio DiT (3.3B), audio VAE, BigVGAN + BWE vocoder
Precision: bf16. This is an MLX format port; the vocoder runs fp32 compute regardless of storage dtype.
Runtime: MLX on Apple Silicon

Requires a separate text encoder. DramaBox conditions on a Gemma 3 12B backbone. Download it from the paired repo: appautomaton/gemma-3-12b-it-backbone-4bit-mlx.

File	Component	Format	Size
`dramabox-dit-v1.safetensors`	Flow-matching audio DiT (3.3B, 48 layers)	bf16	~6.6 GB
`dramabox-audio-components.safetensors`	Audio VAE, BigVGAN/BWE vocoder, connector, aggregate embed	bf16	~1.9 GB
`config.json`	Architecture and inference defaults	JSON	n/a
`assets/silence_latent_frame.pt`	Training metadata, unused at inference	n/a	small

How to Get Started

DramaBox runs through the mlx-speech repo and needs both this repo and the Gemma backbone.

# 1. DramaBox weights
hf download appautomaton/dramabox-tts-3.3b-bf16-mlx \
  --local-dir models/dramabox/mlx-bf16

# 2. Gemma 3 12B text-encoder backbone (paired repo)
hf download appautomaton/gemma-3-12b-it-backbone-4bit-mlx \
  --local-dir models/gemma_3_12b_it_backbone/mlx-4bit

from mlx_speech.generation.dramabox import DramaBoxModel

model = DramaBoxModel.from_dir(
    "models/dramabox/mlx-bf16",
    gemma_dir="models/gemma_3_12b_it_backbone/mlx-4bit",
)
result = model.generate(
    'A woman speaks clearly, "The weather today will be sunny."',
    duration_s=5.0,
    cfg_scale=2.5,
)
# result.waveform : mx.array [2, T_samples], result.sample_rate : 48000

python scripts/generate_dramabox.py \
  --dramabox-dir models/dramabox/mlx-bf16 \
  --gemma-dir models/gemma_3_12b_it_backbone/mlx-4bit \
  --prompt 'A woman speaks clearly.' \
  --duration 5.0 \
  --out outputs/dramabox.wav

Status and Limitations

Works today: English text to 48 kHz stereo waveform, end to end, pure MLX.
Not yet wired: voice-reference cloning (IC-LoRA). The mel front-end is stubbed in this port.
Deferred: Spatio-Temporal Guidance (stg_scale) falls back to CFG-only.
Memory: the DiT, audio components, and Gemma backbone target a 32 GB Apple Silicon machine.

License

DramaBox derives from LTX-2.3 and is distributed under the LTX-2 Community License (see LICENSE in this repo). Use is also subject to the terms of the upstream ResembleAI/Dramabox release.

Downloads last month: 31

MLX

Hardware compatibility

Quantized

Model tree for appautomaton/dramabox-tts-3.3b-bf16-mlx

Base model

Lightricks/LTX-2.3

Finetuned

ResembleAI/Dramabox

Finetuned

(5)

this model

appautomaton
/

dramabox-tts-3.3b-bf16-mlx