DramaBox TTS (MLX, bf16)

GitHub App Automaton Gemma backbone

Pure-MLX conversion of Resemble AI's DramaBox, an expressive flow-matching diffusion text-to-speech model. It renders 48 kHz stereo speech on Apple Silicon with no PyTorch at inference time. Weights ship as plain .safetensors for the mlx-speech runtime.

Model Details

  • Developed by: App Automaton
  • Upstream model: ResembleAI/Dramabox, built on Lightricks/LTX-2.3
  • Task: English text-to-speech at 48 kHz stereo
  • Architecture: Gemma 3 12B text encoder, flow-matching audio DiT (3.3B), audio VAE, BigVGAN + BWE vocoder
  • Precision: bf16. This is an MLX format port; the vocoder runs fp32 compute regardless of storage dtype.
  • Runtime: MLX on Apple Silicon

Requires a separate text encoder. DramaBox conditions on a Gemma 3 12B backbone. Download it from the paired repo: appautomaton/gemma-3-12b-it-backbone-4bit-mlx.

Contents

File Component Format Size
dramabox-dit-v1.safetensors Flow-matching audio DiT (3.3B, 48 layers) bf16 ~6.6 GB
dramabox-audio-components.safetensors Audio VAE, BigVGAN/BWE vocoder, connector, aggregate embed bf16 ~1.9 GB
config.json Architecture and inference defaults JSON n/a
assets/silence_latent_frame.pt Training metadata, unused at inference n/a small

How to Get Started

DramaBox runs through the mlx-speech repo and needs both this repo and the Gemma backbone.

# 1. DramaBox weights
hf download appautomaton/dramabox-tts-3.3b-bf16-mlx \
  --local-dir models/dramabox/mlx-bf16

# 2. Gemma 3 12B text-encoder backbone (paired repo)
hf download appautomaton/gemma-3-12b-it-backbone-4bit-mlx \
  --local-dir models/gemma_3_12b_it_backbone/mlx-4bit
from mlx_speech.generation.dramabox import DramaBoxModel

model = DramaBoxModel.from_dir(
    "models/dramabox/mlx-bf16",
    gemma_dir="models/gemma_3_12b_it_backbone/mlx-4bit",
)
result = model.generate(
    'A woman speaks clearly, "The weather today will be sunny."',
    duration_s=5.0,
    cfg_scale=2.5,
)
# result.waveform : mx.array [2, T_samples], result.sample_rate : 48000
python scripts/generate_dramabox.py \
  --dramabox-dir models/dramabox/mlx-bf16 \
  --gemma-dir models/gemma_3_12b_it_backbone/mlx-4bit \
  --prompt 'A woman speaks clearly.' \
  --duration 5.0 \
  --out outputs/dramabox.wav

Status and Limitations

  • Works today: English text to 48 kHz stereo waveform, end to end, pure MLX.
  • Not yet wired: voice-reference cloning (IC-LoRA). The mel front-end is stubbed in this port.
  • Deferred: Spatio-Temporal Guidance (stg_scale) falls back to CFG-only.
  • Memory: the DiT, audio components, and Gemma backbone target a 32 GB Apple Silicon machine.

Links

License

DramaBox derives from LTX-2.3 and is distributed under the LTX-2 Community License (see LICENSE in this repo). Use is also subject to the terms of the upstream ResembleAI/Dramabox release.

Downloads last month
31
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for appautomaton/dramabox-tts-3.3b-bf16-mlx

Finetuned
(5)
this model