AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation
AudioX-Turbo is a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals). It follows a teacher–student paradigm: the teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion (MAF) module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation (DMD) adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation.
AudioX-Turbo generates audio in only 4 sampling steps (no classifier-free guidance), requiring up to ~25× fewer function evaluations (NFE) than multi-step baselines while achieving superior performance, especially on text-to-audio and text-to-music generation.
- Paper: AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation
- Project Page: https://zeyuet.github.io/AudioX-Turbo/
- Repository: https://github.com/NoizAI/AudioX-Turbo
Files
| File | Description |
|---|---|
audiox_turbo/audiox_turbo.ckpt |
AudioX-Turbo: distilled 4-step student model (inference) |
pretransform/vae.ckpt |
VAE pretransform |
synchformer/synchformer_state_dict.pth |
Synchformer, for video-conditioned (V2A/V2M) generation |
pretrained_ckpt/pretrained_ckpt.ckpt |
Teacher / base model (training only: student init + teacher) |
Download
# Inference checkpoints (student + VAE + Synchformer)
huggingface-cli download HKUSTAudio/AudioX-Turbo \
audiox_turbo/audiox_turbo.ckpt pretransform/vae.ckpt synchformer/synchformer_state_dict.pth \
--local-dir checkpoints
# Training only: teacher / base model
huggingface-cli download HKUSTAudio/AudioX-Turbo \
pretrained_ckpt/pretrained_ckpt.ckpt \
--local-dir checkpoints
Sample Usage
To use this model programmatically, install the audiox_turbo package as specified in the official repository.
import torch
import torchaudio
from einops import rearrange
from audiox_turbo.inference import load_audiox_turbo_model
from audiox_turbo.inference.generation import generate_diffusion_cond_dmd
from audiox_turbo.data.utils import (
read_video, load_and_process_audio, encode_video_with_synchformer, merge_video_audio,
)
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the distilled 4-step student
model, model_config = load_audiox_turbo_model(
"configs/audiox_turbo_infer_4step.json",
"checkpoints/audiox_turbo/audiox_turbo.ckpt",
pretransform_ckpt_path="checkpoints/pretransform/vae.ckpt",
device=device,
)
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
target_fps = model_config.get("video_fps", 5)
seconds_total = 10
# --- Choose a task by setting the inputs below ---
# Text-to-Audio: video_path=None, text_prompt="Typing on a keyboard"
# Video-to-Music: video_path="example/V2M_sample-1.mp4", text_prompt="Generate music for the video"
video_path = "example/V2M_sample-1.mp4"
text_prompt = "Generate music for the video"
audio_path = None
if video_path:
video_tensor = read_video(video_path, seek_time=0, duration=seconds_total, target_fps=target_fps)
sync_features = encode_video_with_synchformer(video_path, 0, seconds_total, device=device)
else:
video_tensor = torch.zeros(seconds_total * target_fps, 3, 224, 224)
sync_features = torch.zeros(1, 240, 768, device=device)
audio_tensor = load_and_process_audio(audio_path, sample_rate, 0, seconds_total)
conditioning = [{
"video_prompt": {"video_tensors": video_tensor.unsqueeze(0), "video_sync_frames": sync_features},
"text_prompt": text_prompt or "",
"audio_prompt": audio_tensor.unsqueeze(0),
"seconds_start": 0,
"seconds_total": seconds_total,
}]
# 4-step generation (no classifier-free guidance)
output = generate_diffusion_cond_dmd(
model, steps=4, conditioning=conditioning,
sample_size=sample_size, seed=0, device=device,
)
output = output[:, :, : sample_rate * seconds_total]
output = rearrange(output, "b d n -> d (b n)")
output = output.to(torch.float32).div(torch.max(torch.abs(output)).clamp_min(1e-8)).clamp(-1, 1)
torchaudio.save("output.wav", output.cpu(), sample_rate)
# Optional: mux the audio back onto the source video
if video_path:
merge_video_audio(video_path, "output.wav", "output.mp4", 0, seconds_total)
Supported Tasks
AudioX-Turbo is a unified model that accepts text, video, and audio conditions in any combination:
| Task | video_path |
text_prompt |
audio_path |
|---|---|---|---|
| Text-to-Audio (T2A) | None |
"Typing on a keyboard" |
None |
| Text-to-Music (T2M) | None |
"A music with piano and violin" |
None |
| Video-to-Audio (V2A) | "video_path.mp4" |
"Generate general audio for the video" |
None |
| Video-to-Music (V2M) | "video_path.mp4" |
"Generate music for the video" |
None |
| TV-to-Audio (TV2A) | "video_path.mp4" |
"Ocean waves crashing with people laughing" |
None |
| TV-to-Music (TV2M) | "video_path.mp4" |
"Generate music with piano instrument" |
None |
Citation
If you find our work useful, please consider citing:
@article{tian2026audioxturbo,
title={AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation},
author={Tian, Zeyue and Ke, Lei and Liu, Zhaoyang and Yuan, Ruibin and Xue, Liumeng and Yang, Yujiu and Chen, Weijia and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike},
journal={arXiv preprint arXiv:2606.12555},
year={2026}
}
@inproceedings{tian2026audiox,
title={AudioX: a unified framework for anything-to-audio generation},
author={Tian, Zeyue and Jin, Y and Liu, Z and others},
booktitle={Proceedings of the Fourteenth International Conference on Learning Representations},
year={2026}
}
License
This model is released under CC-BY-NC 4.0.
Note: The models are watermarked and are strictly for non-commercial use only.