AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

AudioX-Turbo is a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals). It follows a teacher–student paradigm: the teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion (MAF) module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation (DMD) adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation.

AudioX-Turbo generates audio in only 4 sampling steps (no classifier-free guidance), requiring up to ~25× fewer function evaluations (NFE) than multi-step baselines while achieving superior performance, especially on text-to-audio and text-to-music generation.

Files

File Description
audiox_turbo/audiox_turbo.ckpt AudioX-Turbo: distilled 4-step student model (inference)
pretransform/vae.ckpt VAE pretransform
synchformer/synchformer_state_dict.pth Synchformer, for video-conditioned (V2A/V2M) generation
pretrained_ckpt/pretrained_ckpt.ckpt Teacher / base model (training only: student init + teacher)

Download

# Inference checkpoints (student + VAE + Synchformer)
huggingface-cli download HKUSTAudio/AudioX-Turbo \
  audiox_turbo/audiox_turbo.ckpt pretransform/vae.ckpt synchformer/synchformer_state_dict.pth \
  --local-dir checkpoints

# Training only: teacher / base model
huggingface-cli download HKUSTAudio/AudioX-Turbo \
  pretrained_ckpt/pretrained_ckpt.ckpt \
  --local-dir checkpoints

Sample Usage

To use this model programmatically, install the audiox_turbo package as specified in the official repository.

import torch
import torchaudio
from einops import rearrange

from audiox_turbo.inference import load_audiox_turbo_model
from audiox_turbo.inference.generation import generate_diffusion_cond_dmd
from audiox_turbo.data.utils import (
    read_video, load_and_process_audio, encode_video_with_synchformer, merge_video_audio,
)

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the distilled 4-step student
model, model_config = load_audiox_turbo_model(
    "configs/audiox_turbo_infer_4step.json",
    "checkpoints/audiox_turbo/audiox_turbo.ckpt",
    pretransform_ckpt_path="checkpoints/pretransform/vae.ckpt",
    device=device,
)
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
target_fps = model_config.get("video_fps", 5)
seconds_total = 10

# --- Choose a task by setting the inputs below ---
# Text-to-Audio:  video_path=None,  text_prompt="Typing on a keyboard"
# Video-to-Music: video_path="example/V2M_sample-1.mp4", text_prompt="Generate music for the video"
video_path = "example/V2M_sample-1.mp4"
text_prompt = "Generate music for the video"
audio_path = None

if video_path:
    video_tensor = read_video(video_path, seek_time=0, duration=seconds_total, target_fps=target_fps)
    sync_features = encode_video_with_synchformer(video_path, 0, seconds_total, device=device)
else:
    video_tensor = torch.zeros(seconds_total * target_fps, 3, 224, 224)
    sync_features = torch.zeros(1, 240, 768, device=device)

audio_tensor = load_and_process_audio(audio_path, sample_rate, 0, seconds_total)

conditioning = [{
    "video_prompt": {"video_tensors": video_tensor.unsqueeze(0), "video_sync_frames": sync_features},
    "text_prompt": text_prompt or "",
    "audio_prompt": audio_tensor.unsqueeze(0),
    "seconds_start": 0,
    "seconds_total": seconds_total,
}]

# 4-step generation (no classifier-free guidance)
output = generate_diffusion_cond_dmd(
    model, steps=4, conditioning=conditioning,
    sample_size=sample_size, seed=0, device=device,
)

output = output[:, :, : sample_rate * seconds_total]
output = rearrange(output, "b d n -> d (b n)")
output = output.to(torch.float32).div(torch.max(torch.abs(output)).clamp_min(1e-8)).clamp(-1, 1)
torchaudio.save("output.wav", output.cpu(), sample_rate)

# Optional: mux the audio back onto the source video
if video_path:
    merge_video_audio(video_path, "output.wav", "output.mp4", 0, seconds_total)

Supported Tasks

AudioX-Turbo is a unified model that accepts text, video, and audio conditions in any combination:

Task video_path text_prompt audio_path
Text-to-Audio (T2A) None "Typing on a keyboard" None
Text-to-Music (T2M) None "A music with piano and violin" None
Video-to-Audio (V2A) "video_path.mp4" "Generate general audio for the video" None
Video-to-Music (V2M) "video_path.mp4" "Generate music for the video" None
TV-to-Audio (TV2A) "video_path.mp4" "Ocean waves crashing with people laughing" None
TV-to-Music (TV2M) "video_path.mp4" "Generate music with piano instrument" None

Citation

If you find our work useful, please consider citing:

@article{tian2026audioxturbo,
  title={AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation},
  author={Tian, Zeyue and Ke, Lei and Liu, Zhaoyang and Yuan, Ruibin and Xue, Liumeng and Yang, Yujiu and Chen, Weijia and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike},
  journal={arXiv preprint arXiv:2606.12555},
  year={2026}
}
@inproceedings{tian2026audiox,
  title={AudioX: a unified framework for anything-to-audio generation},
  author={Tian, Zeyue and Jin, Y and Liu, Z and others},
  booktitle={Proceedings of the Fourteenth International Conference on Learning Representations},
  year={2026}
}

License

This model is released under CC-BY-NC 4.0.

Note: The models are watermarked and are strictly for non-commercial use only.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for HKUSTAudio/AudioX-Turbo