AudioLDM

AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the 🧨 Diffusers library from v0.15.0 onwards.

Model Details

AudioLDM was proposed in the paper AudioLDM: Text-to-Audio Generation with Latent Diffusion Models by Haohe Liu et al.

Inspired by Stable Diffusion, AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.

Checkpoint Details

This is the medium version of the AudioLDM model, which has a larger UNet, CLAP audio projection dim, and is trained with audio embeddings as condition. The four AudioLDM checkpoints are summarised below:

Table 1: Summary of the AudioLDM checkpoints.

Checkpoint	Training Steps	Audio conditioning	CLAP audio dim	UNet dim	Params
audioldm-s-full	1.5M	No	768	128	421M
audioldm-s-full-v2	> 1.5M	No	768	128	421M
audioldm-m-full	1.5M	Yes	1024	192	652M
audioldm-l-full	1.5M	No	768	256	975M

Model Sources

Usage

First, install the required packages:

pip install --upgrade diffusers transformers accelerate

Text-to-Audio

For text-to-audio generation, the AudioLDMPipeline can be used to load pre-trained weights and generate text-conditional audio outputs:

from diffusers import AudioLDMPipeline
import torch

repo_id = "cvssp/audioldm-m-full"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]

The resulting audio output can be saved as a .wav file:

import scipy

scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)

Or displayed in a Jupyter Notebook / Google Colab:

from IPython.display import Audio

Audio(audio, rate=16000)

Tips

Prompts:

Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.

Inference:

The quality of the predicted audio sample can be controlled by the num_inference_steps argument: higher steps give higher quality audio at the expense of slower inference.
The length of the predicted audio sample can be controlled by varying the audio_length_in_s argument.

Citation

BibTeX:

@article{liu2023audioldm,
  title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={arXiv preprint arXiv:2301.12503},
  year={2023}
}

cvssp
/

audioldm-m-full