Diffusers documentation


You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.16.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started



AudioLDM was proposed in AudioLDM: Text-to-Audio Generation with Latent Diffusion Models by Haohe Liu et al.

Inspired by Stable Diffusion, AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.

This pipeline was contributed by sanchit-gandhi. The original codebase can be found here.


The AudioLDMPipeline can be used to load pre-trained weights from cvssp/audioldm-s-full-v2 and generate text-conditional audio outputs:

from diffusers import AudioLDMPipeline
import torch
import scipy

repo_id = "cvssp/audioldm-s-full-v2"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]

# save the audio sample as a .wav file
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)



  • Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. “high quality” or “clear”) and make the prompt context specific (e.g., “water stream in a forest” instead of “stream”).
  • It’s best to use general terms like ‘cat’ or ‘dog’ instead of specific names or abstract objects that the model may not be familiar with.


  • The quality of the predicted audio sample can be controlled by the num_inference_steps argument: higher steps give higher quality audio at the expense of slower inference.
  • The length of the predicted audio sample can be controlled by varying the audio_length_in_s argument.

How to load and use different schedulers

The AudioLDM pipeline uses DDIMScheduler scheduler by default. But diffusers provides many other schedulers that can be used with the AudioLDM pipeline such as PNDMScheduler, LMSDiscreteScheduler, EulerDiscreteScheduler, EulerAncestralDiscreteScheduler etc. We recommend using the DPMSolverMultistepScheduler as it’s currently the fastest scheduler there is.

To use a different scheduler, you can either change it via the ConfigMixin.from_config() method, or pass the scheduler argument to the from_pretrained method of the pipeline. For example, to use the DPMSolverMultistepScheduler, you can do the following:

>>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler
>>> import torch

>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm-s-full-v2", torch_dtype=torch.float16)
>>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)

>>> # or
>>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm-s-full-v2", subfolder="scheduler")
>>> pipeline = AudioLDMPipeline.from_pretrained(
...     "cvssp/audioldm-s-full-v2", scheduler=dpm_scheduler, torch_dtype=torch.float16
... )


class diffusers.AudioLDMPipeline

< >

( vae: AutoencoderKL text_encoder: ClapTextModelWithProjection tokenizer: typing.Union[transformers.models.roberta.tokenization_roberta.RobertaTokenizer, transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast] unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers vocoder: SpeechT5HifiGan )


Pipeline for text-to-audio generation using AudioLDM.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)


< >

( prompt: typing.Union[str, typing.List[str]] = None audio_length_in_s: typing.Optional[float] = None num_inference_steps: int = 10 guidance_scale: float = 2.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_waveforms_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: typing.Optional[int] = 1 cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None output_type: typing.Optional[str] = 'np' ) StableDiffusionPipelineOutput or tuple


  • prompt (str or List[str], optional) — The prompt or prompts to guide the audio generation. If not defined, one has to pass prompt_embeds. instead.
  • audio_length_in_s (int, optional, defaults to 5.12) — The length of the generated audio sample in seconds.
  • num_inference_steps (int, optional, defaults to 10) — The number of denoising steps. More denoising steps usually lead to a higher quality audio at the expense of slower inference.
  • guidance_scale (float, optional, defaults to 2.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate audios that are closely linked to the text prompt, usually at the expense of lower sound quality.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the audio generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • num_waveforms_per_prompt (int, optional, defaults to 1) — The number of waveforms to generate per prompt.
  • eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
  • generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
  • latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for audio generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
  • prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple.
  • callback (Callable, optional) — A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).
  • callback_steps (int, optional, defaults to 1) — The frequency at which the callback function will be called. If not specified, the callback will be called at every step.
  • cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttnProcessor as defined under self.processor in diffusers.cross_attention.
  • output_type (str, optional, defaults to "np") — The output format of the generate image. Choose between:
    • "np": Return Numpy np.ndarray objects.
    • "pt": Return PyTorch torch.Tensor objects.

StableDiffusionPipelineOutput if return_dict is True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated audios.

Function invoked when calling the pipeline for generation.


>>> import torch
>>> from diffusers import AudioLDMPipeline

>>> pipe = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")

>>> prompt = "A hammer hitting a wooden surface"
>>> audio = pipe(prompt).audio[0]


< >

( )

Disable sliced VAE decoding. If enable_vae_slicing was previously invoked, this method will go back to computing decoding in one step.


< >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, unet, text_encoder, vae and vocoder have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its forward` method called.


< >

( )

Enable sliced VAE decoding.

When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.