AudioLDM
Overview
AudioLDM was proposed in AudioLDM: Text-to-Audio Generation with Latent Diffusion Models by Haohe Liu et al.
Inspired by Stable Diffusion, AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
This pipeline was contributed by sanchit-gandhi. The original codebase can be found here.
Text-to-Audio
The AudioLDMPipeline can be used to load pre-trained weights from cvssp/audioldm and generate text-conditional audio outputs:
from diffusers import AudioLDMPipeline
import torch
import scipy
repo_id = "cvssp/audioldm"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
# save the audio sample as a .wav file
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
Tips
Prompts:
- Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. “high quality” or “clear”) and make the prompt context specific (e.g., “water stream in a forest” instead of “stream”).
- It’s best to use general terms like ‘cat’ or ‘dog’ instead of specific names or abstract objects that the model may not be familiar with.
Inference:
- The quality of the predicted audio sample can be controlled by the
num_inference_steps
argument: higher steps give higher quality audio at the expense of slower inference. - The length of the predicted audio sample can be controlled by varying the
audio_length_in_s
argument.
How to load and use different schedulers
The AudioLDM pipeline uses DDIMScheduler scheduler by default. But diffusers
provides many other schedulers
that can be used with the AudioLDM pipeline such as PNDMScheduler, LMSDiscreteScheduler, EulerDiscreteScheduler,
EulerAncestralDiscreteScheduler etc. We recommend using the DPMSolverMultistepScheduler as it’s currently the fastest
scheduler there is.
To use a different scheduler, you can either change it via the ConfigMixin.from_config()
method, or pass the scheduler
argument to the from_pretrained
method of the pipeline. For example, to use the
DPMSolverMultistepScheduler, you can do the following:
>>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler
>>> import torch
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16)
>>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
>>> # or
>>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm", subfolder="scheduler")
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", scheduler=dpm_scheduler, torch_dtype=torch.float16)
AudioLDMPipeline
class diffusers.AudioLDMPipeline
< source >( vae: AutoencoderKL text_encoder: ClapTextModelWithProjection tokenizer: typing.Union[transformers.models.roberta.tokenization_roberta.RobertaTokenizer, transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast] unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers vocoder: SpeechT5HifiGan )
Parameters
- vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode audios to and from latent representations.
-
text_encoder (
ClapTextModelWithProjection
) — Frozen text-encoder. AudioLDM uses the text portion of CLAP, specifically the RoBERTa HSTAT-unfused variant. -
tokenizer (
PreTrainedTokenizer
) — Tokenizer of class RobertaTokenizer. - unet (UNet2DConditionModel) — U-Net architecture to denoise the encoded audio latents.
-
scheduler (SchedulerMixin) —
A scheduler to be used in combination with
unet
to denoise the encoded audio latents. Can be one of DDIMScheduler, LMSDiscreteScheduler, or PNDMScheduler. -
vocoder (
SpeechT5HifiGan
) — Vocoder of class SpeechT5HifiGan.
Pipeline for text-to-audio generation using AudioLDM.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >(
prompt: typing.Union[str, typing.List[str]] = None
audio_length_in_s: typing.Optional[float] = None
num_inference_steps: int = 10
guidance_scale: float = 2.5
negative_prompt: typing.Union[str, typing.List[str], NoneType] = None
num_waveforms_per_prompt: typing.Optional[int] = 1
eta: float = 0.0
generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None
latents: typing.Optional[torch.FloatTensor] = None
prompt_embeds: typing.Optional[torch.FloatTensor] = None
negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None
return_dict: bool = True
callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None
callback_steps: typing.Optional[int] = 1
cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None
output_type: typing.Optional[str] = 'np'
)
→
StableDiffusionPipelineOutput or tuple
Parameters
-
prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the audio generation. If not defined, one has to passprompt_embeds
. instead. -
audio_length_in_s (
int
, optional, defaults to 5.12) — The length of the generated audio sample in seconds. -
num_inference_steps (
int
, optional, defaults to 10) — The number of denoising steps. More denoising steps usually lead to a higher quality audio at the expense of slower inference. -
guidance_scale (
float
, optional, defaults to 2.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate audios that are closely linked to the textprompt
, usually at the expense of lower sound quality. -
negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the audio generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). -
num_waveforms_per_prompt (
int
, optional, defaults to 1) — The number of waveforms to generate per prompt. -
eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others. -
generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. -
latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for audio generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
. -
prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. -
negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. -
return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple. -
callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor)
. -
callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. -
cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttnProcessor
as defined underself.processor
in diffusers.cross_attention. -
output_type (
str
, optional, defaults to"np"
) — The output format of the generate image. Choose between:"np"
: Return Numpynp.ndarray
objects."pt"
: Return PyTorchtorch.Tensor
objects.
Returns
StableDiffusionPipelineOutput or tuple
StableDiffusionPipelineOutput if return_dict
is True, otherwise a `tuple.
When returning a tuple, the first element is a list with the generated audios.
Function invoked when calling the pipeline for generation.
Disable sliced VAE decoding. If enable_vae_slicing
was previously invoked, this method will go back to
computing decoding in one step.
Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, unet,
text_encoder, vae and vocoder have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its
forward` method called.
Enable sliced VAE decoding.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.