MusicLDM
MusicLDM was proposed in MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov. MusicLDM takes a text prompt as input and predicts the corresponding music sample.
Inspired by Stable Diffusion and AudioLDM, MusicLDM is a text-to-music latent diffusion model (LDM) that learns continuous audio representations from CLAP latents.
MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies encourages the model to interpolate between the training samples, but stay within the domain of the training data. The result is generated music that is more diverse while staying faithful to the corresponding style.
The abstract of the paper is the following:
Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.
This pipeline was contributed by sanchit-gandhi.
Tips
When constructing a prompt, keep in mind:
- Descriptive prompt inputs work best; use adjectives to describe the sound (for example, “high quality” or “clear”) and make the prompt context specific where possible (e.g. “melodic techno with a fast beat and synths” works better than “techno”).
- Using a negative prompt can significantly improve the quality of the generated audio. Try using a negative prompt of “low quality, average quality”.
During inference:
- The quality of the generated audio sample can be controlled by the
num_inference_steps
argument; higher steps give higher quality audio at the expense of slower inference. - Multiple waveforms can be generated in one go: set
num_waveforms_per_prompt
to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. - The length of the generated audio sample can be controlled by varying the
audio_length_in_s
argument.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
MusicLDMPipeline
class diffusers.MusicLDMPipeline
< source >( vae: AutoencoderKL text_encoder: typing.Union[transformers.models.clap.modeling_clap.ClapTextModelWithProjection, transformers.models.clap.modeling_clap.ClapModel] tokenizer: typing.Union[transformers.models.roberta.tokenization_roberta.RobertaTokenizer, transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast] feature_extractor: typing.Optional[transformers.models.clap.feature_extraction_clap.ClapFeatureExtractor] unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers vocoder: SpeechT5HifiGan )
Parameters
- vae (AutoencoderKL) — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
- text_encoder (ClapModel) —
Frozen text-audio embedding model (
ClapTextModel
), specifically the laion/clap-htsat-unfused variant. - tokenizer (
PreTrainedTokenizer
) — A RobertaTokenizer to tokenize text. - feature_extractor (ClapFeatureExtractor) — Feature extractor to compute mel-spectrograms from audio waveforms.
- unet (UNet2DConditionModel) —
A
UNet2DConditionModel
to denoise the encoded audio latents. - scheduler (SchedulerMixin) —
A scheduler to be used in combination with
unet
to denoise the encoded audio latents. Can be one of DDIMScheduler, LMSDiscreteScheduler, or PNDMScheduler. - vocoder (SpeechT5HifiGan) —
Vocoder of class
SpeechT5HifiGan
.
Pipeline for text-to-audio generation using MusicLDM.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None audio_length_in_s: typing.Optional[float] = None num_inference_steps: int = 200 guidance_scale: float = 2.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_waveforms_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: typing.Optional[int] = 1 cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None output_type: typing.Optional[str] = 'np' ) → AudioPipelineOutput or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide audio generation. If not defined, you need to passprompt_embeds
. - audio_length_in_s (
int
, optional, defaults to 10.24) — The length of the generated audio sample in seconds. - num_inference_steps (
int
, optional, defaults to 200) — The number of denoising steps. More denoising steps usually lead to a higher quality audio at the expense of slower inference. - guidance_scale (
float
, optional, defaults to 2.0) — A higher guidance scale value encourages the model to generate audio that is closely linked to the textprompt
at the expense of lower sound quality. Guidance scale is enabled whenguidance_scale > 1
. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts to guide what to not include in audio generation. If not defined, you need to passnegative_prompt_embeds
instead. Ignored when not using guidance (guidance_scale < 1
). - num_waveforms_per_prompt (
int
, optional, defaults to 1) — The number of waveforms to generate per prompt. Ifnum_waveforms_per_prompt > 1
, the text encoding model is a joint text-audio model (ClapModel), and the tokenizer is a[~transformers.ClapProcessor]
, then automatic scoring will be performed between the generated outputs and the input text. This scoring ranks the generated waveforms based on their cosine similarity to text input in the joint text-audio embedding space. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the DDIMScheduler, and is ignored in other schedulers. - generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. - latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embeds
are generated from thenegative_prompt
input argument. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a AudioPipelineOutput instead of a plain tuple. - callback (
Callable
, optional) — A function that calls everycallback_steps
steps during inference. The function is called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function is called. If not specified, the callback is called at every step. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. - output_type (
str
, optional, defaults to"np"
) — The output format of the generated audio. Choose between"np"
to return a NumPynp.ndarray
or"pt"
to return a PyTorchtorch.Tensor
object. Set to"latent"
to return the latent diffusion model (LDM) output.
Returns
AudioPipelineOutput or tuple
If return_dict
is True
, AudioPipelineOutput is returned, otherwise a tuple
is
returned where the first element is a list with the generated audio.
The call function to the pipeline for generation.
Examples:
>>> from diffusers import MusicLDMPipeline
>>> import torch
>>> import scipy
>>> repo_id = "ucsd-reach/musicldm"
>>> pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
>>> audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
>>> # save the audio sample as a .wav file
>>> scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
Disable sliced VAE decoding. If enable_vae_slicing
was previously enabled, this method will go back to
computing decoding in one step.
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared
to enable_sequential_cpu_offload
, this method moves one whole model at a time to the GPU when its forward
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with
enable_sequential_cpu_offload
, but performance is much better due to the iterative execution of the unet
.
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.