musicldm / README.md

sanchit-gandhi HF staff

Update README.md

b8135e3 10 months ago

preview code

raw

history blame contribute delete

No virus

4.98 kB

	---
	license: cc-by-nc-sa-4.0
	---

	# MusicLDM

	MusicLDM is a latent text-to-audio diffusion model capable of generating music samples from a text input.
	It is available in the 🧨 Diffusers library from v0.21.0 onwards.

	# Model Details

	MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.

	Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview),
	MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
	latents.

	MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to
	the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies
	encourages the model to interpolate between the training samples, but stay within the domain of the training data. The
	result is generated music that is more diverse while staying faithful to the corresponding style.

	This work is licensed under a
	[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).

	## Model Sources

	- [🧨 Diffusers Pipeline](https://huggingface.co/docs/diffusers/api/pipelines/musicldm)
	- [Paper](https://huggingface.co/papers/2308.01546)
	- [Demo](https://musicldm.github.io)
	- [Try It!!](https://huggingface.co/spaces/ircam-reach/musicldm-text-to-music)

	# Usage

	First, install the required packages:

	```
	pip install --upgrade diffusers transformers accelerate
	```

	## Text-to-Music

	For text-to-music generation, the [MusicLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/musicldm) can be
	used to load pre-trained weights and generate text-conditional audio outputs:

	```python
	from diffusers import MusicLDMPipeline
	import torch

	repo_id = "ucsd-reach/musicldm"
	pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
	pipe = pipe.to("cuda")

	prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
	audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]
	```

	The resulting audio output can be saved as a .wav file:
	```python
	import scipy

	scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
	```

	Or displayed in a Jupyter Notebook / Google Colab:
	```python
	from IPython.display import Audio

	Audio(audio, rate=16000)
	```

	## Tips

	When constructing a prompt, keep in mind:

	* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
	* Using a negative prompt can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".

	During inference:

	* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
	* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
	* The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument.

	The following example demonstrates how to construct a good audio generation using the aforementioned tips:

	```python
	import scipy
	import torch
	from diffusers import MusicLDMPipeline

	# load the pipeline
	repo_id = "ucsd-reach/musicldm"
	pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
	pipe = pipe.to("cuda")

	# define the prompts
	prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
	negative_prompt = "low quality, average quality"

	# set the seed
	generator = torch.Generator("cuda").manual_seed(0)

	# run the generation
	audio = pipe(
	prompt,
	negative_prompt=negative_prompt,
	num_inference_steps=200,
	audio_length_in_s=10.0,
	num_waveforms_per_prompt=3,
	).audios

	# save the best audio sample (index 0) as a .wav file
	scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
	```

	# Citation

	BibTeX:
	```
	@article{chen2023musicldm,
	title={"MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies"},
	author={Chen, Ke and Wu, Yusong and Liu*, Haohe and Nezhurina, Marianna and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
	journal={arXiv preprint arXiv:2308.01546},
	year={2023}
	}
	```