--- license: cc-by-nc-sa-4.0 --- # MusicLDM MusicLDM is a latent text-to-audio diffusion model capable of generating music samples from a text input. It is available in the 🧨 Diffusers library from v0.21.0 onwards. # Model Details MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview), MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) latents. MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies encourages the model to interpolate between the training samples, but stay within the domain of the training data. The result is generated music that is more diverse while staying faithful to the corresponding style. This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/). ## Model Sources - [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/musicldm) - [**Paper**](https://huggingface.co/papers/2308.01546) - [**Demo**](https://musicldm.github.io) - [**Try It!!**](https://huggingface.co/spaces/ircam-reach/musicldm-text-to-music) # Usage First, install the required packages: ``` pip install --upgrade diffusers transformers accelerate ``` ## Text-to-Music For text-to-music generation, the [MusicLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/musicldm) can be used to load pre-trained weights and generate text-conditional audio outputs: ```python from diffusers import MusicLDMPipeline import torch repo_id = "ucsd-reach/musicldm" pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16) pipe = pipe.to("cuda") prompt = "Techno music with a strong, upbeat tempo and high melodic riffs" audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0] ``` The resulting audio output can be saved as a .wav file: ```python import scipy scipy.io.wavfile.write("techno.wav", rate=16000, data=audio) ``` Or displayed in a Jupyter Notebook / Google Colab: ```python from IPython.display import Audio Audio(audio, rate=16000) ``` ## Tips When constructing a prompt, keep in mind: * Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno"). * Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality". During inference: * The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. * The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument. The following example demonstrates how to construct a good audio generation using the aforementioned tips: ```python import scipy import torch from diffusers import MusicLDMPipeline # load the pipeline repo_id = "ucsd-reach/musicldm" pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16) pipe = pipe.to("cuda") # define the prompts prompt = "Techno music with a strong, upbeat tempo and high melodic riffs" negative_prompt = "low quality, average quality" # set the seed generator = torch.Generator("cuda").manual_seed(0) # run the generation audio = pipe( prompt, negative_prompt=negative_prompt, num_inference_steps=200, audio_length_in_s=10.0, num_waveforms_per_prompt=3, ).audios # save the best audio sample (index 0) as a .wav file scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0]) ``` # Citation **BibTeX:** ``` @article{chen2023musicldm, title={"MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies"}, author={Chen*, Ke and Wu*, Yusong and Liu*, Haohe and Nezhurina, Marianna and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo}, journal={arXiv preprint arXiv:2308.01546}, year={2023} } ```