Diffusers
Safetensors
MusicLDMPipeline
File size: 4,822 Bytes
007b722
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d11cf7f
 
007b722
 
 
 
 
 
ec9a51c
007b722
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d11cf7f
007b722
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2712f79
 
d11cf7f
2712f79
007b722
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
license: cc-by-nc-nd-4.0
---

# MusicLDM

MusicLDM is a latent text-to-audio diffusion model capable of generating music samples from a text input. 
It is available in the 🧨 Diffusers library from v0.21.0 onwards.

# Model Details

MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.

Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview),
MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
latents.

MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to 
the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies 
encourages the model to interpolate between the training samples, but stay within the domain of the training data. The 
result is generated music that is more diverse while staying faithful to the corresponding style.

## Model Sources

- [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/musicldm)
- [**Paper**](https://huggingface.co/papers/2308.01546)
- [**Demo**](https://musicldm.github.io)
- [**Try It!!**](https://huggingface.co/spaces/ircam-reach/musicldm-text-to-music)

# Usage

First, install the required packages:

```
pip install --upgrade diffusers transformers accelerate
```

## Text-to-Music

For text-to-music generation, the [MusicLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/musicldm) can be 
used to load pre-trained weights and generate text-conditional audio outputs:

```python
from diffusers import MusicLDMPipeline
import torch

repo_id = "cvssp/musicldm"
pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]
```

The resulting audio output can be saved as a .wav file:
```python
import scipy

scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
```

Or displayed in a Jupyter Notebook / Google Colab:
```python
from IPython.display import Audio

Audio(audio, rate=16000)
```

## Tips

When constructing a prompt, keep in mind:

* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".

During inference:

* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
* The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument.

The following example demonstrates how to construct a good audio generation using the aforementioned tips:

```python
import scipy
import torch
from diffusers import MusicLDMPipeline

# load the pipeline
repo_id = "ircam-reach/musicldm"
pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# define the prompts
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
negative_prompt = "low quality, average quality"

# set the seed
generator = torch.Generator("cuda").manual_seed(0)

# run the generation
audio = pipe(
    prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=200,
    audio_length_in_s=10.0,
    num_waveforms_per_prompt=3,
).audios

# save the best audio sample (index 0) as a .wav file
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
```

# Citation

**BibTeX:**
```
@article{chen2023musicldm,
  title={"MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies"},
  author={Chen*, Ke and Wu*, Yusong and Liu*, Haohe and Nezhurina, Marianna and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  journal={arXiv preprint arXiv:2308.01546},
  year={2023}
}
```