sanchit-gandhi HF staff commited on
Commit
007b722
1 Parent(s): 9a01010

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ ---
4
+
5
+ # MusicLDM
6
+
7
+ MusicLDM is a latent text-to-audio diffusion model capable of generating music samples from a text input.
8
+ It is available in the 🧨 Diffusers library from v0.21.0 onwards.
9
+
10
+ # Model Details
11
+
12
+ MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
13
+
14
+ Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview),
15
+ MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
16
+ latents.
17
+
18
+ MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to
19
+ the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies
20
+ encourages the model to interpolate between the training samples, but stay within the domain of the training data. The
21
+ result is generated music that is more diverse while staying faithful to the corresponding style.
22
+
23
+ ## Model Sources
24
+
25
+ - [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/musicldm)
26
+ - [**Paper**](https://huggingface.co/papers/2308.01546)
27
+ - [**Demo**](https://huggingface.co/spaces/cvssp/musicldm)
28
+
29
+ # Usage
30
+
31
+ First, install the required packages:
32
+
33
+ ```
34
+ pip install --upgrade diffusers transformers
35
+ ```
36
+
37
+ ## Text-to-Music
38
+
39
+ For text-to-music generation, the [MusicLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/musicldm) can be
40
+ used to load pre-trained weights and generate text-conditional audio outputs:
41
+
42
+ ```python
43
+ from diffusers import MusicLDMPipeline
44
+ import torch
45
+
46
+ repo_id = "cvssp/musicldm"
47
+ pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
48
+ pipe = pipe.to("cuda")
49
+
50
+ prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
51
+ audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]
52
+ ```
53
+
54
+ The resulting audio output can be saved as a .wav file:
55
+ ```python
56
+ import scipy
57
+
58
+ scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
59
+ ```
60
+
61
+ Or displayed in a Jupyter Notebook / Google Colab:
62
+ ```python
63
+ from IPython.display import Audio
64
+
65
+ Audio(audio, rate=16000)
66
+ ```
67
+
68
+ ## Tips
69
+
70
+ When constructing a prompt, keep in mind:
71
+
72
+ * Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
73
+ * Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
74
+
75
+ During inference:
76
+
77
+ * The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
78
+ * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
79
+ * The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument.
80
+
81
+ The following example demonstrates how to construct a good audio generation using the aforementioned tips:
82
+
83
+ ```python
84
+ import scipy
85
+ import torch
86
+ from diffusers import MusicLDMPipeline
87
+
88
+ # load the pipeline
89
+ repo_id = "cvssp/musicldm"
90
+ pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
91
+ pipe = pipe.to("cuda")
92
+
93
+ # define the prompts
94
+ prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
95
+ negative_prompt = "low quality, average quality"
96
+
97
+ # set the seed
98
+ generator = torch.Generator("cuda").manual_seed(0)
99
+
100
+ # run the generation
101
+ audio = pipe(
102
+ prompt,
103
+ negative_prompt=negative_prompt,
104
+ num_inference_steps=200,
105
+ audio_length_in_s=10.0,
106
+ num_waveforms_per_prompt=3,
107
+ ).audios
108
+
109
+ # save the best audio sample (index 0) as a .wav file
110
+ scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
111
+ ```
112
+
113
+ # Citation
114
+
115
+ **BibTeX:**
116
+ ```
117
+ @article{liu2023audioldm2,
118
+ title={"AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining"},
119
+ author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
120
+ journal={arXiv preprint arXiv:2308.05734},
121
+ year={2023}
122
+ }
123
+ ```