Diffusers
AudioLDMPipeline
sanchit-gandhi HF staff commited on
Commit
9c0400e
1 Parent(s): c60f047

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
+ ---
6
+
7
+ # AudioLDM
8
+
9
+ AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the 🧨 Diffusers library from v0.15.0 onwards.
10
+
11
+ # Model Details
12
+
13
+ AudioLDM was proposed in the paper [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al.
14
+
15
+ Inspired by [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion-v1-4), AudioLDM
16
+ is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/laion/clap-htsat-unfused)
17
+ latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
18
+ sound effects, human speech and music.
19
+
20
+ This is the _large_ version of the AudioLDM model, with twice the number of UNet channels and head channels as the _small_ checkpoint. The three AudioLDM checkpoints are summarised in the table below.
21
+
22
+ **Table 1:** Summary of the AudioLDM checkpoints. Frechet Distance (FD) indicates the similarity between generated samples and target samples, where lower is better.
23
+
24
+ | Checkpoint | Training Data (h) | Training Steps | Params | Frechet Distance ↓ |
25
+ |-----------------|-------------------|----------------|--------|-------------------------------|
26
+ | AudioLDM-S | 137 | 0.6M | 181M | 29.48 |
27
+ | AudioLDM-L | 137 | 0.6M | 739M | 27.12 |
28
+ | AudioLDM-L-Full | 9174 | 1.5M | 739M | **23.31** |
29
+
30
+ ## Model Sources
31
+
32
+ - [**Original Repository**](https://github.com/haoheliu/AudioLDM)
33
+ - [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/audioldm)
34
+ - [**Paper**](https://arxiv.org/abs/2301.12503)
35
+ - [**Demo**](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)
36
+
37
+ # Usage
38
+
39
+ First, install the required packages:
40
+
41
+ ```
42
+ pip install --upgrade diffusers transformers
43
+ ```
44
+
45
+ ## Text-to-Audio
46
+
47
+ For text-to-audio generation, the [AudioLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/audioldm) can be
48
+ used to load pre-trained weights and generate text-conditional audio outputs:
49
+
50
+ ```python
51
+ from diffusers import AudioLDMPipeline
52
+ import torch
53
+
54
+ repo_id = "cvssp/audioldm-l"
55
+ pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
56
+ pipe = pipe.to("cuda")
57
+
58
+ prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
59
+ audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
60
+ ```
61
+
62
+ The resulting audio output can be saved as a .wav file:
63
+ ```python
64
+ import scipy
65
+
66
+ scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
67
+ ```
68
+
69
+ Or displayed in a Jupyter Notebook / Google Colab:
70
+ ```python
71
+ from IPython.display import Audio
72
+
73
+ Audio(audio, rate=16000)
74
+ ```
75
+
76
+ ## Tips
77
+
78
+ Prompts:
79
+ * Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
80
+ * It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
81
+
82
+ Inference:
83
+ * The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
84
+ * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
85
+
86
+ # Citation
87
+
88
+ **BibTeX:**
89
+ ```
90
+ @article{liu2023audioldm,
91
+ title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models},
92
+ author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
93
+ journal={arXiv preprint arXiv:2301.12503},
94
+ year={2023}
95
+ }
96
+ ```