cvssp
/

audioldm-l-full

Diffusers

AudioLDMPipeline

Model card Files Files and versions Community

sanchit-gandhi HF staff commited on Apr 4, 2023

Commit

9c0400e

•

1 Parent(s): c60f047

Create README.md

Browse files

Files changed (1) hide show

README.md +96 -0

README.md ADDED Viewed

	@@ -0,0 +1,96 @@

+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{}
+---
+# AudioLDM
+AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the 🧨 Diffusers library from v0.15.0 onwards.
+# Model Details
+AudioLDM was proposed in the paper [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al.
+Inspired by [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion-v1-4), AudioLDM
+is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/laion/clap-htsat-unfused)
+latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
+sound effects, human speech and music.
+This is the _large_ version of the AudioLDM model, with twice the number of UNet channels and head channels as the _small_ checkpoint. The three AudioLDM checkpoints are summarised in the table below.
+**Table 1:** Summary of the AudioLDM checkpoints. Frechet Distance (FD) indicates the similarity between generated samples and target samples, where lower is better.
+| Checkpoint      | Training Data (h) | Training Steps | Params | Frechet Distance &#8595; |
+|-----------------|-------------------|----------------|--------|-------------------------------|
+| AudioLDM-S      | 137               | 0.6M           | 181M   | 29.48                         |
+| AudioLDM-L      | 137               | 0.6M           | 739M   | 27.12                         |
+| AudioLDM-L-Full | 9174              | 1.5M           | 739M   | **23.31**                     |
+## Model Sources
+- [**Original Repository**](https://github.com/haoheliu/AudioLDM)
+- [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/audioldm)
+- [**Paper**](https://arxiv.org/abs/2301.12503)
+- [**Demo**](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)
+# Usage
+First, install the required packages:
+```
+pip install --upgrade diffusers transformers
+```
+## Text-to-Audio
+For text-to-audio generation, the [AudioLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/audioldm) can be
+used to load pre-trained weights and generate text-conditional audio outputs:
+```python
+from diffusers import AudioLDMPipeline
+import torch
+repo_id = "cvssp/audioldm-l"
+pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
+audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
+```
+The resulting audio output can be saved as a .wav file:
+```python
+import scipy
+scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
+```
+Or displayed in a Jupyter Notebook / Google Colab:
+```python
+from IPython.display import Audio
+Audio(audio, rate=16000)
+```
+## Tips
+Prompts:
+* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
+* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
+Inference:
+* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
+* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
+# Citation
+**BibTeX:**
+```
+@article{liu2023audioldm,
+  title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models},
+  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
+  journal={arXiv preprint arXiv:2301.12503},
+  year={2023}
+}
+```