File size: 4,480 Bytes
9c0400e 26aec6f 9c0400e 81e7f9e 9c0400e 81e7f9e ff1a788 9c0400e d4fbce5 9c0400e c05b967 9c0400e 70db2d8 9c0400e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
---
license: cc-by-nc-sa-4.0
---
# AudioLDM
AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the 🧨 Diffusers library from v0.15.0 onwards.
# Model Details
AudioLDM was proposed in the paper [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al.
Inspired by [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion-v1-4), AudioLDM
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/laion/clap-htsat-unfused)
latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
sound effects, human speech and music.
# Checkpoint Details
This is the **large** version of the AudioLDM model, with twice the number of UNet channels and head channels as the small checkpoints. The four AudioLDM checkpoints are summarised in the table below:
**Table 1:** Summary of the AudioLDM checkpoints.
| Checkpoint | Training Steps | Audio conditioning | CLAP audio dim | UNet dim | Params |
|-----------------------------------------------------------------------|----------------|--------------------|----------------|----------|--------|
| [audioldm-s-full](https://huggingface.co/cvssp/audioldm) | 1.5M | No | 768 | 128 | 421M |
| [audioldm-s-full-v2](https://huggingface.co/cvssp/audioldm-s-full-v2) | > 1.5M | No | 768 | 128 | 421M |
| [audioldm-m-full](https://huggingface.co/cvssp/audioldm-m-full) | 1.5M | Yes | 1024 | 192 | 652M |
| [audioldm-l-full](https://huggingface.co/cvssp/audioldm-l-full) | 1.5M | No | 768 | 256 | 975M |
## Model Sources
- [**Original Repository**](https://github.com/haoheliu/AudioLDM)
- [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/audioldm)
- [**Paper**](https://arxiv.org/abs/2301.12503)
- [**Demo**](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)
# Usage
First, install the required packages:
```
pip install --upgrade diffusers transformers accelerate
```
## Text-to-Audio
For text-to-audio generation, the [AudioLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/audioldm) can be
used to load pre-trained weights and generate text-conditional audio outputs:
```python
from diffusers import AudioLDMPipeline
import torch
repo_id = "cvssp/audioldm-l-full"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
```
The resulting audio output can be saved as a .wav file:
```python
import scipy
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
```
Or displayed in a Jupyter Notebook / Google Colab:
```python
from IPython.display import Audio
Audio(audio, rate=16000)
```
<audio controls>
<source src="https://huggingface.co/datasets/sanchit-gandhi/audioldm-readme-samples/resolve/main/audioldm-l-full-techno.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
## Tips
Prompts:
* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
Inference:
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
# Citation
**BibTeX:**
```
@article{liu2023audioldm,
title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models},
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
journal={arXiv preprint arXiv:2301.12503},
year={2023}
}
``` |