--- license: cc-by-nc-sa-4.0 datasets: - AudioCaps language: - en tags: - audio --- # **Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community. 📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation. 📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper. 📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**. ## Auffusion Model Family | Model Name | Model Path | |----------------------------|------------------------------------------------------------------------------------------------------------------------ | | Auffusion | [https://huggingface.co/auffusion/auffusion](https://huggingface.co/auffusion/auffusion) | | Auffusion-Full | [https://huggingface.co/auffusion/auffusion-full](https://huggingface.co/auffusion/auffusion-full) | | Auffusion-Full-no-adapter | [https://huggingface.co/auffusion/auffusion-full-no-adapter](https://huggingface.co/auffusion/auffusion-full-no-adapter)| ## Code Our code is released here: [https://github.com/happylittlecat2333/Auffusion](https://github.com/happylittlecat2333/Auffusion) We uploaded several **Auffusion** generated samples here: [https://auffusion.github.io](https://auffusion.github.io) Please follow the instructions in the repository for installation, usage and experiments. ## Quickstart Guide First, git clone the repository and install the requirements: ```bash git clone https://github.com/happylittlecat2333/Auffusion/ cd Auffusion pip install -r requirements.txt ``` Download the **Auffusion** model and generate audio from a text prompt: ```python import IPython, torch import soundfile as sf from auffusion_pipeline import AuffusionPipeline pipeline = AuffusionPipeline.from_pretrained("auffusion/auffusion") prompt = "Birds singing sweetly in a blooming garden" output = pipeline(prompt=prompt) audio = output.audios[0] sf.write(f"{prompt}.wav", audio, samplerate=16000) IPython.display.Audio(data=audio, rate=16000) ``` The auffusion model will be automatically downloaded from huggingface and saved in cache. Subsequent runs will load the model directly from cache. The `generate` function uses 100 steps and 7.5 guidance_scale by default to sample from the latent diffusion model. You can also vary parameters for different results. ```python prompt = "Rolling thunder with lightning strikes" output = pipeline(prompt=prompt, num_inference_steps=100, guidance_scale=7.5) audio = output.audios[0] IPython.display.Audio(data=audio, rate=16000) ``` ## Citation Please consider citing the following article if you found our work useful: ```bibtex @article{xue2024auffusion, title={Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation}, author={Jinlong Xue and Yayue Deng and Yingming Gao and Ya Li}, journal={arXiv preprint arXiv:2401.01044}, year={2024} } ```