Text-to-Audio
Transformers
English
Inference Endpoints
File size: 1,999 Bytes
2a2ea3c
 
 
 
9235111
2a2ea3c
 
 
 
 
 
1c36db2
2a2ea3c
b79b9b9
2a2ea3c
fadcd35
2a2ea3c
 
 
 
 
 
 
 
 
 
900a6f8
2a2ea3c
 
 
 
 
 
900a6f8
2a2ea3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c36db2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: cc-by-nc-sa-4.0
datasets:
- bjoernp/AudioCaps
- declare-lab/audio_alpaca
language:
- en
pipeline_tag: text-to-audio
tags:
- text-to-audio
---
# Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization

🎵 We developed **Tango 2** building upon **Tango** for text-to-audio generation. Tango 2 was initialized with the Tango-full-ft checkpoint and underwent alignment training using DPO on audio-alpaca, a pairwise text-to-audio preference dataset. 🎶

[Read the paper](https://arxiv.org/abs/2404.09956)

## Code

Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)


Please follow the instructions in the repository for installation, usage and experiments.

## Quickstart Guide

Download the **Tango 2** model and generate audio from a text prompt:

```python
import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango2")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
```

The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.

The `generate` function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.

```python
prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)
```


Use the `generate_for_batch` function to generate multiple audio samples for a batch of text prompts:

```python
prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)
```
This will generate two samples for each of the three text prompts.