Text-to-Audio
Transformers
English
Inference Endpoints
soujanyaporia commited on
Commit
1c36db2
1 Parent(s): 2a2ea3c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -11
README.md CHANGED
@@ -8,17 +8,15 @@ pipeline_tag: text-to-audio
8
  tags:
9
  - text-to-audio
10
  ---
11
- # TANGO: Text to Audio using iNstruction-Guided diffusiOn
12
 
13
- **TANGO** is a latent diffusion model for text-to-audio generation. **TANGO** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We outperform current state-of-the-art models for audio generation across both objective and subjective metrics. We release our model, training, inference code and pre-trained checkpoints for the research community.
14
 
15
- 📣 We are releasing [**Tango-Full-FT-Audiocaps**](https://huggingface.co/declare-lab/tango-full-ft-audiocaps) which was first pre-trained on [**TangoPromptBank**](https://huggingface.co/datasets/declare-lab/TangoPromptBank), a collection of diverse text, audio pairs. We later fine tuned this checkpoint on AudioCaps. This checkpoint obtained state-of-the-art results for text-to-audio generation on AudioCaps.
16
 
17
  ## Code
18
 
19
  Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
20
 
21
- We uploaded several **TANGO** generated samples here: [https://tango-web.github.io/](https://tango-web.github.io/)
22
 
23
  Please follow the instructions in the repository for installation, usage and experiments.
24
 
@@ -63,10 +61,4 @@ prompts = [
63
  ]
64
  audios = tango.generate_for_batch(prompts, samples=2)
65
  ```
66
- This will generate two samples for each of the three text prompts.
67
-
68
- ## Limitations
69
-
70
- TANGO is trained on the small AudioCaps dataset so it may not generate good audio samples related to concepts that it has not seen in training (e.g. _singing_). For the same reason, TANGO is not always able to finely control its generations over textual control prompts. For example, the generations from TANGO for prompts _Chopping tomatoes on a wooden table_ and _Chopping potatoes on a metal table_ are very similar. _Chopping vegetables on a table_ also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text-audio mappings.
71
-
72
- We are training another version of TANGO on larger datasets to enhance its generalization, compositional and controllable generation ability.
 
8
  tags:
9
  - text-to-audio
10
  ---
11
+ # Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization
12
 
13
+ 🎵 We developed **Tango 2** building upon **Tango** for text-to-audio generation. **Tango 2** was initialized with the **Tango-full-ft** checkpoint and underwent alignment training using DPO on **audio-alpaca**, a dataset of pairwise audio preferences. 🎶
14
 
 
15
 
16
  ## Code
17
 
18
  Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
19
 
 
20
 
21
  Please follow the instructions in the repository for installation, usage and experiments.
22
 
 
61
  ]
62
  audios = tango.generate_for_batch(prompts, samples=2)
63
  ```
64
+ This will generate two samples for each of the three text prompts.