declare-lab
/

tango2

Inference Endpoints

Model card Files Files and versions Community

soujanyaporia commited on Apr 13

Commit

1c36db2

•

1 Parent(s): 2a2ea3c

Update README.md

Files changed (1) hide show

README.md +3 -11

README.md CHANGED Viewed

@@ -8,17 +8,15 @@ pipeline_tag: text-to-audio
 tags:
 - text-to-audio
 ---
-# TANGO: Text to Audio using iNstruction-Guided diffusiOn
-**TANGO** is a latent diffusion model for text-to-audio generation. **TANGO** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We outperform current state-of-the-art models for audio generation across both objective and subjective metrics. We release our model, training, inference code and pre-trained checkpoints for the research community.
-📣 We are releasing [**Tango-Full-FT-Audiocaps**](https://huggingface.co/declare-lab/tango-full-ft-audiocaps) which was first pre-trained on [**TangoPromptBank**](https://huggingface.co/datasets/declare-lab/TangoPromptBank), a collection of diverse text, audio pairs. We later fine tuned this checkpoint on AudioCaps. This checkpoint obtained state-of-the-art results for text-to-audio generation on AudioCaps.
 ## Code
 Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
-We uploaded several **TANGO** generated samples here: [https://tango-web.github.io/](https://tango-web.github.io/)
 Please follow the instructions in the repository for installation, usage and experiments.
@@ -63,10 +61,4 @@ prompts = [
 ]
 audios = tango.generate_for_batch(prompts, samples=2)
 ```
-This will generate two samples for each of the three text prompts.
-## Limitations
-TANGO is trained on the small AudioCaps dataset so it may not generate good audio samples related to concepts that it has not seen in training (e.g. _singing_). For the same reason, TANGO is not always able to finely control its generations over textual control prompts. For example, the generations from TANGO for prompts _Chopping tomatoes on a wooden table_ and _Chopping potatoes on a metal table_ are very similar. _Chopping vegetables on a table_ also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text-audio mappings.
-We are training another version of TANGO on larger datasets to enhance its generalization, compositional and controllable generation ability.

 tags:
 - text-to-audio
 ---
+# Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization
+🎵 We developed **Tango 2** building upon **Tango** for text-to-audio generation. **Tango 2** was initialized with the **Tango-full-ft** checkpoint and underwent alignment training using DPO on **audio-alpaca**, a dataset of pairwise audio preferences. 🎶
 ## Code
 Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
 Please follow the instructions in the repository for installation, usage and experiments.
 ]
 audios = tango.generate_for_batch(prompts, samples=2)
 ```
+This will generate two samples for each of the three text prompts.