Text-to-Audio
Transformers
English
Inference Endpoints
hungchiayu commited on
Commit
2a2ea3c
1 Parent(s): cd8c1ba

Upload 8 files

Browse files
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ datasets:
4
+ - bjoernp/AudioCaps
5
+ language:
6
+ - en
7
+ pipeline_tag: text-to-audio
8
+ tags:
9
+ - text-to-audio
10
+ ---
11
+ # TANGO: Text to Audio using iNstruction-Guided diffusiOn
12
+
13
+ **TANGO** is a latent diffusion model for text-to-audio generation. **TANGO** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We outperform current state-of-the-art models for audio generation across both objective and subjective metrics. We release our model, training, inference code and pre-trained checkpoints for the research community.
14
+
15
+ 📣 We are releasing [**Tango-Full-FT-Audiocaps**](https://huggingface.co/declare-lab/tango-full-ft-audiocaps) which was first pre-trained on [**TangoPromptBank**](https://huggingface.co/datasets/declare-lab/TangoPromptBank), a collection of diverse text, audio pairs. We later fine tuned this checkpoint on AudioCaps. This checkpoint obtained state-of-the-art results for text-to-audio generation on AudioCaps.
16
+
17
+ ## Code
18
+
19
+ Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
20
+
21
+ We uploaded several **TANGO** generated samples here: [https://tango-web.github.io/](https://tango-web.github.io/)
22
+
23
+ Please follow the instructions in the repository for installation, usage and experiments.
24
+
25
+ ## Quickstart Guide
26
+
27
+ Download the **TANGO** model and generate audio from a text prompt:
28
+
29
+ ```python
30
+ import IPython
31
+ import soundfile as sf
32
+ from tango import Tango
33
+
34
+ tango = Tango("declare-lab/tango")
35
+
36
+ prompt = "An audience cheering and clapping"
37
+ audio = tango.generate(prompt)
38
+ sf.write(f"{prompt}.wav", audio, samplerate=16000)
39
+ IPython.display.Audio(data=audio, rate=16000)
40
+ ```
41
+ [An audience cheering and clapping.webm](https://user-images.githubusercontent.com/13917097/233851915-e702524d-cd35-43f7-93e0-86ea579231a7.webm)
42
+
43
+ The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.
44
+
45
+ The `generate` function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.
46
+
47
+ ```python
48
+ prompt = "Rolling thunder with lightning strikes"
49
+ audio = tango.generate(prompt, steps=200)
50
+ IPython.display.Audio(data=audio, rate=16000)
51
+ ```
52
+ [Rolling thunder with lightning strikes.webm](https://user-images.githubusercontent.com/13917097/233851929-90501e41-911d-453f-a00b-b215743365b4.webm)
53
+
54
+ <!-- [MachineClicking](https://user-images.githubusercontent.com/25340239/233857834-bfda52b4-4fcc-48de-b47a-6a6ddcb3671b.mp4 "sample 1") -->
55
+
56
+ Use the `generate_for_batch` function to generate multiple audio samples for a batch of text prompts:
57
+
58
+ ```python
59
+ prompts = [
60
+ "A car engine revving",
61
+ "A dog barks and rustles with some clicking",
62
+ "Water flowing and trickling"
63
+ ]
64
+ audios = tango.generate_for_batch(prompts, samples=2)
65
+ ```
66
+ This will generate two samples for each of the three text prompts.
67
+
68
+ ## Limitations
69
+
70
+ TANGO is trained on the small AudioCaps dataset so it may not generate good audio samples related to concepts that it has not seen in training (e.g. _singing_). For the same reason, TANGO is not always able to finely control its generations over textual control prompts. For example, the generations from TANGO for prompts _Chopping tomatoes on a wooden table_ and _Chopping potatoes on a metal table_ are very similar. _Chopping vegetables on a table_ also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text-audio mappings.
71
+
72
+ We are training another version of TANGO on larger datasets to enhance its generalization, compositional and controllable generation ability.
config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"text_encoder_name": "google/flan-t5-large", "scheduler_name": "stabilityai/stable-diffusion-2-1", "unet_model_name": null, "unet_model_config_path": "configs/diffusion_model_config.json", "snr_gamma": 5.0}
main_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"text_encoder_name": "google/flan-t5-large", "scheduler_name": "stabilityai/stable-diffusion-2-1", "unet_model_name": null, "unet_model_config_path": "configs/diffusion_model_config.json", "snr_gamma": 5.0}
pytorch_model_main.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d7b85adf7b6141985887298d10e7cf2428b91dfdc66134c9249195916922ec9
3
+ size 4829066767
pytorch_model_stft.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0a4b498f27175c4d9adc422e4069d919e6874c961c0605d542ffed30778d498
3
+ size 8537803
pytorch_model_vae.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d49e1881f38bd4f4fcaaf1c56686c02fb15f75e80dec5f773ae235b2cf1b61b
3
+ size 442713669
stft_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"filter_length": 1024, "hop_length": 160, "win_length": 1024, "n_mel_channels": 64, "sampling_rate": 16000, "mel_fmin": 0, "mel_fmax": 8000}
vae_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"image_key": "fbank", "subband": 1, "embed_dim": 8, "time_shuffle": 1, "ddconfig": {"double_z": true, "z_channels": 8, "resolution": 256, "downsample_time": false, "in_channels": 1, "out_ch": 1, "ch": 128, "ch_mult": [1, 2, 4], "num_res_blocks": 2, "attn_resolutions": [], "dropout": 0.0}, "scale_factor": 0.9227914214134216}