ConsistencyTTA / README.md
Bai-YT's picture
Update README.md
ab88f61 verified
---
license: cc-by-nc-nd-4.0
---
# ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
This page shares the official model checkpoints of the paper \
*ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation* \
from Microsoft Applied Science Group and UC Berkeley \
by [Yatong Bai](https://bai-yt.github.io),
[Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang),
[Dung Tran](https://www.microsoft.com/applied-sciences/people/dung-tran),
[Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida),
and [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi/).
**[[🤗 Live Demo](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA)]**     
**[[Preprint Paper](https://arxiv.org/abs/2309.10740)]**     
**[[Project Homepage](https://consistency-tta.github.io)]**     
**[[Code](https://github.com/Bai-YT/ConsistencyTTA)]**     
**[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]**     
**[[Generation Examples](https://consistency-tta.github.io/demo.html)]**
## Description
**2024/06 Updates:**
- We have hosted an interactive live demo of ConsistencyTTA at [🤗 Huggingface](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA).
- ConsistencyTTA has been accepted to ***INTERSPEECH 2024***! We look forward to meeting you in Kos Island.
This work proposes a *consistency distillation* framework to train
text-to-audio (TTA) generation models that only require a single neural network query,
reducing the computation of the core step of diffusion-based TTA models by a factor of 400.
By incorporating *classifier-free guidance* into the distillation framework,
our models retain diffusion models' impressive generation quality and diversity.
Furthermore, the non-recurrent differentiable structure of the consistency model
allows for end-to-end fine-tuning with novel loss functions such as the CLAP score, further boosting performance.
<center>
<img src="main_figure_.png" alt="ConsistencyTTA Results" title="Results" width="480"/>
</center>
## Model Details
We share three model checkpoints:
- [ConsistencyTTA directly distilled from a diffusion model](
https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA.zip);
- [ConsistencyTTA fine-tuned by optimizing the CLAP score](
https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA_CLAPFT.zip);
- [The diffusion teacher model from which ConsistencyTTA is distilled](
https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/LightweightLDM.zip).
The first two models are capable of high-quality single-step text-to-audio generation. Generations are 10 seconds long.
After downloading and unzipping the files, place them in the `saved` directory.
The training and inference code are on our [GitHub page](https://github.com/Bai-YT/ConsistencyTTA). Please refer to the GitHub page for usage details.