--- license: cc-by-nc-nd-4.0 --- # ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation This page shares the official model checkpoints of the paper \ "Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation" \ from Microsoft Applied Science Group and UC Berkeley \ by [Yatong Bai](https://bai-yt.github.io), [Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang), [Dung Tran](https://www.microsoft.com/applied-sciences/people/dung-tran), [Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida), and [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi/). **[[Preprint Paper](https://arxiv.org/abs/2309.10740)]**      **[[Project Homepage](https://consistency-tta.github.io)]**      **[[Code](https://github.com/Bai-YT/ConsistencyTTA)]**      **[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]**      **[[Generation Examples](https://consistency-tta.github.io/demo.html)]** ## Description This work proposes a *consistency distillation* framework to train text-to-audio (TTA) generation models that only require a single neural network query, reducing the computation of the core step of diffusion-based TTA models by a factor of 400. By incorporating *classifier-free guidance* into the distillation framework, our models retain diffusion models' impressive generation quality and diversity. Furthermore, the non-recurrent differentiable structure of the consistency model allows for end-to-end fine-tuning with novel loss functions such as the CLAP score, further boosting performance. ## Model Details We share three model checkpoints: - [ConsistencyTTA directly distilled from a diffusion model]( https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA.zip); - [ConsistencyTTA fine-tuned by optimizing the CLAP score]( https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA_CLAPFT.zip); - [The diffusion teacher model from which ConsistencyTTA is distilled]( https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/LightweightLDM.zip). The first two models are capable of high-quality single-step text-to-audio generation. Generations are 10 seconds long. After downloading and unzipping the files, place them in the `saved` directory. The training and inference code are on our [GitHub page](https://github.com/Bai-YT/ConsistencyTTA). Please refer to the GitHub page for usage details.