Question regarding the training process
Hi again,
While training the pipeline using CLAP's textual features, do you also train the audio encoder and decoder? Or just the diffusion model is trained?
Thanks.
-Soujanya
cc @sanchit-gandhi any ideas here?
Hey @soujanyaporia ! The model is trained in a two step process: first the CLAP model on a contrastive task, then the diffusion model on the generative task (conditioned on the CLAP latents). Refer to Appendix A and B of the paper for details: https://arxiv.org/pdf/2301.12503.pdf
Hi @sanchit-gandhi ! Thanks a lot for the reply. We are releasing our text-to-audio model today and would love to explore if we could integrate it into the diffuser library of HF. Can I get your email id so that I could invite you to our GitHub, and share the checkpoints and the paper?
Hey @soujanyaporia ! That's super cool - best of luck with your release π€. Regarding integrations into diffusers, this tends to be done for models with wide community interest. I can't say for certain whether this applies to your project (I guess the community will decide a bit here!), but you can for sure start checking out the diffusers repo. There's also Community Pipelines which may be of interest: https://github.com/huggingface/diffusers/issues/841
Got it. The codes are here fyi https://github.com/declare-lab/tango
Thanks.