Diffusers
AudioLDMPipeline

Question regarding the training process

#2
by soujanyaporia - opened

Hi again,

While training the pipeline using CLAP's textual features, do you also train the audio encoder and decoder? Or just the diffusion model is trained?

Thanks.
-Soujanya

Centre for Vision, Speech and Signal Processing - University of Surrey org

Hey @soujanyaporia ! The model is trained in a two step process: first the CLAP model on a contrastive task, then the diffusion model on the generative task (conditioned on the CLAP latents). Refer to Appendix A and B of the paper for details: https://arxiv.org/pdf/2301.12503.pdf

Hi @sanchit-gandhi ! Thanks a lot for the reply. We are releasing our text-to-audio model today and would love to explore if we could integrate it into the diffuser library of HF. Can I get your email id so that I could invite you to our GitHub, and share the checkpoints and the paper?

Centre for Vision, Speech and Signal Processing - University of Surrey org

Hey @soujanyaporia ! That's super cool - best of luck with your release πŸ€—. Regarding integrations into diffusers, this tends to be done for models with wide community interest. I can't say for certain whether this applies to your project (I guess the community will decide a bit here!), but you can for sure start checking out the diffusers repo. There's also Community Pipelines which may be of interest: https://github.com/huggingface/diffusers/issues/841

Got it. The codes are here fyi https://github.com/declare-lab/tango

Thanks.

soujanyaporia changed discussion status to closed

Sign up or log in to comment