Question regarding the training process

by soujanyaporia - opened Mar 31, 2023

Mar 31, 2023

Hi again,

While training the pipeline using CLAP's textual features, do you also train the audio encoder and decoder? Or just the diffusion model is trained?

Thanks.
-Soujanya

patrickvonplaten

Mar 31, 2023

cc @sanchit-gandhi any ideas here?

sanchit-gandhi

Centre for Vision, Speech and Signal Processing - University of Surrey org Mar 31, 2023

Hey @soujanyaporia ! The model is trained in a two step process: first the CLAP model on a contrastive task, then the diffusion model on the generative task (conditioned on the CLAP latents). Refer to Appendix A and B of the paper for details: https://arxiv.org/pdf/2301.12503.pdf

soujanyaporia

Apr 23, 2023

Hi @sanchit-gandhi ! Thanks a lot for the reply. We are releasing our text-to-audio model today and would love to explore if we could integrate it into the diffuser library of HF. Can I get your email id so that I could invite you to our GitHub, and share the checkpoints and the paper?

sanchit-gandhi

Centre for Vision, Speech and Signal Processing - University of Surrey org Apr 26, 2023

Hey @soujanyaporia ! That's super cool - best of luck with your release 🤗. Regarding integrations into diffusers, this tends to be done for models with wide community interest. I can't say for certain whether this applies to your project (I guess the community will decide a bit here!), but you can for sure start checking out the diffusers repo. There's also Community Pipelines which may be of interest: https://github.com/huggingface/diffusers/issues/841

soujanyaporia

Apr 27, 2023

Got it. The codes are here fyi https://github.com/declare-lab/tango

Thanks.

soujanyaporia changed discussion status to closed Apr 30, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment