Diffusers
AudioLDMPipeline

Which Model Version HuggingFace Use

#3
by mmahdieh - opened

Hi,

I recently saw your model, and first, I need to appreciate your work. Great Job 👏🏻

Second, upon reading the GitHub README, I noticed multiple versions available, including audioldm-m-text-ft, audioldm-s-full , audioldm-l-full, etc, which can be utilized when using the pip package or the GitHub repo directly. However, I was wondering which version is used when using the model within the HuggingFace Diffuser library pipeline and whether I can change it to another version.

Best regards,
Mo

Hi,

I would also like to confirm that this has been adapted from the ldm_trimmed.ckpt checkpoint. I am also interested in knowing if the new checkpoints are going to be adapted to this structure, especially audioldm-m-full.ckpt as that seems the only one capable to encode from audio apart from text (is this right?).

Thank you very much,

Laura

Centre for Vision, Speech and Signal Processing - University of Surrey org

Hey @mmahdieh and @lauraibnz ,

Indeed, there have been new checkpoints released since AudioLDM was merged into diffusers. I've just started the process of converting these checkpoints to diffusers format and will push these to the Hub when ready.

This is the conversion script I'm using: https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_audioldm_to_diffusers.py

Essentially, I downloaded the original ldm_trimmed.ckptto a local device and then ran the script on it, thus converting it to diffusers format. I'm doing the same with the new small/medium/large checkpoints, updating the model config as required for the larger hidden dimensionalities.

It appears that audioldm-m-full.ckpt was trained with audio conditioning, but I'm pretty sure it works the same way as the -s and -l models at inference time (see Figure 1 of the AudioLDM paper: https://arxiv.org/abs/2301.12503)

Hope that answers your questions,

Sanchit

Hi @sanchit-gandhi ,

Thank you for your answer and for your work converting the checkpoints. I was asking about the audio encoder because I want to fine-tune one of the models and I would need it for this, but I will look at the conversion script myself otherwise.

Best,

Laura

Centre for Vision, Speech and Signal Processing - University of Surrey org

Hey @lauraibnz - this new README info should detail the checkpoint info a little more: https://huggingface.co/cvssp/audioldm#checkpoint-details

sanchit-gandhi changed discussion status to closed

Sign up or log in to comment