Which Model Version HuggingFace Use

by mmahdieh - opened Apr 10, 2023

Apr 10, 2023

Hi,

I recently saw your model, and first, I need to appreciate your work. Great Job 👏🏻

Second, upon reading the GitHub README, I noticed multiple versions available, including audioldm-m-text-ft, audioldm-s-full , audioldm-l-full, etc, which can be utilized when using the pip package or the GitHub repo directly. However, I was wondering which version is used when using the model within the HuggingFace Diffuser library pipeline and whether I can change it to another version.

Best regards,
Mo

lauraibnz

Apr 14, 2023

Hi,

I would also like to confirm that this has been adapted from the ldm_trimmed.ckpt checkpoint. I am also interested in knowing if the new checkpoints are going to be adapted to this structure, especially audioldm-m-full.ckpt as that seems the only one capable to encode from audio apart from text (is this right?).

Thank you very much,

Laura

sanchit-gandhi

Centre for Vision, Speech and Signal Processing - University of Surrey org Apr 18, 2023

Hey @mmahdieh and @lauraibnz ,

Indeed, there have been new checkpoints released since AudioLDM was merged into diffusers. I've just started the process of converting these checkpoints to diffusers format and will push these to the Hub when ready.

This is the conversion script I'm using: https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_audioldm_to_diffusers.py

Essentially, I downloaded the original ldm_trimmed.ckptto a local device and then ran the script on it, thus converting it to diffusers format. I'm doing the same with the new small/medium/large checkpoints, updating the model config as required for the larger hidden dimensionalities.

It appears that audioldm-m-full.ckpt was trained with audio conditioning, but I'm pretty sure it works the same way as the -s and -l models at inference time (see Figure 1 of the AudioLDM paper: https://arxiv.org/abs/2301.12503)

Hope that answers your questions,

Sanchit

lauraibnz

Apr 19, 2023

Hi @sanchit-gandhi ,

Thank you for your answer and for your work converting the checkpoints. I was asking about the audio encoder because I want to fine-tune one of the models and I would need it for this, but I will look at the conversion script myself otherwise.

Best,

Laura

sanchit-gandhi

Centre for Vision, Speech and Signal Processing - University of Surrey org Apr 26, 2023

Hey @lauraibnz - this new README info should detail the checkpoint info a little more: https://huggingface.co/cvssp/audioldm#checkpoint-details

sanchit-gandhi changed discussion status to closed Apr 26, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment