Which Model Version HuggingFace Use
Hi,
I recently saw your model, and first, I need to appreciate your work. Great Job 👏🏻
Second, upon reading the GitHub README, I noticed multiple versions available, including audioldm-m-text-ft
, audioldm-s-full
, audioldm-l-full
, etc, which can be utilized when using the pip package
or the GitHub repo directly. However, I was wondering which version is used when using the model within the HuggingFace Diffuser library pipeline and whether I can change it to another version.
Best regards,
Mo
Hi,
I would also like to confirm that this has been adapted from the ldm_trimmed.ckpt
checkpoint. I am also interested in knowing if the new checkpoints are going to be adapted to this structure, especially audioldm-m-full.ckpt
as that seems the only one capable to encode from audio apart from text (is this right?).
Thank you very much,
Laura
Hey @mmahdieh and @lauraibnz ,
Indeed, there have been new checkpoints released since AudioLDM was merged into diffusers. I've just started the process of converting these checkpoints to diffusers format and will push these to the Hub when ready.
This is the conversion script I'm using: https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_audioldm_to_diffusers.py
Essentially, I downloaded the original ldm_trimmed.ckpt
to a local device and then ran the script on it, thus converting it to diffusers format. I'm doing the same with the new small/medium/large checkpoints, updating the model config as required for the larger hidden dimensionalities.
It appears that audioldm-m-full.ckpt
was trained with audio conditioning, but I'm pretty sure it works the same way as the -s
and -l
models at inference time (see Figure 1 of the AudioLDM paper: https://arxiv.org/abs/2301.12503)
Hope that answers your questions,
Sanchit
Hi @sanchit-gandhi ,
Thank you for your answer and for your work converting the checkpoints. I was asking about the audio encoder because I want to fine-tune one of the models and I would need it for this, but I will look at the conversion script myself otherwise.
Best,
Laura
Hey @lauraibnz - this new README info should detail the checkpoint info a little more: https://huggingface.co/cvssp/audioldm#checkpoint-details