Question about text-to-video finetunning.

#1
by Snarky36 - opened

Hello Alara! I've read your hugging face post about text-to-video models and I would like to ask you more about them. Right now I am trying to make a model that will translate form natural text into sing language glosses(they are key words that represents a sequesnce of spatio-temporal dependent images with a sign in sing language). I want to find and finetune a text-to-video model from hugging face that could take a text as input and a video as target so I can fine tune the model into a translation task. The problem is that i couldn't find an example with python code on how to use diffusers to finetune on a custom dataset.
I would appreciate if you could help me with some info about some papers or tutorials that talk about fine tunning a hugging face text-to-video model.
Thank you for your time and effort!
Best Regards,
Andrei.

Sign up or log in to comment