This looks amazing
Could you share any information on the training regime? Dataset, hyper parameters, etc.
Hi, Sure, I used 70 49-frame videos. I captioned them using qwen2-vlm but it made many mistakes so I had to review and correct them one by one. Regarding the parameters, as this was my first LoRA with CogVideoX, I basically used the settings that come with the CogVideoX-factory repo. The whole training took around 13 hours on a L40S and used around 32 GB VRAM but there are suggested optimizations in the cogvideox-factory repo that make it possible for the training to work on 24 GB of VRAM.
@Cseti could your share you data-prep (starting from a folder with videos) scripts to split and caption , fine-tuning scripts , it would be amazing to try to make some LoRas using your scripts. it would be great if you could make a github repo (pushing your current scripts)
@Cseti could your share you data-prep (starting from a folder with videos) scripts to split and caption , fine-tuning scripts , it would be amazing to try to make some LoRas using your scripts. it would be great if you could make a github repo (pushing your current scripts)
I followed the instructions in cogvideo-factory step-by-step. They also discussing the required folder structure, here. For running gwen-vl model I used ComfyUI nodes, but it made many mistakes, however Cogvideo guys released their own caption method here. I couldn't test it yet but if they really used that for captioning the model, it could be the best method to make captions for LoRA training too.