Video Inference or training

#4
by dosun - opened

I have tried to run this model for video captioning. However, it only returns a caption for each frame. In the original paper, the model supports video through multiple frames. Is this support at HuggingFace as well?

Microsoft org

Hi,

For video captioning I'd recommend taking a look at the GIT checkpoints fine-tuned on video datasets, like https://huggingface.co/microsoft/git-base-vatex

Sign up or log in to comment