OpenGVLab/ViCLIP · Hugging Face

ViCLIP: a simple video CLIP for transferrable video-text representation

Built upon CLIP, we make a simple video-text pretraining baseline ViCLIP. It consists of a video encoder (ViT) and a text encoder, as given below. Both modules are initialized from the corresponding CLIP components. We update the native attention in the video encoder to spatiotemporal attention while maintaining other design elements. For efficient learning, we apply masking to videos in pre-training.

Model Description

Repository: InternVid
Paper: 2307.06942
Point of Contact: mailto:InternVideo Group

Data & Model Zoo

Pretrained Data & Model

Model	Training Data	Descriptions
ViCLIP-L-14 [ckpt]	InternVid-10M-FLT [HuggingFace]

Citation

If you find this work useful for your research, please consider citing InternVid. Your acknowledgement would greatly help us in continuing to contribute resources to the research community.

@article{wang2023internvid,
  title={InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation},
  author={Wang, Yi and He, Yinan and Li, Yizhuo and Li, Kunchang and Yu, Jiashuo and Ma, Xin and Chen, Xinyuan and Wang, Yaohui and Luo, Ping and Liu, Ziwei and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2307.06942},
  year={2023}
}

@article{wang2022internvideo,
  title={InternVideo: General Video Foundation Models via Generative and Discriminative Learning},
  author={Wang, Yi and Li, Kunchang and Li, Yizhuo and He, Yinan and Huang, Bingkun and Zhao, Zhiyu and Zhang, Hongjie and Xu, Jilan and Liu, Yi and Wang, Zun and Xing, Sen and Chen, Guo and Pan, Junting and Yu, Jiashuo and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2212.03191},
  year={2022}
}

OpenGVLab
/

ViCLIP

You need to agree to share your contact information to access this model

Model Description

Data & Model Zoo

Citation

Collection including OpenGVLab/ViCLIP

InternVid