# InternVideo2 \[[Paper\]](https://arxiv.org/abs/2403.15377) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-in-videos-on-activitynet)](https://paperswithcode.com/sota/action-recognition-in-videos-on-activitynet?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-moments-in-time)](https://paperswithcode.com/sota/action-classification-on-moments-in-time?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-on-hacs)](https://paperswithcode.com/sota/action-recognition-on-hacs?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-vatex)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-vatex?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-activitynet)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-activitynet?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-vatex)](https://paperswithcode.com/sota/video-retrieval-on-vatex?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/text-to-audio-retrieval-on-audiocaps)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-audiocaps?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-clotho?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-text-to-audio-retrieval-on)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on-clotho?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/audio-classification-on-esc-50)](https://paperswithcode.com/sota/audio-classification-on-esc-50?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-grounding-on-qvhighlights)](https://paperswithcode.com/sota/video-grounding-on-qvhighlights?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-fineaction)](https://paperswithcode.com/sota/temporal-action-localization-on-fineaction?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-hacs)](https://paperswithcode.com/sota/temporal-action-localization-on-hacs?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-thumos14)](https://paperswithcode.com/sota/temporal-action-localization-on-thumos14?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-activitynet)](https://paperswithcode.com/sota/temporal-action-localization-on-activitynet?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=internvideo2-scaling-video-foundation-models) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=internvideo2-scaling-video-foundation-models) This repo will give the code and models of '[InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding](https://arxiv.org/abs/2403.15377)' soon. - **Achieved `92.1%` Top1 accuracy in Kinetics 400.** - **Achieved `SOTA` performance on over `60` video/audio-related tasks (including action recognition, temporal localization, retrieval, etc) when released.** ## Updates - `2024/04/15`: Update the code and scripts for InternVideo2 CLIP. - `2024/04/13`: Update the code and scripts for InternVideo2 Stage1 & 2. - `2024/03/22`: The technical report of InternVideo2 is released. ## Citation If this work is helpful for your research, please consider citing InternVideo. ``` @article{wang2024internvideo2, title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding}, author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and Shi, Yansong and Jiang, Tianxiang and Li, Songze and Zhang, Hongjie and Huang, Yifei and Qiao, Yu and Wang, Yali and Wang, Limin}, journal={arXiv preprint arXiv:2403.15377}, year={2024} } ```