Spaces:
Running
on
Zero
Running
on
Zero
File size: 9,962 Bytes
2d9a728 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
# InternVideo2 \[[Paper\]](https://arxiv.org/abs/2403.15377)
<!-- [中文 README](README_cn.md) -->
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-in-videos-on-activitynet)](https://paperswithcode.com/sota/action-recognition-in-videos-on-activitynet?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-moments-in-time)](https://paperswithcode.com/sota/action-classification-on-moments-in-time?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-on-hacs)](https://paperswithcode.com/sota/action-recognition-on-hacs?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-vatex)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-vatex?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-activitynet)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-activitynet?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-vatex)](https://paperswithcode.com/sota/video-retrieval-on-vatex?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/text-to-audio-retrieval-on-audiocaps)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-audiocaps?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-clotho?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-text-to-audio-retrieval-on)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on-clotho?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/audio-classification-on-esc-50)](https://paperswithcode.com/sota/audio-classification-on-esc-50?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-grounding-on-qvhighlights)](https://paperswithcode.com/sota/video-grounding-on-qvhighlights?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-fineaction)](https://paperswithcode.com/sota/temporal-action-localization-on-fineaction?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-hacs)](https://paperswithcode.com/sota/temporal-action-localization-on-hacs?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-thumos14)](https://paperswithcode.com/sota/temporal-action-localization-on-thumos14?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-activitynet)](https://paperswithcode.com/sota/temporal-action-localization-on-activitynet?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=internvideo2-scaling-video-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=internvideo2-scaling-video-foundation-models)
This repo will give the code and models of '[InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding](https://arxiv.org/abs/2403.15377)' soon.
- **Achieved `92.1%` Top1 accuracy in Kinetics 400.**
- **Achieved `SOTA` performance on over `60` video/audio-related tasks (including action recognition, temporal localization, retrieval, etc) when released.**
## Updates
- `2024/04/15`: Update the code and scripts for InternVideo2 CLIP.
- `2024/04/13`: Update the code and scripts for InternVideo2 Stage1 & 2.
- `2024/03/22`: The technical report of InternVideo2 is released.
## Citation
If this work is helpful for your research, please consider citing InternVideo.
```
@article{wang2024internvideo2,
title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and Shi, Yansong and Jiang, Tianxiang and Li, Songze and Zhang, Hongjie and Huang, Yifei and Qiao, Yu and Wang, Yali and Wang, Limin},
journal={arXiv preprint arXiv:2403.15377},
year={2024}
}
```
|