--- title: 'VideoChat: Chat-Centric Video Understanding' emoji: πŸ‘€ colorFrom: green colorTo: blue sdk: gradio python_version: 3.8.16 app_file: app.py pinned: false license: mit --- # 🦜 VideoChat [[paper](https://arxiv.org/abs/2305.06355)] ![images](assert/framework.png) In this study, we initiate an exploration into video understanding by introducing VideoChat, an **end-to-end chat-centric video understanding system**. It integrates video foundation models and large language models via a learnable neural interface, excelling in **spatiotemporal reasoning, event localization, and causal relationship inference**. To instructively tune this system, we propose a **video-centric instruction dataset**, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes **spatiotemporal reasoning and causal relationships**, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system’s potential across a broad spectrum of video applications and set the standard for future research. # :fire: Updates - **2023/05/11**: Release the 🦜**VideoChat V1**, which can **handle both image and video understanding!** - [Model](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) and [Data](https://github.com/OpenGVLab/InternVideo/blob/main/Data/instruction_data.md). - πŸ§‘β€πŸ’» *Online demo is Preparing*. - πŸ§‘β€πŸ”§ *Tuning scripts are cleaning*. # :hourglass_flowing_sand: Schedule - [x] Small-scale video instuction data and tuning - [x] Instruction tuning on BLIP+UniFormerV2+Vicuna - [ ] Large-scale and complex video instuction data - [ ] Instruction tuning on strong video foundation model - [ ] User-friendly interactions with longer videos - [ ] ... # :speech_balloon: Example
Comparison with ChatGPT, MiniGPT-4, LLaVA and mPLUG-Owl.
Our VideoChat can handle both image and video understanding well!
[Video] Why the video is funny?
[Video] Spatial perception
[Video] Temporal perception
[Video] Multi-turn conversation
Image understanding
# :running: Usage - Prepare the envirment. ```shell pip install -r requirements.txt ``` - Download [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) model: - ViT: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth` - QFormer: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth` - Change the `vit_model_path` and `q_former_model_path` in [config.json](./configs/config.json). - Download [StabelVicuna](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) model: - LLAMA: Download it from the [original repo](https://github.com/facebookresearch/llama) or [hugging face](https://huggingface.co/decapoda-research/llama-13b-hf). - If you download LLAMA from the original repo, please process it via the following command: ```shell # convert_llama_weights_to_hf is copied from transformers python src/transformers/models/llama/convert_llama_weights_to_hf.py \ --input_dir /path/to/downloaded/llama/weights \ --model_size 7B --output_dir /output/path ``` - Download [StableVicuna-13b-deelta](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) and process it: ```shell # fastchat v0.1.10 python3 apply_delta.py \ --base /path/to/model_weights/llama-13b \ --target stable-vicuna-13b \ --delta CarperAI/stable-vicuna-13b-delta ``` - Change the `llama_model_path` in [config.json](./configs/config.json). - Download [VideoChat](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) model: - Change the `videochat_model_path` in [config.json](./configs/config.json). - Running demo with Gradio: ```shell python demo.py ``` - Another demo on Jupyter Notebook can found in [demo.ipynb](demo.ipynb) # :page_facing_up: Citation If you find this project useful in your research, please consider cite: ```BibTeX @article{2023videochat, title={VideoChat: Chat-Centric Video Understanding}, author={KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao}, journal={arXiv preprint arXiv:2305.06355}, year={2023} } ``` # :thumbsup: Acknowledgement Thanks to the open source of the following projects: [InternVideo](https://github.com/OpenGVLab/InternVideo), [UniFormerV2](https://github.com/OpenGVLab/UniFormerV2), [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA), [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2), [StableLM](https://github.com/Stability-AI/StableLM).