lixinhao's picture
Upload folder using huggingface_hub
ee124bb verified
metadata
language:
  - en
library_name: transformers
license: apache-2.0
metrics:
  - accuracy
tags:
  - multimodal
pipeline_tag: video-text-to-text
model-index:
  - name: VideoChat-Flash-Qwen2_5-2B_res448
    results:
      - task:
          type: multimodal
        dataset:
          name: MLVU
          type: mlvu
        metrics:
          - type: accuracy
            value: 65.7
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: MVBench
          type: mvbench
        metrics:
          - type: accuracy
            value: 70
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: PercepTest
          type: percepTest
        metrics:
          - type: accuracy
            value: 70.5
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: LongVideoBench
          type: longvideobench
        metrics:
          - type: accuracy
            value: 58.3
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: VideoMME (wo sub)
          type: videomme
        metrics:
          - type: accuracy
            value: 57
            name: accuracy
            verified: true
      - task:
          type: multimodal
        dataset:
          name: LVBench
          type: lvbench
        metrics:
          - type: accuracy
            value: 42.9
            name: accuracy
            verified: true

🦜VideoChat-Flash-Qwen2_5-2B_res448⚡

[📰 Blog] [📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]

VideoChat-Flash-2B is constructed upon UMT-L (300M) and Qwen2_5-2B, employing only 16 tokens per frame. By leveraging Yarn to extend the context window to 128k (Qwen2's native context window is 32k), our model supports input sequences of up to approximately 10,000 frames.

Note: Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension, to ensure optimal performance, using English for interaction is recommended.

📈 Performance

Model MVBench LongVideoBench VideoMME(w/o sub)
VideoChat-Flash-Qwen2_5-2B@448 70.0 58.3 57.0
VideoChat-Flash-Qwen2-7B@224 73.2 64.2 64.0
VideoChat-Flash-Qwen2-7B@448 74.0 64.7 65.3

🚀 How to use the model

Generation

We provide the simple generation process for using our model. For more details, you could refer to Github.

from transformers import AutoModel, AutoTokenizer

# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = True

# evaluation setting
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

✏️ Citation


@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}