Model Summary

Video-CCAM-4B is a lightweight Video-MLLM built on Phi-3-mini-4k-instruct and SigLIP SO400M. Note: Here Phi-3-mini-4k-instruct refers to the previous version, which requires git commit id ff07dc01615f8113924aed013115ab2abd32115b to get the checkpoint.

Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:

torch==2.1.0
torchvision==0.16.0
transformers==4.40.2
peft==0.10.0

Inference & Evaluation

Please refer to Video-CCAM on inference and evaluation.

Video-MME

#Frames. 32 96
w/o subs 48.2 49.6
w subs 51.7 53.0

MVBench: 57.78 (16 frames)

Acknowledgement

  • xtuner: Video-CCAM-4B is trained using the xtuner framework. Thanks for their excellent works!
  • Phi-3-mini-4k-instruct: Powerful language models developed by Microsoft.
  • SigLIP SO400M: Outstanding vision encoder developed by Google.

License

The model is licensed under the MIT license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Collection including JaronTHU/Video-CCAM-4B