VideoMind-2B

VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers.

πŸ”– Model Details

Model Description

  • Model type: Multi-modal Large Language Model
  • Language(s): English
  • License: BSD-3-Clause

More Details

Please refer to our GitHub Repository for more details about this model.

πŸ“– Citation

Please kindly cite our paper if you find this project helpful.

@article{liu2025videomind,
  title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2503.13444},
  year={2025}
}
Downloads last month
95
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using yeliudev/VideoMind-2B 1

Collection including yeliudev/VideoMind-2B