metadata

license: apache-2.0
datasets:
  - AlexFierro9/Kinetics400
  - imagenet-1k
  - HuggingFaceM4/something_something_v2
language:
  - en
pipeline_tag: video-classification

VideoMamba

Model Details

VideoMamba is a purely SSM-based model for video understanding.

Developed by: OpenGVLab
Model type: An efficient backbone based on the bidirectional state space model.
License: Non-commercial license

Model Sources

Repository: https://github.com/OpenGVLab/VideoMamba
Paper: https://arxiv.org/abs/2403.06977

Uses

The primary use of VideoMamba is research on image and video tasks, e.g., image classification, action recognition, long-term video understanding, and video-text retrieval, with an SSM-based backbone. The primary intended users of the model are researchers and hobbyists in computer vision, machine learning, and artificial intelligence.

How to Get Started with the Model

You can replace the backbone for video tasks with the proposed VideoMamba: https://github.com/OpenGVLab/VideoMamba/blob/main/videomamba/video_sm/models/videomamba.py
Then you can load this checkpoint and start training.

Citation Information

@misc{li2024videomamba,
      title={VideoMamba: State Space Model for Efficient Video Understanding}, 
      author={Kunchang Li and Xinhao Li and Yi Wang and Yinan He and Yali Wang and Limin Wang and Yu Qiao},
      year={2024},
      eprint={2403.06977},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}