Edit model card

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

This is the Hugging Face repo for storing pre-trained & fine-tuned checkpoints of our Video-LLaMA, which is a multi-modal conversational large language model with video understanding capability.

Vision-Language Branch

Checkpoint Link Note
pretrain-vicuna7b link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
finetune-vicuna7b-v2 link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat
pretrain-vicuna13b link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
finetune-vicuna13b-v2 link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat
pretrain-ziya13b-zh link Pre-trained with Chinese LLM Ziya-13B
finetune-ziya13b-zh link Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese)
pretrain-billa7b-zh link Pre-trained with Chinese LLM BiLLA-7B
finetune-billa7b-zh link Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese)

Audio-Language Branch

Checkpoint Link Note
pretrain-vicuna7b link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
finetune-vicuna7b-v2 link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat

Usage

For launching the pre-trained Video-LLaMA on your own machine, please refer to our github repo.

Downloads last month
0
Unable to determine this model's library. Check the docs .

Spaces using DAMO-NLP-SG/Video-LLaMA-Series 2