Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

This is the Hugging Face repo for storing pre-trained & fine-tuned checkpoints of our Video-LLaMA, which is a multi-modal conversational large language model with video understanding capability.

Vision-Language Branch

Checkpoint Link Note
pretrain-vicuna7b link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
finetune-vicuna7b-v2 link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat
pretrain-vicuna13b link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
finetune-vicuna13b-v2 link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat
pretrain-ziya13b-zh link Pre-trained with Chinese LLM Ziya-13B
finetune-ziya13b-zh link Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese)
pretrain-billa7b-zh link Pre-trained with Chinese LLM BiLLA-7B
finetune-billa7b-zh link Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese)

Audio-Language Branch

Checkpoint Link Note
pretrain-vicuna7b link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
finetune-vicuna7b-v2 link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat

Usage

For launching the pre-trained Video-LLaMA on your own machine, please refer to our github repo.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.