Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
This is the Hugging Face repo for storing pre-trained & fine-tuned checkpoints of our Video-LLaMA, which is a multi-modal conversational large language model with video understanding capability.
Vision-Language Branch
Checkpoint | Link | Note |
---|---|---|
pretrain-vicuna7b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna7b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
pretrain-vicuna13b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna13b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
pretrain-ziya13b-zh | link | Pre-trained with Chinese LLM Ziya-13B |
finetune-ziya13b-zh | link | Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese) |
pretrain-billa7b-zh | link | Pre-trained with Chinese LLM BiLLA-7B |
finetune-billa7b-zh | link | Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese) |
Audio-Language Branch
Checkpoint | Link | Note |
---|---|---|
pretrain-vicuna7b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna7b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
Usage
For launching the pre-trained Video-LLaMA on your own machine, please refer to our github repo.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.