metadata
license: bsd-3-clause
language:
- en
- zh
pipeline_tag: video-text-to-text
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
This is the Hugging Face repo for storing pre-trained & fine-tuned checkpoints of our Video-LLaMA, which is a multi-modal conversational large language model with video understanding capability.
Vision-Language Branch
Checkpoint | Link | Note |
---|---|---|
pretrain-vicuna7b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna7b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
pretrain-vicuna13b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna13b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
pretrain-ziya13b-zh | link | Pre-trained with Chinese LLM Ziya-13B |
finetune-ziya13b-zh | link | Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese) |
pretrain-billa7b-zh | link | Pre-trained with Chinese LLM BiLLA-7B |
finetune-billa7b-zh | link | Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese) |
Audio-Language Branch
Checkpoint | Link | Note |
---|---|---|
pretrain-vicuna7b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna7b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
Usage
For launching the pre-trained Video-LLaMA on your own machine, please refer to our github repo.