Video-LLaMA-Series / README.md
lixin4ever's picture
Create README.md
7ed968d
metadata
license: bsd-3-clause
language:
  - en
  - zh
pipeline_tag: visual-question-answering

Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding

This is the Hugging Face repo for storing pre-trained & fine-tuned checkpoints of our Video-LLaMA, which is a multi-modal conversational large language model with video understanding capability.

Checkpoint Link Note
pretrain-vicuna7b.pth link Vicuna-7B as language decoder, pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
finetune-vicuna7b-v2.pth link Fine-tuned on VideoChat instruction-following dataset
pretrain-vicuna13b.pth link Vicuna-13B as language decoder, pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
finetune-vicuna13b-v2.pth (recommended) link Fine-tuned on VideoChat instruction-following dataset
pretrain-ziya13b-zh.pth link Pre-trained with Chinese LLM Ziya-13B
finetune-ziya13b-zh.pth link Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese)
pretrain-billa7b-zh.pth link Pre-trained with Chinese LLM BiLLA-7B
finetune-billa7b-zh.pth link Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese)