Visual Question Answering
English
Chinese
lixin4ever commited on
Commit
7ed968d
1 Parent(s): 7900467

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -0
README.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-3-clause
3
+ language:
4
+ - en
5
+ - zh
6
+ pipeline_tag: visual-question-answering
7
+ ---
8
+
9
+ # Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding
10
+ This is the Hugging Face repo for storing pre-trained & fine-tuned checkpoints of our [Video-LLaMA](https://github.com/DAMO-NLP-SG/Video-LLaMA), which is a multi-modal conversational large language model with video understanding capability.
11
+
12
+
13
+ | Checkpoint | Link | Note |
14
+ |:------------|-------------|-------------|
15
+ | pretrain-vicuna7b.pth | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/tree/main) | Vicuna-7B as language decoder, pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
16
+ | finetune-vicuna7b-v2.pth | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/tree/main) | Fine-tuned on [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset|
17
+ | pretrain-vicuna13b.pth | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-vicuna13b.pth) | Vicuna-13B as language decoder, pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
18
+ | finetune-vicuna13b-v2.pth (**recommended**) | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna13b-v2.pth) | Fine-tuned on [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset|
19
+ | pretrain-ziya13b-zh.pth | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-ziya13b-zh.pth) | Pre-trained with Chinese LLM [Ziya-13B](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) |
20
+ | finetune-ziya13b-zh.pth | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-ziya13b-zh.pth) | Fine-tuned on machine-translated [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset (in Chinese)|
21
+ | pretrain-billa7b-zh.pth | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-billa7b-zh.pth) | Pre-trained with Chinese LLM [BiLLA-7B](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) |
22
+ | finetune-billa7b-zh.pth | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-billa7b-zh.pth) | Fine-tuned on machine-translated [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset (in Chinese) |