RaushanTurganbay HF staff commited on
Commit
aba3cde
β€’
1 Parent(s): 9620ec6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -27
README.md CHANGED
@@ -9,19 +9,21 @@ language:
9
 
10
  Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find [here](https://huggingface.co/liuhaotian/llava-v1.5-13b).
11
 
12
- Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing)
13
 
14
- Or check out our Spaces demo! [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/llava-hf/llava-4bit)
15
 
16
-
17
- ## Model details
18
 
19
  **Model type:**
20
  <br>
21
- LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
22
  <br>
23
  Base LLM: lmsys/vicuna-7b-v1.5
24
 
 
 
 
25
  **Model date:**
26
  <br>
27
  LLaVA-Next-Video-7B was trained in April 2024.
@@ -31,7 +33,24 @@ LLaVA-Next-Video-7B was trained in April 2024.
31
  https://github.com/LLaVA-VL/LLaVA-NeXT
32
 
33
 
34
- ## How to use the model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  First, make sure to have `transformers >= 4.42.0`.
37
  The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` or `<video>` to the location where you want to query images/videos:
@@ -39,17 +58,12 @@ The model supports multi-visual and multi-prompt generation. Meaning that you ca
39
  Below is an example script to run generation in `float16` precision on a GPU device:
40
 
41
  ```python
42
- import requests
43
- from PIL import Image
44
  import av
45
  import torch
46
  from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
47
 
48
  model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"
49
 
50
- prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
51
- image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
52
-
53
  model = LlavaNextVideoForConditionalGeneration.from_pretrained(
54
  model_id,
55
  torch_dtype=torch.float16,
@@ -82,7 +96,7 @@ prompt = "USER: <video>\nWhy is this video funny? ASSISTANT:"
82
  video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
83
  container = av.open(video_path)
84
 
85
- # sample uniformly 8 frames from the video
86
  total_frames = container.streams.video[0].frames
87
  indices = np.arange(0, total_frames, total_frames / 8).astype(int)
88
  clip = read_video_pyav(container, indices)
@@ -97,6 +111,12 @@ print(processor.decode(output[0][2:], skip_special_tokens=True))
97
  To generate from images use the below code after loading the model as shown above:
98
 
99
  ```python
 
 
 
 
 
 
100
  raw_image = Image.open(requests.get(image_file, stream=True).raw)
101
  inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
102
 
@@ -149,11 +169,12 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
149
  ).to(0)
150
  ```
151
 
152
- ## License
 
153
  Llama 2 is licensed under the LLAMA 2 Community License,
154
  Copyright (c) Meta Platforms, Inc. All Rights Reserved.
155
 
156
- ## Intended use
157
  **Primary intended uses:**
158
  <br>
159
  The primary use of LLaVA is research on large multimodal models and chatbots.
@@ -162,20 +183,27 @@ The primary use of LLaVA is research on large multimodal models and chatbots.
162
  <br>
163
  The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
164
 
165
- ## Training dataset
166
 
167
- ### Image
168
- - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
169
- - 158K GPT-generated multimodal instruction-following data.
170
- - 500K academic-task-oriented VQA data mixture.
171
- - 50K GPT-4V data mixture.
172
- - 40K ShareGPT data.
173
-
174
- ### Video
175
- - 100K VideoChatGPT-Instruct.
176
-
177
- ## Evaluation dataset
178
- A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.
179
 
 
 
 
 
 
 
 
 
 
180
 
 
 
 
 
 
 
 
 
 
181
 
 
9
 
10
  Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find [here](https://huggingface.co/liuhaotian/llava-v1.5-13b).
11
 
12
+ Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CZggLHrjxMReG-FNOmqSOdi4z7NPq6SO?usp=sharing)
13
 
14
+ Disclaimer: The team releasing LLaVa-NeXT-Video did not write a model card for this model so this model card has been written by the Hugging Face team.
15
 
16
+ ## πŸ“„ Model details
 
17
 
18
  **Model type:**
19
  <br>
20
+ LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data. The videos were sampled uniformly to be 32 frames per clip.
21
  <br>
22
  Base LLM: lmsys/vicuna-7b-v1.5
23
 
24
+ <img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">
25
+
26
+
27
  **Model date:**
28
  <br>
29
  LLaVA-Next-Video-7B was trained in April 2024.
 
33
  https://github.com/LLaVA-VL/LLaVA-NeXT
34
 
35
 
36
+ ## πŸ“š Training dataset
37
+
38
+ ### Image
39
+ - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
40
+ - 158K GPT-generated multimodal instruction-following data.
41
+ - 500K academic-task-oriented VQA data mixture.
42
+ - 50K GPT-4V data mixture.
43
+ - 40K ShareGPT data.
44
+
45
+ ### Video
46
+ - 100K VideoChatGPT-Instruct.
47
+
48
+ ## πŸ“Š Evaluation dataset
49
+ A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.
50
+
51
+
52
+
53
+ ## πŸš€ How to use the model
54
 
55
  First, make sure to have `transformers >= 4.42.0`.
56
  The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` or `<video>` to the location where you want to query images/videos:
 
58
  Below is an example script to run generation in `float16` precision on a GPU device:
59
 
60
  ```python
 
 
61
  import av
62
  import torch
63
  from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
64
 
65
  model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"
66
 
 
 
 
67
  model = LlavaNextVideoForConditionalGeneration.from_pretrained(
68
  model_id,
69
  torch_dtype=torch.float16,
 
96
  video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
97
  container = av.open(video_path)
98
 
99
+ # sample uniformly 8 frames from the video, can sample more for longer videos
100
  total_frames = container.streams.video[0].frames
101
  indices = np.arange(0, total_frames, total_frames / 8).astype(int)
102
  clip = read_video_pyav(container, indices)
 
111
  To generate from images use the below code after loading the model as shown above:
112
 
113
  ```python
114
+ import requests
115
+ from PIL import Image
116
+
117
+ prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
118
+ image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
119
+
120
  raw_image = Image.open(requests.get(image_file, stream=True).raw)
121
  inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
122
 
 
169
  ).to(0)
170
  ```
171
 
172
+
173
+ ## πŸ”’ License
174
  Llama 2 is licensed under the LLAMA 2 Community License,
175
  Copyright (c) Meta Platforms, Inc. All Rights Reserved.
176
 
177
+ ## 🎯 Intended use
178
  **Primary intended uses:**
179
  <br>
180
  The primary use of LLaVA is research on large multimodal models and chatbots.
 
183
  <br>
184
  The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
185
 
 
186
 
187
+ ## ✏️ Citation
188
+ If you find our paper and code useful in your research:
 
 
 
 
 
 
 
 
 
 
189
 
190
+ ```BibTeX
191
+ @misc{zhang2024llavanextvideo,
192
+ title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
193
+ url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
194
+ author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
195
+ month={April},
196
+ year={2024}
197
+ }
198
+ ```
199
 
200
+ ```BibTeX
201
+ @misc{liu2024llavanext,
202
+ title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
203
+ url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
204
+ author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
205
+ month={January},
206
+ year={2024}
207
+ }
208
+ ```
209